Getting Started
Learn how to use the AI Infrastructure Planner to size your workloads and calculate total cost of ownership.
Platform Demo
Watch this comprehensive walkthrough to see how to use all the features of the AI Infrastructure Planner.
GPU Sizing Concepts
Memory-First Sizing Approach
This tool uses a memory-first approach to GPU sizing, recognizing that for most GenAI inference workloads, memory capacity is the primary constraint, not compute performance. The sizing calculation considers:
- Model Weights: The base memory footprint determined by parameter count and precision (4-bit, 8-bit, 16-bit, or 32-bit)
- KV Cache: Memory required to store key-value attention states for all concurrent requests, calculated from actual model architecture (layers, hidden size, attention heads)
- Concurrent Requests: Each active request requires KV cache memory for its full sequence length (input + output tokens)
- Advanced Architectures: Automatically handles Grouped Query Attention (GQA), Mixture of Experts (MoE), and Multi-Head Latent Attention (MLA) optimizations
Model Instance Deployment
The tool intelligently determines how to deploy your model across GPUs and servers:
- Small Models: Models using <80% of single GPU memory run 1 instance per GPU for maximum throughput
- Medium Models: Models using >80% of GPU memory span 2 GPUs per instance to ensure stability
- Large Models: Models exceeding server capacity (default 8 GPUs) can span 2-4 servers per instance using parallelism
- Multiple Instances: When workload demands exceed single instance capacity, the tool calculates optimal instance count and distribution
Performance Estimation
The tool provides three performance estimate levels for time-per-output-token:
- Conservative: Uses end of sequence token kv cache size for memory bandwidth calculation - best for capacity planning
- Fair: Uses average token kv cache size for memory bandwidth calculation (50% of max) - typical real-world performance
- Optimistic: Uses average token kv cache size + assumes 95% GPU efficiency - best-case scenario
Throughput estimates are memory bandwidth-limited, accounting for KV cache access patterns during autoregressive decode. The tool uses active parameters for TTFT calculations, providing accurate results for MoE architectures.
Using the GenAI Sizing Calculator
The GenAI Sizing Calculator can pull model configurations directly from Hugging Face or accept manual input:
- Automated: Select from predefined models or enter any Hugging Face model ID to automatically load parameters, architecture details, and calculate accurate memory requirements
- Manual: Enter parameter count and optionally paste config.json for custom or private models
- Precision Control: Configure model weights and KV cache precision independently for optimization
- Push to TCO: Send your sizing results directly to the TCO Calculator to see infrastructure costs
Additional Calculators
Vector Database Sizing:
- Calculate storage from document corpus size or direct vector count
- Support for HNSW, IVF, Flat, LSH, and DiskANN indexes
- Configurable memory modes: all-in-RAM, hybrid, or disk-based
- Production features: replication, backup retention, metadata overhead
Facility Design:
- Visual rack layout designer with drag-and-drop equipment placement
- Import hardware configuration from TCO calculator
- Real-time power and space utilization tracking
- Configurable rack power limits (15-120kW per rack)
TCO Methodology
What is Total Cost of Ownership?
Total Cost of Ownership (TCO) provides a comprehensive financial comparison between on-premises and cloud AI infrastructure over a specified time period (1-5 years). The tool calculates all costs including hardware, energy, personnel, facilities, and operations to help you make informed infrastructure decisions.
On-Premises Cost Components
The tool calculates on-premises costs from granular hardware specifications:
- Hardware (CAPEX): GPU servers, control servers, storage servers, network switches, and optical transceivers - each with customizable unit costs
- Networking: Calculated based on cluster size using spine-leaf architecture:
- Small clusters (≤16 GPU servers): 2 spine switches
- Larger clusters: Leaf switches (1 per 8 GPU servers) + spine switches (half the number of leaf switches, minimum 2)
- Very large clusters (>512 servers): Additional super-spine layer
- Optics calculated from switch ports, assuming 400G connections with splitters
- Energy Costs: Power consumption calculated from GPU type and server count, with configurable PUE (Power Usage Effectiveness) for cooling overhead
- Facilities: Rack space costs calculated from power and space requirements, accounting for rack power limits (default 30kW per rack)
- Personnel: Platform engineers and network engineers with fully loaded costs (includes 30% overhead for benefits)
- Maintenance: Annual hardware maintenance as percentage of hardware cost (default 10%)
- Software: Annual costs for data platform, monitoring, and management tools
Cloud Cost Components
The calculator supports two cloud deployment models:
IaaS Mode (Self-Hosted on Cloud):
- Compute: EC2 GPU instances from live pricing API (p5.48xlarge for H100, p6-b200 for B200, etc.)
- Storage: EBS (block), S3 (object), FSx for Lustre (HPC), EFS (network file system)
- Networking: Inter-AZ and internet egress costs based on monthly transfer volumes
- Personnel: Cloud engineers (typically fewer than on-prem)
- Discounts: Configurable % applied to all services (via reserved instances or enterprise agreements)
- Optional: SageMaker (15% overhead), Enterprise Support (% of total)
Token API Mode (Fully Managed SaaS):
- Input Tokens: Per-million-token pricing with optional cached pricing (10% of standard rate)
- Output Tokens: Per-million-token pricing for generated text
- Providers: OpenAI (GPT-5 family), Anthropic (Claude), Gemini, AWS Bedrock, Azure OpenAI, or custom
- Zero Infrastructure: No hardware, facilities, energy, or personnel costs
- Throughput-Based: Costs calculated from GenAI sizing results (effective tokens/sec × time period)
Key Decision Factors
Longer periods (3-5 years) allow on-prem hardware costs to amortize. The tool shows break-even point and cumulative costs over time.
Tool assumes 24/7 operation by default. Cloud costs scale linearly with usage while on-prem has fixed costs regardless of utilization.
All hardware unit costs are customizable. Default values reflect enterprise pricing but can be adjusted based on your vendor negotiations.
Rack costs, power pricing ($/kWh), and PUE are all configurable to match your specific data center environment.
Using the TCO Calculator
The TCO Calculator supports two cloud pricing modes:
IaaS Mode (Infrastructure-as-a-Service):
- Complete GenAI Sizing or Select Preset: Use GenAI Sizing to calculate requirements from your workload, or choose a preset cluster size (2-32 GPU nodes)
- Configure Hardware: Set GPU server count, GPU type, and specifications. Networking auto-calculated.
- Set On-Prem Parameters: Configure facilities, personnel, and maintenance
- Set Cloud Parameters: Choose region, storage (S3/FSx/EFS), discounts, optional services
- Choose Analysis Period: Select 1-5 years for break-even analysis
Token API Mode (SaaS):
- Complete GenAI Sizing: Required to calculate token throughput
- Select Provider: Choose OpenAI, Anthropic, Gemini, Bedrock, Azure, or custom
- Choose Model: Select specific model for accurate per-token pricing
- Enable Cached Pricing: Toggle 10% cached input pricing (enabled by default)
- Compare: See effective cost per 1M tokens vs. self-hosted infrastructure
Workload Presets
Presets provide starting points for common AI cluster configurations with pre-calculated networking and support infrastructure:
- Small Cluster (2 nodes): 16 GPUs, minimal control plane, 2 switches
- Medium Cluster (8 nodes): 64 GPUs, production control plane (5 nodes), storage servers
- Large Cluster (16 nodes): 128 GPUs, full spine-leaf networking
- XL Cluster (32 nodes): 256 GPUs, enterprise-scale infrastructure
💡 Key Features
Transparent Pricing: All cloud pricing is pulled from live AWS APIs and displayed in the detailed breakdown, so you can verify calculations and understand cost drivers.
Effective Token Cost: The calculator shows cost per 1M output tokens for both IaaS and Token API modes, making it easy to compare self-hosted vs. SaaS approaches on an apples-to-apples basis.
Production-Ready: All estimates include real-world considerations like cooling overhead (PUE), benefits costs (30%), and GPU efficiency modifiers.
Additional Resources
Results & Export
The Results page consolidates all analyses (TCO, GenAI sizing, Vector DB, Facility Design) into a comprehensive view. Export professional PDF reports with your organization details, complete with detailed breakdowns and a disclaimer section.
Commercial Use
The AI Infrastructure Planner is free for personal and educational use. If you want to use it as part of a commercial workflow or product, you can review licensing options on the Commercial Use & Licensing page.
Note: This tool provides planning estimates based on public specifications and current market pricing. Actual performance and costs will vary based on specific workload characteristics, vendor negotiations, and deployment details. Always conduct proof-of-concept testing and validate with your vendors before making infrastructure decisions.