Getting Started
Learn how to use the AI Infrastructure Planner to size your workloads and calculate total cost of ownership.
Platform Demo
Watch this comprehensive walkthrough to see how to use all the features of the AI Infrastructure Planner.
GPU Sizing Concepts
Memory-First Sizing Approach
This tool uses a memory-first approach to GPU sizing, recognizing that for most GenAI inference workloads, memory capacity is the primary constraint, not compute performance. The sizing calculation considers:
- Model Weights: The base memory footprint determined by parameter count and precision (4-bit, 8-bit, 16-bit, or 32-bit)
- KV Cache: Memory required to store key-value attention states for all concurrent requests, calculated from actual model architecture (layers, hidden size, attention heads)
- Concurrent Requests: Each active request requires KV cache memory for its full sequence length (input + output tokens)
- Advanced Architectures: Automatically handles Grouped Query Attention (GQA), Mixture of Experts (MoE), and Multi-Head Latent Attention (MLA) optimizations
Model Instance Deployment
The tool intelligently determines how to deploy your model across GPUs and servers:
- Small Models: Models using <80% of single GPU memory run 1 instance per GPU for maximum throughput
- Medium Models: Models using >80% of GPU memory span 2 GPUs per instance to ensure stability
- Large Models: Models exceeding server capacity (default 8 GPUs) can span 2-4 servers per instance using tensor parallelism
- Multiple Instances: When workload demands exceed single instance capacity, the tool calculates optimal instance count and distribution
Throughput Estimation
Throughput estimates assume memory bandwidth is the bottleneck for autoregressive inference (typical for LLMs). The tool:
- Calculates theoretical tokens/second based on GPU memory bandwidth and KV cache size
- Applies a configurable efficiency modifier (default 70%) to account for real-world batching and scheduling overhead
- Scales linearly with GPU count for multi-GPU deployments
Using the GenAI Sizing Calculator
The GenAI Sizing Calculator can pull model configurations directly from Hugging Face or accept manual input:
- Automated: Select from predefined models or enter any Hugging Face model ID to automatically load parameters, architecture details, and calculate accurate memory requirements
- Manual: Enter parameter count and optionally paste config.json for custom or private models
- Precision Control: Configure model weights and KV cache precision independently for optimization
- Push to TCO: Send your sizing results directly to the TCO Calculator to see infrastructure costs
Vector Database Sizing
The Vector DB Sizing tool helps estimate storage and memory requirements for vector databases used in RAG applications:
- Calculate storage from document corpus size or direct vector count
- Account for index overhead (HNSW, IVF, DiskANN, etc.)
- Estimate RAM requirements based on memory mode (all-in-RAM, hybrid, or DiskANN)
- Include replication, backup, and metadata in sizing calculations
TCO Methodology
What is Total Cost of Ownership?
Total Cost of Ownership (TCO) provides a comprehensive financial comparison between on-premises and cloud AI infrastructure over a specified time period (1-5 years). The tool calculates all costs including hardware, energy, personnel, facilities, and operations to help you make informed infrastructure decisions.
On-Premises Cost Components
The tool calculates on-premises costs from granular hardware specifications:
- Hardware (CAPEX): GPU servers, control servers, storage servers, network switches, and optical transceivers - each with customizable unit costs
- Energy Costs: Power consumption calculated from GPU type and server count, with configurable PUE (Power Usage Effectiveness) for cooling overhead
- Facilities: Rack space costs calculated from power and space requirements, accounting for rack power limits (default 30kW per rack)
- Personnel: Platform engineers and network engineers with fully loaded costs (includes 30% overhead for benefits)
- Maintenance: Annual hardware maintenance as percentage of hardware cost (default 10%)
- Software: Annual costs for data platform, monitoring, and management tools
Cloud Cost Components
Cloud costs are calculated based on 24/7 operation with configurable adjustments:
- Compute: Instance pricing pulled from live AWS pricing API, matching GPU type to appropriate instance family (p5.48xlarge for H100, p6-b200.48xlarge for B200, etc.)
- Storage: Separate pricing for EBS (block storage), S3 (object storage), FSx for Lustre (high-performance file system), and EFS (network file system)
- Networking: Data transfer costs for inter-AZ traffic and internet egress, calculated from monthly transfer volumes
- Discounts: Configurable discount percentage applied to all cloud services (compute, storage, networking, software)
- Personnel: Cloud engineers for infrastructure management (typically fewer FTEs than on-prem)
- Optional Services: SageMaker (15% overhead on compute) and configurable enterprise support (% of total cloud spend)
Networking Calculations
On-premises networking is calculated based on cluster size using spine-leaf architecture:
- Small clusters (≤16 GPU servers): 2 spine switches
- Larger clusters: Leaf switches (1 per 8 GPU servers) + spine switches (half the number of leaf switches, minimum 2)
- Very large clusters (>512 servers): Additional super-spine layer
- Optics calculated from switch ports (96 effective ports per switch × 2 optics per connection), assuming 400G connections to each GPU NIC where each 800G switch port uses a splitter to create two 400G connections
Key Decision Factors
Longer periods (3-5 years) allow on-prem hardware costs to amortize. The tool shows break-even point and cumulative costs over time.
Tool assumes 24/7 operation by default. Cloud costs scale linearly with usage while on-prem has fixed costs regardless of utilization.
All hardware unit costs are customizable. Default values reflect enterprise pricing but can be adjusted based on your vendor negotiations.
Rack costs, power pricing ($/kWh), and PUE are all configurable to match your specific data center environment.
Using the TCO Calculator
The TCO Calculator workflow:
- Select Workload Preset: Start with predefined cluster sizes (Small: 2 nodes, Medium: 8 nodes, Large: 16 nodes, XL: 32 nodes) or customize
- Configure GPU Hardware: Adjust GPU server count, GPU type, and server specifications. Networking is auto-calculated but customizable.
- Set On-Premises Parameters: Configure facilities (power cost, PUE, rack costs), personnel (engineers, salaries), and maintenance rates
- Set Cloud Parameters: Choose region, configure storage options (S3, FSx, EFS), set discounts, add optional services (SageMaker, enterprise support)
- Choose Analysis Period: Select 1-5 years to see how costs evolve and where the break-even point occurs
- Review Results: Detailed breakdown shows cost by category, break-even analysis, and per-unit cloud pricing used in calculations
- Export PDF: Generate comprehensive reports for stakeholder review
Workload Presets
Presets provide starting points for common AI cluster configurations with pre-calculated networking and support infrastructure:
- Small Cluster (2 nodes): 16 GPUs, minimal control plane, 2 switches
- Medium Cluster (8 nodes): 64 GPUs, production control plane (5 nodes), storage servers
- Large Cluster (16 nodes): 128 GPUs, full spine-leaf networking
- XL Cluster (32 nodes): 256 GPUs, enterprise-scale infrastructure
💡 Pro Tip
The detailed view in the results page shows the actual per-unit cloud pricing used in calculations (pulled from AWS pricing API). This transparency helps you verify calculations and understand cost drivers. All pricing updates automatically when you change GPU types or regions.
Additional Resources
For more information or to provide feedback on the platform, please use the feedback link in the sidebar navigation.
This tool is designed to provide guidance for infrastructure planning. Always validate assumptions and pricing with your specific vendors and requirements.