GenAI Sizing Calculator

Calculate the optimal GPU configuration for your GenAI workloads. Input your requirements and get recommendations for hardware sizing, expected throughput, and cost estimates.

Use Case Profiles

Add and compare multiple workload profiles. Sizing results on the right are aggregated across all configured profiles.

Active Profile Name

Use Case Configuration

Use Case 1

Model Configuration

Select Model

Manual Input

Model Precision

Used for memory footprint calculation

KV Cache Precision

For future optimization features

Use Personal Hugging Face API Key (Optional)

Memory Requirements

Model Weights

N/A

Based on 16bit precision

KV Cache per Token

N/A

Based on 16bit precision

Workload Requirements

Average Concurrent Requests

Avg Input Tokens

Avg Output Tokens

GPU Selection

GPU Type

Performance Estimate(Time per Output Token)

Uses end of sequence token kv cache size for memory bandwidth calculation

GPU Efficiency

Real-world efficiency modifier for throughput estimates

GPUs per Server

Number of GPUs in each server (default: 8 for DGX/HGX systems)

Performance Specifications

Compute (16bit)

989 TFLOPS

Memory Bandwidth

4.8 TB/s

Memory Capacity

141 GB HBM3e

Per-Use-Case Performance

Configure and load a model for this profile to view latency, throughput, and memory metrics.

Aggregate Sizing Results

Select and configure at least one use case profile to see aggregate GPU requirements.

Key Assumptions

Memory-First Sizing: GPU count determined by total memory requirements (model weights + KV cache), not compute requirements.

KV Cache Calculation: Uses actual model architecture details (layers, hidden size, attention heads) with support for GQA and MoE models.

Memory-Bandwidth Bound: Throughput estimates assume memory bandwidth is the bottleneck. Uses average memory traffic (sequence_length/2) during autoregressive decode.

Real-World Efficiency: Throughput includes configurable efficiency modifier (default 70%) to account for batching and scheduling overhead.

TTFT (Prefill) - Per Request: Compute-bound using formula: (2 × parameters × input_tokens) / (TFLOPs × GPUs_per_instance × efficiency). Uses per-instance GPUs.

TPOT (Decode) - Per Request: Memory-bound based on cluster token throughput divided by concurrent requests: 1 / (cluster_tokens_per_sec / concurrent_requests).

End-to-End Time - Per Request: Total request completion time = TTFT + (TPOT × output_tokens)

Cluster Throughput: Requests/sec = (concurrent_requests × 1000ms) / time_per_request. Effective output tokens/sec = requests/sec × output_tokens/request.

Concurrent Requests: Input value represents total cluster-wide concurrent load, not per-instance load.

Model Parameters: Extracted directly from HuggingFace model configs when available, with manual input fallback.

Precision Support: Supports 4-bit through 32-bit precision for both model weights and KV cache independently.

Model Instance Deployment: Small models use 1 instance per GPU. Models using >80% of GPU memory span 2 GPUs. Large models can span up to 4 servers per instance.

Server Configuration: Default 8 GPUs per server. Large models span multiple servers (2-4) per instance. Smaller models achieve better utilization with multiple instances per server.

Roadmap

Launch v1.0!

Multiple model instance support

Add Time to First Token (TTFT)

Add Time Per Output Token (TPOT)

Differentiate between prefill and decode timing

Add additional GPU options

Add use case templates

Mixed LLM workloads

AI Infrastructure