LLM VRAM & Time Estimator\nPlan memory & runtime for full finetune, LoRA, or RL (GRPO).
Hugging Face Model ID (e.g., meta-llama/Llama-2-7b-hf)
meta-llama/Llama-2-7b-hf
Training Mode
fp32
fp16
lora
Compute Precision (weights/grads)
fp32
fp16
LoRA Settings
▼
LoRA Rank (r)
↺
1
256
LoRA Alpha (scaling)
↺
1
256
Target module name filters (comma-separated)
q_proj,k_proj,v_proj,o_proj,up_proj,down_proj,gate_proj
Show per-layer LoRA table
Batch & Sequence
▼
Max Sequence Length
↺
128
16384
Micro-batch Size (per GPU)
↺
1
1024
Gradient Accumulation Steps
↺
1
1024
Optimizer & Activations
▼
Optimizer
Keep FP32 master weights (mixed precision)
Activation Checkpointing
Activation Checkpointing Reduction Factor
↺
0.3
1
Activation Memory Multiplier (heuristic)
↺
0.5
4
Parallelism / Memory Strategies
▼
World Size (number of GPUs)
↺
1
256
DeepSpeed ZeRO Stage
None
ZeRO-1
ZeRO-2
ZeRO-3
Enable FSDP (Fully Sharded Data Parallel)
FSDP Auto Wrap Policy
CPU Offload (optimizer/params)
Temp AllGather Overhead (fraction)
↺
0
0.5
Safety Headroom (fraction)
↺
0
0.5
Reinforcement Learning (GRPO) Settings
▼
Enable RL with verifiable rewards (GRPO-like)
Load Reference Model (same as policy)
Reference Model Precision
fp16
fp32
Rollout Batch Size (per step)
↺
1
8192
Number of Rollouts per Step
↺
1
64
Reward Model Size (B params)
↺
0
20
Verifier / Extra Overheads (GB)
↺
0
40
Training Time & Cost
▼
Total Tokens to Train (Billions)
↺
0.1
200
Throughput (tokens/sec) per GPU
↺
1000
500000
Cost per GPU-hour (USD)
↺
0
100
Estimate VRAM & Time
VRAM Breakdown (after sharding + overhead)
VRAM Breakdown (after sharding + overhead)
component
⋮
GB
⋮
component
⋮
GB
⋮
Per-layer LoRA Params
Per-layer LoRA Params
layer
⋮
lora_params
⋮
lora_params_millions
⋮
memory_fp16_MB
⋮
layer
⋮
lora_params
⋮
lora_params_millions
⋮
memory_fp16_MB
⋮
Raw VRAM JSON
Raw Time/Cost JSON