MLIP Inference
Inference throughput benchmarks for ML interatomic potentials on A100 GPUs
All numbers are from real benchmarks using TorchSim on Modal A100 GPUs. CPS scores are from matbench-discovery. Source: mlip-inference-bench.
CPS = Combined Performance Score from matbench-discovery; shown where the exact checkpoint is on the leaderboard. ORB-v3 CPS is for the conservative-inf-mpa checkpoint. Throughput bars are relative to the fastest model.
Key findings
ORB-v3-Direct: fastest leaderboard model
24k atoms/s at 42.7 ms/step with 1.2 GB. 2x faster than Conservative, 31x faster than PET-OAM-XL.
NequIP-OAM-S: fastest small model
34k atoms/s with near-perfect batch parallelism (29.9 ms batched vs 29.3 ms single). Only 531 MB.
Accuracy vs speed
PET-OAM-XL has the highest benchmarked CPS (0.898) but is 31x slower than ORB-v3-Direct.
XL models are memory-bound
NequIP-OAM-XL and PET-OAM-XL use 30-35 GB, nearly saturating the A100 40GB.
Accuracy vs speed (Pareto front)
Only models with both CPS and benchmark results. Points on the Pareto front offer the best accuracy-speed tradeoff.
Detailed charts
Throughput (atoms/s)
Latency (ms/step)
Peak GPU Memory (GB)
Batched Speedup vs Single
Methodology
Each model runs forward passes on a 64-atom FCC copper supercell using TorchSim’s batched API on A100 GPUs via Modal. 10 warmup steps are excluded, then 100 steps are timed with torch.cuda.synchronize() before and after. Batch size is 16 independent copies.
Models
| Checkpoint | Architecture | Origin |
|---|---|---|
| EquiformerV3+DeNS-OAM | Equivariant transformer | Atomic Architects |
| NequIP-OAM-XL / S | E(3)-equivariant message passing | MIR Group (Harvard) |
| PET-OAM-XL / MAD-S | Point Edge Transformer | COSMO Lab (EPFL) |
| ORB-v3 | Graph network | Orbital Materials |
| UMA-S-1p1 | Universal Model for Atoms | FAIR (Meta) |
| eSEN | Scalable E(3) network | FAIR (Meta) |