MLIP Inference
Inference throughput benchmarks for ML interatomic potentials on A100 GPUs
All numbers are from real benchmarks on Modal A100 GPUs using TorchSim (batched) or ASE calculators (single-system). CPS scores are from matbench-discovery. Source: mlip-inference-bench.
CPS = Combined Performance Score from matbench-discovery; shown where the exact checkpoint is on the leaderboard. ORB-v3 CPS is for the conservative-inf-mpa checkpoint. Models marked (ASE) are benchmarked via ASE calculators (single-system only, no batching). Throughput bars are relative to the fastest model.
Key findings
ORB-v3-Direct: fastest leaderboard model
24k atoms/s at 42.7 ms/step with 1.2 GB. 2x faster than Conservative, 31x faster than PET-OAM-XL.
NequIP-OAM-S: fastest small model
34k atoms/s with near-perfect batch parallelism (29.9 ms batched vs 29.3 ms single). Only 531 MB.
ORB-v3-Direct: fast in both frameworks
14.4 ms/step via ASE vs 16.8 ms via TorchSim. 14x faster than EquiformerV3 in apples-to-apples ASE comparison.
EquiformerV3 vs AllScAIP
Both ~180-200 ms/step via ASE, but AllScAIP uses only 1.2 GB vs EquiformerV3's 8.2 GB. EqV3 leads in CPS (0.902).
Accuracy vs speed (Pareto front)
Only models with both CPS and benchmark results. EquiformerV3 uses single-system ASE data (no batching available). Points on the Pareto front offer the best accuracy-speed tradeoff.
Detailed charts
Throughput (atoms/s)
Latency (ms/step)
Peak GPU Memory (GB)
Batched Speedup vs Single
Profiling: Inference Bottlenecks
Profiled with torch.profiler on A100 (64-atom FCC Cu, 20 steps after 5 warmup). Top CUDA operations by time.
NequIP-OAM-S — 176.6 ms total CUDA time
Methodology
Each model runs forward passes on a 64-atom FCC copper supercell on A100 GPUs via Modal. Most models use TorchSim’s batched API (batch size 16). Models marked (ASE) use ASE calculators for single-system inference: EquiformerV3 via OCPCalculator, ORB-v3-Direct via ORBCalculator, AllScAIP via FAIRChemCalculator. 10 warmup steps are excluded, then 100 steps are timed with torch.cuda.synchronize(). Positions are perturbed each step to prevent calculator caching.
Models
| Checkpoint | Architecture | Origin |
|---|---|---|
| EquiformerV3+DeNS-OAM | Equivariant transformer | Atomic Architects |
| NequIP-OAM-XL / S | E(3)-equivariant message passing | MIR Group (Harvard) |
| PET-OAM-XL / MAD-S | Point Edge Transformer | COSMO Lab (EPFL) |
| ORB-v3 | Graph network | Orbital Materials |
| UMA-S-1p1 | Universal Model for Atoms | FAIR (Meta) |
| AllScAIP-MD-Conserving | eSCN-based | FAIR (Meta) |
| eSEN | Scalable E(3) network | FAIR (Meta) |