MLIP Inference | Adeesh Kolluru

NVIDIA A100-SXM4-40GB PyTorch 2.8.0 CUDA 12.4 64-atom FCC Cu Batch 16 100 steps + 10 warmup

All numbers are from real benchmarks using TorchSim on Modal A100 GPUs. CPS scores are from matbench-discovery. Source: mlip-inference-bench.

Sort by:

CPS = Combined Performance Score from matbench-discovery; shown where the exact checkpoint is on the leaderboard. ORB-v3 CPS is for the conservative-inf-mpa checkpoint. Throughput bars are relative to the fastest model.

Key findings

ORB-v3-Direct: fastest leaderboard model

24k atoms/s at 42.7 ms/step with 1.2 GB. 2x faster than Conservative, 31x faster than PET-OAM-XL.

NequIP-OAM-S: fastest small model

34k atoms/s with near-perfect batch parallelism (29.9 ms batched vs 29.3 ms single). Only 531 MB.

Accuracy vs speed

PET-OAM-XL has the highest benchmarked CPS (0.898) but is 31x slower than ORB-v3-Direct.

XL models are memory-bound

NequIP-OAM-XL and PET-OAM-XL use 30-35 GB, nearly saturating the A100 40GB.

Accuracy vs speed (Pareto front)

Only models with both CPS and benchmark results. Points on the Pareto front offer the best accuracy-speed tradeoff.

X-axis:

Detailed charts

Mode:

Throughput (atoms/s)

Latency (ms/step)

Peak GPU Memory (GB)

Batched Speedup vs Single

Methodology

Each model runs forward passes on a 64-atom FCC copper supercell using TorchSim’s batched API on A100 GPUs via Modal. 10 warmup steps are excluded, then 100 steps are timed with torch.cuda.synchronize() before and after. Batch size is 16 independent copies.

Models

Checkpoint	Architecture	Origin
EquiformerV3+DeNS-OAM	Equivariant transformer	Atomic Architects
NequIP-OAM-XL / S	E(3)-equivariant message passing	MIR Group (Harvard)
PET-OAM-XL / MAD-S	Point Edge Transformer	COSMO Lab (EPFL)
ORB-v3	Graph network	Orbital Materials
UMA-S-1p1	Universal Model for Atoms	FAIR (Meta)
eSEN	Scalable E(3) network	FAIR (Meta)