Physical AI Leaderboard
Physical AI looks impressive in demos. But nobody can tell you whether a model is ready for production. PhAIL measures that gap – on real robots, with real commercial tasks, using the metrics operations teams actually care about.
Where this is going
Today PhAIL is a leaderboard. The larger goal is to be the real-robot evaluation layer for VLA models – we eval, you research. Public on the leaderboard, or private to your lab. And the harness we run is becoming a toolkit you’ll run yourself.
Read the paper
The methodology behind the benchmark – what we measure and why – is laid out in our paper on arXiv.
Getting started
The fastest way in – browse the v1.0 fine-tuning dataset in one command:
uv run --with positronic \
python -m positronic.cfg.phail.v1_0
pip install positronic
python -m positronic.cfg.phail.v1_0
Other paths in:
- Explore results. The leaderboard shows current model scores; the run explorer drills into individual rollouts.
- Dig deeper. Model weights, evaluation rollouts, protocol, hardware spec, and the full reproduction recipe are on the v1.0 release page.
- Get evaluated. Public on the leaderboard or private to your lab – see the options on the eval page, or email [email protected]. We onboard by hand for v1.0.
Metrics
UPH – units per hour. How fast the system works.
MTBA – mean time between assists. How long it runs before a human needs to step in.
Current release: v1.0
- Task: bin-to-bin order picking across four object classes – towels, wooden spoons, scissors, batteries.
- Hardware: single Franka Research 3 + Robotiq 2F-85, two Stereolabs ZED cameras.
- Models evaluated: OpenPI π0.5, NVIDIA GR00T N1.6, Hugging Face SmolVLA, Action Chunking Transformer.
Full hardware spec, protocol, dataset access, model weights, and reproduction recipe on the v1.0 release page. Evaluation methodology in the v1.0 white paper.
Contact
Questions, feedback, or interested in submitting a model? Reach out at [email protected].