Where this is going

Today PhAIL is a leaderboard. The larger goal is to be the real-robot evaluation layer for VLA models – we eval, you research. Public on the leaderboard, or private to your lab. And the harness we run is becoming a toolkit you’ll run yourself.

See what we’re building →

Read the paper

The methodology behind the benchmark – what we measure and why – is laid out in our paper on arXiv.

Getting started

The fastest way in – browse the v1.0 fine-tuning dataset in one command:

uv run --with positronic \
  python -m positronic.cfg.phail.v1_0

Other paths in:

  • Explore results. The leaderboard shows current model scores; the run explorer drills into individual rollouts.
  • Dig deeper. Model weights, evaluation rollouts, protocol, hardware spec, and the full reproduction recipe are on the v1.0 release page.
  • Get evaluated. Public on the leaderboard or private to your lab – see the options on the eval page, or email [email protected]. We onboard by hand for v1.0.

Metrics

UPH – units per hour. How fast the system works.

MTBA – mean time between assists. How long it runs before a human needs to step in.

Current release: v1.0

  • Task: bin-to-bin order picking across four object classes – towels, wooden spoons, scissors, batteries.
  • Hardware: single Franka Research 3 + Robotiq 2F-85, two Stereolabs ZED cameras.
  • Models evaluated: OpenPI π0.5, NVIDIA GR00T N1.6, Hugging Face SmolVLA, Action Chunking Transformer.

Full hardware spec, protocol, dataset access, model weights, and reproduction recipe on the v1.0 release page. Evaluation methodology in the v1.0 white paper.

Contact

Questions, feedback, or interested in submitting a model? Reach out at [email protected].