Metrics
UPH – units per hour. How fast the system works.
MTBA – mean time between assists. How long it runs
before a human needs to step in.
Tasks & hardware
The first task is bin-to-bin order picking – transferring individual items
between containers. Evaluations run on Franka Research 3 arms with Robotiq
grippers. More tasks and platforms are coming.
Fine-tuning dataset
The DROID teleoperation dataset used to fine-tune all models on the
leaderboard. 352 episodes, 12GB. Available for non-commercial use.
Evaluation runs
Every evaluation run on the leaderboard is a downloadable Positronic
dataset – multi-view video and robot telemetry.
Browse individual runs in the Run explorer.
Methodology
Full evaluation protocol, scoring, and reproducibility details are in
the white paper.
Consortium
PhAIL is governed as an open consortium. Founding partners: Nebius and Toloka.
We're looking for model builders, hardware vendors, and deployers who want
to shape how physical AI is measured.
[email protected]