What did the HumanScale research find?

HumanScale compared egocentric human video with real robot trajectories for pretraining embodied models. Models pretrained on filtered human video achieved 24% lower validation loss on real robot action prediction, with success rate gains of 52.5% (in-distribution) and 90% (out-of-distribution).

Why does this matter for embodied AI?

Robot data collection is expensive and limited in diversity, while human video is abundant and cheap. The new paradigm—learning world representation from human video, then aligning actions with a small amount of robot data—could dramatically lower the barrier to building capable embodied AI systems.

What should we watch next?

Future work should validate whether data quality assessment standards generalize across datasets, and whether the open-source community can build large-scale egocentric video benchmarks. The two-step paradigm could become the standard data pipeline for embodied AI.

HumanScale：用第一人稱人類影片預訓練，效果超越真實機器人資料

本文針對具身基礎模型預訓練資料稀缺的瓶頸，系統性對比了第一人稱人類影片與遙操作真實機器人軌跡作為預訓練源的效能。研究指出，雖然機器人資料動作監督精確，但成本高昂且多樣性有限。透過對人類影片進行精心設計的過濾與標註流程，模型在預訓練階段展現顯著優勢。實驗顯示，在相同預訓練資料量下，基於人類影片預訓練的模型在真實機器人動作預測上的驗證損失降低24%，分佈內和分佈外任務的成功率分別提升52.5%和90%。這驗證了一種可擴展的具身預訓練新範式：用人類影片學習世界表徵，再輔以少量機器人資料做動作對齊。

Sources

arXiv