Scaling real robot data is a key bottleneck in imitation learning, leading to the use of auxiliary data for policy training. While other aspects of robotic manipulation such as image or language understanding may be learned from internet-based datasets, acquiring motion knowledge remains challenging. Human data, with its rich diversity of manipulation behaviors, offers a valuable resource for this purpose. While previous works show that using human data can bring benefits, such as improving robustness and training efficiency, it remains unclear whether it can realize its greatest advantage: enabling robot policies to directly learn new motions for task completion. In this paper, we systematically explore this potential through multi-task human-robot cotraining. We introduce MotionTrans, a framework that includes a data collection system, a human data transformation pipeline, and a weighted cotraining strategy. By cotraining 30 human-robot tasks simultaneously, we directly transfer more than 10 motions from human data to deployable end-to-end robot policies. Notably, 9 tasks achieve non-trivial success rates in zero-shot manner. MotionTrans also significantly enhances pretraining–finetuning performance (+40% success rate). Through ablation study, we identify key factors for successful motion learning: cotraining with robot data. These findings unlock the potential of motion-level learning from human data, offering insights into its effective use for training robotic manipulation policies. All data, code, and model weights are open-sourced.
From VR human demonstrations to deployable, zero/few-shot robot skills via a unified state–action space, human→robot data transformation, and weighted multi-task co-training.
(Left: Human Data · Middle: MotionTrans-DP · Right: MotionTrans-Pi0-VLA)
(Left: Human Data · Right: MotionTrans-DP, few-shot)
We synchronize VR headset/controllers and multi-view cameras to capture 3D hand trajectories, egocentric video, and robot state with precise time alignment.
During collection we log hand keypoints, egocentric observations, and textual annotations under a unified clock, making downstream alignment and training straightforward.
We collected 3,213 demonstrations across 15 human tasks and 15 robot tasks in 10+ real-world scenes. Tasks are grouped by motion-similar skill categories to support cross-embodiment (Human→Robot) co-training and transfer.
We would like to express our sincere gratitude to Shuo Wang, Gu Zhang, Enshen Zhou, Haoxu Huang, Jialei Huang, Ruiqian Nai, Zhengrong Xue, Junmin Zhao, and Weirui Ye for their valuable discussions. We are especially grateful to Ruiqian Nai and Fanqi Lin for their assistance with the implementation of Pi0-VLA, and to Yankai Fu for his support with the hardware implementation. Our thanks also extend to the SpiritAI and InspireRobot team for their assistance.
@article{yuan2025motiontrans,
title={MotionTrans: Human VR Data Enable Motion-Level Learning for Robotic Manipulation Policies},
author={Yuan, Chengbo and Zhou, Rui and Liu, Mengzhen and Hu, Yingdong and Wang, Shengjie and Yi, Li and Wen, Chuan and Zhang, Shanghang and Gao, Yang},
journal={arXiv preprint arXiv:2509.17759},
year={2025}
}