We introduce GE-Sim 2.0, Genie Envisioner World Simulator 2.0, a comprehensive embodied world simulator for robotic manipulation.
As an evolution of GE-Sim 1.0, GE-Sim 2.0 extends action-conditioned video simulation with three capabilities for scalable policy evaluation and learning: proprioceptive state estimation, automatic task evaluation, and efficient rollout.
By integrating future video generation, proprioceptive state estimation, and reward-based policy assessment, GE-Sim 2.0 moves beyond visual simulation toward an embodied world simulator with built-in evaluation capability, laying the foundation for a world-model-centric platform for closed-loop and interactive policy learning.
It provides a practical framework for scalable evaluation and training of robotic manipulation policies.
Capabilities
Our Signature Control-Signal Injection Mechanism for Precise, Unified Action-Driven GenerationBy calibrating heterogeneous action spaces and policies into a unified control space, our method enables precise action-conditioned video generation across diverse robotic embodiments.
Pixel-aligned Action Conditiongenerated vision state
Pixel-aligned Action Conditiongenerated vision state
High-Consistency Multi-View Generation for Precise Robotic World ModelingGenerates highly consistent videos across head view and left- and right-hand views, enabling comprehensive modeling of real robot observations and more faithful simulation.
generated vision state
The towel is initially hidden in the blind spot of the left view. As the left arm moves, the towel comes into view and remains consistent with the main view. At the same time, the reflection in the mirror is also generated convincingly.
Stable and Consistent Minute-Level Video GenerationDelivers stable, consistent minute-level video generation, enabling realistic simulation of long-horizon action policies in embodied environments.
00:00x1generated vision state
Powered by Millions of Real-World Robotic Episodes, Delivering Strong Generalization Across Scenes and TasksBuilt on large-scale teleoperation, deployment, and interaction data, the model transfers robustly across diverse environments and manipulation goals.
generated vision state
generated vision state
generated vision state
generated vision state
Scene 11 / 4
Diverse Data for Faithful Trajectory FollowingThe diversity of teleoperation, deployment, and interaction data supports richer trajectory coverage, helping the model follow diverse motions more faithfully and reduce hallucinations in video generation.
Train with Expert, Autonomous Execution, and Failure Demonstration Data
generated vision state
Train with Expert Data Only
generated vision state
When trained with expert data only, hallucination appears: the gripper moves without actually grasping the towel, yet the towel moves with it. With more diverse training data, this hallucination disappears.
Joint Visual and Proprioceptive Future PredictionA state expert decodes proprioceptive states from video latents, enabling the simulator to provide both future visual observations and proprioceptive signals for downstream policy models.
generated vision state
The radar chart in the upper-left shows the difference between the action condition (yellow line) and the model-generated state (blue line). J denotes joint, and G denotes gripper.
Automated Task Evaluation with an Integrated World JudgeAn integrated world-judge reward model evaluates generated rollout videos against task instructions, automatically measuring task completion and producing reward signals for policy assessment.
Task CaptionPut the blue chips into the red box.
generated vision state
generated vision state
Case 11 / 2
Efficient and Scalable Simulation with Preserved FidelityIncorporates an acceleration framework that improves simulation efficiency while maintaining fidelity, enabling scalable policy evaluation.