GE-Sim2 | Genie Envisioner World Simulator 2.0

We introduce GE-Sim 2.0, Genie Envisioner World Simulator 2.0, a comprehensive embodied world simulator for robotic manipulation.

As an evolution of GE-Sim 1.0, GE-Sim 2.0 extends action-conditioned video simulation with three capabilities for scalable policy evaluation and learning: proprioceptive state estimation, automatic task evaluation, and efficient rollout.

By integrating future video generation, proprioceptive state estimation, and reward-based policy assessment, GE-Sim 2.0 moves beyond visual simulation toward an embodied world simulator with built-in evaluation capability, laying the foundation for a world-model-centric platform for closed-loop and interactive policy learning.

It provides a practical framework for scalable evaluation and training of robotic manipulation policies.

Capabilities

Our Signature Control-Signal Injection Mechanism for Precise, Unified Action-Driven Generation By calibrating heterogeneous action spaces and policies into a unified control space, our method enables precise action-conditioned video generation across diverse robotic embodiments.

Pixel-aligned Action Condition generated vision state

High-Consistency Multi-View Generation for Precise Robotic World Modeling Generates highly consistent videos across head view and left- and right-hand views, enabling comprehensive modeling of real robot observations and more faithful simulation.

generated vision state

The towel is initially hidden in the blind spot of the left view. As the left arm moves, the towel comes into view and remains consistent with the main view. At the same time, the reflection in the mirror is also generated convincingly.

Stable and Consistent Minute-Level Video Generation Delivers stable, consistent minute-level video generation, enabling realistic simulation of long-horizon action policies in embodied environments.

00:00 x1

generated vision state

Powered by Millions of Real-World Robotic Episodes, Delivering Strong Generalization Across Scenes and Tasks Built on large-scale teleoperation, deployment, and interaction data, the model transfers robustly across diverse environments and manipulation goals.

generated vision state

Scene 1 1 / 4

Diverse Data for Faithful Trajectory Following The diversity of teleoperation, deployment, and interaction data supports richer trajectory coverage, helping the model follow diverse motions more faithfully and reduce hallucinations in video generation.

Train with Expert, Autonomous Execution, and Failure Demonstration Data

generated vision state

Train with Expert Data Only

generated vision state

When trained with expert data only, hallucination appears: the gripper moves without actually grasping the towel, yet the towel moves with it. With more diverse training data, this hallucination disappears.

Joint Visual and Proprioceptive Future Prediction A state expert decodes proprioceptive states from video latents, enabling the simulator to provide both future visual observations and proprioceptive signals for downstream policy models.

generated vision state

The radar chart in the upper-left shows the difference between the action condition (yellow line) and the model-generated state (blue line). J denotes joint, and G denotes gripper.

Automated Task Evaluation with an Integrated World Judge An integrated world-judge reward model evaluates generated rollout videos against task instructions, automatically measuring task completion and producing reward signals for policy assessment.

Task Caption Put the blue chips into the red box.

generated vision state

Case 1 1 / 2

Efficient and Scalable Simulation with Preserved Fidelity Incorporates an acceleration framework that improves simulation efficiency while maintaining fidelity, enabling scalable policy evaluation.

GE-Sim 2.0 inference screen recording

Genie Envisioner World Simulator 2.0

Capabilities

Application Scenarios