Overview

Open-world object manipulation has emerged as a popular research frontier in robotics. While recent vision-language-action models have achieved impressive results, they typically rely on large amounts of task-specific action data for training. This project explores a different route: enabling a manipulator to perform open-world manipulation tasks by understanding object dynamics rather than imitating action demonstrations.

The framework integrates open-set segmentation and grasping, 3D digital twin reconstruction, and simulation-based strategy sampling. A physically grounded digital twin allows the robot to simulate and evaluate possible interaction strategies before real-world execution.

The system was tested on tasks such as putting a banana into a basket, stacking cubes, and placing a cup upside down on a box. These tasks were completed without task-specific action demonstrations, highlighting the potential of explicit world models for generalization.

Demonstrations

Put lemon into cup

Put cube into box

Stack two cubes

Put cup upside down on box