Building Explicit World Model for Open-world Object Manipulation
A manipulation framework that combines open-set perception, 3D digital twin reconstruction, and simulation-based strategy sampling without task-specific action demonstrations.
Overview
Open-world object manipulation has emerged as a popular research frontier in robotics. While recent vision-language-action models have achieved impressive results, they typically rely on large amounts of task-specific action data for training. This project explores a different route: enabling a manipulator to perform open-world manipulation tasks by understanding object dynamics rather than imitating action demonstrations.
The framework integrates open-set segmentation and grasping, 3D digital twin reconstruction, and simulation-based strategy sampling. A physically grounded digital twin allows the robot to simulate and evaluate possible interaction strategies before real-world execution.
The system was tested on tasks such as putting a banana into a basket, stacking cubes, and placing a cup upside down on a box. These tasks were completed without task-specific action demonstrations, highlighting the potential of explicit world models for generalization.