Resem3D

Abstract

Semantics-driven 3D spatial constraints align high-level semantic representations with low-level action spaces, facilitating the unification of task understanding and execution in robotic manipulation. The synergistic reasoning of Multimodal Large Language Models (MLLMs) and Vision Foundation Models (VFMs) enables cross-modal 3D spatial constraint construction. Nevertheless, existing methods have three key limitations: (1) coarse semantic granularity in constraint modeling, (2) lack of real-time closed-loop planning, (3) compromised robustness in semantically diverse environments. To address these challenges, we propose ReSem3D, a unified manipulation framework for semantically diverse environments, leveraging the synergy between VFMs and MLLMs to achieve fine-grained visual grounding and dynamically constructs hierarchical 3D spatial constraints for real-time manipulation. Specifically, the framework is driven by hierarchical recursive reasoning in MLLMs, which interact with VFMs to automatically construct 3D spatial constraints from natural language instructions and RGB-D observations in two stages: part-level extraction and region-level refinement. Subsequently, these constraints are encoded as real-time optimization objectives in joint space, enabling reactive behavior to dynamic disturbances. Extensive simulation and realworld experiments are conducted in semantically rich household and sparse chemical lab environments. The results demonstrate that ReSem3D performs diverse manipulation tasks under zero-shot conditions, exhibiting strong adaptability and generalization.

ReSem3D FrameWork

Framework. Given natural language instruction and RGB-D observations, VFM segments semantically relevant part-level regions and overlays visual prompts to facilitate initial constraint generation. Under the MLLM-driven TAMP framework, constraint modeling is conducted hierarchically in two stages: part-level extraction and region-level refinement. The resulting 3D spatial constraints are encoded as cost function for real-time parsing and solved in closed-loop by MPPI-based optimizer in Isaac Gym, enabling joint-space velocity control with point tracking.

Task. ReSem3D is a unified robotic manipulation framework for semantically diverse environments. It leverages the synergy between MLLMs and VFMs to construct semantics-driven, two-stage hierarchical 3D spatial constraints, which are mapped into real-time optimization objectives in joint space to enable closed-loop perception-action control.

Interactive Visualization (Simulation)

🌟 Explore interactive visualization with mouse for tasks:

Interactive visualization 1

Interactive visualization 2

Interactive Visualization (Real-World)

🌟 Explore interactive visualization with mouse for Household Tasks:

Interactive visualization

🌟 Explore interactive visualization with mouse for Chemical Tasks:

Interactive visualization

Execution under Disturbances

Disturbance for Household Tasks

"Fold towel."

"Throw trash into the bin."

Disturbance for Chemical Lab Tasks

"Pick up pestle on mortar."

"Pour liquid with beaker."

Long-Horizon Task

🌟 With the two-stage hierarchical constraint modeling and MLLM-driven Task and Motion Planning, ReSem3D shows strong potential for long-horizon tasks. It achieves closed-loop control through autonomous multi-stage task decomposition, combined with condition reasoning and cost optimization.

User Instruction: Please complete a magnetic stirring experiment involving liquid and stir bar transfer with an empty beaker and tweezers.

Step 1: Pick up the tweezers.
Step 2: Grasp the stir bar with the tweezers.
Step 3: Place the stir bar into the empty beaker.
Step 4: Pour the liquid into the beaker.
Step 5: Place the beaker on the magnetic stirrer.

Interactive visualization

Prompts for ReSem3D

ReSem3D: Refinable 3D Spatial Constraints via Fine-Grained Semantic Grounding for Generalizable Robotic Manipulation

Abstract

ReSem3D FrameWork

Interactive Visualization (Simulation)

Interactive Visualization (Real-World)

Execution under Disturbances

Long-Horizon Task

User Instruction: Please complete a magnetic stirring experiment involving liquid and stir bar transfer with an empty beaker and tweezers.

Prompts for ReSem3D

ReSem3D: Refinable 3D Spatial Constraints
via Fine-Grained Semantic Grounding for Generalizable Robotic Manipulation