TaMMa: Target-driven Multi-subscene Mobile Manipulation

Fudan University

Our proposed multi-subscene mobile manipulation approach. Coarse scene prior and refined pose estimation empower the mobile base movement and robotic arm manipulation effectively and efficiently on compositional challenging tasks such as pouring, stacking, and tidy-up.

Abstract

For everyday service robotics, the ability to navigate back and forth based on tasks in multi-subscene environments and perform delicate manipulations is crucial and highly practical. While existing robotics primarily focus on complex tasks within a single scene or simple tasks across scalable scenes individually, robots consisting of a mobile base with a robotic arm face the challenge of efficiently representing multiple subscenes, coordinating the collaboration between the mobile base and the robotic arm, and managing delicate tasks in scalable environments.

To address this issue, we propose Target-driven Multi-subscene Mobile Manipulation (\textit{TaMMa}), which efficiently handles mobile base movement and fine-grained manipulation across subscenes. Specifically, we obtain a reliable 3D Gaussian initialization of the whole scene using a sparse 3D point cloud with encoded semantics. Through querying the coarse Gaussians, we acquire the approximate pose of the target, navigate the mobile base to approach it, and reduce the scope of precise target pose estimation to the corresponding subscene. Optimizing while moving, we employ diffusion-based depth completion to optimize fine-grained Gaussians and estimate the target's refined pose. For target-driven manipulation, we adopt Gaussians inpainting to obtain precise poses for the origin and destination of the operation in a \textit{think before you do it} manner, enabling fine-grained manipulation.

We conduct various experiments on a real robotic to demonstrate our method in effectively and efficiently achieving precise operation tasks across multiple tabletop subscenes.

Method Overview

RGB-D sequences with camera poses are used for point cloud unprojection, which initializes the Gaussians. Grounded-Light-HQ-SAM is employed to encode semantics for Gaussians and depth completion. The coarse Gaussians provide a scene prior to mobile base navigation; further, a fine-grained manipulation target pose is optimized from completed and inpainted Gaussians with an object mask.

Challenging tasks supported

BibTeX


        @inproceedings{hou2024tamma,
            title={TaMMa: Target-driven Multi-subscene Mobile Manipulation},
            author={Hou, Jiawei and Wang, Tianyu and Wang, Shouyan and Xue, Xiangyang and Fu, Yanwei},
            booktitle = {arXiv},
            year = {2024},
        }