Bi-Adapt: Few-shot Bimanual Adaptation for Novel Categories of 3D Objects via Semantic Correspondence

Anonymous Authors

Teaser

MY ALT TEXT

Abstract

Bimanual manipulation tasks, such as unfolding a box or opening scissors, are prevalent in daily life and demand approaches that account for collaboration between both hands. These tasks, however, present substantial difficulties due to the need for large amounts of training data. Additionally, when manipulating novel object categories, the variation in geometric characteristics and physical properties across categories complicates generalization. To address these challenges, we present Bi-Adapt, a novel framework designed for efficient learning of bimanual manipulation for novel categories. Bi-Adapt leverages semantic correspondence from diffusion models, known for their strong generalization abilities, to transfer affordance to new categories. Furthermore, it introduces an efficient few-shot adaptation strategy that fine-tunes actions with minimal data, improving performance on novel object categories. Our experimental results demonstrate the high efficiency of Bi-Adapt, achieving high success rates in complex bimanual manipulation tasks with restricted data.

Pipeline

MY ALT TEXT

(Left) We collect quantities of data in simulator, and train the Action Learning Network to build our supporting set. The corresponding affordance and point-level action of training categories serve as our prior knowledge. (Middle) When facing an unseen object from novel categories and a task, we retrieve the familiar ones from our training categories within the same task. We leverage the powerful semantic correspondence capability in the diffusion model to achieve cross-category affordance generalization. Particularly, we often get multiple contact-point pair candidates. (Lower Right) Then, we sample the affordance candidates and action directions proposed by pre-trained Action Learning Network to execute. After few-shot learning, our network is fine-tuned by the interaction results to adapt the novel categories. (Upper Right)Finally, our fine-tuned framework can facilitate manipulating unseen instances from novel categories with better performance.

More Visuals

Cross-category affordance generalization results

Example Image 1

In each group of figures from left to right, we present a source image as reference ( represents the first gripper’s contact point and represents the second gripper’ contact point), and a target image with corresponding contact point inferred by diffusion model. The degree of highlighting in the target image represents the degree of correspondence.

Adaptation Results

Example Image 2

Visualization of manipulation results

Example Image 3

In each row, from left to right, we respectively show the process of task by ar- rows above. The blue block represents action learning on training categories, building supporting set. The green block shows the process of contact points mapping to novel categories. After few- shot learning on novel categories (yellow block), our finetuned model could select best contact point pairs and propose suitable actions for novel categories in pink block.

Real Experiment Setup

(a) Robot Setup

real exp setup

(b) Objects

Normal image

(a) We use two UFactory xArm6 for manipulation and two RealSense D435 cameras to cap- ture RGBD observations.
(b) We test on a diverse set of objects.

Visualization of real-world experiments

MY ALT TEXT

Each row represents a different task: (top) unfolding, (middle) opening, and (bottom) uncapping.
For each row, from left to right: (a) the target object from unseen categories; (b) contact points of training categories; (c) contact points mapped on novel categories; (d) the initial pose of the grippers; (e) the final result after manipulation.

Real-world Demo

BibTeX

BibTex Code Here