Success: Put orange bottle in drawer
For 3D object manipulation, methods that build an explicit 3D representation perform better than those relying only on camera images. But using explicit 3D representations like voxels comes at large computing cost, adversely affecting scalability.
We propose RVT, a multi-view transformer for 3D manipulation that is both scalable and accurate. RVT takes camera images and task language description as inputs and predicts the gripper pose action. In simulations, we find that a single RVT model works well across 18 RLBench tasks with 249 task variations, achieving 26% higher relative success than existing state-of-the-art method (PerAct). It also trains 36X faster than PerAct for achieving the same performance and achieves 2.3X the inference speed of PerAct. Further, RVT can perform a variety of manipulation tasks in the real world with just a few (~10) demonstrations per task.
We trained a single RVT model from real world data and a single RVT model from RLBench simulation data. In both settings, the single trained RVT model is used to evaluate the performance on all tasks.
Success: Put orange bottle in drawer
Success: Put orange bottle in drawer
Success: Put orange bottle in drawer
Failure: Put blue marker in drawer
Success: Put yellow block in top shelf
Success: Put yellow block in bottom shelf
Success: Put yellow block in top shelf
Failure: Put yellow block in top shelf
Success: Put yellow block on blue block
Success: Put blue block on red block
Success: Put red block on yellow block
Success: Press Sanitizer
Success: Press Sanitizer
Success: Press Sanitizer
Failure: Press Sanitizer
Failure: Put green marker in bowl
Failure: Put blue marker in bowl
Failure: Put green marker in mug
Success: put the item in the top drawer
Success: put the item in the bottom drawer
Success: put the item in the top drawer
Failure: put the item in the middle drawer
Success: sweep dirt to the short dustpan
Success: sweep dirt to the short dustpan
Success: sweep dirt to the tall dustpan
Failure: sweep dirt to the tall dustpan
Success: take the steak off the grill
Success: take the steak off the grill
Success: take the steak off the grill
Failure: take the steak off the grill
Success: open the top drawer
Success: open the middle drawer
Success: open the bottom drawer
Failure: open the top drawer
Success: turn right tap
Success: turn left tap
Success: turn left tap
Success: turn right tap
Success: close the cyan jar
Success: close the orange jar
Success: close the navy jar
Failure: close the red jar
Success: use the stick to drag the cube onto the navy target
Success: use the stick to drag the cube onto the gray target
Success: use the stick to drag the cube onto the red target
Success: use the stick to drag the cube onto the silver target
Success: stack 3 teal blocks
Success: stack 3 gray blocks
Success: stack 4 navy blocks
Failure: stack 2 maroon blocks
Success: screw in the rose light bulb
Success: screw in the gray light bulb
Success: screw in the violet light bulb
Failure: screw in the silver light bulb
Success: slide the block to pink target
Success: slide the block to yellow target
Success: slide the block to green target
Failure: slide the block to pink target
Success: put the money away in the safe on the top shelf
Success: put the money away in the safe on the bottom shelf
Success: put the money away in the safe on the middle shelf
Failure: put the money away in the safe on the top shelf
Success: stack the wine bottle to the left of the rack
Success: stack the wine bottle to the middle of the rack
Success: stack the wine bottle to the right of the rack
Failure: stack the wine bottle to the middle of the rack
Success: put the coffee in the cupboard
Success: put the mustard in the cupboard
Success: put the chocolate jello in the cupboard
Failure: put the coffee in the cupboard
Success: put the cylinder in the shape sorter
Success: put the star in the shape sorter
Success: put the moon in the shape sorter
Failure: put the star in the shape sorter
Success: push the maroon button, then push the green button, then push the navy button
Success: push the maroon button
Success: push the maroon button
Failure: push the maroon button
Success: put the ring on the violet spoke
Success: put the ring on the black spoke
Failure: put the ring on the green spoke
Failure: put the ring on the azure spoke
Success: stack the other cups on top of the lime cup
Success: stack the other cups on top of the gray cup
Success: stack the other cups on top of the red cup
Failure: stack the other cups on top of the maroon cup
Failure: place 3 cups on the cup holder
Failure: place 2 cups on the cup holder
Failure: place 2 cups on the cup holder
@article{goyal2023rvt,
author = {Goyal, Ankit and Xu, Jie and Guo, Yijie and Blukis, Valts and Chao, Yu-Wei and Fox, Dieter},
title = {RVT: Robotic View Transformer for 3D Object Manipulation},
journal = {arXiv:2306.14896},
year = {2023},
}