Mobile ALOHA

Mobile ALOHA is a low-cost, whole-body teleoperation system for data collection. It augments the ALOHA system for data collection.

With the data collected by Mobile ALOHA, perform supervised behavior cloning and co-training with existing static ALOHA datasets to boost performance on mobile manipulation tasks.

According to the paper, success rates increased by up to 90%, allowing Mobile ALOHA to autonomously complete complex mobile manipulation tasks.

Why Did They Make Mobile ALOHA?

Most traditional robots allow people to teach arbitrary skills to robots; however, many tasks in realistic, everyday environments require whole-body coordination of both mobility and dexterous manipulation, rather than just individual mobility or manipulation behaviors.

To address this problem, the paper studied the feasibility of extending imitation learning to tasks that require whole-body control of bimanual mobile robots.

How Did They Make Mobile ALOHA?

They encountered two main factors that hinder the wide adoption of imitation learning for bimanual mobile manipulation while developing Mobile ALOHA.

Lack of accessible, plug-and-play hardware for whole-body teleoperation:
- They used additional hardware and calibration to enable teleoperation on the robot platform, such as:
Prior robot learning (ALOHA) works have not demonstrated high-performance bimanual mobile manipulation for complex tasks:
- To solve this problem, they decided to leverage data from the ALOHA datasets. Despite the differences in tasks and morphology, they achieved positive transfer in nearly all mobile manipulation tasks, attaining equivalent or better performance and data efficiency than policies trained using only Mobile ALOHA data.

As a result, they achieved the following four outcomes:

Mobile : The system can move at a speed comparable to human walking, around 1.42 m/s.
Stable : It is stable when manipulating heavy household objects, such as pots and cabinets.
Whole-body teleoperation: All degrees of freedom can be teleoperated simultaneously, including both arms and the mobile base.
Untethered : Onboard power and compute.

They chose the AgileX Tracer AGV (“Tracer”) as the mobile base for “Mobile” and “Stable,” as you can see above. For “Whole-body teleoperation”, they selected a teleoperation system that allows simultaneous control of both the base and two arms. Finally, they solved the “Untethered” problem by attaching the battery and laptop to the robot, as shown below:

How Did They do Co-training with Static ALOHA Data?

To address the issue that policies trained on specialized datasets collected by human operators are often not robust to perceptual perturbations such as distractors or lighting changes due to the limited visual diversity in the datasets, they used a co-training pipeline that leverages the existing static ALOHA datasets to improve the performance of imitation learning.

Formula of The Pipeline

$\mathbb{E}_{(o^{i}, a^{i}_{\text{arms}},a^{i}_{\text{base}}) \ \sim \ D^{m}_{\text{mobile}}}[L(a^{i}_{\text{arms}},a^{i}_{\text{base}},\pi^{m}(o^{i}))] \ + \\ \mathbb{E}_{(o^{i},a^{i}_{arms}) \ \sim \ D_{\text{static}}}[L(a^{i}_{\text{arms}},[0,0], \pi(o^{i}))]$

where :

$o^{i}$ : observation consisting of two wrist camera RGB observations
$a_{\text{arms}}$ : bimanual actions formulated as target joint positions $a_{arms} \in \mathbb{R}^{14}$.
$a_{\text{base}}$ : the base actions formulated as target base linear and angular velocities $a_{\text{base}} \in \mathbb{R}^{2}$.
$D^{m}_{\text{mobile}}$ : Mobile ALOHA dataset for a task $m$.
$\pi^{m}(o^{i})$ : the training objective for a mobile manipulation policy $\pi^{m}$ for a task $m$.
$D_{\text{static}}$ : static ALOHA data

The Formula represents:

The sum of the expected values of $L(a^{i}{\text{arms}},[0,0], \pi^{m}(o^{i}))$ for $o^{i}$, $a^{i}{\text{arms}}$, and $a^{i}{\text{arms}}$ sampled from $D^{m}\text{mobile}$ and the expected values of $L(a^{i}{\text{arms}},[0,0], \pi^{m}(o^{i}))$ for $o^{i}$, $a^{i}{\text{arms}}$ sampled from $D_{\text{static}}$.

They sample with equal probability from $D_{\text{static}}$ and $D^{m}{\text{mobile}}$ and set the batch size to 16.
Since $D{\text{static}}$ data points have no mobile base actions, they added zero-padding to the action labels to make the two datasets have the same dimensions.
Also, they ignored the front camera in the $D_{\text{static}}$ data so that both datasets have three cameras.
Finally, they normalized every action based on the statistics of $D^{m}_{\text{mobile}}$ alone.

In their experiments, they combined the co-training recipe with multiple base imitation learning approaches, including ACT, Diffusion Policy, and VINN.

Tasks

They selected 6 tasks:

Below, you can see all tasks:

Wipe Wine

The robot base is initialized within a square of $1.5 m \times 1.5 m$ with a yaw of up to 30° (Init).
It first navigates to the sink and picks up the towel hanging on the faucet (#1).
It then turns around and approaches the kitchen island, picks up the wine glass (randomized in a $30 cm \times 30 cm$ area), and wipes the spilled wine (#2).
It then puts down the wine glass on the table (#3).
Each demo consists of 1300 steps or 26 seconds.

Cook Shrimp

The robot’s position is randomized up to 5 cm, and all objects’ positions are randomized up to 2 cm (init).
The right gripper first pours oil into the hot pan (#1).
After that, it pours raw shrimp into the hot pan (#2).
With the left gripper lifting the pan at an angle, the right gripper grasps the spatula and flips the shrimp (#3)..
The robot then turns around and pours the shrimp into an empty bowl(#4).
Each demo consists of 3750 steps or 75 seconds.

Wash Pan

The pan’s position is randomized up to 10 cm with a yaw of up to $45^{\circ}$.
The left gripper grasps the pan (#1).
he right gripper opens and then closes the faucet while the left gripper holds the pan to receive the water (#2).
The left gripper then swirls the water inside the pan and pours it out (#3).
Each demo consists of 110 steps or 22 seconds.

Use Cabinet

TThe robot’s position is randomized up to 10 cm, and the pots’ positions are randomized up to 5 cm (init).
The robot approaches the cabinet, grasps both handles, then backs up, pulling both doors open (#1).
Both arms grasp the handles of the pot, move forward, and place it inside the cabinet (#2, #3).
The robot backs up and closes both cabinet doors(#4).
Each demo consists of 1500 steps or 30 seconds.

Take Elevator

The robot starts 15 m from the elevator and is randomized across the 10 m wide lobby (init).
The robot goes around a column to reach the elevator button (#1).
The right gripper presses the button (#2).
The robot enters the elevator (#3).
Each demo consists of 2250 steps or 45 seconds.

Push Chairs

The robot’s initial position is randomized up to 10 cm (init).
The demonstration dataset contains the robot pushing the first 3 chairs (#1, #2, #3).
Each demo consists of 2000 steps or 40 seconds.

High Five

The robot base is initialized next to the kitchen island (init).
The robot keeps moving around the kitchen island until a human is in front of it, then gives a high five to the human (#1, #2).
Each demo consists of 2000 steps or 40 seconds.

Result

It is a very clever idea to train robots through teleoperation. This method allows anyone, even without any knowledge about robots, to train precise motions through teleoperation. Additionally, Mobile ALOHA overcomes the disadvantages of traditional robot arm manipulation algorithms, such as KDL.

Mobile ALOHA

Mobile ALOHA

Why Did They Make Mobile ALOHA?

How Did They Make Mobile ALOHA?

How Did They do Co-training with Static ALOHA Data?

Formula of The Pipeline

Tasks

Wipe Wine

Cook Shrimp

Wash Pan

Use Cabinet

Take Elevator

Push Chairs

High Five

Result

Further Reading

ALOHA

Deep SORT

Transformer