ALOHA

ALOHA is an abbreviation for “A Low-cost Open-source Hardware system for bimanual teleoperation”. As the name suggests, ALOHA is used for fine manipulation tasks. In this paper, the authors consider how robots can learn effectively at low cost. The answer is that they present a system that performs end-to-end imitation learning directly from real demonstrations collected with a custom teleoperation interface.
However, imitation learning has problems such as errors in the policy compounding over time and human demonstrations being non-stationary.They present a method to overcome these problems in this paper. Now, let’s look at the details.

Architecture of ALOHA

To make the robot move more precisely and reduce costs, they designed two systems:

Teleoperation system

They devised a teleoperation setup with two pairs of robot arms:

In the image above, the two arms in the back are called “follower robots” and the two arms in front are called “leader robots”. As the name suggests, the follower robots follow the leader robots controlled by a human, as shown below:

To make the robot aware of the situation, they set up 4 cameras: two on top of the follower robots, one on the front, and one on the top of the leader robot.

Imitation learning algorithm

They found that small errors in the predicted actions can incur large differences in the state, exacerbating the compounding error problem of imitation learning. To tackle it, they used action chunking.
Predicting action sequences also helps tackle temporally correlated confounders, such as pauses in demonstrations, that are hard to model with Markovian single-step policies.
Also, to improve the smoothness of the policy, they proposed temporal ensembling.

They implemented the action chunking policy with transformers and trained it as a CVAE (Conditional Variational Autoencoder) to capture the variability in human data, naming it ACT (Action Chunking with Transformers).

Action Chunk

Action chunking is a concept that describes how sequences of actions are grouped together as a chunk and executed as one unit. In their case, the policy predicts the target joint positions for the next $k$ timesteps. This means that every $k$ steps, the agent receives an observation, generates the next $k$ actions, and executes the actions in sequence.
This implies a $k$-fold reduction in the effective horizon of the task. Specifically, the policy models $\pi_{\theta}(a_{t:t+k}|s_{t})$ instead of $\pi_{\theta}(a_{t}|s_{t})$. That is, a single-step policy would struggle with temporally correlated confounders, such as pauses in the middle of a demonstration, since the behavior depends not only on the state but also on the timestep. More specifically, since $k$ dictates how long the sequence in each chunk is, we can analyze this hypothesis by varying $k$. If $k=1$ corresponds to no action chunking, and $k=episode_length$ corresponds to fully open-loop control, where the robot outputs the entire episode’s action sequence based on the first observation.
However, a naive implementation of action chunking can be suboptimal because a new environment observation is incorporated abruptly every $k$ steps, which can result in jerky robot motion. To improve smoothness and avoid discrete switching between executing and observing, they query the policy at every timestep.
This approach causes different action chunks to overlap with each other, resulting in more than one predicted action at a given timestep. Since there are more than two different actions at a single timestep, they proposed Temporal Ensembling.

Temporal Ensemble

Temporal ensembling queries the policy more frequently and averages across the overlapping action chunks:

Specifically, it performs a weighted average over those actions using an exponential weighting scheme $w_{i} = e^{-m*i}$, where $w_{0}$ is the oldest action and a smaller $m$ means faster incorporation.

Human data Modeling

Human data modeling is a challenging task because it must learn from noisy human demonstrations. Therefore, they made the policy focus on regions where high precision matters.
They trained the policy as a CVAE to generate an action sequence conditioned on current observations.

CVAE

CVAE stands for Conditional Variational Autoencoder. It consists of two components:

CVAE encoder

The CVAE encoder is used only to train the CVAE decoder and is discarded during testing. Specifically, the CVAE encoder predicts the mean and variance of the style variable $z$’s distribution, which is parameterized as a diagonal Gaussian, given the current observation and action sequence as inputs.
Unlike standard CVAEs, they omit the image observations and condition only on the proprioceptive observation and the action sequence to make training faster.

CVAE decoder

The CVAE decoder conditions on both $z$ and the current observations to predict the action sequence. The entire model is trained to maximize the log-likelihood of demonstration action chunks. The formula is:

$\min_{\theta} \ - \ \sum_{s_{t}, a_{t:t+k}\in D}\log\pi_{\theta}(a_{t:T+k}|s_{t})$

ACT Architecture

ACT consists of two elements:

ACT Training

They record the joint positions of the leader robots (input from the human operator) using ALOHA and use them as actions. The observations are composed of the current joint positions of the follower robots and the image feed from four cameras (Step 1).
Then, infer the style variable $z$ using the CVAE encoder (Step 2). As you can see in the image below, the weight matrix is initialized randomly to vary the number of CLS tokens in the training stage.
Next, they obtain the predicted action from the CVAE decoder (Step 3). Each image observation is first processed by a RESNet18 to obtain a feature map, which is then flattened to get a sequence of features. These features are projected to the embedding dimension with a linear layer, and 2D sinusoidal position embeddings are added to preserve the spatial information. The feature sequences from each camera are concatenated to be used as input to the transformer encoder. Additionally, the joint positions and $z$ are projected to the embedding dimension with two linear layers, respectively. The outputs of the transformer encoder are used as “keys” and “values” in the cross-attention layers of the transformer decoder, which predicts the action sequence given the encoder output.

Intuitively, ACT treis to imitate what a human operator would do in the following time steps given current observations.

We can see the steps of training:

Also, we can see the pseudocode:

ACT Testing

At test time, the CVAE decoder is used only as the policy. The incoming observations (images and joint positions) are fed into the model in the same way as during training. The only difference is in $z$.
They simply set $z$ to a zero vector, which is the mean of the unit Gaussian prior used during training. Thus, given an observation, the output of the policy is always deterministic, benefiting policy evaluation.

We can see the processing pipelines of testing:

Also, we can see the pseudocode:

Result

Through these methods, they made the robot arm work without direct human manipulation. Additionally, it can modify its actions if the task fails. You can see the demonstration below.

ALOHA

ALOHA

Architecture of ALOHA

Teleoperation system

Imitation learning algorithm

Action Chunk

Temporal Ensemble

Human data Modeling

CVAE

CVAE encoder

CVAE decoder

ACT Architecture

ACT Training

ACT Testing

Result

Slot Battery

Open Lid

Attaching tape

Put on shoes

Reacting when the cup falls

Reacting to a moving target

ALOHA

ALOHA

Architecture of ALOHA

Teleoperation system

Imitation learning algorithm

Action Chunk

Temporal Ensemble

Human data Modeling

CVAE

CVAE encoder

CVAE decoder

ACT Architecture

ACT Training

ACT Testing

Result

Slot Battery

Open Lid

Attaching tape

Put on shoes

Reacting when the cup falls

Reacting to a moving target

Further Reading

Mobile ALOHA

Deep SORT

Transformer