ALOHA
ALOHA is an abbreviation for “A Low-cost Open-source Hardware system for bimanual teleoperation”. As the name suggests, ALOHA is used for fine manipulation tasks. In this paper, the authors consider how robots can learn effectively at low cost. The answer is that they present a system that performs end-to-end imitation learning directly from real demonstrations collected with a custom teleoperation interface.
However, imitation learning has problems such as errors in the policy compounding over time and human demonstrations being non-stationary.They present a method to overcome these problems in this paper. Now, let’s look at the details.
Architecture of ALOHA
To make the robot move more precisely and reduce costs, they designed two systems:
Teleoperation system
They devised a teleoperation setup with two pairs of robot arms:
In the image above, the two arms in the back are called “follower robots” and the two arms in front are called “leader robots”. As the name suggests, the follower robots follow the leader robots controlled by a human, as shown below:
To make the robot aware of the situation, they set up 4 cameras: two on top of the follower robots, one on the front, and one on the top of the leader robot.
Imitation learning algorithm
They found that small errors in the predicted actions can incur large differences in the state, exacerbating the compounding error problem of imitation learning. To tackle it, they used action chunking.
Predicting action sequences also helps tackle temporally correlated confounders, such as pauses in demonstrations, that are hard to model with Markovian single-step policies.
Also, to improve the smoothness of the policy, they proposed temporal ensembling.
They implemented the action chunking policy with transformers and trained it as a CVAE (Conditional Variational Autoencoder) to capture the variability in human data, naming it ACT (Action Chunking with Transformers).
Action Chunk
Action chunking is a concept that describes how sequences of actions are grouped together as a chunk and executed as one unit. In their case, the policy predicts the target joint positions for the next $k$ timesteps. This means that every $k$ steps, the agent receives an observation, generates the next $k$ actions, and executes the actions in sequence.
This implies a $k$-fold reduction in the effective horizon of the task. Specifically, the policy models $\pi_{\theta}(a_{t:t+k}|s_{t})$ instead of $\pi_{\theta}(a_{t}|s_{t})$. That is, a single-step policy would struggle with temporally correlated confounders, such as pauses in the middle of a demonstration, since the behavior depends not only on the state but also on the timestep. More specifically, since $k$ dictates how long the sequence in each chunk is, we can analyze this hypothesis by varying $k$. If $k=1$ corresponds to no action chunking, and $k=episode_length$ corresponds to fully open-loop control, where the robot outputs the entire episode’s action sequence based on the first observation.
However, a naive implementation of action chunking can be suboptimal because a new environment observation is incorporated abruptly every $k$ steps, which can result in jerky robot motion. To improve smoothness and avoid discrete switching between executing and observing, they query the policy at every timestep.
This approach causes different action chunks to overlap with each other, resulting in more than one predicted action at a given timestep. Since there are more than two different actions at a single timestep, they proposed Temporal Ensembling.
Temporal Ensemble
Temporal ensembling queries the policy more frequently and averages across the overlapping action chunks:
Specifically, it performs a weighted average over those actions using an exponential weighting scheme $w_{i} = e^{-m*i}$, where $w_{0}$ is the oldest action and a smaller $m$ means faster incorporation.
Human data Modeling
Human data modeling is a challenging task because it must learn from noisy human demonstrations. Therefore, they made the policy focus on regions where high precision matters.
They trained the policy as a CVAE to generate an action sequence conditioned on current observations.
CVAE
CVAE stands for Conditional Variational Autoencoder. It consists of two components:
CVAE encoder
The CVAE encoder is used only to train the CVAE decoder and is discarded during testing. Specifically, the CVAE encoder predicts the mean and variance of the style variable $z$’s distribution, which is parameterized as a diagonal Gaussian, given the current observation and action sequence as inputs.
Unlike standard CVAEs, they omit the image observations and condition only on the proprioceptive observation and the action sequence to make training faster.
CVAE decoder
The CVAE decoder conditions on both $z$ and the current observations to predict the action sequence. The entire model is trained to maximize the log-likelihood of demonstration action chunks. The formula is:
$\min_{\theta} \ - \ \sum_{s_{t}, a_{t:t+k}\in D}\log\pi_{\theta}(a_{t:T+k}|s_{t})$
ACT Architecture
ACT consists of two elements:
ACT Training
They record the joint positions of the leader robots (input from the human operator) using ALOHA and use them as actions. The observations are composed of the current joint positions of the follower robots and the image feed from four cameras (Step 1).
Then, infer the style variable $z$ using the CVAE encoder (Step 2). As you can see in the image below, the weight matrix is initialized randomly to vary the number of CLS tokens in the training stage.
Next, they obtain the predicted action from the CVAE decoder (Step 3). Each image observation is first processed by a RESNet18 to obtain a feature map, which is then flattened to get a sequence of features. These features are projected to the embedding dimension with a linear layer, and 2D sinusoidal position embeddings are added to preserve the spatial information. The feature sequences from each camera are concatenated to be used as input to the transformer encoder. Additionally, the joint positions and $z$ are projected to the embedding dimension with two linear layers, respectively. The outputs of the transformer encoder are used as “keys” and “values” in the cross-attention layers of the transformer decoder, which predicts the action sequence given the encoder output.
Intuitively, ACT treis to imitate what a human operator would do in the following time steps given current observations.
We can see the steps of training:
Also, we can see the pseudocode:
ACT Testing
At test time, the CVAE decoder is used only as the policy. The incoming observations (images and joint positions) are fed into the model in the same way as during training. The only difference is in $z$.
They simply set $z$ to a zero vector, which is the mean of the unit Gaussian prior used during training. Thus, given an observation, the output of the policy is always deterministic, benefiting policy evaluation.
We can see the processing pipelines of testing:
Also, we can see the pseudocode:
Result
Through these methods, they made the robot arm work without direct human manipulation. Additionally, it can modify its actions if the task fails. You can see the demonstration below.