Octo: Generalist Visuomotor Policy via Transformer and Diffusion

Octo is a generalist visuomotor policy designed to control a wide range of robots and tasks using a single Transformer-based backbone and a diffusion-based action head.

Unlike classical robotics pipelines that explicitly separate perception, planning, and control, Octo directly maps visual observations and task conditions to low-level continuous actions.

TL;DR

Octo represents all inputs as tokens, uses a Transformer to summarize task-conditioned observations via a readout token, and generates continuous actions using a diffusion-based action head.
The architecture is designed to support multiple robots, sensor configurations, and action spaces through finetuning rather than architectural redesign.

Methodology

Tokenized Input Representation

Octo unifies heterogeneous inputs into a single token sequence:

[Task Tokens | Observation Tokens | Readout Token]

Task Tokens
Generated from language instructions using a pretrained language encoder.
Observation Tokens
Generated from RGB images using a shallow CNN and patch-based tokenization.
Only a short observation history is used.
Readout Token
A learnable embedding vector appended to the end of the token sequence.

This design allows language, vision, and temporal information to be processed uniformly by the Transformer.

Block-wise Masked Attention

Octo does not use full self-attention.

Instead, attention is constrained as follows:

Observation tokens can attend only to observations from the same or earlier timesteps.
Task tokens are always visible.
Tokens corresponding to missing modalities are fully masked out.

Here, “masking” means blocking attention connections between tokens, not disabling Transformer layers.

This block-wise masking enforces causality while enabling training on datasets with different sensor configurations.

Readout Token

The readout token is not computed from observations.

It is not a CNN output.
It is not an average pooling result.
It is a learnable parameter vector, analogous to the [CLS] token in BERT.

Attention masks ensure that:

The readout token can attend to all task and observation tokens.
Other tokens cannot attend to the readout token.

As a result, the readout token aggregates task-conditioned information without influencing other token representations.

The output embedding of the readout token at the final Transformer layer represents a compact summary of the current state.

Role of the Transformer

In Octo, the Transformer does not directly generate actions.

Its role is to:

Select relevant visual information.
Model relationships between task instructions and observations.
Produce a compact state representation via the readout token.

Action generation is handled by a separate action head.

Diffusion-based Action Head

Octo uses a diffusion model rather than direct regression to generate actions.

Training Objective

Let ( a ) denote a clean ground-truth action from the dataset.

During training:

A diffusion step ( k ) is sampled.
Gaussian noise ( \epsilon \sim \mathcal{N}(0, I) ) is sampled.
A noisy action is constructed as:

[ x^k = \sqrt{\alpha_k} a + \sqrt{1 - \alpha_k} \epsilon ]

The model receives ( (x^k, e, k) ) as input, where ( e ) is the readout embedding, and is trained to predict the injected noise ( \epsilon ).

Importantly, the model is not trained to predict noisy actions.
It is trained to predict the noise that was added.

Inference Procedure

At inference time, no ground-truth action is available.

Start from pure noise ( x^K \sim \mathcal{N}(0, I) ).
Iteratively denoise from step ( K ) to ( 0 ) using the learned denoising network.
The final output ( x^0 ) is used as the action.

The readout embedding is used as a fixed conditioning vector throughout the diffusion process.

Finetuning Strategy (Fig. 2 Interpretation)

Figure 2 in the Octo paper illustrates structural extensibility rather than parameter freezing.

Key points:

The Transformer backbone remains structurally unchanged.
New observation modalities (e.g., depth) can be added via new encoders.
New action spaces can be supported via new action heads.
Finetuning is performed on the entire model, including the backbone.

The figure does not imply freezing the Transformer during finetuning.

Pipeline of Octo

The pipeline consists of:

Tokenization of task and observation inputs.
Transformer-based state summarization via a readout token.
Diffusion-based action generation conditioned on the readout embedding..

Conclusion

Octo is not a zero-shot solution for all robotic manipulation problems.

In particular, tasks such as strawberry harvesting involve:

Deformable objects.
Stem–fruit separation.
Force-sensitive interactions.

These behaviors lie outside the pretraining distribution of Octo.

However, Octo provides a strong foundation for such tasks by offering:

A pretrained visuomotor representation.
A flexible token-based input structure.
A principled action generation mechanism via diffusion.

In this context, Octo should be viewed not as a plug-and-play solution, but as a backbone for task-specific finetuned policies, including agricultural manipulation tasks such as strawberry harvesting.

Octo