ContextVLA: Vision-Language-Action Model with Amortized Multi-Frame Context

Huiwon Jang1,2, Sihyun Yu1, Heeseung Kwon1,2, Hojin Jeon1,2, Younggyo Seo3*, Jinwoo Shin1,2*
1KAIST, 2RLWRLD, 3UC Berkeley
*Equal advising

TL; DR. We propose ContextVLA, an efficient Vision-Language-Action model that leverages multi-frame observations.

Abstract

Leveraging temporal context is crucial for success in partially observable robotic tasks. However, prior work in behavior cloning has demonstrated inconsistent performance gains when using multi-frame observations. In this paper, we introduce ContextVLA, a policy model that robustly improves robotic task performance by effectively leveraging multi-frame observations. Our approach is motivated by the key observation that Vision-Language-Action models (VLA), i.e., policy models built upon a Vision-Language Model (VLM), more effectively utilize multi-frame observations for action generation. This suggests that VLMs' inherent temporal understanding capability enables them to extract more meaningful context from multi-frame observations. However, the high dimensionality of video inputs introduces significant computational overhead, making VLA training and inference inefficient. To address this, ContextVLA compresses past observations into a single context token, allowing the policy to efficiently leverage temporal context for action generation. Our experiments show that ContextVLA consistently improves over single-frame VLAs and achieves the benefits of full multi-frame training but with reduced training and inference times.

Observation: Effect of multi-frame observations for various policy models.

(a) When training policy models using multi-frame observations, traditional policy model (Diffusion policy) shows significant performance degradation, whereas recent Vision-Language-Action models (VLA; $\pi_0$ and GR00T N1.5) do not. (b) We find that the key factor in overcoming this problem is leveraging a pre-trained Vision-Language Model (VLM) to extract temporal information for action generation. ViT, VLM, and VLA-init indicate how the VLA architecture is initialized for training; we use a pre-trained vision encoder, VLM, or VLA, respectively, and other parameters are randomly initialized.

Method: ContextVLA

We propose ContextVLA, an efficient framework that leverages a VLM's temporal understanding to learn a policy model that utilizes multi-frame observations. We use a Vision-Language Model (VLM) to encode multi-frame observations $\mathbf{o}_{t-k:t}$, where we compress past observations $\mathbf{o}_{t-k:t-1}$ into a single context token $\mathbf{m}$ at the VLM block $n$. We then leverage the VLM features to generate actions via either autoregressive modeling or diffusion-based modeling.

Main Results: Simulation

Main Results: Real-world