Vision-Language Models for Continuous Outputs

DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

Zhuoming Liu^*, Jinhong Lin^*, Kwan Man Cheng^*, Lin Zhang, Shayok Bagchi, Yin Li

^*Co-first Author

DRIFT adapts pretrained vision-language models to tasks that require precise continuous predictions. It couples a base predictor, which provides a coarse target estimate, with a flow-matching residual refinement module that iteratively improves the output around that prior.

Code Hugging Face arXiv

Introduction

Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorly suited for problems that require precise continuous outputs, such as localizing temporal boundaries of events or generating robotic control actions.

To address this challenge, we propose DRIFT, a general framework for adapting pretrained VLMs to continuous decoding tasks. DRIFT combines a base predictor, which provides a coarse estimate of the target output, with a generative refinement module based on flow matching that iteratively improves the prediction.

This residual formulation transforms the generative modeling problem from learning a global output distribution to modeling a localized residual distribution around a strong prior, substantially simplifying optimization. We evaluate DRIFT on both perception and planning tasks, including visual grounding and robotic control. Across multiple tasks and architectures spanning Multi-modal Large Language Models (MLLMs), Vision-Language-Action Models (VLAs), and World Action Models (WAMs), DRIFT consistently outperforms a strong set of regression- and generative-based baselines.

Problem

Discrete token decoding is convenient for language, but can introduce quantization errors and discard ordinal structure in continuous prediction tasks.

Core Idea

Leverage the knowledge in the pretrained VLMs and use a light weight module to predict strong prior, then learn a residual flow that refines the coarse prediction into a precise continuous output.

Scope

The adapter is designed for MLLMs, VLAs, and WAMs in perception and planning settings.

Method Overview

DRIFT preserves the knowledge and transfer behavior of pretrained vision-language backbones while adding a compact continuous decoder. The base predictor gives the adapter a useful starting point; the residual flow focuses its capacity on the remaining correction.

Overview of the DRIFT residual flow adapter method

1. Pretrained VLM

Visual and language context are encoded by an existing model trained with a discrete autoregressive interface.

2. Base Predictor

A base predictor estimates the continuous target base on the information in the special tokens, creating a strong prior close to the final answer.

3. Residual Flow

A flow-matching refinement module models the residual from base predictor to the final answer and iteratively improves the prediction.

Performance

DRIFT consistently improves continuous decoding across action generation, temporal localization, spatial grounding, and world-action modeling.

Vision-Language-Action

Action success rates on Libero and Simpler. Higher is better.

Method	Base Model	Action Decoder	Libero					Simpler
Method	Base Model	Action Decoder	Spatial	Object	Goal	Long	AVG	Simpler
OpenVLA* [20]	Prismatic-7B [17]	Tokenizer	83.1	88.5	76.4	54.1	75.5	7.7
VLM4VLA [69]	Qwen3-VL-2B [2]	MLP	N/A	N/A	N/A	55.8	N/A	49.0
InternVLA [7]	Qwen2.5-VL-3B [3]	Diffusion	98.0	99.0	93.8	92.6	95.8	54.2
DRIFT	OpenVLA-7B [20]	DRIFT-Tokenizer	87.0	89.1	77.5	57.0	77.7	13.5
DRIFT	Qwen3-VL-2B [2]	DRIFT-MLP	98.8	99.2	97.0	96.4	97.9	61.5

Temporal Video Grounding

Zero-shot generalization on Charades-STA and ActivityNet-Captions. Higher is better.

Method	# Train	Charades-STA				ActivityNet-Captions
Method	# Train	R@0.3	R@0.5	R@0.7	mIoU	R@0.3	R@0.5	R@0.7	mIoU
LT-ZVG [19]	-	52.9	37.2	19.3	36.0	47.6	32.6	15.4	31.8
ED-VTG [44]	136K	59.5	39.3	19.8	40.2	52.1	33.1	16.0	35.2
ET-Chat [32]	164K	65.7	45.9	20.0	42.3	24.1	12.7	6.2	18.9
DRIFT (ET-Chat)	100K	67.2	44.8	20.6	43.8	57.9	35.8	14.1	37.3

Spatial Grounding

RefCOCO benchmark accuracy with Qwen3-VL-2B. Higher is better.

Method	refcoco	refcoco+	refcocog	AVG
Qwen-VL [2]	89.6	81.8	85.5	85.6
Qwen-VL (w/ DRIFT)	91.7	85.8	87.9	88.5

World Action Models

Libero benchmark results for FastWAM with and without DRIFT. Higher is better.

Method	Spatial	Object	Goal	Long	AVG
FastWAM [65]	97.6	99.4	95.6	95.0	96.9
FastWAM (w/ DRIFT)	99.2	99.0	98.2	96.2	98.2

Qualitative Results

Visual examples for action rollouts and temporal grounding, showing how residual flow refinement improves continuous outputs.

Vision-Language-Action

DRIFT rollout examples across LIBERO-style manipulation tasks.

Put both the alphabet soup and the tomato sauce in the basket.

DRIFT predicts a more accurate action to avoid knock down other boxes on the table.

Turn on the stove and put the moka pot on it.

DRIFT has higher success rate in picking up the moka pot.

Put the white mug on the plate and place the chocolate pudding to the right of the plate.

DRIFT place the cup on the plate precisely by iterative refine the action.

Put both moka pots on the stove.

Shared failure case in picking up the moka pot, which require deeper physical understanding of the object.

Temporal Video Grounding

Qualitative VTG examples compare the localized temporal window with the reference segment.

Temporal video grounding visualization example 2 — Localize the video moment according to the query 'person take a drink from their water bottle'.

DRIFT predicts a more accurate action boundary by iteratively refining the coarse prediction from base predictor.

Temporal video grounding visualization example 3 — Localize the video moment according to the query 'person walking around drinking from glass'.

DRIFT predicts a more accurate action boundary by iteratively refining the coarse prediction from base predictor.

Temporal video grounding visualization example 4 — Localize the video moment according to the query 'a person sits on a bed'.

The shared error pattern caused by ambigurity in the text query.

Resources

Paper

Read the paper on arXiv or download the PDF.

Code

Coming Soon!

Hugging Face

Stay tuned!

What To Add Next

Training Code

We will release the training code on github repository soon.

Adaptor Checkpoints

We will release the adaptor checkpoints in Huggingface soon.

Citation

@article{drift2026,
  title={DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models},
  author={Liu, Zhuoming and Lin, Jinhong and Cheng, Kwan Man and Zhang, Lin and Bagchi, Shayok and Li, Yin},
  journal={arXiv preprint arXiv:2606.05758},
  eprint={2606.05758},
  archivePrefix={arXiv},
  year={2026}
}

DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

Introduction

Problem

Core Idea

Scope

Method Overview

1. Pretrained VLM

2. Base Predictor

3. Residual Flow

Performance

Vision-Language-Action

Temporal Video Grounding

Spatial Grounding

World Action Models

Qualitative Results

Vision-Language-Action

Put both the alphabet soup and the tomato sauce in the basket.

Turn on the stove and put the moka pot on it.

Put the white mug on the plate and place the chocolate pudding to the right of the plate.

Put both moka pots on the stove.

Temporal Video Grounding

Localize the video moment according to the query 'a person reads a book'.

Localize the video moment according to the query 'person take a drink from their water bottle'.

Localize the video moment according to the query 'person walking around drinking from glass'.

Localize the video moment according to the query 'a person sits on a bed'.

Resources

Paper

Code

Hugging Face

What To Add Next

Training Code

Adaptor Checkpoints

Citation