Context Unrolling in Omni Models

Released on April, 2026

Ceyuan Yang*†, Zhijie Lin*, Yang Zhao*, Fei Xiao*, Hao He*, Qi Zhao*,
Chaorui Deng, Kunchang Li, Zihan Ding, Yuwei Guo, Fuyun Wang, Fangqi Zhu, Xiaonan Nie, Shenhan Zhu, Shanchuan Lin, Hongsheng Li, Weilin Huang, Guang Shi, Haoqi Fan

ByteDance Seed  ·  * Equal contribution  ·  Corresponding authors

We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.

Context Unrolling

A model natively trained across diverse modalities develops the ability to unroll its internal reasoning across heterogeneous modal representations before producing outputs. Each modality becomes an atomic primitive — composable, invocable, and writable back into a shared context.

Omni teaser overview
Context Unrolling in Omni. Given an arbitrary task, Omni selectively activates task-relevant contexts from a heterogeneous context pool—spanning text, image, video, 3D geometry, and beyond—into a shared workspace before producing predictions. This mechanism enables the model to aggregate complementary information across modalities, improving downstream reasoning and generation fidelity.

Visual Understanding

Context unrolling in visual understanding primarily occurs through CoT-style textual rollout, which enriches the latent workspace with finer semantic decompositions before producing the final answer. Text thinking context yields consistent gains across perception, reasoning, document, and spatial understanding benchmarks.

Understanding with Text-Thinking. Validated on downsampled mutiple benchmarks. Text thinking context improves performance consistently across all evaluated dimensions.

Context BLINK ↑ MMStar ↑ MMBench-V11 ↑ SimpleVQA ↑ AI2D ↑ Chartqa ↑ Docvqa ↑ HallusionBench ↑ Erqa ↑ MMSI ↑
baseline60.859.476.250.490.285.593.569.641.531.5
+ text thinking61.666.577.151.492.388.094.071.344.532.6

Visual Generation

Visual generation benefits from multi-granularity context, reducing the inherent ambiguity of mapping language to images. Before synthesizing an image, Omni can (i) roll out fine-grained textual specifications — attributes, counts, spatial constraints — via text thinking, and/or (ii) unroll visual context representations carrying strong structural information. Longer textual context consistently improves both aggregate and high-atomicity performance; combining text thinking with visual token unrolling yields the most consistent gains, confirming that textual and visual contexts are complementary.

Text-to-Image Generation Ablation. GenEval-2 (TIFAGM) and per-attribute breakdown. Higher is better.

Context TIFAGM Object Attr. Count Pos. Verb Overall
baseline29.2591.6490.0052.0377.6726.250.56
+ short text37.3593.1892.4560.1476.9238.830.59
+ long text43.9491.8691.1367.0377.0338.310.61
+ visual unrolling48.0294.4292.9666.9279.2853.960.61
+ short text & visual unrolling49.1693.1392.6868.3676.8343.340.64
+ long text & visual unrolling53.4492.3492.3272.9880.2342.810.66
+ oracle & visual unrolling57.2194.7797.8969.4790.6456.000.73

Depth Estimation

Context unrolling also benefits monocular depth estimation. A depth-related caption describes the scene's structural layout (textual context) recovers key geometric features; combining it with unrolled visual context further corrects residual ambiguities, yielding sharper depth boundaries and globally consistent depth maps.

Depth Estimation Ablation. δ₁ ↑ higher is better; AbsRel ↓ lower is better. Evaluated on NYU-Depth v2.

Metric baseline text context visual context
δ₁ ↑83.21%83.27%84.01%
AbsRel ↓0.20280.20290.1970

Spatial Understanding

Spatial understanding requires resolving geometric ambiguities — viewpoint changes, foreshortening, and occlusions — that freeform text reasoning alone cannot handle. Omni incorporates camera pose estimation and novel-view synthesis as atomic primitives: camera pose provides geometry-grounded textual context, while synthesized views enable visual imagination as context. Both substantially improve spatial reasoning on MMSI-Bench compared to text chain-of-thought alone.

Spatial Understanding Evaluation. Textual contexts denote geometry-grounded text (e.g., camera pose estimation results). Visual contexts refer to novel-view synthesis results used as context. Evaluated on a downsampled MMSI-Bench.

Context Overall Positional Relationship
baseline27.1419.63
+ text thinking28.1530.25
+ text geometric contexts30.1533.95
+ visual geometric contexts34.1735.80

System Performance

State-of-the-art results across understanding, generation, and geometry benchmarks — from a single unified model with only 3B activated parameters.

Multimodal Understanding. Compared against models with similar MoE architecture and activation scale (no thinking). Bold = best in column.

Benchmark Qwen3-VL-30B-A3B InternVL3.5-30B-A3B Omni
BLINK 67.7 60.4 63.0
MMStar 78.4 72.0 63.8
MMBench-v11 78.4 84.875.3
VlmsAreBlind67.5 76.4
SimpleVQA 52.7 53.3
RealWorldQA 73.7 72.3 76.0
Textvqa 80.5 81.0
AI2D 85.0 86.8 91.5
Chartqa 86.8 87.486.9
Docvqa 95.0 94.2 92.8
HallusionBench61.5 53.8 70.1
MuirBench 73.0 53.1 64.2
Erqa 51.3 41.5 45.0
MMSI-Bench 30.3 27.5 31.5
MVBench 72.3 72.1 68.4
Video-MME 74.5 68.7 67.2

Text-to-Image Generation. GenEval2 measures compositional accuracy; DPG measures dense prompt following; LongText measures long-form prompt following. Higher is better.

Model GenEval2 ↑ DPG ↑ LongText-EN ↑ LongText-CN ↑ Inhouse ↑
Qwen-Image30.6788.3294.394.655.16
Z-Image41.8388.1493.593.655.19
Flux34.5983.8460.7-49.91
Omni 54.12 88.55 97.5 96.8 63.87

Image Editing — GEdit-Bench EN (Full set). G_SC = semantic correctness; G_PQ = perceptual quality; G_O = overall. Higher is better.

Model G_SC ↑ G_PQ ↑ G_O ↑
Flux-Kontext-dev7.167.376.51
Step1X-Edit v1.17.667.356.97
Step1X-Edit v1.27.777.657.24
Emu-3.58.117.707.59
Z-Image-Edit8.117.727.57
Qwen-Image-Edit8.157.867.54
Omni 8.42 7.85 7.75

Text-to-Video Generation — VBench 1.0. Total Score combines Quality Score and Semantic Score. Higher is better.

Model Total Score ↑ Quality Score ↑ Semantic Score ↑
Wan2.185.5983.4376.11
Hunyuan Video83.6983.3576.88
Omni 85.07 84.29 83.11

Video Editing — FiVE Benchmark. Evaluated across structure preservation, background fidelity, text alignment, motion quality, and instruction-following (FiVE score). Higher is better except Dist.↓ and LPIPS↓.

Method Structure Background Preservation Text Alignment Motion FiVE ↑
Dist.×10³ ↓ PSNR ↑LPIPS×10² ↓SSIM×10² ↑ CLIPS ↑CLIPSedit Fid.S.×10² ↑ YN ↑MC ↑∪ ↑∩ ↑Acc ↑
TokenFlow35.6219.06263.6172.5126.4621.1589.0019.3635.5136.6818.1827.43
DMT85.9514.71404.6051.6426.6621.4482.3034.7862.0662.9833.8648.42
Vid2Me22.3721.15263.9170.6926.8421.0590.0620.0333.5036.2017.3426.77
AnyV2V71.3615.90348.5950.7724.8919.7260.3630.6245.4248.9627.0938.02
VideoGrain12.4027.05185.2179.1325.6920.3188.5730.5043.9744.3030.1737.23
Pyramid-Edit28.6520.84276.5971.7226.8220.2080.5933.6754.0156.3631.3143.84
Wan-Edit12.5325.5794.6182.5526.3921.2389.4341.4152.5355.7238.2246.97
Omni 42.9620.30245.1867.00 27.3221.47 83.37 69.6786.1588.5067.2378.03

Camera Pose Estimation on RealEstate10K and CO3Dv2

AUC@30↑ higher is better; RPE trans↓ and RPE rot↓ lower is better.

Method RealEstate10K CO3Dv2
AUC@30 ↑RPE trans ↓RPE rot ↓ AUC@30 ↑RPE trans ↓RPE rot ↓
Flare84.420.42150.053272.232.12420.0342
Cut3r85.320.40230.042475.621.53210.0331
VGGT88.230.38860.038686.231.14320.0285
Omni88.320.37660.028975.211.59550.0269

Monocular Depth Estimation

δ₁↑ higher is better; AbsRel↓ lower is better. Zero-shot evaluation across five standard benchmarks.

Method NYU KITTI SINTEL ETH3D DIODE
δ₁ ↑AbsRel ↓ δ₁ ↑AbsRel ↓ δ₁ ↑AbsRel ↓ δ₁ ↑AbsRel ↓ δ₁ ↑AbsRel ↓
Marigold92.750.078187.870.110862.240.466697.120.056481.640.2266
Cut3r91.640.082486.420.125355.640.472395.340.063273.210.3521
DA3 giant94.780.057993.960.082466.540.382198.790.032482.690.2050
VGGT96.100.049994.290.080366.110.455198.350.032682.150.2115
Omni96.220.054296.920.062174.270.334098.910.031283.830.2034

See What Omni Can Do

Image Generation & Editing

Omni generates images from complex textural prompts and performs precise instruction-following edits. The model can also estimate depth from a single image and synthesize novel views from arbitrary camera poses.

Text-to-Image Generation
Hover over an image to see its text prompt
t2i 1
Double exposure portrait of a woman's side profile silhouette combined with a misty pine forest at sunrise. Flock of birds flying out from the back of her head. The forest is inside her silhouette. White background, high contrast, surreal art, dreamy, monochrome with a splash of orange light, minimal style.
t2i 2
A hyper-realistic close-up of a miniature world inside an ancient, cracked lightbulb. Inside the bulb is a lush, bioluminescent rainforest with tiny waterfalls and glowing mushrooms. A small wooden cabin sits on a mossy rock. Volumetric lighting beams through the glass cracks, dust particles dancing in the light. 8k resolution, macro photography, tilt-shift effect, octane render, unreal engine 5 style.
t2i 3
A tiny, cute fluffy hedgehog wearing vintage aviator goggles and a small leather scarf, standing on a mossy log in a magical forest. Soft sunlight filtering through giant fern leaves (Tyndall effect). Macro photography, shallow depth of field, bokeh, highly detailed fur, Pixar movie style, adorable, cinematic lighting.
t2i 4
A cute anthropomorphic red fox wearing a vintage green adventurer outfit and a small hat. It is standing on a mossy rock in a sunlit magical forest, holding a wooden sign board with both paws. The text on the sign reads "Welcome back, my friend" written in rustic paint. Soft cinematic lighting, fluffy fur texture, 3D render, Pixar movie style, 8k resolution.
Image Editing
Before: motorcycle
Remove the motorcycle.
After: motorcycle removed
Before: photo
Present the brushstroke characteristics of digital painting.
After: digital painting style
Before: cake
Add a candle on top of the cake.
After: cake with candle
Before: photo
Please help me add a filter suitable for this picture.
After: filtered photo
Depth Estimation
RGB Scene 1
RGB — Scene 1
Depth Scene 1
Depth — Scene 1
RGB Scene 2
RGB — Scene 2
Depth Scene 2
Depth — Scene 2
Novel View Synthesis
The first image is the condition input; subsequent images are generated autoregressively. Each step uses an instruct caption containing the camera pose template <campose>tx ty tz rx ry rz</campose> <fov>fovh</fov><fov>fovw</fov> — the six values give the translation and rotation of the next view relative to the current one, fovh and fovw are field of views in the vertical and horizontal directions. The entire instruct, including all camera pose and FOV values, is encoded purely by the language tokenizer with no separate geometric encoder. Hover over each image to see the full instruct prompt used to generate the next view.
Scene A
The relative camera pose from the next image to the current image is <campose>0.011252 0.099488 0.19671 -0.049303 -0.015919 0.002047</campose> <fov>65</fov><fov>96</fov>, generate the next image.
The relative camera pose from the next image to the current image is <campose>-0.054978 0.182972 0.185104 -0.131286 -0.010527 -0.007159</campose> <fov>65</fov><fov>96</fov>, generate the next image.
The relative camera pose from the next image to the current image is <campose>-0.088696 0.00953 0.157078 -0.05298 0.002812 -0.000716</campose> <fov>65</fov><fov>96</fov>, generate the next image.
Scene B
The relative camera pose from the next image to the current image is <campose>-0.366414 -0.118678 0.246736 0.068828 0.178609 0.192226</campose> <fov>65</fov><fov>96</fov>, generate the next image.
The relative camera pose from the next image to the current image is <campose>-0.334792 -0.095082 0.187188 0.118345 0.204825 0.225027</campose> <fov>65</fov><fov>96</fov>, generate the next image.
The relative camera pose from the next image to the current image is <campose>0.068664 0.046758 0.156604 0.041073 -0.020259 -0.039201</campose> <fov>65</fov><fov>96</fov>, generate the next image.

Video Generation & Editing

Omni handles diverse video generation tasks in a single unified model: generating from text prompts or reference images, composing specified subjects into new scenes, and applying instruction-based edits — all with high temporal consistency.

Text-to-Video
Hover over a video to see its text prompt
A fluffy gray domestic cat laps water from a blue ceramic bowl on a sunlit wooden floor. The camera holds a steady close-up, capturing the cat's gentle movements and small ripples forming in the water as it drinks.
A beautiful coastal beach in spring, waves lapping on sand, zoom out.
酒吧里,一只长毛猫坐在桌子前,面前一杯啤酒,两个爪子端起酒杯对着屏幕敬酒。
穿着白色长上衣,深蓝色裤子,拟人化的马,在办公室内的工位,疲惫的葛优瘫在椅子上,放下手臂,眼睛在看电脑。皮克斯风格,侧面90度视角,广角。
Image-to-Video
First frame is the condition image
Subject-to-Video
Prompt
In a serene park setting, a young person with pink hair is seated on a stone bench, engrossed in handling a vintage camera. Surrounding her are tall trees with purple foliage, hinting at the onset of autumn with fallen leaves scattered on the ground. Parked cars line the distant background, subtly blending into the serene environment. The person, dressed in a floral blouse and stylish green high-waisted pants, appears focused on preparing the camera, adjusting its settings diligently. The scene captures a moment of quiet reflection intermixed with the vintage charm of analog photography.
Reference subjects
Reference Subjects
Generated Video
Video Editing
Replace the white ibises with drones and change pecking to scanning.
Remove the backpack from the man.
Replace the real dog with a plush dog.
Add sunglasses to the dog.

World Navigation

Omni predicts the next first-person or third-person views from discrete action text inputs — no separate policy network required. Trained on diverse video navigation data, the model develops precise action control and internalizes fundamental physical laws: objects carry inertia, characters obey gravity, and solid boundaries enforce collision constraints. These behaviors emerge naturally from unified multimodal training rather than explicit physics simulation.

WASDCamera movement View rotation SpaceJump

Cite this work

@online{yang2026omni, title = {Context Unrolling in Omni Models}, author = {Yang, Ceyuan and Lin, Zhijie and Zhao, Yang and Xiao, Fei and He, Hao and Zhao, Qi and Deng, Chaorui and Li, Kunchang and Ding, Zihan and Guo, Yuwei and Wang, Fuyun and Zhu, Fangqi and Nie, Xiaonan and Zhu, Shenhan and Lin, Shanchuan and Li, Hongsheng and Huang, Weilin and Shi, Guang and Fan, Haoqi}, url = {https://omni-model.com} year = {2026} }