Omni: Context Unrolling in Omni Models

Context Unrolling

A model natively trained across diverse modalities develops the ability to unroll its internal reasoning across heterogeneous modal representations before producing outputs. Each modality becomes an atomic primitive — composable, invocable, and writable back into a shared context.

Omni teaser overview — **Context Unrolling in mni.** Given an arbitrary task, mni selectively activates task-relevant contexts from a heterogeneous context pool—spanning text, image, video, 3D geometry, and beyond—into a shared workspace before producing predictions. This mechanism enables the model to aggregate complementary information across modalities, improving downstream reasoning and generation fidelity.

Visual Understanding

Context unrolling in visual understanding primarily occurs through CoT-style textual rollout, which enriches the latent workspace with finer semantic decompositions before producing the final answer. Text thinking context yields consistent gains across perception, reasoning, document, and spatial understanding benchmarks.

Understanding with Text-Thinking. Validated on downsampled mutiple benchmarks. Text thinking context improves performance consistently across all evaluated dimensions.

Context	BLINK ↑	MMStar ↑	MMBench-V11 ↑	SimpleVQA ↑	AI2D ↑	Chartqa ↑	Docvqa ↑	HallusionBench ↑	Erqa ↑	MMSI ↑
baseline	60.8	59.4	76.2	50.4	90.2	85.5	93.5	69.6	41.5	31.5
+ text thinking	61.6	66.5	77.1	51.4	92.3	88.0	94.0	71.3	44.5	32.6

Visual Generation

Visual generation benefits from multi-granularity context, reducing the inherent ambiguity of mapping language to images. Before synthesizing an image, mni can (i) roll out fine-grained textual specifications — attributes, counts, spatial constraints — via text thinking, and/or (ii) unroll visual context representations carrying strong structural information. Longer textual context consistently improves both aggregate and high-atomicity performance; combining text thinking with visual token unrolling yields the most consistent gains, confirming that textual and visual contexts are complementary.

Text-to-Image Generation Ablation. GenEval-2 (TIFA_GM) and per-attribute breakdown. Higher is better.

Context	TIFA_GM	Object	Attr.	Count	Pos.	Verb	Overall
baseline	29.25	91.64	90.00	52.03	77.67	26.25	0.56
+ short text	37.35	93.18	92.45	60.14	76.92	38.83	0.59
+ long text	43.94	91.86	91.13	67.03	77.03	38.31	0.61
+ visual unrolling	48.02	94.42	92.96	66.92	79.28	53.96	0.61
+ short text & visual unrolling	49.16	93.13	92.68	68.36	76.83	43.34	0.64
+ long text & visual unrolling	53.44	92.34	92.32	72.98	80.23	42.81	0.66
+ oracle & visual unrolling	57.21	94.77	97.89	69.47	90.64	56.00	0.73

Depth Estimation

Context unrolling also benefits monocular depth estimation. A depth-related caption describes the scene's structural layout (textual context) recovers key geometric features; combining it with unrolled visual context further corrects residual ambiguities, yielding sharper depth boundaries and globally consistent depth maps.

Depth Estimation Ablation. δ₁ ↑ higher is better; AbsRel ↓ lower is better. Evaluated on NYU-Depth v2.

Metric	baseline	text context	visual context
δ₁ ↑	83.21%	83.27%	84.01%
AbsRel ↓	0.2028	0.2029	0.1970

Spatial Understanding

Spatial understanding requires resolving geometric ambiguities — viewpoint changes, foreshortening, and occlusions — that freeform text reasoning alone cannot handle. mni incorporates camera pose estimation and novel-view synthesis as atomic primitives: camera pose provides geometry-grounded textual context, while synthesized views enable visual imagination as context. Both substantially improve spatial reasoning on MMSI-Bench compared to text chain-of-thought alone.

Spatial Understanding Evaluation. Textual contexts denote geometry-grounded text (e.g., camera pose estimation results). Visual contexts refer to novel-view synthesis results used as context. Evaluated on a downsampled MMSI-Bench.

Context	Overall	Positional Relationship
baseline	27.14	19.63
+ text thinking	28.15	30.25
+ text geometric contexts	30.15	33.95
+ visual geometric contexts	34.17	35.80

System Performance

State-of-the-art results across understanding, generation, and geometry benchmarks — from a single unified model with only 3B activated parameters.

Multimodal Understanding. Compared against models with similar MoE architecture and activation scale (no thinking). Bold = best in column.

Benchmark	Qwen3-VL-30B-A3B ↑	InternVL3.5-30B-A3B ↑	mni ↑
BLINK	67.7	60.4	63.0
MMStar	78.4	72.0	63.8
MMBench-v11	78.4	84.8	75.3
VlmsAreBlind	67.5	—	76.4
SimpleVQA	52.7	—	53.3
RealWorldQA	73.7	72.3	76.0
Textvqa	—	80.5	81.0
AI2D	85.0	86.8	91.5
Chartqa	86.8	87.4	86.9
Docvqa	95.0	94.2	92.8
HallusionBench	61.5	53.8	70.1
MuirBench	73.0	53.1	64.2
Erqa	51.3	41.5	45.0
MMSI-Bench	30.3	27.5	31.5
MVBench	72.3	72.1	68.4
Video-MME	74.5	68.7	67.2

Text-to-Image Generation. GenEval2 measures compositional accuracy; DPG measures dense prompt following; LongText measures long-form prompt following. Higher is better.

Model	GenEval2 ↑	DPG ↑	LongText-EN ↑	LongText-CN ↑	Inhouse ↑
Qwen-Image	30.67	88.32	94.3	94.6	55.16
Z-Image	41.83	88.14	93.5	93.6	55.19
Flux	34.59	83.84	60.7	-	49.91
mni	54.12	88.55	97.5	96.8	63.87

Image Editing — GEdit-Bench EN (Full set). G_SC = semantic correctness; G_PQ = perceptual quality; G_O = overall. Higher is better.

Model	G_SC ↑	G_PQ ↑	G_O ↑
Flux-Kontext-dev	7.16	7.37	6.51
Step1X-Edit v1.1	7.66	7.35	6.97
Step1X-Edit v1.2	7.77	7.65	7.24
Emu-3.5	8.11	7.70	7.59
Z-Image-Edit	8.11	7.72	7.57
Qwen-Image-Edit	8.15	7.86	7.54
mni	8.42	7.85	7.75

Text-to-Video Generation — VBench 1.0. Total Score combines Quality Score and Semantic Score. Higher is better.

Model	Total Score ↑	Quality Score ↑	Semantic Score ↑
Wan2.1	85.59	83.43	76.11
Hunyuan Video	83.69	83.35	76.88
mni	85.07	84.29	83.11

Video Editing — FiVE Benchmark. Evaluated across structure preservation, background fidelity, text alignment, motion quality, and instruction-following (FiVE score). Higher is better except Dist.↓ and LPIPS↓.

Method	Structure	Background Preservation			Text Alignment		Motion	FiVE ↑
Method	Dist.×10³ ↓	PSNR ↑	LPIPS×10² ↓	SSIM×10² ↑	CLIPS ↑	CLIPS_edit ↑	Fid.S.×10² ↑	YN ↑	MC ↑	∪ ↑	∩ ↑	Acc ↑
TokenFlow	35.62	19.06	263.61	72.51	26.46	21.15	89.00	19.36	35.51	36.68	18.18	27.43
DMT	85.95	14.71	404.60	51.64	26.66	21.44	82.30	34.78	62.06	62.98	33.86	48.42
Vid2Me	22.37	21.15	263.91	70.69	26.84	21.05	90.06	20.03	33.50	36.20	17.34	26.77
AnyV2V	71.36	15.90	348.59	50.77	24.89	19.72	60.36	30.62	45.42	48.96	27.09	38.02
VideoGrain	12.40	27.05	185.21	79.13	25.69	20.31	88.57	30.50	43.97	44.30	30.17	37.23
Pyramid-Edit	28.65	20.84	276.59	71.72	26.82	20.20	80.59	33.67	54.01	56.36	31.31	43.84
Wan-Edit	12.53	25.57	94.61	82.55	26.39	21.23	89.43	41.41	52.53	55.72	38.22	46.97
mni	42.96	20.30	245.18	67.00	27.32	21.47	83.37	69.67	86.15	88.50	67.23	78.03

Camera Pose Estimation on RealEstate10K and CO3Dv2

AUC@30↑ higher is better; RPE trans↓ and RPE rot↓ lower is better.

Method	RealEstate10K			CO3Dv2
Method	AUC@30 ↑	RPE trans ↓	RPE rot ↓	AUC@30 ↑	RPE trans ↓	RPE rot ↓
Flare	84.42	0.4215	0.0532	72.23	2.1242	0.0342
Cut3r	85.32	0.4023	0.0424	75.62	1.5321	0.0331
VGGT	88.23	0.3886	0.0386	86.23	1.1432	0.0285
mni	88.32	0.3766	0.0289	75.21	1.5955	0.0269

Monocular Depth Estimation

δ₁↑ higher is better; AbsRel↓ lower is better. Zero-shot evaluation across five standard benchmarks.

Method	NYU		KITTI		SINTEL		ETH3D		DIODE
Method	δ₁ ↑	AbsRel ↓	δ₁ ↑	AbsRel ↓	δ₁ ↑	AbsRel ↓	δ₁ ↑	AbsRel ↓	δ₁ ↑	AbsRel ↓
Marigold	92.75	0.0781	87.87	0.1108	62.24	0.4666	97.12	0.0564	81.64	0.2266
Cut3r	91.64	0.0824	86.42	0.1253	55.64	0.4723	95.34	0.0632	73.21	0.3521
DA3 giant	94.78	0.0579	93.96	0.0824	66.54	0.3821	98.79	0.0324	82.69	0.2050
VGGT	96.10	0.0499	94.29	0.0803	66.11	0.4551	98.35	0.0326	82.15	0.2115
mni	96.22	0.0542	96.92	0.0621	74.27	0.3340	98.91	0.0312	83.83	0.2034

See What mni Can Do

Image Generation & Editing

mni generates images from complex textural prompts and performs precise instruction-following edits. The model can also estimate depth from a single image and synthesize novel views from arbitrary camera poses.

Text-to-Image Generation

Hover over an image to see its text prompt

Double exposure portrait of a woman's side profile silhouette combined with a misty pine forest at sunrise. Flock of birds flying out from the back of her head. The forest is inside her silhouette. White background, high contrast, surreal art, dreamy, monochrome with a splash of orange light, minimal style.

A hyper-realistic close-up of a miniature world inside an ancient, cracked lightbulb. Inside the bulb is a lush, bioluminescent rainforest with tiny waterfalls and glowing mushrooms. A small wooden cabin sits on a mossy rock. Volumetric lighting beams through the glass cracks, dust particles dancing in the light. 8k resolution, macro photography, tilt-shift effect, octane render, unreal engine 5 style.

A tiny, cute fluffy hedgehog wearing vintage aviator goggles and a small leather scarf, standing on a mossy log in a magical forest. Soft sunlight filtering through giant fern leaves (Tyndall effect). Macro photography, shallow depth of field, bokeh, highly detailed fur, Pixar movie style, adorable, cinematic lighting.

A cute anthropomorphic red fox wearing a vintage green adventurer outfit and a small hat. It is standing on a mossy rock in a sunlit magical forest, holding a wooden sign board with both paws. The text on the sign reads "Welcome back, my friend" written in rustic paint. Soft cinematic lighting, fluffy fur texture, 3D render, Pixar movie style, 8k resolution.

Image Editing

→ Remove the motorcycle.

→ Present the brushstroke characteristics of digital painting.

→ Add a candle on top of the cake.

→ Please help me add a filter suitable for this picture.

Depth Estimation

RGB — Scene 1

Depth — Scene 1

RGB — Scene 2

Depth — Scene 2

Novel View Synthesis

The first image is the condition input; subsequent images are generated autoregressively. Each step uses an instruct caption containing the camera pose template <campose>t_x t_y t_z r_x r_y r_z</campose> <fov>fov_h</fov><fov>fov_w</fov> — the six values give the translation and rotation of the next view relative to the current one, fov_h and fov_w are field of views in the vertical and horizontal directions. The entire instruct, including all camera pose and FOV values, is encoded purely by the language tokenizer with no separate geometric encoder. Hover over each image to see the full instruct prompt used to generate the next view.

Scene A

The relative camera pose from the next image to the current image is <campose>0.011252 0.099488 0.19671 -0.049303 -0.015919 0.002047</campose> <fov>65</fov><fov>96</fov>, generate the next image.

The relative camera pose from the next image to the current image is <campose>-0.054978 0.182972 0.185104 -0.131286 -0.010527 -0.007159</campose> <fov>65</fov><fov>96</fov>, generate the next image.

The relative camera pose from the next image to the current image is <campose>-0.088696 0.00953 0.157078 -0.05298 0.002812 -0.000716</campose> <fov>65</fov><fov>96</fov>, generate the next image.

Scene B

The relative camera pose from the next image to the current image is <campose>-0.366414 -0.118678 0.246736 0.068828 0.178609 0.192226</campose> <fov>65</fov><fov>96</fov>, generate the next image.

The relative camera pose from the next image to the current image is <campose>-0.334792 -0.095082 0.187188 0.118345 0.204825 0.225027</campose> <fov>65</fov><fov>96</fov>, generate the next image.

The relative camera pose from the next image to the current image is <campose>0.068664 0.046758 0.156604 0.041073 -0.020259 -0.039201</campose> <fov>65</fov><fov>96</fov>, generate the next image.

Video Generation & Editing

mni handles diverse video generation tasks in a single unified model: generating from text prompts or reference images, composing specified subjects into new scenes, and applying instruction-based edits — all with high temporal consistency.

Text-to-Video

Hover over a video to see its text prompt

A fluffy gray domestic cat laps water from a blue ceramic bowl on a sunlit wooden floor. The camera holds a steady close-up, capturing the cat's gentle movements and small ripples forming in the water as it drinks.

A beautiful coastal beach in spring, waves lapping on sand, zoom out.

酒吧里，一只长毛猫坐在桌子前，面前一杯啤酒，两个爪子端起酒杯对着屏幕敬酒。

穿着白色长上衣，深蓝色裤子，拟人化的马，在办公室内的工位，疲惫的葛优瘫在椅子上，放下手臂，眼睛在看电脑。皮克斯风格，侧面90度视角，广角。

Image-to-Video

First frame is the condition image

Subject-to-Video

Prompt

In a serene park setting, a young person with pink hair is seated on a stone bench, engrossed in handling a vintage camera. Surrounding her are tall trees with purple foliage, hinting at the onset of autumn with fallen leaves scattered on the ground. Parked cars line the distant background, subtly blending into the serene environment. The person, dressed in a floral blouse and stylish green high-waisted pants, appears focused on preparing the camera, adjusting its settings diligently. The scene captures a moment of quiet reflection intermixed with the vintage charm of analog photography.

Reference Subjects

Generated Video

Video Editing

→ Replace the white ibises with drones and change pecking to scanning.

→ Remove the backpack from the man.

← Replace the real dog with a plush dog.

→ Add sunglasses to the dog.

World Navigation

mni predicts the next first-person or third-person views from discrete action text inputs — no separate policy network required. Trained on diverse video navigation data, the model develops precise action control and internalizes fundamental physical laws: objects carry inertia, characters obey gravity, and solid boundaries enforce collision constraints. These behaviors emerge naturally from unified multimodal training rather than explicit physics simulation.

WASDCamera movement ↑↓←→View rotation SpaceJump

Inertia

Momentum Persistence on Water

Even after the forward action is released, the boat continues gliding — the model has learned that momentum persists in the absence of resistance, faithfully following inertia in a dynamic aquatic environment.

Gravity & Buoyancy

Gravity-Driven Fall & Swimming

After jumping, the character arcs upward and naturally falls back under gravity.

Collision

Obstacle Collision & Jumping (Outdoor)

In an open outdoor scene, forward movement halts upon contact solid structures, demonstrating physically coherent collision response in unstructured environments.

Collision

Indoor Spatial Constraint Navigation

Navigating a structured indoor space, the agent consistently stops when approaching walls. The model infers solid geometry purely from visual context, enforcing physically plausible movement constraints without any explicit collision map.

Context Unrolling in mni Models

Context Unrolling

Visual Understanding

Visual Generation

Depth Estimation

Spatial Understanding

System Performance

See What mni Can Do

Image Generation & Editing

Video Generation & Editing

World Navigation

Cite this work