Video4Spatial

Zeqi Xiao^1,2, Yiwei Zhao^{1, †,}, Lingxiao Li¹, Yushi Lan³, Ning Yu⁴, Rahul Garg¹, Roshni Cooper¹, Mohammad H. Taghavi¹, Xingang Pan^2,

¹Netflix, ²Nanyang Technological University, ³University of Oxford, ⁴Netflix Eyeline Studios
^†Project Lead, Corresponding Author

Paper arXiv

Overview

TL;DR: We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data as context.

Demo Video

We sample videos as context and perform instruction-following spatial tasks: video-based object grounding and scene navigation. Our method generates videos that fulfill instructional spatial tasks while remaining geometrically consistent with the provided video context.

Input Video Context

Object Grounding

* Red bbox is generated by the model.

Instruction: The camera moves smoothly through a bedroom. Finally, it focuses on a guitar in the center of the frame.

Instruction: The camera moves smoothly through a bedroom. Finally, it focuses on a green plant in the center of the frame.

Instruction: The camera moves smoothly through a bedroom. Finally, it focuses on a door in the center of the frame (seed 1).

Instruction: The camera moves smoothly through a bedroom. Finally, it focuses on a door in the center of the frame (seed 2).

Scene Navigation

The camera moves smoothly in a bedroom.

Framework

Our framework is intentionally simple: a standard video diffusion architecture trained only with the diffusion objective. The inputs are a video context (several frames from the same environment) and instructions; the output is a video that completes the instructed spatial task while maintaining scene geometry and temporal coherence.

Illustrative figure

Comparisons

Object Grounding

Our method faithfully grounds the object present in the context, whereas other methods hallucinate the target object.

* Wan2.2-5B and Veo3 take 1 frame as context, FramePack uses 105 frames as context, while ours uses 337 frames as context.

Context

Wan2.2-5B

Veo3

FramePack

Ours

Instruction: The camera moves smoothly through a bedroom. Finally, it focuses on a bag on top of the dresser in the center of the frame.

Context

Wan2.2-5B

Veo3

FramePack

Ours

Instruction: The camera moves through a desk area. Finally, it focuses on a monitor in the center of the frame.

Our method delivers the highest perceptual quality and good camera controllability.

Context

Instruction: Camera Pose

GT (only for reference)

AnySplat

Gen3C

TrajectoryCrafter

Ours

Context

Instruction: Camera Pose

GT (only for reference)

AnySplat

Gen3C

TrajectoryCrafter

Ours

Qualitative comparison of scene navigation results.

Ablation Study

Visualization for ablations on condition frame number, CFG, and auxiliary bbox. The instruction is "The camera moves through an office space. Finally, it focuses on a monitor in the center of the frame."

Context

Default

W/ 45 context frames (infer)

W/ only the first frame (infer)

W/o context CFG

W/o auxiliary bbox

Out-of-Distribution Generalization

Our models can generalize to out-of-distribution (OOD) scenarios like outdoor scenes and new categories, performing object grounding and scene navigation. Below are two cases of real scenes scanned by iPone.

Context

Instruction (simplified): Tree

Instruction: Bicycle

Instruction: Camera Pose

Results

Context

Instruction: Monitor

Instruction: Window

Instruction: Camera Pose

Results

Failure Cases

Our method still suffers from artifacts such as temporal discontinuities (row 1) and incorrect grounding (row 2) for some cases. We observe that the model accuracy grows as the repeat times increase.

Context

Failure Case

Success Case

Instruction: The camera moves smoothly through a kitchen. Finally, it focuses on a sink in the center of the frame.

Context

Failure Case

Success Case

Instruction: The camera moves from a desk area. Finally, it focuses on a suitcase in the center of the frame.

BibTeX

@article{xiao2025video4spatial,
  title={Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation},
  author={Xiao, Zeqi and Zhao, Yiwei and Li, Lingxiao and Lan, Yushi and Yu, Ning and Garg, Rahul and Cooper, Roshni and Taghavi, Mohammad H and Pan, Xingang},
  journal={arXiv preprint arXiv:2512.03040},
  year={2025}
}

Acknowledgements

We are grateful for the insightful discussions and invaluable support provided by Ritwik Kumar, Pablo Delgado, Serhan Uslubas, Avin Regmi, Nilesh Kulkarni, Dan Zheng, Soshi Shimada, Yihang Luo, Runsen Xu, Yunhao Fang, Xiaohan Mao, Yifan Zhou, Hao Li and Xiao Chen. We thank the great project of Wan, ScanNet++, ArkitScene, LaCT, ReCamMaster, from which we borrow the insights, data and codebase. The webpage is modified from eliahuhorwitz.github.io.

Video4Spatial

Towards Visuospatial Intelligence with Context-Guided Video Generation

Overview

Demo Video

We sample videos as context and perform instruction-following spatial tasks: video-based object grounding and scene navigation. Our method generates videos that fulfill instructional spatial tasks while remaining geometrically consistent with the provided video context.

Input Video Context

Object Grounding

* Red bbox is generated by the model.

Instruction: The camera moves smoothly through a bedroom. Finally, it focuses on a guitar in the center of the frame.

Instruction: The camera moves smoothly through a bedroom. Finally, it focuses on a green plant in the center of the frame.

Instruction: The camera moves smoothly through a bedroom. Finally, it focuses on a door in the center of the frame (seed 1).

Instruction: The camera moves smoothly through a bedroom. Finally, it focuses on a door in the center of the frame (seed 2).

Scene Navigation

The camera moves smoothly in a bedroom.

The camera moves smoothly in a bedroom.

Framework

Comparisons

Object Grounding

Our method faithfully grounds the object present in the context, whereas other methods hallucinate the target object.

* Wan2.2-5B and Veo3 take 1 frame as context, FramePack uses 105 frames as context, while ours uses 337 frames as context.

Context

Wan2.2-5B

Veo3

FramePack

Ours

Instruction: The camera moves smoothly through a bedroom. Finally, it focuses on a bag on top of the dresser in the center of the frame.

Context

Wan2.2-5B

Veo3

FramePack

Ours

Instruction: The camera moves through a desk area. Finally, it focuses on a monitor in the center of the frame.

Qualitative comparison of object grounding performance.

Scene Navigation

Our method delivers the highest perceptual quality and good camera controllability.

Context

Instruction: Camera Pose

GT (only for reference)

AnySplat

Gen3C

TrajectoryCrafter

Ours

Context

Instruction: Camera Pose

GT (only for reference)

AnySplat

Gen3C

TrajectoryCrafter

Ours

Qualitative comparison of scene navigation results.

Ablation Study

Visualization for ablations on condition frame number, CFG, and auxiliary bbox. The instruction is "The camera moves through an office space. Finally, it focuses on a monitor in the center of the frame."

Context

Default

W/ 45 context frames (infer)

W/ only the first frame (infer)

W/o context CFG

W/o auxiliary bbox

Out-of-Distribution Generalization

Our models can generalize to out-of-distribution (OOD) scenarios like outdoor scenes and new categories, performing object grounding and scene navigation. Below are two cases of real scenes scanned by iPone.

Context

Instruction (simplified): Tree

Instruction: Bicycle

Instruction: Camera Pose

Results

Context

Instruction: Monitor

Instruction: Window

Instruction: Camera Pose

Results

Failure Cases

Our method still suffers from artifacts such as temporal discontinuities (row 1) and incorrect grounding (row 2) for some cases. We observe that the model accuracy grows as the repeat times increase.

Context

Failure Case

Success Case

Instruction: The camera moves smoothly through a kitchen. Finally, it focuses on a sink in the center of the frame.

Context

Failure Case

Success Case

Instruction: The camera moves from a desk area. Finally, it focuses on a suitcase in the center of the frame.