Overview

TL;DR: We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data as context.

Demo Video

We sample videos as context and perform instruction-following spatial tasks: video-based object grounding and scene navigation. Our method generates videos that fulfill instructional spatial tasks while remaining geometrically consistent with the provided video context.

Input Video Context


Object Grounding

* Red bbox is generated by the model.

Instruction: The camera moves smoothly through a bedroom. Finally, it focuses on a guitar in the center of the frame.

Instruction: The camera moves smoothly through a bedroom. Finally, it focuses on a green plant in the center of the frame.

Instruction: The camera moves smoothly through a bedroom. Finally, it focuses on a door in the center of the frame (seed 1).

Instruction: The camera moves smoothly through a bedroom. Finally, it focuses on a door in the center of the frame (seed 2).


Scene Navigation

The camera moves smoothly in a bedroom.

The camera moves smoothly in a bedroom.

Framework

Our framework is intentionally simple: a standard video diffusion architecture trained only with the diffusion objective. The inputs are a video context (several frames from the same environment) and instructions; the output is a video that completes the instructed spatial task while maintaining scene geometry and temporal coherence.

Illustrative figure

Comparisons

Object Grounding

Our method faithfully grounds the object present in the context, whereas other methods hallucinate the target object.

* Wan2.2-5B and Veo3 take 1 frame as context, FramePack uses 105 frames as context, while ours uses 337 frames as context.

Context

Wan2.2-5B

Veo3

FramePack

Ours

Instruction: The camera moves smoothly through a bedroom. Finally, it focuses on a bag on top of the dresser in the center of the frame.


Context

Wan2.2-5B

Veo3

FramePack

Ours

Instruction: The camera moves through a desk area. Finally, it focuses on a monitor in the center of the frame.

Qualitative comparison of object grounding performance.

Comparison Quantitative Results Table

Scene Navigation

Our method delivers the highest perceptual quality and good camera controllability.

Context

Instruction: Camera Pose

GT (only for reference)

AnySplat

Gen3C

TrajectoryCrafter

Ours


Context

Instruction: Camera Pose

GT (only for reference)

AnySplat

Gen3C

TrajectoryCrafter

Ours

Qualitative comparison of scene navigation results.

navigation comparison results

Ablation Study

Visualization for ablations on condition frame number, CFG, and auxiliary bbox. The instruction is "The camera moves through an office space. Finally, it focuses on a monitor in the center of the frame."

Context

Default

W/ 45 context frames (infer)

W/ only the first frame (infer)

W/o context CFG

W/o auxiliary bbox

Out-of-Distribution Generalization

Our models can generalize to out-of-distribution (OOD) scenarios like outdoor scenes and new categories, performing object grounding and scene navigation. Below are two cases of real scenes scanned by iPone.

Context

Instruction (simplified): Tree

Instruction: Bicycle

Instruction: Camera Pose

Results


Context

Instruction: Monitor

Instruction: Window

Instruction: Camera Pose

Results

Failure Cases

Our method still suffers from artifacts such as temporal discontinuities (row 1) and incorrect grounding (row 2) for some cases. We observe that the model accuracy grows as the repeat times increase.

Context

Failure Case

Success Case

Instruction: The camera moves smoothly through a kitchen. Finally, it focuses on a sink in the center of the frame.

Context

Failure Case

Success Case

Instruction: The camera moves from a desk area. Finally, it focuses on a suitcase in the center of the frame.


BibTeX

@article{YourPaperKey2024,
title={Your Paper Title Here},
author={First Author and Second Author and Third Author},
journal={Conference/Journal Name},
year={2024},
url={https://your-domain.com/your-project-page}
}

Acknowledgements

We are grateful for the insightful discussions and invaluable support provided by Ritwik Kumar, Pablo Delgado, Serhan Uslubas, Avin Regmi, Nilesh Kulkarni, Dan Zheng, Soshi Shimada, Yihang Luo, Runsen Xu, Yunhao Fang, Xiaohan Mao, Yifan Zhou, Hao Li and Xiao Chen. We thank the great project of Wan, ScanNet++, ArkitScene, LaCT, ReCamMaster, from which we borrow the insights, data and codebase. The webpage is modified from eliahuhorwitz.github.io.