ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration

Fengyuan Yang^1,2, Luying Huang^2,†, Jiazhi Guan^2,*, Quanwei Yang², Dongwei Pan², Jianglin Fu², Haocheng Feng², Wei He², Kaisiyuan Wang², Hang Zhou^2,*, Angela Yao¹

¹National University of Singapore, ²Baidu, AMU

Work done during Fengyuan's internship at Baidu, ^†Project leader, ^*Corresponding authors

arXiv Video Code

Abstract

We propose ONE-SHOT, a parameter-efficient framework for compositional human-environment video generation.

Our key insight is to factorize the generative process into disentangled signals. Specifically, we introduce a canonical-space injection mechanism that decouples human dynamics from environmental cues via cross-attention. The proposed Dynamic-Grounded-RoPE, a novel positional embedding strategy that establishes spatial correspondences between disparate scenes without any heuristic 3D alignments.

To support long-horizon synthesis, we introduce a Hybrid Context Integration mechanism to maintain subject and scene consistency across minute-level generations. Experiments demonstrate that our method significantly outperforms state-of-the-art methods, offering superior structural control and creative diversity for video synthesis.

Method

Decoupled Motion Cross-attention. The canonical-space human pose is injected into the video through cross-attention, where the proposed Dynamic-Grounded-RoPE bridges the spatial discrepancy between the environment and human spaces.

Compositional Controls

Freely compose scene, identity, motion, and camera into new videos.

Scene Replacement

City·Hepburn·Walk

Road·Hepburn·Walk

SkiResort·Hepburn·Walk

Villa·Hepburn·Walk

Museum·WillSmith·TaiChi

BlueTown1·WillSmith·TaiChi

BlueTown2·WillSmith·TaiChi

Museum·WillSmith·TaiChi

Identity Replacement

GraffitiWall·WillSmith·Walk

GraffitiWall·JenHsun·Walk

GraffitiWall·Hepburn·Walk

GraffitiWall·Fran·Walk

GraffitiWall·Siyu·Walk

Austria1·WillSmith·Walk

Austria1·JenHsun·Walk

Austria1·Hepburn·Walk

Austria1·Fran·Walk

Austria1·Siyu·Walk

Museum·WillSmith·TaiChi

Museum·JenHsun·TaiChi

Museum·Hepburn·TaiChi

Museum·Fran·TaiChi

Museum·Siyu·TaiChi

Motion Editing

BlueTown·JenHsun·ForwardCircle

BlueTown·JenHsun·WalkForward

BlueTown·JenHsun·HandsUp

Museum·Hepburn·ForwardCircle

Museum·Hepburn·TaiChi

Museum·Hepburn·WalkRightLeft

Camera Motion Editing

Orbit around the subject

Dolly right

Dolly left

Context Memory (Viewpoint Revisit Consistency)

Instruction-based Editing

we have strong compatibility with the pretrained VFM and minimal loss of its native text-conditioned editing ability..

Doraemon

Sleek Robot Dog

American Shorthair Cat

Tiny Glowing Dragon

Floating Balloon-Dog Sculpture

Origami Crane Spirit

Tiny Enchanted Deer

Corgi

Long-horizon Generation

WillSmith in Office Room

Hepburn in Office Room

JenHsun in Garden

Fran in Garden

Quantitative Comparisons

Video

BibTeX

@misc{yang2026oneshot,
  title={ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration},
  author={Fengyuan Yang and Luying Huang and Jiazhi Guan and Quanwei Yang and Dongwei Pan and Jianglin Fu and Haocheng Feng and Wei He and Kaisiyuan Wang and Hang Zhou and Angela Yao},
  year={2026},
  eprint={2604.01043},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2604.01043}
}