ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration

1National University of Singapore, 2Baidu, AMU
Work done during Fengyuan's internship at Baidu, Project leader, *Corresponding authors

Abstract

We propose ONE-SHOT, a parameter-efficient framework for compositional human-environment video generation.

Our key insight is to factorize the generative process into disentangled signals. Specifically, we introduce a canonical-space injection mechanism that decouples human dynamics from environmental cues via cross-attention. The proposed Dynamic-Grounded-RoPE, a novel positional embedding strategy that establishes spatial correspondences between disparate scenes without any heuristic 3D alignments.

To support long-horizon synthesis, we introduce a Hybrid Context Integration mechanism to maintain subject and scene consistency across minute-level generations. Experiments demonstrate that our method significantly outperforms state-of-the-art methods, offering superior structural control and creative diversity for video synthesis.

Method

ONE-SHOT method overview
Model Architecture. Our model builds on a pretrained video foundation model augmented with a conditioning branch for concept injection. Environmental cues are encoded from projected point clouds and depth maps, while identity appearance and context memory maintain visual coherence. Disentangled human dynamics are injected via Decoupled Motion Cross-Attention.
Decoupled Motion Cross-attention
Decoupled Motion Cross-attention. The canonical-space human pose is injected into the video through cross-attention, where the proposed Dynamic-Grounded-RoPE bridges the spatial discrepancy between the environment and human spaces.

Compositional Controls

Freely compose scene, identity, motion, and camera into new videos.

Scene Replacement

Identity Replacement

Motion Editing

Camera Motion Editing

Context Memory (Viewpoint Revisit Consistency)

Instruction-based Editing

we have strong compatibility with the pretrained VFM and minimal loss of its native text-conditioned editing ability..

Long-horizon Generation

Video

BibTeX

@misc{yang2026oneshot,
  title={ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration},
  author={Fengyuan Yang and Luying Huang and Jiazhi Guan and Quanwei Yang and Dongwei Pan and Jianglin Fu and Haocheng Feng and Wei He and Kaisiyuan Wang and Hang Zhou and Angela Yao},
  year={2026},
  eprint={2604.01043},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2604.01043}
}