MotionClone: Training-Free Motion Cloning for Controllable Video Generation (2025)

footnotetext: *Equal contribution. † Corresponding author.

Pengyang Ling1,4∗ Jiazi Bu2,4∗ Pan Zhang4† Xiaoyi Dong4
Yuhang Zang4Tong Wu3Huaian Chen1Jiaqi Wang4Yi Jin1†
1
University of Science and Technology of China 2Shanghai Jiao Tong University
3The Chinese University of Hong Kong  4Shanghai AI Laboratory
https://github.com/LPengYang/MotionClone

Abstract

Motion-based controllable video generation offers the potential for creating captivating visual content. Existing methods typically necessitate model training to encode particular motion cues or incorporate fine-tuning to inject certain motion patterns, resulting in limited flexibility and generalization.In this work, we propose MotionClone, a training-free framework that enables motion cloning from reference videos to versatile motion-controlled video generation, including text-to-video and image-to-video. Based on the observation that the dominant components in temporal-attention maps drive motion synthesis, while the rest mainly capture noisy or very subtle motions, MotionClone utilizes sparse temporal attention weights as motion representations for motion guidance, facilitating diverse motion transfer across varying scenarios. Meanwhile, MotionClone allows for the direct extraction of motion representation through a single denoising step, bypassing the cumbersome inversion processes and thus promoting both efficiency and flexibility.Extensive experiments demonstrate that MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency.

1 Introduction

Video generations that align with human intentions and produce high-quality outputs has recently attracted significant attention, particularly with the rise of mainstream text-to-video (Guo etal., 2023b; Blattmann etal., 2023b; Chen etal., 2024) and image-to-video (Guo etal., 2023a; Blattmann etal., 2023a; Dai etal., 2023) diffusion models. Despite the substantial progress witnessed in conditional image generation, the domain of video generation presents unique challenges, primarily due to the complexities introduced by motion synthesis. Incorporating additional motion control not only mitigates the ambiguity inherent in video synthesis for superior motion modeling but also enhances the manipulability of the synthesized content for customized creations.

In the realm of video generation that is steered by motion cues, pioneering methodologies can be generally classified into two principal strategies: one that leverages the dense depth or sketch of reference videos (Wang etal., 2024; Jeong & Ye, 2023; Guo etal., 2023a), and another that relies on motion trajectories (Wang etal., 2023b; Yin etal., 2023; Niu etal., 2024). The former methodology typically involves the integration of a pre-trained model to extract motion cues at the pixel level. Despite achieving highly aligned motion, these dense motion cues can be intricately entangled with the structural elements of the reference videos, impeding their transferability in novel scenarios. The latter trajectory-based methodology, by contrast, provides a more user-friendly approach for capturing broader object movements but struggles to delineate finer, localized motions such as head turns or hand raises.Additionally, both methodologies typically entail model training to encode particular motion cues, implying suboptimal generation when applied outside the trained domain. Such limitation is also observed in approaches relying on fine-tuning (Jeong etal., 2023; Zhao etal., 2023), which aim to fit the motion patterns of certain videos.

MotionClone: Training-Free Motion Cloning for Controllable Video Generation (1)

In this work, we introduce MotionClone, a novel training-free framework designed to clone motions from reference videos for controllable video generation. Diverging from traditional approaches involving tailored training or fine-tuning, MotionClone employs the commonly used temporal-attention mechanism within video generation models to capture the motion in reference videos. This strategy effectively renders detailed motion while concurrently preserving minimal interdependencies with the structural components of the reference video, offering flexible motion cloning in varying scenarios, as shown in Fig.1.To be specific, it is observed that the dominant components in temporal-attention weights significantly drive motion synthesis, while the rest mainly refer to noisy or very subtle motions. When the whole temporal-attention is applied uniformly across the model, the majority of temporal-attention weights can overshadow the motion guidance, consequently resulting in the suppression of the primary motion.Therefore, we propose to leverage the principal components of the temporal-attention weights as motion representation, which serves as motion guidance that overlooks noisy or less significant motions and concentrates on the primary motion, thus substantially enhancing the fidelity of motion cloning. Moreover, it has been demonstrated that the motion representation extracted from a certain denoising step holds effective guidance across all time steps, offering high efficiency without the burden of cumbersome video inversion. Furthermore, MotionClone is compatible with a range of video generation tasks, including text-to-video (T2V) and image-to-video (I2V), highlighting its versatility and broad applicability.

In summary, (1) we propose MotionClone, a novel motion-guided video generation framework that enables training-free motion cloning from given reference videos; (2) we design a primary motion control strategy to perform substantial motion guidance over sparse temporal attention map, allowing for efficient motion transfer across scenarios; (3) we validate the effectiveness and versatility of MotionClone in various video generation tasks, in which extensive experiments demonstrate its proficiency in both global camera motion and local object action, with notable superiority in terms of motion fidelity, text alignment, and temporal consistency.

2 Related Work

2.1 Text-to-video diffusion models

Equipped with sophisticated text encoders (Radford etal., 2021; Zhang etal., 2024), a great breakthrough has been achieved in the realm of text-to-image (T2I) generation (Gu etal., 2022; Nichol etal., 2021; Rombach etal., 2022; Podell etal., 2023), which sparks the enthusiasm for advanced text-to-video (T2V) models (Blattmann etal., 2023b; Wang etal., 2023a; Chen etal., 2023a; 2024; Guo etal., 2023b).Notably, VideoLDM (Blattmann etal., 2023b) introduces a motion module that utilizes 3D convolutions and temporal attention to capture frame-to-frame correlations. In a novel approach, AnimateDiff (Guo etal., 2023b) enhances a pre-trained T2I diffusion model with motion modeling capabilities. This is achieved by fine-tuning a series of specialized temporal attention layers on extensive video datasets, allowing for a harmonious fusion with the original T2I generation process.To address the challenge of data scarcity, VideoCraft2 (Chen etal., 2024) suggests an innovative strategy of learning motion from low-quality videos (Bain etal., 2021) while simultaneously learning appearance from high-quality images (Sun etal., 2024). Despite these advancements, there remains a significant disparity in the quality of generated content between the available T2V models and their sophisticated T2I counterparts, primarily due to the intricate nature of diverse motions and the limited availability of high-quality video data.In this work, a motion guidance strategy is developed, which ingeniously incorporates motion cues from given videos to ease the challenges of motion modeling, yielding more realistic and coherent video sequences, without model fine-tuning.

2.2 Controllable video generation

Building on the success of controllable image generation through the integration of additional conditions (Zhang etal., 2023; Kim etal., 2023; Li etal., 2023; Qin etal., 2023; Huang etal., 2023), a multitude of studies (Chen etal., 2023a; Yin etal., 2023; Dai etal., 2023; Ma etal., 2024; Blattmann etal., 2023a) have endeavored to introduce diverse control signals for versatile video generation. These include control over the first video frame (Chen etal., 2023a), motion trajectory (Yin etal., 2023), motion region (Dai etal., 2023), and motion object (Ma etal., 2024). Furthermore, in pursuit of high-quality video customization, several studies delve into reference-based video generation, leveraging the motion from an existing real video to direct the creation of new video content. A straightforward solution developed in Wang etal. (2024); Esser etal. (2023); Xing etal. (2024), involves the direct integration of frame-wise depth maps or canny maps to regularize motion. However, this approach inadvertently introduces motion-independent features, such as structures in static areas, which can disrupt the alignment of the resulting video appearance with new text. To address this issue, motion-specific fine-tuning frameworks, as explored in (Zhao etal., 2023; Jeong etal., 2023), have been developed to extract a distinct motion pattern from a single video or a collection of videos with identical motion. While holding promise, these methods are subject to complex training processes and potential model degradation. To address this, we present a novel motion cloning scheme, which extracts temporal correlations from existing videos as explicit motion clues to guide the generation of new video content, providing a plug-and-play motion customization solution.

2.3 Attention feature control

Attention mechanisms have been confirmed as vital for high-quality content generation. Prompt2Prompt(Hertz etal., 2022) illustrates that cross-attention maps are instrumental in dictating the spatial layout of synthesized images. This observation subsequently motivates serious work in semantic preservation (Chefer etal., 2023), multi-object generation (Ma etal., 2023; Xiao etal., 2023), and video editing (Liu etal., 2023). AnyV2V (Ku etal., 2024) reveals dense injection of both CNN and attention features facilitates improved alignment with source videos in video editing. FreeControl(Mo etal., 2023) highlights that the feature space within self-attention layers encodes structural image information, facilitating reference-based image generation. While previous methods mainly concentrate on spatial attention layers, our work uncovers the untapped potential of temporal attention layers for effective motion guidance, enabling flexible motion cloning.

3 MotionClone

In this section, we first introduce video diffusion models and temporal attention mechanisms. Then we explore the potential of primary control over sparse temporal attention maps for substantial motion guidance. Subsequently, we elaborate on the proposed MotionClone framework, which performs motion cloning by deliberately manipulating temporal attention weights.

3.1 Preliminaries

Diffusion sampling. Following pioneering work (Rombach etal., 2022), video diffusion models encode a input video x𝑥xitalic_x into latent representation z0=(x)subscript𝑧0𝑥z_{0}=\mathcal{E}(x)italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_E ( italic_x ) by using a pre-trained encoder ()\mathcal{E}(\cdot)caligraphic_E ( ⋅ ).To enable video distribution learning, diffusion model ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is encouraged to estimate noise component ϵitalic-ϵ\epsilonitalic_ϵ from noised latent ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that follows time-dependent scheduler (Ho etal., 2020), i.e.,

(θ)=𝔼(x),ϵ𝒩(0,1),t𝒰(1,T)[ϵϵθ(zt,c,t)22],𝜃subscript𝔼formulae-sequence𝑥italic-ϵ𝒩01similar-to𝑡𝒰1𝑇delimited-[]superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑐𝑡22\mathcal{L(\theta)}=\mathbb{E}_{\mathcal{E}(x),\epsilon\in\mathcal{N}(0,1),t%\sim\mathcal{U}(1,T)}\left[\|\epsilon-\epsilon_{\theta}(z_{t},c,t)\|_{2}^{2}%\right],caligraphic_L ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT caligraphic_E ( italic_x ) , italic_ϵ ∈ caligraphic_N ( 0 , 1 ) , italic_t ∼ caligraphic_U ( 1 , italic_T ) end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,(1)

where t𝑡titalic_t is the time step, and c𝑐citalic_c is the condition signal such as text or image.In the inference phase, the sampling process commences with a standard Gaussian noise. The sampling trajectory, however, can be adjusted by incorporating guidance for extra controllability. This is typically achieved by customized energy function g(zt,y,t)𝑔subscript𝑧𝑡𝑦𝑡g(z_{t},y,t)italic_g ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y , italic_t ) with label y𝑦yitalic_y indicating guidance direction, i.e.,

ϵθ^=ϵθ(zt,c,t)+s(ϵθ(zt,c,t)ϵθ(zt,ϕ,t))λ1α¯tztg(zt,y,t),^subscriptitalic-ϵ𝜃subscriptitalic-ϵ𝜃subscript𝑧𝑡𝑐𝑡𝑠subscriptitalic-ϵ𝜃subscript𝑧𝑡𝑐𝑡subscriptitalic-ϵ𝜃subscript𝑧𝑡italic-ϕ𝑡𝜆1subscript¯𝛼𝑡subscriptsubscript𝑧𝑡𝑔subscript𝑧𝑡𝑦𝑡\hat{\epsilon_{\theta}}=\epsilon_{\theta}(z_{t},c,t)+s(\epsilon_{\theta}(z_{t}%,c,t)-\epsilon_{\theta}(z_{t},\phi,t))-\lambda\sqrt{1-\bar{\alpha}_{t}}\nabla_%{z_{t}}g(z_{t},y,t),over^ start_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_ARG = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) + italic_s ( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ϕ , italic_t ) ) - italic_λ square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∇ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_g ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y , italic_t ) ,(2)

where ϵθ(zt,ϕ,t)subscriptitalic-ϵ𝜃subscript𝑧𝑡italic-ϕ𝑡\epsilon_{\theta}(z_{t},\phi,t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ϕ , italic_t ) is the classifier-free guidance (Ho & Salimans, 2022), ϕitalic-ϕ\phiitalic_ϕ denotes the unconditional class identifier (such as null text for textual condition), s𝑠sitalic_s and λ𝜆\lambdaitalic_λ are guidance weights, and the term 1α¯t1subscript¯𝛼𝑡\sqrt{1-\bar{\alpha}_{t}}square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG is used to convert the gradient of energy function g()𝑔g(\cdot)italic_g ( ⋅ ) into noise prediction, in which α¯tsubscript¯𝛼𝑡\sqrt{\bar{\alpha}_{t}}square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG is the hyperparameter of noise schedule, i.e., zt=α¯tz0+1α¯tϵsubscript𝑧𝑡subscript¯𝛼𝑡subscript𝑧01subscript¯𝛼𝑡italic-ϵz_{t}=\sqrt{\bar{\alpha}_{t}}z_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilonitalic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ. During sampling, the gradient generated by energy function g()𝑔g(\cdot)italic_g ( ⋅ ) indicates the direction toward generation target.

Temporal attention. In video motion synthesis, temporal attention mechanism is broadly applied to establish correlation across frames. Given a f𝑓fitalic_f-frame video feature finb×f×c×h×wsubscript𝑓𝑖𝑛superscript𝑏𝑓𝑐𝑤f_{in}\in\mathbb{R}^{b\times f\times c\times h\times w}italic_f start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_b × italic_f × italic_c × italic_h × italic_w end_POSTSUPERSCRIPT where b𝑏bitalic_b denotes batch size, c𝑐citalic_c denotes channel number, hhitalic_h and w𝑤witalic_w are spatial resolution, temporal attention first reshapes it into 3D tensor fin(b×h×w)×f×csuperscriptsubscript𝑓𝑖𝑛superscript𝑏𝑤𝑓𝑐{f}_{in}^{{}^{\prime}}\in\mathbb{R}^{(b\times h\times w)\times f\times c}italic_f start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_b × italic_h × italic_w ) × italic_f × italic_c end_POSTSUPERSCRIPT by merging the spatial dimensions into the batch size. Subsequently, it executes self-attention along the frame axis, which can be expressed as:

fout=Attention(Q(fin),K(fin),V(fin)),subscript𝑓𝑜𝑢𝑡𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑄superscriptsubscript𝑓𝑖𝑛𝐾superscriptsubscript𝑓𝑖𝑛𝑉superscriptsubscript𝑓𝑖𝑛{f}_{out}=Attention(Q(f_{in}^{{}^{\prime}}),K(f_{in}^{{}^{\prime}}),V(f_{in}^{%{}^{\prime}})),italic_f start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT = italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q ( italic_f start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) , italic_K ( italic_f start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) , italic_V ( italic_f start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) ) ,(3)

where Q()𝑄Q(\cdot)italic_Q ( ⋅ ), K()𝐾K(\cdot)italic_K ( ⋅ ), and V()𝑉V(\cdot)italic_V ( ⋅ ) are projection layers. Correspondingly, the attention map is labeled as 𝒜(b×h×w)×f×f𝒜superscript𝑏𝑤𝑓𝑓\mathcal{A}\in\mathbb{R}^{(b\times h\times w)\times f\times f}caligraphic_A ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_b × italic_h × italic_w ) × italic_f × italic_f end_POSTSUPERSCRIPT, which captures the temporal relation for each pixel feature.

3.2 Observation

MotionClone: Training-Free Motion Cloning for Controllable Video Generation (2)

Since temporal attention mechanism governs the motion in the generated video, videos with similar temporal attention maps are expected to share similar motion characteristics. To investigate this hypothesis, we manipulate the sampling trajectory by aligning the temporal attention maps of the generated video with those from a reference video. As depicted in Fig.2, simply enforcing alignment on the entire temporal attention map (plain control) can only partly restore coarse motion patterns of reference videos, such as the gait of a cat and the directional movement of a tank, demonstrating limited motion alignment. We postulate that this is because not all temporal attention weights are essential for motion synthesis, with some reflecting scene-specific noise or extremely small motions. Indiscriminate alignment with the entire temporal attention maps dilutes critical motion guidance, resulting in suboptimal motion cloning in novel scenarios. As evidence, primary control over the sparse temporal attention map significantly boosts motion alignment, which can be attributed to the emphasis on motion-related cues and the disregard of motion-irrelevant factors.

MotionClone: Training-Free Motion Cloning for Controllable Video Generation (3)

3.3 Motion Representation

Given a reference video, the corresponding temporal attention map in t𝑡titalic_t denoising step is denoted as 𝒜reft(1×h×w)×f×fsuperscriptsubscript𝒜𝑟𝑒𝑓𝑡superscript1𝑤𝑓𝑓\mathcal{A}_{ref}^{t}\in\mathbb{R}^{(1\times h\times w)\times f\times f}caligraphic_A start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( 1 × italic_h × italic_w ) × italic_f × italic_f end_POSTSUPERSCRIPT, which satisfies j=1f[𝒜reft]p,i,j=1superscriptsubscript𝑗1𝑓subscriptdelimited-[]superscriptsubscript𝒜𝑟𝑒𝑓𝑡𝑝𝑖𝑗1\sum_{j=1}^{f}[\mathcal{A}_{ref}^{t}]_{p,i,j}=1∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT [ caligraphic_A start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_p , italic_i , italic_j end_POSTSUBSCRIPT = 1. The value of [𝒜reft]p,i,jsubscriptdelimited-[]superscriptsubscript𝒜𝑟𝑒𝑓𝑡𝑝𝑖𝑗[\mathcal{A}_{ref}^{t}]_{p,i,j}[ caligraphic_A start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_p , italic_i , italic_j end_POSTSUBSCRIPT reflects the relation between i𝑖iitalic_i frame and j𝑗jitalic_j frame in position p𝑝pitalic_p, and a larger value of [𝒜reft]p,i,jsubscriptdelimited-[]superscriptsubscript𝒜𝑟𝑒𝑓𝑡𝑝𝑖𝑗[\mathcal{A}_{ref}^{t}]_{p,i,j}[ caligraphic_A start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_p , italic_i , italic_j end_POSTSUBSCRIPT implies a stronger correlation.The motion guidance over temporal attention maps, depicted by energy function g()𝑔g(\cdot)italic_g ( ⋅ ), is modeled as:

g=t(𝒜reft𝒜gent)22,𝑔superscriptsubscriptnormsuperscript𝑡superscriptsubscript𝒜𝑟𝑒𝑓𝑡superscriptsubscript𝒜𝑔𝑒𝑛𝑡22g=\left\|\mathcal{M}^{t}\cdot(\mathcal{A}_{ref}^{t}-\mathcal{A}_{gen}^{t})%\right\|_{2}^{2},italic_g = ∥ caligraphic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⋅ ( caligraphic_A start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - caligraphic_A start_POSTSUBSCRIPT italic_g italic_e italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(4)

where tsuperscript𝑡\mathcal{M}^{t}caligraphic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the temporal mask for sparse constraint, and 𝒜gentsuperscriptsubscript𝒜𝑔𝑒𝑛𝑡\mathcal{A}_{gen}^{t}caligraphic_A start_POSTSUBSCRIPT italic_g italic_e italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the temporal attention weights of generated videos in time step t𝑡titalic_t. Essentially, Eq. 4 promotes motion cloning by forcing 𝒜gentsuperscriptsubscript𝒜𝑔𝑒𝑛𝑡\mathcal{A}_{gen}^{t}caligraphic_A start_POSTSUBSCRIPT italic_g italic_e italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT close to 𝒜reftsuperscriptsubscript𝒜𝑟𝑒𝑓𝑡\mathcal{A}_{ref}^{t}caligraphic_A start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, while tsuperscript𝑡\mathcal{M}^{t}caligraphic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT determines the sparsity of constraint, time-dependence {𝒜reft,t}superscriptsubscript𝒜𝑟𝑒𝑓𝑡superscript𝑡\left\{\mathcal{A}_{ref}^{t},\mathcal{M}^{t}\right\}{ caligraphic_A start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , caligraphic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } constitute the motion guidance.Particularly, t1superscript𝑡1\mathcal{M}^{t}\equiv 1caligraphic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ≡ 1 refers to the “plain control” that exhibits limited motion transfer capability as illustrated in Fig.2.Since the value of 𝒜reftsuperscriptsubscript𝒜𝑟𝑒𝑓𝑡\mathcal{A}_{ref}^{t}caligraphic_A start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is indicative of the strength of inter-frame correlation, we propose to obtain the sparse temporal mask according to the rank of 𝒜reftsuperscriptsubscript𝒜𝑟𝑒𝑓𝑡\mathcal{A}_{ref}^{t}caligraphic_A start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT value in the temporal axis, i.e.,

p,i,jt:={1,if[𝒜reft]p,i,jΩp,it0,otherwise,assignsuperscriptsubscript𝑝𝑖𝑗𝑡cases1𝑖𝑓subscriptdelimited-[]superscriptsubscript𝒜𝑟𝑒𝑓𝑡𝑝𝑖𝑗superscriptsubscriptΩ𝑝𝑖𝑡0𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒\mathcal{M}_{p,i,j}^{t}:=\left\{\begin{array}[]{l}1,\ \ \ \ \ if\ \ \ [%\mathcal{A}_{ref}^{t}]_{p,i,j}\in\Omega_{p,i}^{t}\\0,\ \ \ \ \ \ otherwise,\end{array}\right.caligraphic_M start_POSTSUBSCRIPT italic_p , italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT := { start_ARRAY start_ROW start_CELL 1 , italic_i italic_f [ caligraphic_A start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_p , italic_i , italic_j end_POSTSUBSCRIPT ∈ roman_Ω start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 , italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e , end_CELL end_ROW end_ARRAY(5)

where Ωp,it={τ1,τ2,,τk}superscriptsubscriptΩ𝑝𝑖𝑡subscript𝜏1subscript𝜏2subscript𝜏𝑘\Omega_{p,i}^{t}=\left\{\tau_{1},\tau_{2},...,\tau_{k}\right\}roman_Ω start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_τ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } is the subset of index that comprising the top k𝑘kitalic_k values in attention map 𝒜reftsuperscriptsubscript𝒜𝑟𝑒𝑓𝑡\mathcal{A}_{ref}^{t}caligraphic_A start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT along the temporal axis j𝑗jitalic_j, and k𝑘kitalic_k is a hyper-parameter. Particularly, in the case where k=1𝑘1k=1italic_k = 1, motion guidance focuses solely on the highest activation for each spatial location. Supervised by Eq.5, motion guidance in Eq.4 encourages the sparse alignment with the primary component in 𝒜reftsuperscriptsubscript𝒜𝑟𝑒𝑓𝑡\mathcal{A}_{ref}^{t}caligraphic_A start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT while ensures spatially even constraint, facilitating a stable and reliable motion transfer.

Despite enabling effective motion cloning, the above scheme has obvious flaws: i) for real reference videos, laborious and time-consuming inversion operation is required for preparing 𝒜reftsuperscriptsubscript𝒜𝑟𝑒𝑓𝑡\mathcal{A}_{ref}^{t}caligraphic_A start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT; and ii) the considerable size of the time-dependent {𝒜reft,t}superscriptsubscript𝒜𝑟𝑒𝑓𝑡superscript𝑡\left\{\mathcal{A}_{ref}^{t},\mathcal{M}^{t}\right\}{ caligraphic_A start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , caligraphic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } poses significant challenges for large-scale preparation and efficient deployment. Fortunately, it is noted that the representation from certain denoising step can provide substantial and consistent motion guidance in generation process. Mathematically, motion guidance in Eq.4 can be converted into

g=tα(𝒜reftα𝒜gent)22=tαtα𝒜gent22,𝑔superscriptsubscriptnormsuperscriptsubscript𝑡𝛼superscriptsubscript𝒜𝑟𝑒𝑓subscript𝑡𝛼superscriptsubscript𝒜𝑔𝑒𝑛𝑡22superscriptsubscriptnormsuperscriptsubscript𝑡𝛼superscriptsubscript𝑡𝛼superscriptsubscript𝒜𝑔𝑒𝑛𝑡22g=\left\|\mathcal{M}^{t_{\alpha}}\cdot(\mathcal{A}_{ref}^{t_{\alpha}}-\mathcal%{A}_{gen}^{t})\right\|_{2}^{2}=\left\|\mathcal{L}^{t_{\alpha}}-\mathcal{M}^{t_%{\alpha}}\cdot\mathcal{A}_{gen}^{t}\right\|_{2}^{2},italic_g = ∥ caligraphic_M start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ ( caligraphic_A start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - caligraphic_A start_POSTSUBSCRIPT italic_g italic_e italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ caligraphic_L start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - caligraphic_M start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ caligraphic_A start_POSTSUBSCRIPT italic_g italic_e italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(6)

where tαsubscript𝑡𝛼t_{\alpha}italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT denotes certain time step, and tα=tα𝒜reftαsuperscriptsubscript𝑡𝛼superscriptsubscript𝑡𝛼superscriptsubscript𝒜𝑟𝑒𝑓subscript𝑡𝛼\mathcal{L}^{t_{\alpha}}=\mathcal{M}^{t_{\alpha}}\cdot\mathcal{A}_{ref}^{t_{%\alpha}}caligraphic_L start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = caligraphic_M start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⋅ caligraphic_A start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.For given reference videos, the corresponding motion representation is denoted as tα={tα,tα}superscriptsubscript𝑡𝛼superscriptsubscript𝑡𝛼superscriptsubscript𝑡𝛼\mathcal{H}^{t_{\alpha}}=\left\{{\mathcal{L}^{t_{\alpha}},\mathcal{M}^{t_{%\alpha}}}\right\}caligraphic_H start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = { caligraphic_L start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , caligraphic_M start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT }, comprising two elements that are both highly temporally sparse. For real reference videos, their tαsuperscriptsubscript𝑡𝛼\mathcal{H}^{t_{\alpha}}caligraphic_H start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT can be easily derived by directly adding noise to shift them into the noised latent of tαsubscript𝑡𝛼t_{\alpha}italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT time step, followed by a single denoising step. This straightforward strategy, impressively, proves to be remarkably effective. As shown in Fig.3, over a larger range of time steps (tαsubscript𝑡𝛼t_{\alpha}italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT from 200 to 600), the mean intensity of tαsuperscriptsubscript𝑡𝛼\mathcal{H}^{t_{\alpha}}caligraphic_H start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT effectively highlights the region and magnitude of motion. However, it is also observed that tαsuperscriptsubscript𝑡𝛼\mathcal{H}^{t_{\alpha}}caligraphic_H start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT in early denoising stage (tα=800subscript𝑡𝛼800t_{\alpha}=800italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 800) shows some discrepancies with the “head-turning” motion. This can be attributed to the fact that motion synthesis has not yet been fully determined at this early stage. Therefore, we suggest to employ the motion-aligned tαsuperscriptsubscript𝑡𝛼\mathcal{H}^{t_{\alpha}}caligraphic_H start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT from latter denoising stage to guide motion synthesis in the entire sampling process, facilitating substantial and consistent motion guidance for superior motion alignment.

MotionClone: Training-Free Motion Cloning for Controllable Video Generation (4)

3.4 Motion Guidance

The pipeline of MotionClone is depicted in Fig.4. Given a real reference video, the corresponding motion representation tαsuperscriptsubscript𝑡𝛼\mathcal{H}^{t_{\alpha}}caligraphic_H start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is obtained by performing a single noise-adding and denoising step.During the video generation process, an initial latent is initialized from a standard Gaussian distribution and subsequently undergoes an iterative denoising procedure via a pre-trained video diffusion model, advised by both classifier-free guidance and the proposed motion guidance. Given that image structure is determined in the early steps of the denoising process (Hertz etal., 2022), whereas motion fidelity primarily depends on the structure of each frame,motion guidance only involves the early denoising steps, allowing for sufficient flexibility for semantic adjustment and thus empowering premium video generation with compelling motion fidelity and precise textual alignment.

4 Experiments

4.1 Implementation details

In this work, we employ AnimateDiff(Guo etal., 2023b) as the base text-to-video generation model and leverage SparseCtrl(Guo etal., 2023a) for image-to-video and sketch-to-video generator.For given real videos, we apply single denoising in tα=400subscript𝑡𝛼400t_{\alpha}=400italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 400 for motion representation extraction. k=1𝑘1k=1italic_k = 1 is adopted for mask in Eq.5 to facilitate sparse constraint. “null-text” is uniformly used as textual prompt for preparing motion representations, promoting a more convenient video customization.The motion guidance is conducted on temporal attention layers in “up_block.1”. The detailed ablations of above setting are represented in 4.6.Guidance weight s𝑠sitalic_s and λ𝜆\lambdaitalic_λ in Eq.2 are empirically set as 7.5, and 2000, respectively. For camera motion cloning, the denoising step is configured to 100, in which the motion guidance steps set as 50. For object motion cloning, the denoising step is raised to 300, while applying motion guidance in the early 180 steps.

4.2 Experimental setup

Dataset. For experimental evaluation, 40 real videos sourced from DAVIS (Pont-Tuset etal., 2017) and website are utilized for a thorough analysis, comprising 15 videos with camera motion and 25 videos for object motion. These videos encompass a rich tapestry of motion types and scenarios, ranging from the dynamic motions of animals and humans to the global camera motion.

Evaluation metricsFor objective evaluation, two commonly used metrics are adopted: i) Textual alignment, which quantifies the congruence with the provided textual prompt. Following previous work (Wang etal., 2024), it is measured by the average CLIP (Radford etal., 2021) cosine similarity between all video frames and text (Jeong etal., 2023); ii) Temporal consistency, the indicator of video smoothness, is quantified by calculating the average CLIP similarity among consecutive video frames.Beyond the scope of objective metrics, a user study is employed for a more nuanced assessment of human preferences in video quality, incorporating two additional criteria: i) motion preservation which evaluates the motion’s adherence to the reference video, and ii) appearance diversity which assesses the visual range and diversity in contrast to the reference video. The user study scores are derived from the average ratings provided by 20 volunteers, ranging from 1 to 5.

MotionClone: Training-Free Motion Cloning for Controllable Video Generation (5)

Baselines. For a thorough comparative analysis, various alternative methods have been examined in the comparison, including VideoComposer(Wang etal., 2024), Tune-A-video(Wu etal., 2023), Control-A-Video(Chen etal., 2023b), VMC(Jeong etal., 2023), and Gen-1(Esser etal., 2023). A detailed description of each method is depicted in the Appendix.

4.3 Qualitative Comparison

Camera motion cloning. As shown in Fig. 5, the ”clockwise rotation” and ”view switching” motion present a significant challenge. VMC and Tune-A-Video generate scenes with acceptable textual alignment but exhibit deficiencies in motion transfer. The outputs from VideoComposer, Gen-1, and Control-A-Video are notably unrealistic, which can be attributed to the dense integration of the structural elements from the original videos. Conversely, MotionClone demonstrates superior text alignment and motion consistency, thereby suggesting its superior video motion transfer capabilities within global camera motion scenarios.

Object motion cloning. Beyond the scope of camera motion, the proficiency in handling local object motions has been rigorously validated. As evidenced by Fig. 6, VMC falls short in matching motion with the source videos. Videocomposer appears to generate grayish colors with limited prompt-following ability. Gen-1 is inhibited by the original videos’ structure. Tune-A-Video struggles with capturing detailed body motions, while Control-A-Video cannot maintain a faithful appearance. In contrast, MotionClone stands out in scenarios with localized object motions, enhancing motion accuracy and improved text alignment.

MotionClone: Training-Free Motion Cloning for Controllable Video Generation (6)
MethodVMCVideoComposerGen-1Tune-A-VideoControl-A-VideoMotionClone
Textual Alignment0.31340.28540.24620.30020.28590.3187
Temporal Consistency0.96140.95770.95630.93510.95130.9621
Motion Preservation2.593.283.502.443.333.69
Appearance Diversity3.513.233.253.093.274.31
Textual Alignment3.792.712.803.042.824.15
Temporal Consistency2.852.793.342.282.814.28

4.4 Quantitative comparison

The quantitative comparison on 40 real videos with various motion pattern are outlined in Tab.1. It is observed that MotionClone gains competitive scores in both textual alignment and temporal consistency. Moreover, MotionClone has outperformed its rivals in motion preservation, appearance diversity, temporal consistency, and textual alignment in human preference tests, underscoring its ability to produce visually compelling outcomes.

4.5 Versatile application

Beyond T2V, MotionClone is also compatible with I2V and sketch-to-video. As shown in Fig.7, by incorporating the first frame or a sketch image as an additional condition, MotionClone achieves impressive motion transfer while aligning with the specified condition, underscoring its significant potential for a wide range of applications.

MotionClone: Training-Free Motion Cloning for Controllable Video Generation (7)

4.6 Ablation and Analysis

Choice of k𝑘kitalic_k.k𝑘kitalic_k determines the mask in Eq.5 and thus impacts the sparsity of motion constraint. As illustrated in Fig.8, a lower k𝑘kitalic_k value helps better primary motion alignment, attributed by the enhanced elimination of scene-specific noise and subtle motion.

Choice of tαsubscript𝑡𝛼t_{\alpha}italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT.The value of tαsubscript𝑡𝛼t_{\alpha}italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT determines diffusion feature distribution used for preparing motion representations. As shown in Fig.8, an excessively large tα=800subscript𝑡𝛼800t_{\alpha}=800italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 800 causes substantial loss of motion information due to excessive noise injection, whiletα{200,400,600}subscript𝑡𝛼200400600t_{\alpha}\in\left\{200,400,600\right\}italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∈ { 200 , 400 , 600 } can all achieve a certain degree of motion alignment, implying the robustness of tαsubscript𝑡𝛼t_{\alpha}italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT. In this work, we chose tα=400subscript𝑡𝛼400t_{\alpha}=400italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 400 as default value as it typically yields appealing motion cloning in our experiments.

Choice of temporal attention block.Fig.9 illustrates the results with motion guidance applied in different blocks. It is observed that “up_block.1” stands out for its superior motion manipulation capabilities while safeguarding visual quality, underscoring its dominant role in motion synthesis.

MotionClone: Training-Free Motion Cloning for Controllable Video Generation (8)
MotionClone: Training-Free Motion Cloning for Controllable Video Generation (9)

Does precise prompt help ? During motion representation preparation procedure, few differences arise when using tailored prompts regrading video content, as represented Fig.9. We speculate that motion-related information is effective preserved in diffusion features at tα=400subscript𝑡𝛼400t_{\alpha}=400italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 400, thereby diminishing the significance of the precise prompt.

Does video inversion help ?Video inversion can provide time-dependence {𝒜reft,t}superscriptsubscript𝒜𝑟𝑒𝑓𝑡superscript𝑡\left\{\mathcal{A}_{ref}^{t},\mathcal{M}^{t}\right\}{ caligraphic_A start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , caligraphic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } for Eq.4 and certain time step {tα,tα}superscriptsubscript𝑡𝛼superscriptsubscript𝑡𝛼\left\{{\mathcal{L}^{t_{\alpha}},\mathcal{M}^{t_{\alpha}}}\right\}{ caligraphic_L start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , caligraphic_M start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } for Eq.6, but entails considerable time costs.As shown in Fig.9 (Inversion_1 vs. Inversion_2), {tα,tα}superscriptsubscript𝑡𝛼superscriptsubscript𝑡𝛼\left\{{\mathcal{L}^{t_{\alpha}},\mathcal{M}^{t_{\alpha}}}\right\}{ caligraphic_L start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , caligraphic_M start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } outperforms {𝒜reft,t}superscriptsubscript𝒜𝑟𝑒𝑓𝑡superscript𝑡\left\{\mathcal{A}_{ref}^{t},\mathcal{M}^{t}\right\}{ caligraphic_A start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , caligraphic_M start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } due to the consistent motion guidance from the same representation. Meanwhile, there is not obvious quality difference regarding whether DDIM inversion is applied (MotionClone vs. Inversion_2). We leave how to perform better diffusion inversion for enhanced motion cloning to further work.

4.7 Limitation

MotionClone: Training-Free Motion Cloning for Controllable Video Generation (10)

Given that MotioClone is conducted in latent space, the spatial resolution of diffusion features in temporary attention is significantly lower than that of input videos, thus MotionClone struggles in local subtle motion, such as winking, as shown in Fig.10. Additionally, when multiple moving objects overlap, MotionClone risks quality dropping, attributing that coupled motion raises the difficulty of motion cloning.

5 Conclusion

In this work, we observe that the temporal attention layers embedded within video generation models harbor substantial representational capacities pertinent to video motion transfer. Motivated by these findings, we introduce a training-free method dubbed MotionClone for motion cloning. Leveraging sparse temporal attention weights as motion representations, MotionClone facilitates motion guidance by promoting primary motion alignment, enabling diverse motion transfer across different scenarios. Employing a real reference video, MotionClone demonstrates its capability to preserve motion fidelity robustly while concurrently assimilating novel textual semantics. Furthermore, MotionClone demonstrates efficiency by avoiding cumbersome inversion processes and offers versatility across various video generation tasks, establishing itself as a highly adaptable and efficient tool for motion customization.

References

  • Bain etal. (2021)Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman.Frozen in time: A joint video and image encoder for end-to-end retrieval.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1728–1738, 2021.
  • Blattmann etal. (2023a)Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, etal.Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023a.
  • Blattmann etal. (2023b)Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, SeungWook Kim, Sanja Fidler, and Karsten Kreis.Align your latents: High-resolution video synthesis with latent diffusion models.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22563–22575, 2023b.
  • Chefer etal. (2023)Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or.Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
  • Chen etal. (2023a)Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, etal.Videocrafter1: Open diffusion models for high-quality video generation.arXiv preprint arXiv:2310.19512, 2023a.
  • Chen etal. (2024)Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan.Videocrafter2: Overcoming data limitations for high-quality video diffusion models.arXiv preprint arXiv:2401.09047, 2024.
  • Chen etal. (2023b)Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin.Control-a-video: Controllable text-to-video generation with diffusion models.arXiv preprint arXiv:2305.13840, 2023b.
  • Dai etal. (2023)Zuozhuo Dai, Zhenghao Zhang, Yao Yao, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang.Animateanything: Fine-grained open domain image animation with motion guidance.arXiv e-prints, pp. arXiv–2311, 2023.
  • Esser etal. (2023)Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis.Structure and content-guided video synthesis with diffusion models.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7346–7356, 2023.
  • Gu etal. (2022)Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, BoZhang, Dongdong Chen, LuYuan, and Baining Guo.Vector quantized diffusion model for text-to-image synthesis.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10696–10706, 2022.
  • Guo etal. (2023a)Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and BoDai.Sparsectrl: Adding sparse controls to text-to-video diffusion models.arXiv preprint arXiv:2311.16933, 2023a.
  • Guo etal. (2023b)Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, YuQiao, Dahua Lin, and BoDai.Animatediff: Animate your personalized text-to-image diffusion models without specific tuning.arXiv preprint arXiv:2307.04725, 2023b.
  • Hertz etal. (2022)Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or.Prompt-to-prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022.
  • Ho & Salimans (2022)Jonathan Ho and Tim Salimans.Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022.
  • Ho etal. (2020)Jonathan Ho, Ajay Jain, and Pieter Abbeel.Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020.
  • Huang etal. (2023)Lianghua Huang, DiChen, YuLiu, Yujun Shen, Deli Zhao, and Jingren Zhou.Composer: Creative and controllable image synthesis with composable conditions.arXiv preprint arXiv:2302.09778, 2023.
  • Jeong & Ye (2023)Hyeonho Jeong and JongChul Ye.Ground-a-video: Zero-shot grounded video editing using text-to-image diffusion models.arXiv preprint arXiv:2310.01107, 2023.
  • Jeong etal. (2023)Hyeonho Jeong, GeonYeong Park, and JongChul Ye.Vmc: Video motion customization using temporal attention adaption for text-to-video diffusion models.arXiv preprint arXiv:2312.00845, 2023.
  • Kim etal. (2023)Yunji Kim, Jiyoung Lee, Jin-Hwa Kim, Jung-Woo Ha, and Jun-Yan Zhu.Dense text-to-image generation with attention modulation.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7701–7711, 2023.
  • Ku etal. (2024)Max Ku, Cong Wei, Weiming Ren, Huan Yang, and Wenhu Chen.Anyv2v: A plug-and-play framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024.
  • Li etal. (2023)Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and YongJae Lee.Gligen: Open-set grounded text-to-image generation.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22511–22521, 2023.
  • Liu etal. (2023)Shaoteng Liu, Yuechen Zhang, Wenbo Li, Zhe Lin, and Jiaya Jia.Video-p2p: Video editing with cross-attention control.arXiv preprint arXiv:2303.04761, 2023.
  • Ma etal. (2023)Jian Ma, Junhao Liang, Chen Chen, and Haonan Lu.Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning.arXiv preprint arXiv:2307.11410, 2023.
  • Ma etal. (2024)Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Chenyang Qi, Chengfei Cai, Xiu Li, Zhifeng Li, Heung-Yeung Shum, Wei Liu, etal.Follow-your-click: Open-domain regional image animation via short prompts.arXiv preprint arXiv:2403.08268, 2024.
  • Mo etal. (2023)Sicheng Mo, Fangzhou Mu, KuanHeng Lin, Yanli Liu, Bochen Guan, Yin Li, and Bolei Zhou.Freecontrol: Training-free spatial control of any text-to-image diffusion model with any condition.arXiv preprint arXiv:2312.07536, 2023.
  • Nichol etal. (2021)Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen.Glide: Towards photorealistic image generation and editing with text-guided diffusion models.arXiv preprint arXiv:2112.10741, 2021.
  • Niu etal. (2024)Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, and Yinqiang Zheng.Mofa-video: Controllable image animation via generative motion field adaptions in frozen image-to-video diffusion model.arXiv preprint arXiv:2405.20222, 2024.
  • Podell etal. (2023)Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach.Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023.
  • Pont-Tuset etal. (2017)Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc VanGool.The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017.
  • Qin etal. (2023)Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, JuanCarlos Niebles, Caiming Xiong, Silvio Savarese, etal.Unicontrol: A unified diffusion model for controllable visual generation in the wild.arXiv preprint arXiv:2305.11147, 2023.
  • Radford etal. (2021)Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  • Rombach etal. (2022)Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
  • Sun etal. (2024)Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, YiWang, etal.Journeydb: A benchmark for generative image understanding.Advances in Neural Information Processing Systems, 36, 2024.
  • Wang etal. (2024)Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou.Videocomposer: Compositional video synthesis with motion controllability.Advances in Neural Information Processing Systems, 36, 2024.
  • Wang etal. (2023a)Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, YiWang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, etal.Lavie: High-quality video generation with cascaded latent diffusion models.arXiv preprint arXiv:2309.15103, 2023a.
  • Wang etal. (2023b)Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan.Motionctrl: A unified and flexible motion controller for video generation.arXiv preprint arXiv:2312.03641, 2023b.
  • Wu etal. (2023)JayZhangjie Wu, Yixiao Ge, Xintao Wang, StanWeixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and MikeZheng Shou.Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7623–7633, 2023.
  • Xiao etal. (2023)Guangxuan Xiao, Tianwei Yin, WilliamT Freeman, Frédo Durand, and Song Han.Fastcomposer: Tuning-free multi-subject image generation with localized attention.arXiv preprint arXiv:2305.10431, 2023.
  • Xing etal. (2024)Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, YHe, HLiu, HChen, XCun, XWang, YShan, etal.Make-your-video: Customized video generation using textual and structural guidance.IEEE Transactions on Visualization and Computer Graphics, 2024.
  • Yin etal. (2023)Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan.Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory.arXiv preprint arXiv:2308.08089, 2023.
  • Zhang etal. (2024)Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang.Long-clip: Unlocking the long-text capability of clip.arXiv preprint arXiv:2403.15378, 2024.
  • Zhang etal. (2023)Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.Adding conditional control to text-to-image diffusion models.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847, 2023.
  • Zhao etal. (2023)Rui Zhao, Yuchao Gu, JayZhangjie Wu, DavidJunhao Zhang, Jiawei Liu, Weijia Wu, Jussi Keppo, and MikeZheng Shou.Motiondirector: Motion customization of text-to-video diffusion models.arXiv preprint arXiv:2310.08465, 2023.

Appendix A Appendix

A.1 Baseline description

Among the compared methods, VideoComposer(Wang etal., 2024) creates videos by extracting specific features such as frame-wise depth or canny maps from existing videos, achieving a compositional approach to controllable video generation. Gen-1(Esser etal., 2023) leverages the original structure of reference videos to generate new video content, akin to video-to-video translation. Tune-A-Video expands the spatial self-attention of pre-trained text-to-image models into spatio-temporal attention, and then fine-tuning it for motion-specific generation. Control-A-Video(Chen etal., 2023b) incorporates the first video frame as an additional motion cue for customized video generation. VMC(Jeong etal., 2023) aims to distill motion patterns by fine-tuning the temporal attention layers in a pre-trained text-to-video diffusion model.

A.2 More generated results

A broader array of generated content is displayed to validate the versatile generation capability. As shown in Figs.11-14, MotionClone is able to adeptly extract motion cues from a diverse range of existing videos and thus enables the creation of content that is both prompt-aligned and motion-preserved, showcasing its robust motion cloning capabilities.For a better demonstration of MotionClone, we highly recommend viewing the video file at https://github.com/LPengYang/MotionClone.

A.3 Broader Impact

The development of MotionClone, a novel training-free framework for motion-based controllable video generation, carry distinct societal implications, both beneficial and challenging.

On the positive side, MotionClone’s capability to efficiently clone motions from reference videos while ensuring high fidelity and textual alignment opens new avenues in numerous fields. In the realm of digital content creation, film and media professionals can utilize this technology to streamline the production process, enhance narrative expressions, and create more engaging visual experiences without extensive resource commitments. Furthermore, in the educational sector, instructors and content creators can leverage this innovation to produce customized instructional videos that incorporate precise motions aligned with textual descriptions, potentially increasing engagement and comprehension among students. This could be particularly transformative for subjects where demonstration of physical actions or processes plays a crucial role, such as in sports training or scientific experiments.

On the negative side, the power of MotionClone to generate realistic videos based on text and existing motion cues raises concerns about its potential misuse, including the creation of deepfakes or misleading media content. Such applications can undermine trust in media, affect public opinion through the dissemination of false information, and infringe on personal rights and privacy. Moreover, the ease of generating convincing videos might enable the proliferation of propaganda or harmful content that can have widespread negative implications on society.

In conclusion, while MotionClone presents significant advancements in the field of AI-driven video generation, it is imperative that these technologies are developed and utilized with a conscious commitment to ethical standards and regulatory oversight. Promoting transparency in AI-generated content, establishing clear usage guidelines, and fostering an open dialogue about the capabilities and ethics of such technologies are crucial steps in ensuring that the benefits of MotionClone are realized while its risks are effectively mitigated. This involves collaborative efforts among technologists, policymakers, industry stakeholders, and the broader public to steer the responsible development and application of AI-driven media tools.

MotionClone: Training-Free Motion Cloning for Controllable Video Generation (11)
MotionClone: Training-Free Motion Cloning for Controllable Video Generation (12)
MotionClone: Training-Free Motion Cloning for Controllable Video Generation (13)
MotionClone: Training-Free Motion Cloning for Controllable Video Generation (14)
MotionClone: Training-Free Motion Cloning for Controllable Video Generation (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Geoffrey Lueilwitz

Last Updated:

Views: 6409

Rating: 5 / 5 (80 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Geoffrey Lueilwitz

Birthday: 1997-03-23

Address: 74183 Thomas Course, Port Micheal, OK 55446-1529

Phone: +13408645881558

Job: Global Representative

Hobby: Sailing, Vehicle restoration, Rowing, Ghost hunting, Scrapbooking, Rugby, Board sports

Introduction: My name is Geoffrey Lueilwitz, I am a zealous, encouraging, sparkling, enchanting, graceful, faithful, nice person who loves writing and wants to share my knowledge and understanding with you.