close
close
Bytedance proposes omnihuman-1: an end-to-end multimodality framework that creates human videos based on a single human image and movement signals

Despite the progress in AI-controlled human animation, existing models are often exposed to restrictions in movement realism, adaptability and scalability. Many models have difficulty creating liquid body movements and relying on filtered training data records in order to limit their ability to process different scenarios. The facial animation has recorded improvements, but full body animations are still a challenge due to inconsistencies in the accuracy of the gestures and the alignment of the gesture. In addition, many frameworks are restricted by specific side conditions and body conditions, which restricts their applicability to different media formats. Coping with these challenges requires a more flexible and scalable approach for learning movement.

Bytedance has introduced Omnihuman-1A Diffusion transfer -based AI model that can generate Realistic human videos from a single image and movement signalsincluding Audio, video or a combination of both. In contrast to previous methods that focus on portrait or static body animations, contains omnihuman-1 Omni conditioning trainingMake possible Gesture realism, body movement and human-object interactions.

Omnihuman-1 supports several forms of movement of the movement:

  • Audio -driven animationCreation of synchronized lip movements and gestures from voice input.
  • Video -controlled animationReplication movement from a reference video.
  • Multimodal fusionCombination both audio and video signals for precise control over different parts of the body.

His ability to handle Different side conditions and body shares Makes it a versatile tool for applications that need human animation and stand out from previous models.

Technical basics and advantages

Omnihuman-1 employs a Diffusion transformer (dit) architectureIntegration of several movement -related conditions to improve video generation. The most important innovations include:

  1. Multimodal movement conditioning: Installation Text, audio and pose conditions During training, it can be generalized to various animation styles and input types.
  2. Scalable training strategy: In contrast to conventional methods that discard significant data due to the strict filtering, Omnihuman-1 optimizes the use of both strong and weak terms of movementAchieve high -quality animation through minimal input.
  3. Omni conditioning training: The training strategy follows two principles:
    • Stronger conditioned tasks (e.g. posing -controlled animation) use weaker conditioned data (e.g. text, audio -oriented movement) to improve data diversity.
    • The training relationships are adapted to ensure that weaker conditions receive a higher center of gravity and reconcile the generalization across modalities.
  4. Realistic generation of movement: Omnihuman-1 is characterized Co-speech gestures, natural head movements and detailed hand interactionsWhat it is particularly effective for Virtual avatars, AI-controlled character animation and digital storytelling.
  5. Versatile adaptation in style: The model is not limited to photo -realistic outputs; It supports Cartoon, stylized and anthropomorphic character animationsExpansion of its creative applications.

Performance and benchmarking

Omnihuman-1 was rated on leading animation models such as Loopy, Cyberhost and Diffed, which shows the superior performance in several metrics:

  • Accuracy of lip -synchronization (Higher is better):
    • Omnihuman-1: 5.255
    • Loopy: 4.814
    • Cyberhost: 6.627
  • Fréchet Video distance (FVD) (lower is better):
    • Omnihuman-1: 15.906
    • Loopy: 16.134
    • Differde: 58.871
  • Gesture expression (HKV metric):
    • Omnihuman-1: 47,561
    • Cyberhost: 24,733
    • Difficulty: 23.409
  • Hand Tastoint trust (HKC) (Higher is better):
    • Omnihuman-1: 0.898
    • Cyberhost: 0.884
    • Differad: 0.769

Ablation studies continue to confirm how important Balancing the pose, reference image and audio conditions in training to achieve natural and expressive generation of movement. The ability of the model to generalize across different body conditions and side conditions offers him a clear advantage over existing approaches.

Diploma

Omnihuman-1 represents a significant step forward in the AI-controlled human animation. Through integration Omni conditioning training and use of a dit-based architectureBytedance has developed a model that effectively bridges the gap between static image input and dynamic, lifelike video generation. His ability to encourage human figures a single picture with audio, video or both Makes it a valuable tool for Virtual influencers, digital avatars, game development and AI supported filmmaking.

When AI-generated human videos are more demanding, Omnihuman-1 emphasizes a shift to more flexible, scalable and adaptable animation models. By coping with the many years of challenges in movement realism and training scalability, it lays the basis for further progress in the generative AI for human animation.


Checkout The paper and project page. All credit for this research applies to the researchers of this project. Don’t forget to follow us either Twitter and join our Telegram channel And LinkedIn GRGate. Don’t forget to join our 75k+ ml subreddit.

🚨 Marketchpost invites AI companies/startups/groups to be a partner for his upcoming AI magazines for “Open Source Ki in Production” and “Agentic Ai”.


Asif Razzaq is the CEO of Marktechpost Media Inc. His latest endeavor is the introduction of a media platform for artificial intelligence, Marktechpost, which is characterized by detailed reporting on machine learning and deep learning messages, which is technically good and easy are understandable. The platform has 2 million monthly views and illustrates its popularity of the audience.

✅ (recommended) Enter in our telegram channel

Leave a Reply

Your email address will not be published. Required fields are marked *