OmniHuman-1: Advancing Single-Image Video Generation
TikTok creator ByteDance's new GenAI video model
Input: first frame + audio track
Output:
ByteDance has introduced OmniHuman-1, a sophisticated AI model that transforms single images into dynamic videos, complete with natural movements and synchronized speech.
The model represents a significant step forward in video synthesis technology, particularly in its ability to handle full-body animation and complex motion patterns.
Before you get too excited, there are no downloads, and you can’t try it yourself on any service right now, it’s just research.
“If you created a TikTok video there’s a good chance you’re now in a database that’s going to be used to create virtual humans.”
Freddy Tran Nager, Clinical associate professor of communications at the University of Southern California’s Annenberg School for Communication and Journalism
Technical Innovation Through Scale
At the heart of OmniHuman-1's capabilities lies an impressive training dataset of over 18,700 hours of human video data. This extensive dataset, combined with a novel "omni-conditions" training approach, enables the model to process multiple types of inputs simultaneously - including text, audio, and physical poses.
The model employs a Diffusion Transformer (DiT) architecture, merging diffusion-based generative modeling with transformer-based sequence handling. This allows for one-stage, end-to-end generation of output videos, streamlining what traditionally required multiple specialized models. Notably this is audio and video without syncing tricks.
Key Capabilities
OmniHuman-1 stands out in several areas:
Flexible Input Handling: The model works with images of any aspect ratio, from portraits to full-body shots. It is limited though to one image and one audio input.
Natural Motion Generation: Produces realistic co-speech gestures and body movements synchronized with audio
Style Versatility: Supports both photorealistic and stylized outputs, suitable for various creative applications
Comprehensive Animation: Handles everything from subtle facial expressions to complex full-body movements, previous research in this area only covered the ‘bust’
Performance Benchmarks
The model has demonstrated impressive metrics across key performance indicators:
Lip-sync accuracy score of 5.255, surpassing several established benchmarks
Significantly improved gesture expressiveness, scoring 47.56 compared to baseline scores of 24.73
Enhanced video quality with a Fréchet Video Distance of 15.9
Practical Applications
This technology opens up numerous possibilities:
Content Creation: Streamlined video production from still images
Virtual Presentations: Enhanced digital communications and remote interactions
Character Animation: Efficient creation of animated characters for various media
Educational Content: Dynamic video content generation from static resources
Limitations
Inputs
Only single 🖼️ human image input
Motion signals 🔊 audio only, 📽️ video only, or both
No textual guidance input is mentioned
Output
Single 📽️ video & 🔊audio with motion including lip syncing and, gestures
Looking Forward
While OmniHuman-1 represents notable progress in AI video synthesis, ByteDance has emphasized their commitment to responsible development and implementation.
I’d be interested in seeing this in an open source offering with ControlNet like features like OpenPose built into this.
I would bet that in 6 months we have an OpenPose, ControlNet UI that lets us textually direct from an image an entire scene. And I believe that will be doable on my RTX-4090. Watch this open-source space.