What is Alibaba's EMO?

What are Alibaba, OpenAI, Haiper, Pika and others doing exactly?

Mar 08, 2024

∙ Paid

Hey Everyone,

In late February, 2024 an organization called the "Institute for Intelligent Computing" within the Chinese e-commerce juggernaut Alibaba released a paper about an intriguing new AI video generator

With the emergence of startups like Haiper, Pika Labs, Runway and OpenAI’s Sora, and a group in China who want to emulate Sora, a lot of text-to-video will actually be led by Chinese AI researchers. Maybe some of those were educated in the UK or the U.S. and worked at Google, but it’s striking how they are Chinese born.

So it was interesting to see what Alibaba have done with EMO.

Researchers from Chinese tech giant Alibaba unveiled a new generative AI tool that pushes the boundaries of realism in animating subjects not unlike what Pika Labs or Haiper are working on doing.

Text-to-video and Multimodal AI is actually fairly interesting. Tom Alison, the head of Facebook, said on Wednesday that the company is working on an AI model to “power our entire video ecosystem.” You really do have to wonder why Meta bought as many H100s from Nvidia as did Microsoft, right?

Meta recently released V-JEPA: Video Joint Embedding Predictive Architecture.

Read the Blog

Towards an Onslaught of Synthetic Video Creations

Fair enough so how about Alibaba Emo? Emo stands for Emote Portrait Alive and is not that different from what those I have mentioned are doing.

EMO: Emote Portrait Alive

Alibaba Researchers: Linrui Tian, Qi Wang, Bang Zhang, Liefeng Bo

EMO Project: https://humanaigc.github.io/emote-portrait-alive/

EMO Research Paper: https://arxiv.org/abs/2402.17485

EMO PDF: https://arxiv.org/pdf/2402.17485.pdf

EMO GitHub: https://humanaigc.github.io/emote-portrait-alive/ https://github.com/HumanAIGC/EMO

EMO is an expressive audio-driven portrait-video generation framework. Input a single reference image and the vocal audio, e.g. talking and singing, our method can generate vocal avatar videos with expressive facial expressions, and various head poses, meanwhile, we can generate videos with any duration depending on the length of input video.

It’s not clear of course how this actually benefits the internet or can make businesses more productive?

Read the Paper

This represents a major advance in audio-driven talking head video generation, an area that has challenged AI researchers for years.

Keep reading with a 7-day free trial

Subscribe to Artificial Intelligence Learning 🤖🧠🦾 to keep reading this post and get 7 days of free access to the full post archives.