Meta's New Model Makes Videos That Look and Sound Real!

Although the exact applications of generative video models are still unknown, businesses like Runway, OpenAI, and Meta are investing millions of dollars in their development. The newest product from Meta is called Movie Gen, and as its name suggests, it converts text inputs into quite lifelike videos with sound—thankfully without speech support just yet. And sensibly, they are not releasing this one to the general public.

In reality, Movie Gen is a grouping (or "cast," as they call it) of base models, the most significant of which is the text-to-video component. Meta says it beats the likes of Kling1.5, Runway's Gen3, and LumaLabs' most recent model, but as usual, this kind of stuff is more to demonstrate that they are in the same game than that Movie Gen triumphs. The technical particulars can be found in the paper Meta put out describing all the components.

Keep Reading: Builder. ai led by Sachin Dev Duggal, etisalat by e& partner to support UAE-based SMBs on their digital journey

Audio is generated to match the contents of the video, adding for instance engine noises that correspond with car movements, or the rush of a waterfall in the background, or a crack of thunder halfway through the video when it’s called for. It’ll even add music if that seems relevant.

It was trained on “a combination of licensed and publicly available datasets” that they called “proprietary/commercially sensitive” and would provide no further details on. We can only guess means is a lot of Instagram and Facebook videos, plus some partner stuff and a lot of others that are inadequately protected from scrapers — AKA “publicly available.”

What Meta is clearly aiming for here, however, is not simply capturing the “state of the art” crown for a month or two, but a practical, soup-to-nuts approach where a solid final product can be produced from a very simple, natural-language prompt. Stuff like “imagine me as a baker making a shiny hippo cake in a thunderstorm.”

For instance, one sticking point for these video generators has been in how difficult they usually are to edit. If you ask for a video of someone walking across the street, then realize you want them walking right to left instead of left to right, there’s a good chance the whole shot will look different when you repeat the prompt with that additional instruction. Meta is adding a simple, text-based editing method where you can simply say “change the background to a busy intersection” or “change her clothes to a red dress” and it will attempt to make that change, but only that change.

Camera movements are also generally understood, with things like “tracking shot” and “pan left” taken into account when generating the video. This is still pretty clumsy compared with real camera control, but it’s a lot better than nothing.

The limitations of the model are a little weird. It generates video 768 pixels wide, a dimension familiar to most from the famous but outdated 1024×768, but which is also three times 256, making it play well with other HD formats. The Movie Gen system upscales this to 1080p, which is the source of the claim that it generates that resolution. Not really true, but we’ll give them a pass because upscaling is surprisingly effective.

Weirdly, it generates up to 16 seconds of video… at 16 frames per second, a frame rate no one in history has ever wanted or asked for. You can, however, also do 10 seconds of video at 24 FPS. Lead with that one!

As for why it doesn’t do voice… well, there are likely two reasons. First, it’s super hard. Generating speech is easy now, but matching it to lip movements, and those lips to face movements, is a much more complicated proposition. I don’t blame them for leaving this one til later, since it would be a minute-one failure case. Someone could say “generate a clown delivering the Gettysburg Address while riding a tiny bike in circles” — nightmare fuel primed to go viral.

The second reason is likely political: putting out what amounts to a deepfake generator a month before a major election is… not the best for optics. Crimping its capabilities a bit so that, should malicious actors try to use it, it would require some real work on their part, is a practical preventive step. One certainly could combine this generative model with a speech generator and an open lip syncing one, but you can’t just have it generate a candidate making wild claims.

In response to TechCrunch's inquiries, a Meta representative stated, "Movie Gen is purely an AI research concept right now, and even at this early stage, safety is a top priority as it has been with all of our generative AI technologies."

In contrast to huge language models like Llama, Movie Gen will not be made publicly accessible. Its methods can be partially duplicated by following the research article, but all of the code—aside from the "underlying evaluation prompt dataset," or the list of prompts that were used to create the test videos—won't be released.