Video generation with realistic motion episode artwork

EPISODE · Jan 23, 2025 · 45 MIN

Video generation with realistic motion

from Changelog Master Feed · host Practical AI LLC

We seem to be experiencing a surge of video generation tools, models, and applications. However, video generation models generally struggle with some basic physics, like realistic walking motion. This leaves some generated videos lacking true motion with disappointing, simplistic panning camera views. Genmo is focused on the motion side of video generation and has released some of the best open models. Paras joins us to discuss video generation and their journey at Genmo.Sponsors:Domo – The AI and data products platform. Strengthen your entire data journey with Domo’s AI and data products. Featuring:Paras Jain – LinkedIn, XChris Benson – Website, GitHub, LinkedIn, XDaniel Whitenack – Website, GitHub, XShow Notes:GenmoUpcoming Events: Register for upcoming webinars here!

We seem to be experiencing a surge of video generation tools, models, and applications. However, video generation models generally struggle with some basic physics, like realistic walking motion. This leaves some generated videos lacking true motion with disappointing, simplistic panning camera views. Genmo is focused on the motion side of video generation and has released some of the best open models. Paras joins us to discuss video generation and their journey at Genmo.Sponsors:Domo – The AI and data products platform. Strengthen your entire data journey with Domo’s AI and data products. Featuring:Paras Jain – LinkedIn, XChris Benson – Website, GitHub, LinkedIn, XDaniel Whitenack – Website, GitHub, XShow Notes:GenmoUpcoming Events: Register for upcoming webinars here!

NOW PLAYING

Video generation with realistic motion

0:00 45:09
of MATCHES

TRANSCRIPT · AUTO-GENERATED

Welcome to practical AI the podcast that makes artificial intelligence practical productive and accessible to all if you like this show You will love the changelog. It's news on Mondays deep technical interviews on Wednesdays and on Fridays an awesome talk show where you're week in Enjoyment find us by searching for the changelog wherever you get your podcasts. Thanks to our partners at fly.io launch your AI apps in five minutes or less Learn how at fly.io Welcome to another episode of the practical AI podcast This is Daniel whitenack. I am CEO at prediction guard and I'm joined as always by my co-host Chris Benson who is a principal AI research engineer at Lockheed Martin How you doing Chris doing great?

Happy New Year. This is our first show of 2025 happy new year Yeah, this is the first one we're recording for the year first first time jumping back on the mics to talk about AI and Definitely something that I think will be a theme in 2025, which will be of course multimodal AI in general But I think something that a lot of people are wondering where it's going to go I guess in 2025, which is which is video generation Yeah, so we're very pleased to have a Paris Jane with us today who's CEO at Genmo. How you doing? I'm doing great happy new Everyone.

It's really wonderful to be here. Yeah, welcome I know we've been trying to make this one happen for a little while and I think the timing worked out well because like I say People are thinking a lot about video generation and how that will evolve in 2025 Maybe from someone that's working in this area and has been thinking about it deeply Maybe you know a lot of people a listeners have have just started thinking about this topic recently But but you've been thinking about it deeply for some time could you give us a little bit of a sense of kind of what happened what led up to video generation and kind of where it is in 2024 and then kind of as we're entering this new year What's the current state of video generation? I guess more generally in terms of what people can access in actually released systems or released models? Yeah, absolutely There's been a long path to kind of where we are today I think you know given so much excitement in what you might call the left brain of AI that is like language models reasoning your O series of models You know, I think the right brain has kind of lag progress for quite a while Right like people didn't really why they use kind of creative AI at a huge scale I think video is the ultimate creative modality here Right if you think about it so much of how we communicate as humans is through visual mediums and specifically just video through through motion And so I think it's incredibly exciting in video being this ultimate form of creative multimodal synthesis It was always really exciting but the technology really kind of was far behind I think what people really wanted from it And so it's interesting My co-founder worked on some of the earliest image generation models and then 3D generation and then video was always this kind of big modality We wanted to target what I think was really interesting in 2023 and 2024 was first the development of an image generation Which is kind of a precursor to video But even then the gap from image generation models of your generation models was always really big because if you think about it An image might have thousands of pixels and you know or even a million pixels But a video would have you know hundreds of millions or even billion pixels and just a short clip and so there was a huge gap to cross I mean so compute has scaled a lot and that has enabled larger models And so I think you know top of mine since we lat you know rescheduled this podcast or I came to market right I think that was really exciting That was a watershed moment for a lot of people to kind of see what was possible with the generation and to me I think this is a really early bell where they're like, you know, what is too calm I think we're still really early here Yeah, and for those that don't know stories from opening eye, right?

Yeah, correct Yeah, and I mean you talk about some of the some of the challenges with video generation being a kind of different animal I know that some people might you know if they've been longtime listeners of the of the show And we've had episodes talking about kind of stable diffusion and and these sorts of models for for image generation What are the the main I guess if you want to be a video generation model builder What do you have to think of differently both in terms of kind of the type of model that you would use and also kind of you know The process that you'd have to go through in terms of of curating data and that sort of thing. Yeah I mean I think first and foremost video data is really data intensive, right? Like you just think about it compared to even images or text like text is tiny images were more expensive The video is like hundred X more in terms of data volume a short clip than you might have for images And so when you think about training these models, that's really the most important challenges How do you build architectures and then systems that can scale to process large datasets? That was a big ball net for the community I mean again generally been innovating heavily to actually make that possible But I think this was why progress took a little bit longer than say for images or language to come to market for companies that are Training this though like again They're having to curate massive scale datasets easily in the you know petabytes of data essentially just to pre train these models And that's really intensive many practitioners are beginning to find these models too But even that remains more challenging than your stable diffusion for example That's got to make it hard for new entrants to come into the into the field Just just the sheer volume of what you have to get set up You know ahead of time to handle that is probably be I would imagine beyond what what most organizations are really able to do unless they're You know have specific expertise or experience in the area or something.

Yeah, absolutely I mean it took us a long time to get ready to pre train models We'll talk about this more but we open source to one of the state of your generation models And part of the goal here was to kind of let other people have a chance Right, let them pick up a model and begin to find unit and they're kind of skipping past like a massive volume of technical infrastructure They otherwise would have needed to have built and maybe talk a little bit about that data side because I know that this is one of the things That's been a I mean it has been a struggle on the tech side But I think especially on the image and video side where there's a lot of questions about Hey, well, where can you actually source all of this video and imagery and you know what are the what are the rights associated with that? But I also imagine there's there's definite curation that's needed in terms of like all these prompts that that I've seen people do with like oh Generate this, you know, it's shot with a Canon DSLR or whatever Like all of that sort of thing has to be has to be curated on the on the prompt side as well So yeah, could you could you talk a little bit about that that data curation and what the source of that kind of where where you could even get videos And then and then the curation process. Yeah, I mean, I think pre-training in general That's true for image models text models audio models and videos they rely on like large volumes of internet scale data But I think what's uniquely challenging with video is it's just you know It's easy to kind of get drowned out in the noise right and one angle here I think is really interesting that we for example is already known was how do we learn a high quality motion with a video model? And it turns out the vast majority of video you find on the internet doesn't move It's like a static object or it's someone talking and if you think about that that doesn't actually teach a generative model about the world It doesn't teach about physics doesn't teach it about how objects interact and so it's not going to learn strong reasoning So the way we think about it is really the goal with the video model is to learn physics and realism and the laws that govern our world And so you might think about inertia mass optics, you know fluid dynamics All these kind of base properties and how they all interact That's really the goal with the generations to learn an engine that can simulate this because the output is a video And we can consume and it's creative and it's beautiful But the hard step here is finding data that can really help you learn these base rules of kind of of the world And this was one of the most fundamental gaps we had across It's kind of non-trivial.

I'm kind of curious as you know As you're describing that I would imagine that some things that you're training for are harder for the model to learn than other areas And that I mean if you think about just you know narrow it down to just you know Animals and mammals and humans and they move differently and the you know the the physiology and the anatomy is a bit different across those And that all has to somehow be inferred by the model if they're going to make a video that's realistic What have you in your experience as you know you talked about motion being so important stuff? What are some of the harder things that's been to get right over time not just in where you're at today But like for the industry, what is the industry and maybe early on what have you struggled with? Yeah, I think it's really it's kind of funny that one of the test cases people use now to test different video generators is gymnastics And I think the reason for this hilarious videos I'm like you'll see of of Sora or other video generators doing gymnastics And I think one of the answers is video generation models just can't do it now And it's really complex human motion. It's really rare.

I obviously talk about data curation for example There isn't that much complex motion we see people like doing twists and twirls and backflips and stuff in nature, right? And so it's kind of interesting is like that is that requires fundamental understanding of how human human kinematics behave for you to sing Like that properly without it feeling disturbing and so this has been one of the challenges for people I mean for example early on when we were training I mean we've gone through three fundamental pre-training foundational models in our history of a company And what's interesting with with with the moji which is our latest model in the prior one replay was a walking was actually a really basic thing That was really hard to nail it turned out most video generators early in kind of early to mid 2023 They would make humans kind of hover as if they're hovercrafts like they would not move They would just kind of levitate off the ground and move and so the models were not capable of synthesizing forget gymnastics just walking And so that was one of the critical watershed moments we had across for example the company I would suggest that might have been some folks that had a little too much bourbon in their eggnog over the holidays right there kind of that floating thing So I've definitely seen the like jedi vibe Which is kind of cool in one respect but not that awesome if you don't want it One thing here is I think like we've invested heavily in evaluation infrastructure agenda And as part of it's like how you benchmark these capabilities like one of the test cases we have is a you know You might have a woman drinking a glass of water with ice and you want to look that hey There's the ice move realistically there's a water flow But also like you know what's interesting is once in a blue moon like the character will try to drink the water through the side of the glass Which is just not physically go here's consistent and you watch you see this with some more competitive models it's something we had been trying to develop And I think just that isolate test case alone communicates a lot about the video but generation models capability just understand the laws of Reality right it's kind of yeah, it's a jedi mind trick like you just cannot you should not be able to do that right? How like when you're using those test cases that you've developed is that a lot of human review or how how does that like how do you create? Kind of the the tooling around that because I know there's sort of like comparisons between this image and that image Right and or this frame and that frame and you can compare closeness and all that But there could be a lot of sort of closeness in the in the overall image But if the woman is drinking from the side of the glass There's kind of a major failure moment even though maybe everything around is is really good Yeah, I mean look there's a there's a lack of external public available quantitative benchmarks I think one of the ones that are publicly available are there these leaderboards or artificial analysis has a video generation leaderboard I mean we are the number one open source model and kind of neck and neck with closed models there And that's just human preferences hundreds of thousands of people look at two video side by side They say this one's better this one's better and you kind of get like a chest out elo rating And I think this has been one of the best or better public benchmarks I'm you know internally one of the ways we think about this though is as we're measuring these capabilities such as world understanding physics It's very hard for a human actually to rate by that It turns out like when we as humans look at two videos side by side You're saying which one do you prefer?

You often prefer the one that might have slightly higher resolution or more detail But if you actually think about if I'm gonna use this in actual production application like film production or gaming or something else like I probably actually care more about the motion And so we actually have to override the human intuition your first order intuition say select for detail and use these test cases sort of like Functionals testing of how we can measure these capabilities. You know, I in my career I started actually in self-driving I worked at one of the early companies applying deep learning to self-driving perception and you know I took a lot I take a lot of inspiration how we built functional safety testing for example for deep learning systems Right and in that way you're gonna enumerate these test cases in use cases and you can actually say yes or no You kind of pass that that test case scenario, right? And so you know whether it's a human has to do that review and we're starting about more automated metrics I mean it's just producing more structured forms evaluation I think are really important because otherwise the world is just too intricate for us to test everything, right? So we have to kind of go use case by use case and just measure progress And it turns out as you scale the models on scale the data sets We begin to see percentage completion rates improve and this gives us a semi-quanticated benchmark of progress Friends AI is transforming how we do business, but we need AI solutions that are not only ambitious but practical and adaptable too That's where Domo's AI and data products platform comes to play It's built for the challenges of today's AI landscape With Domo, you and your team can channel AI and data into innovative uses that deliver measurable impact While many companies focus on their applications or single model solutions Domo's all-in-one platform is more robust with trustworthy AI results without having to overhaul your entire data infrastructure secure AI agents that connect, prepare and automate your workflows helping you and your team to gain insights receive alerts and act with ease through guided apps tailored to your role and the flexibility to choose which AI models you want to use So Domo goes beyond productivity is designed to transform your processes Helping you make smarter and faster decisions that drive real growth And it's all powered by Domo's trust flexibility and years of expertise in data and AI innovation And of course our best companies rely on Domo to make smarter decisions See how Domo can unlock your data's full potential learn more at ai.domo.com That's ai.domo.com So Paras, I'm wondering you mentioned this kind of history of pre-training at at Genmo And the most recent model which of course we want to talk about But I'm sure that that most recent model is informed by things that you tried in the past and kind of your history there So could you give a little bit of a snapshot of kind of the history of your team and how they how they approached this problem?

How you all approached this problem and the kind of generations that you went through with that? Absolutely. Um, so we're just about two years old at this point. We actually started work on a company christmas 2022 So it was a holiday and there's a J and my I were both the co-founders of the company first and foremost were brothers I think that's really unique And we didn't really plan to start a company with brothers and it's a little weird I mean, you know, normally you have sibling rivalies and things like that I don't know we didn't have much of that, but it turned out our skillsets were super complementary both of us were doing our PhDs in UC Berkeley I was working on large-scale distributed systems, you know, in the UC Berkeley amp lab and rise up And this is the same by the creative patchy spark and you know, ray and the any scale project And so really hardcore machine learning systems for scaling large levels That was what my dissertation topic was on and concurrent to that a J was working on the foundations of modern image generation So, you know, he had joined Berkeley to work on early image generation models This is kind of like in your GAN era And I think for him, one deep and satisfying thing was a generative out of a scale network was like a in mirage It wasn't actually like a grounded loss objective that was learning real motion or dynamics It kind of was like this game, but you got image generation and side artifacts So I think his story was really interesting in that he ended up kind of writing this paper DDPM or the denoising diffusion probabilistic model paper Which is one of the foundations for how we think about image generation with diffusion today It's one of the most highly cited papers in theory and that came from I think early inclination that how do we build video image models understand physics and realism Instead of just kind of like artificially playing this, you know Game that results in image generation ground and get in real generative pre-training And so that's some of the early academic history of the company But starting the copy we decided to do video because like it seemed impossible back in 2022 It was just completely outside the frontier and we said fundamentally we need a new architecture to solve this And so let us discover both from a systems perspective and distributed systems perspective But also machine learning perspective what the right approach to do that is I mean, so yeah, it's a bit about two years since founding We've gone through three large pre-training runs and each time we learn something new about the world And integrate that into our approach and our framework and architecture for how we train these models I think the single underpinning thing though is motion We always joke like Genmo doesn't really have an explanation But we kind of retroactively apply like this idea of generative motion right Genmo Is like we care so much about motion in video that that that's really a core element of our founding history and our framework for how we approach video generation I'm wondering you got me I had a question for a moment or two As you were talking through it You kind of talked about that evolution And you know kind of starting with GANS the generative adversarial networks And and finding your way across kind of the architectural progression that you guys have found Could you talk a little bit about that in terms of of like, you know If you were coming into it during you know the age of GANS right there and that was the thing But you know what I'm kind of curious like at high level what was the problem with that?

Why did that not work for you? What did you look to next? You know, could you kind of give us kind of a highlight skip over the top of a couple of different major architectural twists and turns To give us a sense of what your journey might have been like? Yeah, so I think the earliest form of image generation models They think started to work well were auto aggressive image generation models So this is very similar to a large language model You kind of take a you know, you might take an image and you make it a single vector align So if it's like, you know 28 by 28 image now you have 784 pixels in a straight line And you just go one by one by one and decode the next one So that was the earliest form of image generation your models like pixel R and N or pixel CNN or image GPT Which are some open AI, but you're the earliest works here that worked well But the problem is like images have millions of pixels This would never scale to produce high resolution images I think what's interesting from when a J was working in this like early on in 2018 2019 I think you remember he trained an auto aggressive image generator model And the first models he trained were trained on like L Sun which is a data set of bedrooms basically But what was so interesting in the in it would be like a little 5 by 5 or 10 by 10 pixel region It would start to put artwork on the background of people's bedrooms Why?

Because that's just like what nature looks like That's what real estate listings look like But it was the first indication of AI generated art in some sense with an early image generation model The problem is this would not scale right because you're kind of going pixel by pixel by pixel by pixel So we take hours to make a small image I mean, so GANS were the kind of major approach I think that really worked well for this And GANS are trained again with this like generative adversarial objective It's kind of this like doing game between a generator and a discriminator But they were really hard to train It turned out they were going to these bad, you know, states where you might get like mode collapse For example was one of the biggest issues It would mean you could produce images of like a single domain But you couldn't like produce everything in the world with GANS So you get a really good model for making faces or really good model for making you know Bedroom pictures are making a really good model of like tigers But it was really hard to say train a model on all of ImageNet Meaning cover thousands or thousands of different categories Right and so diffusion models was a really exciting approach that J began to work on Because it was it had the potential to provide that kind of mode coverage You could learn diverse representation in the world that generalize beyond just a single Single domain like faces or animals to everything And so you know that that was what kind of result in DPM And I think since then I mean you've had late and diffusion the stable diffusion approach And the veto generation is I think the next major evolution But but the same the learning paradigm has mostly remained Remain similar through this like learning through this like a you know diffusion setup or kind of an iterative No easing right this is the formulation of this diffusion problem But like it's remarkable to see how far that has scaled like literally 10,000 x in pixel scale From the earliest diffusion models to kind of where we are with video generation I guess on that front like how did you decide I guess Because I know that part of what you've done And I'm assuming what the intention was with the models that you've created is to Open source them in one way or another And as you mentioned earlier released things into a community where people could experiment and try things and fine-tune How did you think about kind of size of the model and and that sort of was that kind of purely driven by What was needed to produce kind of a certain size of video or certain resolution A certain kind of performance metric that you were after How did you make some of those I guess trade-off decisions Maybe also the compute that you had had access to yeah I mean first of all pre-training is incredibly GPU intensive I mean we've accessed a more than 1,000 in relation 100 grade GPUs and so I mean that it's incredibly GPU intensive But I think it's also a question of how you utilize that hardware effectively and one of the critical challenges with video is They have really long sequence lengths like training a video generation models equivalent to kind of training a million token Lents kind of context window for an in language model And so this introduces a huge set of challenges that are kind of orthogonal to kind of parameters scaling You might see typically with large language models What I think is interesting is though certain capabilities only emerge at certain parameter scales So like I talked about walking like it's very difficult to get walking to work with like a You know one or two billion parameter model or something smaller than that It's just like we won't learn that capability So you do need a certain amount of scale for it to work But at the same time you're not seeing models that are like hundred billion or trillion parameter scale as you see kind of with a fun Key of grade language model, so we open source Mochi one. It's a 10 or 11 billion parameter scale model So it's it's big a lot bigger than your conventional grade older grade of generation models But it again still is run up along a consumer grade GPU people can access it and they can use it That was a very intentional choice by us to kind of write sides of the community while making sure It wasn't too small to limit its capabilities And what reasonably I know one of the things that I've noticed Over the over time as I've experimented with different video generation either demos or products There there's definitely an element of it where you can only generate so much I'm imagining that there as you mentioned there's a sequence that is being generated in a way similar to a sequence that's generated out of a language model There's iteratively or iterations of calling the model which is more compute intensive the more you generate Is that a true assumption about like our video models or is there I think people Are somewhat familiar at least if they've been around the podcast or or have done their own research in terms of how language models generate Tokens, right? So I have a prompt I the model generates a token and then that's added to my prompt and then I iteratively generate another token And so the model is being called the more that I more the more that I'm generating Is that same thing true for kind of generating these sequences of videos? What are the kind of concerns around actual compute and usage of of these models in a realistic environment?

Yeah, I think video generation models share common elements with large language models, but they also differ in some key ways So first and foremost a language model decodes tokens autoregressively one at a time So if you want to generate, you know, a thousand tokens Whatever let's say 500 words you need to do 500 forward pass or a thousand four passes the model in a video generation model Like every pixel is kind of generated at once So each four pass produces all of the pixels that you see in your video across space and time And we do multiple denoising steps We start with a kind of a pure noise sample and through maybe 50 or 100 four passes all of those pixels eventually become full resolution So you actually see if you use our product we stream those pixels as they're getting denoised in real time to your browser So you'll see a full video like not just a frame but a full video But it's kind of blurry and slowly the video gets let you know sharper and sharper and the details begin to resolve like you'll see blobs That become more and more detail and eventually get fine details like hair Or teeth or you know plant leaves and so on like that appears on the last stage of this and similarly the motion might start with coarse green motion That eventually becomes much more detailed and realistic as the denoising process proceeds So it's kind of a different axis in which we do the compute like just like you do tokens and decoding like if you don't models You have this denoising step, but one really important thing to talk about architect actually with at least with mochi as we open source It is it's kind of a multi-stage model There's first what we call it's a variational autoencoder which or vie which refers to essentially video compression There's just too many pixels and video for us to learn over natively in the model It's just it's just way too expensive So in mochi we train this hundred x video compression model through through through the variational autoencoder setup That takes the input video and actually projects and makes that sequence like we talk about So you're going from something that's like you know hundreds of millions of pixels or something down to something that ends up effectively taking about You know 50,000 tokens equivalent 50,000 tokens equivalent in a language model So we do that compression stage first and then in that laden space That is actually what the diffusion models learning right so that 10 billion parameter models learning to kind of reconstruct You know that hundred x down sampled or compressed space Do you envision that there's ever a point you know with with compute you know growing so fast Is there ever a point where you think compression will no longer be needed and you'll be able to do You know very large and detailed Videos without the need for that just because compute is so available in the future Or do you think that's unlikely and we're gonna keep chasing it with compression and doing other things So there was the first diffusion models actually were what we would call pixel space models So they were done at the full resolution of the sequence And so this is actually still doable for images I think what's interesting is that this like um laden diffusion setup has outperformed the pixel space approach Even in images where it is feasible computationally still do that You know, I think it is interesting though because like there has been a lot of hybridization of architectures between like autoregressive setups and diffusion setups That was one trend at like for example our team went to new rips this year Then to 2024 and you know several people have begun to explore combining different elements of autoregressive models diffusion models both in pixel space and laden space I think it's like a really diverse space that like it's just extremely underexplored For example, when we open source moji we actually developed a new architecture We call it the asymptit or asymptor did it was just an evolution on the kind of area that people were in I mean people leveraged diffusion transformer setup with architecture It's part of why it's so expensive, but you know, we began to take some early steps to do architectural exploration So I hope we can eventually long story short like find some global optimum between compression and the actual generation part Today we kind of factorize of computational reasons and I think it'll just get more and more blurry as we kind of combine these different elements Well, Prazi you've mentioned moji What uh and you know, this this is the latest wave of what you have have created at jinmo Could you talk a little bit about moji in relation to previous models and also moji? I mean you mentioned that there's um moji is is achieving kind of top performance on certain benchmarks Could you kind of help us understand where it fits into the ecosystem of video models out there? And also kind of what it represents to you all in kind of progression from your last generation to this generation First and foremost before I take into this my belief is video generation super early I think we're 1% of the way there so I think people look at this stuff and it's it's really surprising But there's a huge gap between reality and where the state of view generation is right and I think That mindset is really important because when we looked at the field of view generation as of you know mid 2023 when we kind of had our last generation model Um, so I made 2024 when we had our last generation model replay was they would synthesize high resolution videos But they just wouldn't move they weren't that interesting right so you would see a video of a person And they would just stand there and maybe there was camera motion So the camera would kind of orbit the person or pan a little bit But but the subject wouldn't be moving and to us that would indicated some kind of learning failure with the video generation set up as of And the last generation of these models and so that was first and foremost The most important thing we wanted to solve for video generation was solve motion and subject motion specifically And so moji one is kind of neck and neck with the latest frontier grade kind of close source models your google veos or you know Sora is in that way specifically by motion benchmarks actually and I think this is really important and subtle But um, that was kind of the key component we wanted to solve with your generation The second one that was really important for us to solve in moji was prompt adherence It was really common. I think many people have this experience with your generation as you say I want I want x right like you might say, you know, I want, uh, you know A classic test for this is uh, is uh, you know, like I want a dog wearing a hat holding a teacup But it you don't make that but the order of those things in the composition of those elements is wrong Right.

So they might be sitting next to it but not holding it We talked to a user and user study about the generation they described the state of view generation Was kind of like pushing on a rope. You kind of want the rope to go one way, but you just can't get it to go Right. It's just it's just really hard and so with moji We also invested heavily in prompt adherence in addition to motion And so prompt falling is I think is really important element that that will be critical to make these systems practically usable We love to talk about like, you know, we open source this also because there was no good open model let alone close There were a few of these close models in a runway. Um, and Sora had been kind of previewed in that blog for several months But nobody had actually trained and released an open model and so that was holding this field back And because we're so early our viewpoint is releasing this model and creating this bedrock foundation for people to actually do the research On aspects like motion and prompt adherence was going to be critical for the field and it benefits us as a company because people are building on top of our models, right?

So what kinds of things are you seeing people want to do with the model and what are of the different categories of use cases? You know that people might be addressing what what are the ones that are the high value? Yeah, I think everyone's first experience is just play so like people just want to open it up and they want to see something wild, right? Like a baby riding a dog, right?

And so I think that was always a funny one that was like, uh, you know, you might have these things that just don't happen They're real that you want to see the model do and so people start with that and explore the surface area But when we look at actual real use cases, I think what's really interesting is this video generation technology is getting it work It's way into like enterprise Content creation workflows and I think of this as like creation and then there's like editing, right? These are two kind of halves of practical application of your generation So creation I mean first and foremost like many people are starting to begin to explore using video generation as a substitute for stock video Like if you can't find exactly what you want to stock catalog, you can just go generate it And it's going to come with all the right adequate licenses. It's exclusive to you, right? No one else gets that video because you made it right and it's n equals one And so that's actually really powerful for a lot of content creation workflows Video is just really hard and expensive also to iterate with, right?

You shoot it once and if it's not perfect, you know, you might want to re-prompt and re-edit it And so I think that's an exciting application for example in the brainstorming and pre-visualization storyboarding process of content production That goes way faster to look at the generator in the loop and then editing Actually, that's exactly where I was about to go was on the editing is kind of how does that how do you envision that fitting in? Is that becomes a problem that people are attacking aggressively? What does it mean to edit video, you know, you know in the context of video generation, you know, if you're generating the video from scratch What does it mean to edit a video like that and how might that be done? Is anyone really thinking about that right now?

Is that on the table? So we released mochi one as open source We didn't know what people would use it for and one really exciting thing within two weeks of open sourcing it One of the community members built this workflow called mochi edit It's a full video editing pipeline built on top of our open source model and we they use to add remove or change an object So it's a crazy video you can search a mochi on GitHub and what was the demo that I think he showed me that was really cool is They took a video of a person talking and they said give him a hat and it actually put in a fully realistic exactly 3d trapped hat on him Just look totally realistic And I think that full process with the conventional videoing and pipeline between tracking and rendering and compositing everything would have taken like You know two three weeks, honestly Very cool. Do you see I mean, I know there's certain if I remember I Coke did like a commercial Coca-Cola did a commercial for their winter advertisement with um With jinnai, do you think we'll this is maybe a wider question? But how do you think people kind of in 2025, you know How are we going to experience video generation kind of at the general public level?

Do you think it will start to like in what ways will it start to filter into people's every day every day lives because I Chris and I well, everybody remembers like we were talking about lots of language models before chat GPT on the on the podcast But you know, we weren't talking about them at Thanksgiving dinner right now And so you do have those moments of like the Coca-Cola video where people were talking about this more widely But that's probably not like the chat GPT moment of video generation Any thoughts on kind of how general public will kind of start to intersect with this technology in the in the coming year And you think the early adopters are certainly here for for a generation I mean our platform has more than well passed more than two million users Just beyond open source and open source is probably some many multiples of that But I think that still represents like drop in the bucket compared to conventional media And I think that one of the biggest limiters like I shared was like your ability to control it and like You know, once you can actually get something out of it The wow moment is almost instant like you'll ask it for something that just couldn't exist in the real world And you see it in front of your eyes and that that is a jaw-dropping experience for most people Right, but I think the hard part there is the tech has required too much expertise with prompting and understanding of how to actually get good results Of the model to make it usable I think 2024 is the year that we will see you know instruction following and prompt adherence solved here That makes this makes us actually follow what you want to say And I think of this is like going from GPD 3 which was just like an unaligned language model in some sense Which kind of would ramble about whatever topic at end but not in a particularly useful way towards chat based instruction tuning Right that was the breakthrough moment for the models I think very similar for video models It kind of comes to the moment where like somebody can pick it up and use it without being an AI expert You know today you many people are already talented mid-journey or other kind of conventional forms of image generation Kind of translate into video And I think this is really one of the critical moments that has to be solved for this to have like breakout exposure But I mean I just imagine a world like I think in five years when I hit a point where you know They might be a poor kid in Mumbai or Kenya or something just has a phone and a good idea push the button on their phone And it wins an academy award right like that that's gonna change the world I don't think we're that far from that to be honest Yeah, I think that there's I love how you've framed that in that kind of expanded agency sort of way So instead of like AI models generally I think that the way people think about them as a bummer is like oh these these things are gonna automate everything every video I'm gonna see I'm never gonna see cool videos again because they're all gonna be AI generated without creativity But I think the fact that you know what we're seeing with language models what we've seen even with image generation is there's So much creativity that the human can bring into that but it also democratizes a lot of Potential you know production and that sort of thing to those that have amazing ideas But maybe not access to a Hollywood film crew, right? So I love I love that there's still that element in kind of your vision of that human human agency being expanded upon and even you know People people getting to tell stories that maybe they they wouldn't otherwise so I love that I've got a question for you. It's a little bit of a random one, but um interesting You know people ask me this a lot What does creativity mean as we go forward as we're having these tools and you know human creativity is coming to bear You're having these tools that you know some people consider them creative in a sense Some people don't know but what does that look like? What is that that person and a tool together going and doing that that thing?

The Kenyan boy is doing how do you think about that? Like how do you contextualize that? You know, I think human ingenuity and creativity is a root Of all like interesting form of content like if you have AI like I know people are scared Hey, I was gonna automate all this stuff But if you if you look at what like LM's all just ramble on about it's just like a the aggregate average of all their training inputs And that's not particularly interesting or novel to anybody right like I think the greatest films come from someone with a new idea Right and the new lens on the world right a new interpretation of of what what it means to be human right in and live in the world That we do and from that you have great media right there's some some and I think that will forever be true The humans role here is always going to be pushing the frontier I mean language models learn and video models learn by just averaging and aggregating compressing all information around them But in some sense, they won't ever go to really push the frontier alone like a human plus now a video model though Is something entirely different piece right now? You have something I like the term creative amplification is possible, right?

Like the human alone is producing the creativity But with that video model it now amplifies in such a way that just would never have been possible with this older older generation media in the older world Right like that iteration cycle might have taken years an entire lifetime to kind of go through and discover an idea space And now somebody can do that you know within a matter of like months or weeks Just just iterating on new ideas and testing and on seeing them visualized with them I guess that kind of leads us naturally into that was a great kind of vision wider But what is your vision for for Genmo specifically what what's kind of what what keeps you up at night? What are you most excited about kind of as you move into a new year with a lot of new possibilities? So I think our vision has been very consistent over a long period time Which is to build frontier models of your generation But the goal was to unlock the right brain of artificial general intelligence. It's completely neglected I mean opening eye and kind of these frontier models have taken over the left brain and we say hey This other side is just as capable and just as important as left brain here And so you know I turn that is like thinking imagine AI that can say anything possible or impossible, right?

And I think the first step here is creativity is media people creating like you know, you know Like I described this vision of empowering creators But like longer term I actually think this is really interesting in that if we can explore this world of synthetic realities It'll unlock huge progress and you know, I think like embodied AI for example And that's when this text starts to become really powerful, right? I started in self-driving my career and the big problem is there's too many edge cases to simulate Right and then even if you get millions of miles on the road, there's still new things that will happen But I think for the first time if you do a model will enable training robust agents that can operate in the real world and actually understand All the possible realities that they can just simulate through that right like that that's I'm telling you parent And I think we're starting to see exploring even in the reasoning with the 001 scale models as well But to me, that's that's one of the most exciting long-term tenure potentials that we'll see for the generation And then we again are kind of like trying to work towards that future Well, thank you for how you're digging in in the space It is truly inspirational and really appreciate you taking time to chat with us as you head into head into those Innovations and exciting stuff. Please come back when you release whatever whatever the next the next is you're welcome back to the chat about it Thank you so much, Paras. It's great to chat.

Thank you Daniel. Thank you Chris All right, that is our show for this week If you haven't checked out our changelog newsletter head to changelog.com There you'll find 29 reasons. Yes, 29 reasons why you should subscribe. I'll tell you reason number 17 You might actually start looking forward to Mondays sounds like somebody's got a case of the monday's 28 more reasons are waiting for you At changelog.com slash news.

Thanks again to our partners at flight.io to breakmaster cylinder for the beats and to you for listening That is all for now, but we'll talk to you again next time

PodQuesting Dwight J Randolph- WolfShield Media PodQuesting: -By WolfShield Media and Dwight J RandolphJoin us on an exciting journey to master the world of fiction podcasting! At PodQuesting, we document our quest to improve and innovate, sharing valuable insights, strategies, and behind-the-scenes tips along the way. Whether you're an experienced podcaster or just starting your first show, our podcast is your go-to resource for everything podcasting.Discover practical advice, creative techniques, and lessons from our own experiences as we explore the ever-evolving podcasting landscape. Ready to level up your skills and embark on this adventure with us? Tune in and join the quest!Have questions or feedback? Reach out to us at [email protected] and visit our website:WolfShield.Media The PFN Cincinnati Bengals Podcast Pro Football Network The PFN Cincinnati Bengals Podcast is where you can stay up-to-date with the latest news and analysis on the Cincinnati Bengals! Our hosts, industry experts Jay Morrison and Dallas Robinson, provide weekly coverage of all the latest rumors and updates about the Bengals. Don’t forget to follow the show to receive new episodes directly in your podcast feed and leave a rating and review to let us know your thoughts. The 48 Laws of Power by Robert Greene (Full Audiobook) Robert Greene Amoral, cunning, ruthless, and instructive, this multi-million-copy New York Times bestseller is the definitive manual for anyone interested in gaining, observing, or defending against ultimate control – from the author of The Laws of Human Nature.In the book that People magazine proclaimed “beguiling” and “fascinating,” Robert Greene and Joost Elffers have distilled three thousand years of the history of power into 48 essential laws by drawing from the philosophies of Machiavelli, Sun Tzu, and Carl Von Clausewitz and also from the lives of figures ranging from Henry Kissinger to P.T. Barnum.Some laws teach the need for prudence (“Law 1: Never Outshine the Master”), others teach the value of confidence (“Law 28: Enter Action with Boldness”), and many recommend absolute self-preservation (“Law 15: Crush Your Enemy Totally”). Every law, though, has one thing in common: an interest in t Mind Force Radio.com Mind Force Radio.com Natural Strength Night is an informative, humorous, sometimes a little raucous, good-time of myth busting and honest training information from the trenches. We strive to help everyone involved with old school strength training (without steroids) to not make some common training mistakes. Along with great information, you'll hear a fair share of steroid bashing, flamingo sightings, breaking goons, iron game history, and honest drug-free training information from various leaders and strength coaches in the field to help you get real results! If your primary training information comes from reading "Muscle & Fiction" magazine we'll help get you straightened out. If you love high-intensity strength training, dinosaur style training and just like lifting heavy weights ... or loved Jack Lalanne, Sandow, Grimek, Peary Rader's Iron Man magazine, Brad Steiner's articles, Stuart McRobert's Hardgainer, Iron Nation, Osmo Kiiha's The Iron Master, you will love the show.On The Rugged Individual, we

Frequently Asked Questions

How long is this episode of Changelog Master Feed?

This episode is 45 minutes long.

When was this Changelog Master Feed episode published?

This episode was published on January 23, 2025.

What is this episode about?

We seem to be experiencing a surge of video generation tools, models, and applications. However, video generation models generally struggle with some basic physics, like realistic walking motion. This leaves some generated videos lacking true motion...

Can I download this Changelog Master Feed episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.
URL copied to clipboard!