Can a Video Generative Model "Understand" the Laws of Physics? Doubao Big Model Team Announces

Can a Video Generative Model "Understand" the Laws of Physics? Doubao Big Model Team Announces

Date

2024-11-08

Category

Tech

The Doubao Big Model Team of ByteDance recently announced their results, titled "How Far is Video Generation from World Model: A Physical Law Perspective". The results systematically analyze whether mainstream DiT architecture video generation models can abstract and understand physical laws from datasets, so they have methodically carried out experiments for this purpose.


This study describes the experimental method, investigation process, as well as key experimental results of their findings.


In AI research, we always strive to create machines that have “human-like intelligence”, which does not only perceive the world, understand rules, but is also capable of predicting the future.


Currently, video generative models have been able to generate entirely new, never-before-seen content and, in the relevant introductions, these models are often described as – following the laws of physics and having significant potential for developing world models. However, can a video generative model observe the interrelationships of things and extract a stable set of physical laws from them? This in itself requires a deep dive.


Recently, the Daobao Big Model team of ByteDance published a study “How Far is Video Generation from World Model: A Physical Law Perspective”, hoping to answer this question.


How Far is Video Generation from World Model: A Physical Law Perspective


Link to study: https://team.doubao.com/zh/publication/how-far-is-video-generation-from-world-model-a-physical-law-perspective?view_from=research


Results Showcase Page: https://phyworld.github.io/


In this study, we take a closer look at the generalization capabilities of video generative models when learning the laws of physics, especially under three different generalization scenarios: In-Distribution Generalization (ID), Out-of-Distribution Generalization (OOD), and Combinatorial Generalization. These different scenarios can help us to fully understand whether models can extract and apply common physical laws from the data.


  • In-Distribution Generalization (ID)

In-distribution generalization refers to training and test data coming from the same distribution source. In our experiment, the training data and the test data were within the same range of values.


  • Out-of-Distribution Generalization (OOD)

Out-of-distribution generalization refers to the ability of a model to apply learned laws of physics to unknown situations when faced with new scenarios that it has never seen before. It also examines whether models can speculate about unknown physical phenomena through known physical laws without direct experience, rather than just memorizing the data patterns. This is an important criterion for whether the model adopts the physical laws.


  • Combinatorial Generalization

Combinatorial Generalization is the link between ID and OOD. It is of higher practical value in examining how the model combines different information learned previously into the new scenarios and how it produces effective predictions when tested. In these situations, the training data already contains all the “concepts” or objects, but these concepts or objects do not appear in all possible combinations or more complex forms.


In video-based observations, where each frame represents a point in time, the predictions of the laws of physics correspond to the generation of future frames from the past and present ones. Therefore, in each experiment, we train a video generative model based on frame conditions to simulate and predict the evolution of physical phenomena.


By measuring the change in the physical position of objects in each frame (time point) of the generated video, its motion state can be determined, and then a comparison can be made with the real simulated video data to determine whether the generated content meets the equation expressed in classical physics.


1. Can the model learn in-distribution generalization / out-of-distribution generalization?


In this study, we focused on deterministic tasks governed by basic kinematic equations.


These tasks allow us to clearly define both in-distribution generalization (ID) and out-of-distribution (OOD), as well as to make an intuitive quantification assessment of error.


We selected the following three physical scenarios for evaluation, each of which is determined by its initial frame:


  1. Constant linear motion: A sphere moves horizontally, with a constant velocity, to illustrate the law of inertia.

  1. The perfect bounce: Two spheres of different sizes and speeds move horizontally and collide, embodying the law of conservation of energy and momentum.

  1. Parabolic motion: A ball with an initial horizontal velocity falls by gravity, in accordance with Newton’s Second Law.


In tests of in-distribution generalization (ID), we observed that: as the model size increases (from DiT-S to DiT-L) or the amount of training data increases (from 30K to 3M), the velocity error of the model decreases in all three physical tasks. This shows that increases in model size and data volumes are critical to in-distribution generalization.


However, the results of out-of-distribution generalization (OOD) contrasts considerably with the results of in-distribution generalization (ID):


  • Higher error: OOD velocity error is one order of magnitude higher than ID (~0.02 v.s. ~0.3) in all settings.

  • Limited impact of scaling data and models: Unlike in-distribution generalization, scaling the training data and model size has a little effect on reducing OOD error. This shows that simple increases in data volume and model size do not effectively improve the model’s reasoning ability in an OOD scenario.

image


2. Can the model learn combinatorial generalization?


The inability of video generative models to reason when dealing with out-of-distribution generalization (OOD) scenarios is not unexpected. After all, extracting exact physical laws from data is just as difficult for the humans. But humans can predict what will happen by combining their past experiences.


In this section, we assess the performance of diffusion model-based video generation in terms of combinatorial generalization capabilities.


  • Combined Physical Scenario

We used the PHYRE simulator to evaluate the combinatorial generalization capability of the model.


PHYRE is a two-dimensional simulation environment that includes multiple objects such as balls, cans, poles and walls, which can be fixed or dynamic and can perform complex physical interactions such as collisions, parabolic trajectories, rotations, etc., but the underlying physical laws of the environment are deterministic.


  • Video data structure

Each video considered eight objects, including two dynamic gray balls, a set of fixed black balls, a fixed black bar, a dynamic bar, a dynamic vertical bar, a dynamic can, and a dynamic vertical stick.


Each task contained one red ball and four objects randomly selected from these eight types, creating a total of 70 unique templates. The data example is as follows:



For each training template, we retained a small portion of the video for creating an in-template evaluation set and then we also retained 10 unused templates for an out-of-template assessment set to evaluate the model’s generalization ability to new combinations not seen during training.


  • Analysis Results

As you can see from the table below, when the number of templates increased from 6 to 60, all metrics (FVD, SIM, PSNR, LPIPS) improved significantly in the out-of-template evaluation set. In particular, the anomaly rate (the proportion of videos that violate the laws of physics) dropped dramatically from 67% to 10%. This shows that, when the training set covers more combination scenarios, the model is able to show greater generalization in combinations that have not been seen before.


However, for the in-template evaluation set, on the training set of 6 templates, the model performs best on metrics such as SSIM, PSNR, and LPIPS. This is because each training example is repeatedly displayed.


image


These results suggest that model capacity and coverage of the combinatorial space are critical for combinatorial generalization. This further implies that the Scaling Law for video generation should focus on increasing portfolio diversity, not just expanding data volume.



Note: Sample video generated on an out-of-template evaluation set. First row: Real Video. Second row: Video generated from a model trained on 60 templates. Third row: Video generated from a model trained on 30 templates. Fourth row: Video generated from a model trained on 6 templates.


3. Mechanistic Investigation: Understanding patterns or imitating cases


As mentioned above, video generative models do not perform well in terms of out-of-distribution generalization, but in the combined scenario, data and model scaling can bring some improvement. Is this the result of case learning or an abstract understanding of underlying laws? We did the relevant experiments.


  • Models seem to rely more on memory and case imitation

The training was done using a constant speed video with a speed range of v∈[2.5, 4.0] and using the first 3 frames as input conditions. We trained on two datasets and compared the results: Set-1 contains only a ball moving left to right, while Set-2 contains a ball moving left to right and a ball moving right to left.


As shown in the figure below, given the frame conditions for low-speed forward (left to right) motion, the Set-1 model generates video with only positive velocity and skews towards the high-speed range. In contrast, the Set-2 model occasionally generates videos with negative velocity, as shown by the green circles in the figure.


image


Faced with the difference between the two, the team speculated that this could be because the model believes that “what is closer to a low-speed ball is a small ball that moves in the opposite direction in the training data.” This results in the model being influenced by “misleading” examples in the training data. In other words, models seem to rely more on memory and case imitation rather than abstracting general laws of physics in order to actualize out-of-distribution generalizations (OODs).


  • Models rely more on color to find targets of imitation

As we explored earlier, models rely more on memory and similar cases to imitate and generate video. Furthermore, it is necessary to analyze which attributes have the greatest influence on their imitation.


After comparing the four attributes of color, shape, size, and speed, we found that video generating models based on diffusion technology are inherently more biased toward attributes other than shapes, which may also explain why current open-set video generating models often have difficulty with shape retention.


As shown below, the first row is the real video, and the second row is a video generated by our model. The color is pretty consistent, but the shape is difficult to maintain.



After the pairwise comparison, the team found that the video generative model was more accustomed to finding similar references through “colors” to generate an object’s motion state, followed by size, then speed, and ultimately, shape. The effect of color/size/velocity on shape is shown below:


image


  • Complex Combinatorial Generalizations

Finally, as for why complex combinatorial generalizations can happen, the team proposes that the video model has three basic combinatorial patterns, namely: Attribute combinations, spatial combinations (different states of motion of multiple objects), temporal combinations (different states of multiple objects at different points in time).


The experimental results found that the model exhibited some combinatorial generalization for the attribute pairs such as speed vs. size or color vs. size. At the same time, as shown in the figure below, the model is able to recombine the temporal/spatial dimensions of the local segments of the training data.


It is worth noting, however, that in not all cases is it possible to generate videos that follow the laws of physics through combinatorial generalization. The model’s reliance on case matching limits its effectiveness. Without understanding the underlying rules, the model retrieves and combines fragments, which may produce unrealistic results.


image


  • Limitations of Video Representation

Finally, the team explored whether generating in the video representation space is sufficient as a world model, and discovered that the visual ambiguity leads to significant errors in fine-grained physical modeling.


For example, when the object size difference is only at the pixel level, it becomes difficult to determine whether a ball can pass through the gap simply by visual inspection, which can lead to seemingly reasonable but actually erroneous results.



Note: The first row is the real video and the second row is the video generated by our model.


These findings suggest that relying solely on video representations is not sufficient for accurate physical modeling.


For more interesting experimental design, conclusions and open discussion, please read the original study.


4. A study carried out by two persons, born after 1995 and after 2000, respectively


Both of the authors of this study are quite young. One was born after 1995, and the other was born after 2000. They focus on basic research in the field of vision in the Doubao Big Model Team. The authors have always been interested in world models. During the 8-month “play-by-play” exploration, they read tirelessly numerous pieces of Physical Reasoning research literature and also tried to get their inspiration from the games. After many failed “breakthroughs”, they finally managed to determine the research ideas and develop step-by-step experimental methods.


As early as 2023 when they joined ByteDance, student Kang, who was born after 1995, began to conceive of research related to the “world model”. After Sora exploded, they decided to start with whether the video generative model can truly understand the laws of physics, gradually unveiling the world model mechanism.


Initially, Kang intended to feed the model video game such as “Angry Birds”, and the scene contained classical physical movements such as parabolic movement and collisions, but it was very difficult to perform quantitative analysis of the game video. Later, after reading at least one hundred physics reasoning field papers, student Kang’s team converged their thinking on synthesizing motion video with a physics engine, which can minimize interference variables and provide unlimited video data. Two months have passed from the initial research to the design of the experiment.


Before the experiment, two young people worked together to develop a physics engine for an experimental design. Then, for three or four months, they were working hard to produce videos, train models, and conduct experiments. The model size expanded from dozens of M to hundreds of M. Experimental data also soared from dozens of K to several M.


When you come to an experimental conclusion and think about the behavioral reasons behind the model, the research stalls. For three or four weeks, the project made no progress until, during one experiment, everyone noticed a hidden counter-conventional phenomenon. After designing a comparative trial, they confirmed that “the model is not summarizing the law, but matching the closest sample.”


“Doing research is anything but suddenly having a good idea, and then trying it, only to find out that it works. In reality, you just have to keep troubleshooting. But after a period of trial and error, you may unexpectedly find a solution lurking in one direction.”, said student Kang.


Although the study took 8 months and they conducted quantitative experiments on virtual balls in the video every day, student Kang felt – not bored, but "excited" and having a "brain-wave". Recalling that time, student Kang said: “The team gave us enough room to carry out the basic research.”


The second classmate born after 2000 shared that this study was the most challenging and the time-consuming project they had ever experienced, involving the construction of a physics engine, evaluation systems, and experimental methods. It was very cumbersome, and there were several times when the project was “stuck”. However, both the Team Lead and the Mentor were patient and encouraging during those times. “No one was rushed to get the project done.”


At present, the related positions in the visual research field of the Doubao Big Model Team are still recruiting. Click here to learn about the position.