Make-a-video snippets: Google’s AI diffusion model Imagen makes short videos in HD

0
5
1665214672 make a video snippets googles ai diffusion model imagen makes short videos.jpg
1665214672 make a video snippets googles ai diffusion model imagen makes short videos.jpg

Shortly after Meta, Google also presented an AI tool for creating videos based on text specifications. Imagen Video feels a bit more mature than Meta’s Make a Video.

In the field of AI-based image generation, the major providers are now burning their powder at what feels like a second: Instead of static images, moving images have already arrived, earlier than even insiders would have suspected recently. The fact that these are sometimes quick shots is shown by the somewhat immature results, which above all radiate one thing: Everyone wants to be first – or at least present what they already have as quickly as possible. Apparently, the open-source image machine Stable Diffusion, which is available to everyone (almost) without restrictions, not only frightened the DALL·E provider OpenAI, but also put the pedal to the metal in the AI ​​development departments of Facebook mother Meta and Google Brain. Or were the AI ​​systems in both places already half-finished in the drawer? However, one by one.

Almost a week after the presentation of the AI ​​video generator “Make a Video” by the US group Meta, Google’s AI department shows the status of its own research in a paper and the supplementary website with video demos. As with Meta, these are only a few seconds in the style of short GIF animations and sometimes still show the flickering that is typical of AI-generated image sequences, which occurs when changing the individual images created with a text prompt. However, the Imagen video snippets flicker less than the Meta demos, which seemed a bit awkward to the inexperienced viewer at the end of September (purely subjective). What is striking is the apparently pronounced ability of the Imagen video model to visualize readable text.

SEE ALSO  The Xiaomi Pad 6S Pro arrives stomping to surpass the iPad Pro

Google Imagen Video, examples of the prompt “A bunch of autumn leaves falling on a calm lake to form the text

Imagen Video prompt examples: “A bunch of autumn leaves falling on a calm lake to form the text ‘Imagen Video’. Smooth.” The video produced here has a resolution of 1280 x 768 pixels and is 5.3 seconds long at a frame rate of 24 frames per second. (Image: Google Research)

Imagen Video is an extension of the Google Imagen text-to-image system introduced in May 2022 to include the time dimension. According to the Imagen team, the system for creating videos is based on a “cascade of diffusion models”. Starting from a text specification (prompt), the tool gradually creates high-resolution videos. Imagen Video uses a neural network to generate video, followed by a series of nested models (apparently what the term cascade means) that improve spatial fidelity, temporal dynamics, and high-resolution appearance through multiple stages of processing.

Examples of text prompts and the resulting image series

Examples of text prompts and the resulting series of images: The images are in a coherent flow of time and clearly implement the text specifications visually.

(Image: Google Research)

If you want to take a closer look at the architecture of this car wash, you can take a look at the research paper that the team has published in addition to the website. If you look at the research paper, it is immediately noticeable that a number of previous works form the basis, including a paper by the OpenAI researchers on DALL E 2 and several research reports on diffusion models, including that by Robin Rombach and team on stable diffusion and work on 3D modeling of AI-generated graphics.

The technical background is more complex and mathematically demanding than the website, which is aimed at visual impact, can convey. The illustration of the individual steps using examples from a first, fuzzy draft to the finished high-resolution video, which can be viewed on the website, is instructive. According to the team, up to 24 frames per second (FPS) are possible with a resolution of 1280 x 768 pixels.

The rendering does not happen in one block, but step by step. The basic diffusion model for video generation first creates a sequence of 16 images (frames) with a resolution of 24 x 48 pixels at a frame rate of 3 frames per second. With the further diffusion models, the AI ​​system successively upscales the video and adds further images up to the currently highest possible resolution of 24 FPS. The result after going through all the steps is a 5.3 second video in HD. The researchers outline the pipeline in the paper as follows:

Cascaded pipeline, starting from the text prompt through the T5 model (base model for video generation) through various stages of spatial (spatial) and temporal (temporal) processing.

Cascaded pipeline: Starting from the text prompt via the T5 model and a basic model for creating the first image sequence, a finished short video is created via various stations of spatial (spatial) and temporal (temporal) processing.

(Image: Google Research)

To train the input page (context-related recognition of text prompts), the Google team used a T5-XXL text encoder, T5 stands for five times T (Text-To-Text Transfer Transformer) and XXL apparently for the dimensions – the library of the T5 The model is available on GitHub, where interested parties can also find further information on the subject. The weights of the model were “frozen”, as a technique used in training and modifying large artificial neural networks is called. By “freezing” a layer of the neural network, the team maintains control over how the weights continue to update. They cannot then be modified any further. The technology is used for fine tuning.

Among other things, computing time can be saved here, while the accuracy should suffer little. In further training levels, correspondingly fewer layers have to be trained. An overview can be found in an article in the magazine Analytics India. In the paper, the Imagen team explained the exact procedure and the individual design and architecture decisions made by the Google research team. According to the team, the use of the T5-XXL text encoder was crucial in getting the match between text input and video output. The model is based on 14 million video-text pairs and 60 million image-text pairs as well as an image data set from the publicly accessible database LAION-400M (with around 400 million image-text pairs, with LAION merely indexing the pairs available on the web and making them accessible).

The move from 2D images to 3D is important for video creation, which is why the team chose Video U-Net as the diffusion architecture. Google Brain researcher Ben Poole came up with the text-to-3D method on September 29th presented separately under the name DreamFusion. In Video U-Net, the interleaved diffusion models can apparently simultaneously process multiple video frames in blocks, with SSR and TSR models being concatenated (SSR stands for Spatial Super-Resolution, TSR for Temporal Super-Resolution). Apparently, this makes it possible to depict longer temporal dynamics and processes without losing visual coherence. The further steps up to a progressive distillation (distillation) for acceleration and sampling can be understood in the research paper.

U-Net architecture with separate space-time blocks – the adjustments are made per frame. The team uses techniques such as spatial convolution, spatial self-attention and temporal self-attention.

(Image: Google Research)

Google is not yet able to create such a video itself or to test the tool in any other way. The Imagen team justified this with security concerns: First, problematic images had to be filtered out in order to put potential misuse in their place. Since the model was trained with freely available image material from the Internet, it apparently still contains fake, hateful, “explicit” (i.e.: pictorial nudity and sexual acts) and harmful content and can therefore be generated. Internal work on filtering prompts and output doesn’t appear to be complete.

According to the Imagen team, obviously violent  content is relatively easy to filter out. It is obviously more difficult with stereotypes and representations that contain an implicit (social) bias. “We have decided not to release the Imagen video model or its source code until these concerns are resolved,” the team’s blog post concludes. It should be noted here that Google Imagen, the tool for creating static images using text specifications, has so far only shown hand-picked internally created outputs, but there is still no demo version available. The text-to-image generator Imagen was presented in May 2022, shortly after the DALL E 2, which was released in April 2022.

Anyone who wants to look at the Imagen demos will find what they are looking for on Google’s research page. At the moment, only about five-second film sequences are available online, so far more video snippets than films, but they already show the potential of the new technology. This limitation applies to the visuals published by the Google AI team as well as the demos of Meta – already circulating the internet are pieces of film a few minutes long and the first music videos made by creative programmers with film knowledge using Stable Diffusion.

This is particularly true here AI film pioneer Glenn Marshall as a pioneer of new techniques: After winning the award for his AI short film The Crow in Cannes, he is now experimenting with stable diffusion, among other things, and presents ongoing research and art pieces on his Twitter channel, such as a project in which he visually implements texts by James Joyce and poems.