Text-to-image revolution: Stable Diffusion enables AI image generation for everyone

0
75
1661602653 text to image revolution stable diffusion enables ai image generation for everyone.jpg
1661602653 text to image revolution stable diffusion enables ai image generation for everyone.jpg

With Stable Diffusion, an AI system is freely available that creates impressive images. Organizations such as LAION and EleutherAI support the non-profit project.

 

The family of AI image generators is growing again, this time from the open source corner: With Stable Diffusion, a neural text-to-image model has appeared that might have what it takes to beat the previous top dog DALL E 2 from OpenAI ( released in April 2022) and Imagen by Google Brain (presented in May, so far without a demo). This is not only due to the high quality of the images, but also to their accessibility: Thanks to the non-profit approach of the publishers, the stable diffusion model and the output generated with it are freely available to the general public.

 

After a closed release in early August via a registration form (for researchers), the stable diffusion team has now opened the model to everyone. According to the release notification, the time span from the research release apparently served to clarify remaining legal, ethical and technical questions. The public release contains an AI-based safety classifier that removes output unwanted by the user. The parameters of the seat belt can apparently be customized.

The model is licensed under the “Creative ML OpenRAIL-M” license, details of which can be viewed on Hugging Face. The prerequisite for using the model is accepting this license, after which the weights can be downloaded. Use is not limited to private use, but commercial use or the offering of services with Stable Diffusion is also expressly permitted, provided that the license-related restrictions are observed (illegal or harmful output and content are prohibited). In return, the users themselves bear full responsibility for their use.

Numerous users are already testing the system extensively, especially in combination with Midjourney (beta), and sharing some impressive results on the Internet. During the test phase, over 10,000 beta testers used the model and produced around 1.7 million images per day.

 

Beginners can use the Lexica search engine to browse through the images and text prompts previously generated with Stable Diffusion. Lexica is currently indexing more than five million entries, and the number is growing all the time. Anyone who creates AI-based images with Stable Diffusion or another text-to-image system will find creative inspiration here, but also food for thought for research.

Stable Diffusion: Finding Images and Prompts with Lexica (10 Images)

Middle-earth: Mines of Moria and Khazad Dum

“Mines of moria, khazad dum, halls of durin, middle earth, tolkien, a bright orb of light in the center of a grand hall, outer edges shrouded in darkness with creatures crawling out into the light, in the style of hieronymus bosch”
(Image: Stable Diffusion (via Lexica.art)) 

For training, the stable diffusion team used an image data set from the freely accessible LAION-5B database, which contains around 5.85 billion CLIP-filtered image-text pairs and is therefore fourteen times larger than its predecessor LAION-400M. LAION datasets are to be understood as indexes for the Internet: they list the URLs to the original images together with the associated ALT texts. By early August 2022, LAION-400M had been the world’s largest publicly accessible image-text database, and numerous machine learning projects rely on datasets from this source.

CLIP stands for Contrastive Language-Image Pre-Training and is a technique developed by OpenAI, which is also used in DALL·E 2 and converts visual concepts into categories. In addition to their own work (mainly by the CompVis and Runway ML teams), the researchers have participated in the work on DALL·E 2 (from OpenAI), Imagen (from Google Brain) and the contributions of the AI developer Katherine Crowson oriented.

A “stable diffusion” takes place in two steps: The encoder compresses an image (x) into a low-dimensional representation (z) in latent space. Diffusion and noise reduction (denoising) then run, primarily via the representation (z) rather than the original image (x).

 

Diagram of stable diffusion: The encoder compresses an image x to a representation z, followed by diffusion and noise reduction (Fig. 1).

(Image: CompVis / AI Pub)

 

Anyone who would like to delve further into the technical background can find the paper by the CompVis research group led by Robin Rombach from the LMU Munich and Patrick Esser from Runway ML, who developed the technology (“High-Resolution Image Synthesis with Latent Diffusion models”). The Heidelberg and Munich team uploaded the scientific paper in December 2021, and the revised version from April 2022 is currently available online. The CompVis group also provides a GitHub repository on Latent Diffusion, in which they describe their approach to pre-training the augmented diffusion models and the way to generate text-to-image including pre-trained weights. In it, the researchers explain how their own models can be trained.

If you want to test Stable Diffusion, you need user access to Hugging Face Hub and an access token for the code. First is on your own computer diffusers==0.2.4 to install with the following command: pip install diffusers==0.2.4 transformers scipy ftfy. The model card for the required version 1-4 is stored at Hugging Face and after reading the license you have to approve it (by ticking a check box). With the delivery of the access token, the setup is complete and you can start with the inference.

How to get the token and what the next steps are is explained in detail in the blog entry on Stable Diffusion. One hurdle could be the computing power, because GPU storage space is required to operate the model. If you have less than 10 GB of GPU RAM, you have to be content with a smaller version of the stable diffusion pipeline (float 16 precision instead fp32). In this case, extra steps would be necessary during setup to set up the alternative fp16-Involve branch.

 

Behind Stable Diffusion are a team of researchers and engineers from the non-profit organization LAION, the research department of the AI ​​company Stability AI and CompVis, a group at the Ludwig-Maximilians-Universität Munich (LMU), which was formed from the former Computer Vision Group from the University of Heidelberg. Stable Diffusion is based on the combined forces of the machine learning community: such as the community of the grassroots collective EleutherAI, who already created GPT-Neo and GPT-J.

LAION stands for Large-scale Artificial Intelligence Open Network and is non-profit. This is remarkable because the organization (unlike companies like Google, Meta and OpenAI, which is associated with Microsoft) is financed exclusively by donations and public research funds. According to the project description, the goal of LAION is to make the essential results from the highly scaled machine learning accessible to everyone who is interested. Apparently, stable diffusion is also intended to serve this goal.

With the text-to-image model, billions of people can potentially create digital works of art in the format of 512 x 512 pixels within seconds, because the model is said to be able to be operated “on consumer GPUs” (less than 10 GB of virtual RAM), i.e. no require their own data center – good news for underfunded researchers.

Details on the release of Stable Diffusion can be found in the public release announcement on the Stability AI blog and in a blog post on Hugging Face – where the weights, model card and code are also stored. The advance announcement of the research release provides additional insight and input from stakeholders. If you want to better understand how diffusion models work, you can check out the diffuser library at Hugging Face (Colab “Getting started with diffusers”).

Explanatory threads can already be found on Twitter, including one by machine learning professor Tom Goldstein with an overview of relevant research work (thread: “How diffusion models workhow we understand them, and why I think this understanding is broken”). The overview at AI Pub is particularly recommended:

 

Anyone who works (or intends to do so) with large image data sets and is concerned about copyright issues will find an informative FAQ collection on the LAION website.


(her)