The OpenFlamingo project has recently announced the release of its open source framework for training and evaluating multimodal vision-language models (MVMLs) with in-context learning. This project seeks to develop a multimodal system capable of addressing a wide range of vision-language tasks and reaching the power and versatility of GPT-4 in the processing of visual and text inputs.
The goal of OpenFlamingo is to create an open source version of DeepMind’s Flamingo model, which is capable of processing and reasoning about images, videos, and text.
OpenFlamingo contributions
OpenFlamingo is a Python framework that enables Flamingo-style training and testing of MVMLs, based on the Lucidrains implementation and David Hansmair’s flamingo-mini repository. The project includes the following contributions:
multimodal dataset
A large-scale multimodal dataset with interspersed sequences of images and text. The C4 multimodal dataset is an extension of the C4 dataset, which was used to train T5 models. For each document in the C4 en.clean dataset, the original Common Crawl web page is retrieved and the downloadable images are collected. Data cleansing is done through deduplication and content filtering, which seeks to remove images that are not safe for work (NSFW) and those that are not related, such as advertisements. In addition, face detection is carried out and images with positive identifications are discarded. Finally, images and sentences are interleaved using bipartite matching within a document: CLIP ViT/L-14 image and text similarities serve as edge weights. Multimodal-C4 consists of approximately 75 million documents, encompassing around 400 million images and 38 billion tokens.
Assessment of tasks in context
A suite of vision-language task assessments that measures OpenFlamingo’s performance on a wide range of tasks. Currently, tasks such as visual question-answer, image description, and image classification are included. The ultimate goal is to create an open source version of the Flamingo assessment suite and extend it to standardize vision-language task assessment.
OpenFlamingo-9B model
A first version of the OpenFlamingo-9B model, based on LLaMA, has been published as part of the first version of the project. Although the model is not fully optimized yet, it demonstrates the potential of this project. The model has been trained on 5 million samples from the Multimodal C4 dataset and 10 million samples from LAION-2B.
Technical details
The OpenFlamingo implementation largely follows the Flamingo architecture. Flamingo models are trained on a large scale, and contain interspersed text and images, which is crucial to give them learning-in-context capabilities with few examples. OpenFlamingo implements the same architecture proposed in the original Flamingo article. However, since Flamingo training data is not publicly available, open source data sets are used to train the models in OpenFlamingo.
The inclusion of multimodal datasets such as Multimodal C4 is essential for the training of MVMLs, since the combination of visual and text data is essential to address vision-language tasks. In addition, the evaluation of tasks in context is a valuable tool to measure the performance of models, since it allows to measure their ability to learn from limited examples in a given context.
Ethics and security considerations
It is important to note that OpenFlamingo-9B is a research artifact and not a finished product. As such, it can produce inappropriate, offensive, and inaccurate results. Safety and ethical considerations also need to be taken into account when using large-scale MVML models, as they can have unintended effects and perpetuate bias and discrimination. OpenFlamingo inherits the issues from its parent models, and therefore extensive testing is needed to mitigate these issues in future models.