AIs have a problem: they are opaque and closed. BLOOM is the great open source project that wants to change everything
DALL-E, GPT-3, Image… are some of the most recognized names in the field of artificial intelligence. They all have something in common and that is that they are not open models. These AIs are making it possible to generate amazing images and conversations, but it is not clear to everyone how they got there. An extremely complex process that for many researchers is also opaque.
BLOOM is the great open source project that wants to change this situation. An open multilanguage model with 176 billion parameters and trained on 1.5 terabytes of text. If the existing models are in relevance like the Google of the time, BLOOM may be the equivalent of Wikipedia.
A year of work later, the community already has its great AI open
The number of parameters is no accident. BLOOM (‘BigScience Language Open-science Open-access Multilingual’) is just slightly larger than GPT-3 (175,000M). But it is not its power that makes it so relevant, but the process by which it has been carried out. Companies like Meta or OpenAI also have some open AI, but all these initiatives have a commercial interest behind them.
This is where the community and BLOOM come in. BigScience is the organization responsible for this model. A group of more than 1,000 researchers dedicated to artificial intelligence, united through Hugging Face, the leading platform and community around AI. But they have not been alone. Total, more than 250 institutions have collaborated on this project that began in early 2021.
As Nature describes, BLOOM was trained in France with the $7 million publicly funded Jean Zay supercomputer. The result was published in the middle of last June.
The use of BLOOM will depend on the researchers, but some uses are already contemplated, such as extracting information from historical texts and making classifications in biology. Being an open project, Hugging Face will launch a web application and will allow any user to download BLOOM to make it work.
Social impact of LLM series with @BigscienceW and @mtlaiethics part 3: let’s talk about data!
Training data is at the heart of modern ML, yet we still pay too little attention to how it’s governed, curated, and to who has a say in data choices:
🧵1/6https://t.co/wij9GdX1Nt pic.twitter.com/1C2Nf0uIi5— Yacine Jernite (@YJernite) June 23, 2022
One of the features of BLOOM is the data used. AI results are closely related to the data sets on which they are based. In this case, the team of researchers hand-selected almost 70% of the 341 billion words with whom he trained.
One of the goals of the initiative was also to feed the AI with a diverse database and sufficiently representative of different languages and cultures.
“Values such as openness, inclusion, diversity, responsibility and reproducibility are the DNA of this project. BigScience and BLOOM embody the most remarkable and honest attempt to break down the barriers that Big Tech has erected around AI during these years”, Alberto Romero points outanalyst at CambrianAI.
We will have to wait to see the results, but the fact that the open source community has already presented an open alternative to AI models is great news, especially considering the enormous work and high technical requirements behind of creating these models.
More information | BigScience