Artificial intelligence (AI) is advancing by leaps and bounds. In 2022 we saw amazing image-from-text generators like DALL E 2 , Stable Diffusion 2.0 and Midjourney . When everything seemed that the year was going to close without any further notable developments in this field, ChatGPT appeared and generated a true revolution.
These tools did not go unnoticed and quickly began to be adopted in different scenarios. As a result, the world began to prepare to face the challenges that come with AI, from controversies over possible copyright infringement to its use in academia .
However, the seemingly rampant advancement of this technology could soon reach a limit. The possibilities offered by the creations of companies like OpenAI do not happen by magic. The secret is in huge data sets (datasets) , and we are consuming them faster than we are producing them.
Data sets, the secret behind ChatGPT and other AI apps
Data sets are essential for machine learning tasks. In the case of ChatGPT, these provide the necessary information that allows you to produce consistent and natural responses. The larger and more varied the datasets used, the more capable the model is of learning to produce a wide variety of texts.
If we take DALL·E as an example, in general terms, datasets provide the AI model with image examples and their corresponding descriptions . In this way, using a neural network that has been specifically designed to process text input, you can generate images from textual descriptions.
You may be wondering, then, where is the problem? According to a group of researchers from Epoch AI, an organization that studies the development of artificial intelligence, the high-quality datasets that are being used to train the aforementioned advanced language models will be exhausted by 2026, which could harm their development.
According to a paper published by the researchers in the online archive ArXiv, the demand for high-quality data sets for training AI language models is growing by approximately 50% every year . The generation of these data sets, on the other hand, only grows at a rate of 7% per year.
Now, the process for generating quality data sets is very important. Remember that these collect public information and must be large enough for the model to learn effectively. In addition, they must be varied and coherent. Here human manual labor comes into play, which is responsible for reviewing and cleaning the data.
This process, as explained by Epoch AI, is slow and expensive. There are, however, tools that help automate some dataset cleaning processes. Even the possibility of using AI to review the models, but this carries risks, such as the proliferation of errors and biases that could affect the model.
What will happen between now and 2026 remains to be seen. If datasets start to run out, as the researchers predict, the evolution of AI could slow down much as time goes on. But for now, artificial intelligence enthusiasts are looking forward to the arrival of GPT-4, the evolution of the famous GPT-3 that brings ChatGPT to life.