Machine Learning: Ray 2.0 makes distributed workloads of large models more scalable

0
47
machine learning ray 20 makes distributed workloads of large models.jpg
machine learning ray 20 makes distributed workloads of large models.jpg

The framework for scaling AI and Python apps supports random replay of large data sets over 100 terabytes. It introduces the AI ​​Runtime Ray AIR.

 

Ray, an open source framework for large machine learning applications, is in its second major release. Ray 2.0 arrives two years after the first major release, after a series of iterations, and offers significant innovations in all libraries, new functions to unify ML workflows and, according to the publishers, improved support for producing ML applications. In addition, the libraries should be easier to use and integrate than before.

 

According to the release announcement for Ray 2.0, the Ray team had set itself the goal of standardizing ML workloads across different tools. With the new main version, it should be possible to use TensorFlow, PyTorch and HuggingFace in one ML workload at the same time. This is made possible by new tools that are still in beta, in particular the Ray AI Runtime (AIR for short), which is used to scale and standardize ML applications, and KubeRay (beta) for running Ray on Kubernetes. In the future, KubeRay will replace the old, Python-based ray operator. With the datasets library, Ray now natively supports the random playback of large amounts of data of 100 terabytes and more.

Another highlight is Ray Serve’s Deployment Graph API, which provides a new, easy way to build, test, and deploy deployments’ inference graphs (also the API is still in beta). According to its publishers, Ray 2.0 should be able to score particularly well when deploying a large number of ML models with complex mutual dependencies thanks to the Ray-Serve deployment graph.

The project team announced the general availability at the Ray Summit, which took place from August 22 to 24, 2022 in San Francisco. The framework and its ecosystem are now proven for scaling and powering large, complex AI workloads. Among other things, it is behind GPT-3, the large language model of OpenAI, and providers such as Shopify and Amazon use Ray, according to the project team. In the field of MLOps, Ray is also an established tool for managing workloads.

Behind the open source project is the company Anyscale, which also presented a new enterprise platform for operating Ray at the summit. The framework emerged from a small university project at UC Berkeley, the co-founder and CEO Robert Nishihara presented the background and concerns of the project in a keynote at the summit. Greg Brockman, the CTO and co-founder of OpenAI, also spoke at the summit. According to him, OpenAI uses Ray to train its largest models. Brockman describes Ray as developer-friendly and highlights the advantage of such a third-party tool that maintenance is not OpenAI’s own responsibility. That saves resources. Ray seems to be part of the basic infrastructure for OpenAI.

If you are already using Ray 1.0 and want to update your version, you can find orientation in the Ray 2.0 migration guide. Further details can be found in the release notification on GitHub, which lists all technical innovations. If you want to delve deeper, you can take a look at the Ray 2.0 documentation. There, the Ray team also offers an overview for beginners and introduces the use of the framework and its components step by step. General information can be found on the Ray project website.