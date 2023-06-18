Google AI brings us a fascinating technological breakthrough on his blog. His team of engineers, Juhyun Lee and Raman Sarokin, have developed a method to speed up the implementation of large diffusion models (LDMs) on mobile devices, making sizable models run on modern smartphones efficiently.

The Challenge of LDMs on Mobile Devices

The BOMs they are models for the generation of large images. Deploying them in a mobile environment is a considerable challenge, due to their high memory and computation requirements. To overcome these obstacles, the Google AI team focused on optimizing memory efficiency and reducing the overall latency of ML inference.

Optimization Techniques: Memory Efficiency

To achieve optimized execution of BOMs, Google AI has implemented a number of techniques. Among them, the improvement of the attention module for greater memory efficiency. This module is crucial in the LDM denoiser model, allowing the artificial intelligence to focus on specific parts of the input. Two of the optimization techniques used are partially fused softmax and FlashAttention.

Softmax Partially Fused

He partially merged softmax is an optimization that saves the need for extensive memory reads and writes between softmax and matrix multiplication in the attention module. This technique manages to significantly reduce the amount of memory required.

FlashAttention

On the other hand, the FlashAttention is an I/O-aware and exact attention algorithm, which decreases the number of high-speed memory accesses from the GPU. This method turned out to be especially efficient on certain SRAM sizes and with a considerable number of registers.

Optimization Techniques: Winograd Fast Convolutions

The Winograd fast convolutions are another technique that Google AI has employed to optimize BOMs. Despite an increase in memory consumption and numerical errors, these convolutions proved effective in speeding up LDM processes, striking a balance between computational efficiency and memory usage.

Optimization Techniques: Fusion of Specialized Operators

Another important finding was the need for more extensive merges for common layers in LDMs than those provided by ML inference engines for GPUs. As a solution, specialized implementations were developed for a wider range of neural operators, such as the Gaussian Error Linear Unit (GELU) and the layer of group normalization.

Notable Results

By applying these optimizations, Google AI has been able to run sizable, high-resolution broadcast models on modern mobile devices in less than 12 seconds. These results mark an important milestone in the implementation of ML on mobile devices.

Google AI’s achievement demonstrates that the challenges of implementing great ML models on mobile devices are not insurmountable. By focusing on optimal memory utilization and the balance between ALU and memory efficiency, the team has achieved unprecedented latency time for ML inference. This advancement represents a significant step toward more powerful and efficient mobile technology. For more details about this work, you can visit the original article on the Google AI blog.