Deep Learning Frameworks and Compilers
In previous blogs, we have covered the topic of scheduling tensor programs. This task might be viewed as one step in machine learning compilers, which I will provide an introduction in this blog.
Compilers are inherently designed to abridge the gap between software and hardware. In the same sense, machine learning compilers are primarily developed for hardware for machine learning programs, e.g., deep neural networks.
Due to the successes of deep neural networks and other machine learning applications, specific hardware devicies have been developed for these computer programs. These hardware might be collectively referred to as AI accelerators, as their major aim is to accelerate tensor program computations.
In below I list some chips that might be used for AI computation:
- CPUs – this is the most generic chips we see everday, and of course, they are able to execute tensor programs though it has high latency.
- NVIDIA GPUs, e.g., microarchitectures Volta, 2017, Turing, 2018, and Ampere, 2020
- Google’s TPUs. So far, four versions have been released, i.e., TPU-v1 in 2016, TPU-v2 in 2017, TPU-v3 in 2018, and TPU-v4 in 2021. An Edge TPU is announced in 2018, and has been used in Pixel-4 smart phone.
- Huawei’s Ascend Series. See paper (Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing : Industry Track Paper)
- Others: they are many other AI accelerators.
Tensor programs software frameworks (neural net frameworks)
Having only a dedicated hardware is insufficient to accelerating tensor program computations. They need also software support. Over the past few years, many neural network frameworks have been developed. To name a few, Tensorflow, Pytorch, JAX, and Huawei’s Mindspore.
These frameworks are all created under the above software architecture stack, among which the core question being addressed is how to translate high-level and hardware-agnostic code into low-level, efficient and hardware-aware code. In fact, this core task is fairly identical despite how different these software frameworks may look in appearance. Thus, deep learning compilers do have the right of being exist as standalone projects. Some of them are listed in below.
This specifically refers to the “intermidiate” part of the software framework above.
XLA, developed by Google, originally for tensorflow, but now also supports Pytorch, JAX.
GLOW, developed by Facebook.
Tiramisu, developed by MIT.
TensorComprehensions by Facebook.
AutoTVM, as the name indicates, is developed in TVM, but it supports also many other front-ends, e.g. Tensorflow, Pytorch etc.
Halide is a language in
C++as well as a compiler. It was first created for accelerating image processing pipelines, but can also be used for deep neural network models, given that the model specification has been converted into Halide’s
In summary, the differences of these tools are majorly reflected in the following aspects:
frontend language expressiveness. For example, Halide does not support cyclic computation graphs, e.g., LSTMs.
search space modelling. For example, Tiramisu models the scheduling space using
Polyhedral, while Halide models the search space as a decision tree.
algorithms for scheduling. For example, Tiramisu uses ILP solver. AutoTVM used evolutionary algorithms. Halide uses tree search.
Besides the algorithmic differences, all the tools above rely certain format of intermidate representation, which is usually developed for that specific tool. It would ideal that design on this aspect can be standardized, thus different compilers can mostly focus on developing intelligent compilation algorithms, rather than engineering seemingly different but in-theory equivalent designs. The MLIR represnets an effort towards unifying these intermediate representations.