Scheduling tensor programs
Tensor programs are ubiquitous, ranging from general matrix muliplication to deep neural networks. However, due to extensive float number computation, execution them can incur high latency. Indeed, it is well-known that the rejuvenation of neural networks is very much due to the use of GPUs for these computation — thus accelerating tensor programs is of great importance.
The remaining question is how should we achieve that? In below I capsulize the methods into two categories:
- Optimize the tensor program with either reformulation or approximation
- Reformulation: this optimizes only the algorithm from which the tensor program is created. For example, Strassen’s algorithm accelerates matrix multiplication since it uses less multiplications.
- Approximation: this kind of opitmization generates a new tensor program that is similar but is “cheaper”. For example, by quantization, one can obtain a significantly smaller network whose peformance is usually similar to the original net.
- Optimize the execution with hardware advancement
- This keeps tensor program untouched, but instead optimzing how they could be executed on a given hardware. For example, the development of cuDNN is one such endeavour.
We are primarily interested in category 2), but with yet one more aspiration: automate the process. The approach to this end requires two ingredients:
- A language — this refers to a domain specific language through which tensor program can be specified.
- A compiler — this refers to a tool that can translate the high-level source code into efficient low-level implementation.
In below are two instances of the above idea.
Halide is a
C++ embedded language, proposed Jonathan Ragan-Kelley and Andrew Adams in 2012.
OpenTuner: An extensible framework for program autotuning
By Jason Ansel et al., PACI 2014
Automatically Scheduling Halide Image Pipelines
By Mullapudi et al. SIGGRAPH 2016
Differentiable Programming for Image Processing and Deep Learning in Halide
By Tzu-Mao Li eta al., SIGGRAPH 2018
Learning to Optimize Halide with Tree Search and Random Programs
By Andrew Adams et al., SIGGRAPH 2019
Efficient automatic scheduling of imaging and vision pipelines for the GPU
By Lunke Andreson et al., Proceedings of the ACM on Programming Languages 2021
TVM is a
Python embedded language; its grammar is very much similar to
Learning to Optimize Tensor Programs
By Chen et al., Neurips 2018
Ansor : Generating High-Performance Tensor Programs for Deep Learning
By Zhang et al., USENIX 2020
Difference between Halide and TVM
Language expressiveness difference
DAG like computational logics. TVM is fully compatible with
PyTorch, so it also supports recurrent network architectures.
Auto-scheduler algorithms difference
- The autoscheduler in Halide schedules a program
stage by stage, where a
stageis a functional object. At each stage, it makes two type of decisions
- Cross-stage granularity:
- Intra-stage order:
- The autoscheduler in TVM divides the computation graph into independent
operators(an operator is similar to a stage in Halide), then optimize each operator separately.