Scheduling tensor programs

2 minute read

Published:

Tensor programs are ubiquitous, ranging from general matrix muliplication to deep neural networks. However, due to extensive float number computation, execution them can incur high latency. Indeed, it is well-known that the rejuvenation of neural networks is very much due to the use of GPUs for these computation — thus accelerating tensor programs is of great importance.

The remaining question is how should we achieve that? In below I capsulize the methods into two categories:

  1. Optimize the tensor program with either reformulation or approximation
    • Reformulation: this optimizes only the algorithm from which the tensor program is created. For example, Strassen’s algorithm accelerates matrix multiplication since it uses less multiplications.
    • Approximation: this kind of opitmization generates a new tensor program that is similar but is “cheaper”. For example, by quantization, one can obtain a significantly smaller network whose peformance is usually similar to the original net.
  2. Optimize the execution with hardware advancement
    • This keeps tensor program untouched, but instead optimzing how they could be executed on a given hardware. For example, the development of cuDNN is one such endeavour.

We are primarily interested in category 2), but with yet one more aspiration: automate the process. The approach to this end requires two ingredients:

  • A language — this refers to a domain specific language through which tensor program can be specified.
  • A compiler — this refers to a tool that can translate the high-level source code into efficient low-level implementation.

In below are two instances of the above idea.

Halide

Halide is a C++ embedded language, proposed Jonathan Ragan-Kelley and Andrew Adams in 2012.

Auto-schedulers

TVM

TVM is a Python embedded language; its grammar is very much similar to tensorflow, pytorch.

Auto-schedulers

Difference between Halide and TVM

Language expressiveness difference

Halide supports DAG like computational logics. TVM is fully compatible with Tensorflow and PyTorch, so it also supports recurrent network architectures.

Auto-scheduler algorithms difference

  • The autoscheduler in Halide schedules a program stage by stage, where a stage is a functional object. At each stage, it makes two type of decisions
    • Cross-stage granularity:
    • Intra-stage order:
  • The autoscheduler in TVM divides the computation graph into independent operators (an operator is similar to a stage in Halide), then optimize each operator separately.