My personal thoughts are as follows.
Overall:
In terms of the results, it seems to me some “claims” are a bit exaggerated.
AlphaTensor did update the state-of-the-art on some cases for matrix multiplication.
However, it also seems to me that these cases are all from scenarios where few people have devoted their effort on. The emphasis on the advancement of a 50-year old problem seems a little misleading, the problem better solved by AlphaTensor is not really the same problem Strassen was solving 50 years ago.
In terms of the algorithm. This is more interesting, although the article only emphasizes the AlphaZero was used. After having a deeper look, what I see is that it differs from AlphaZero
in a number of places:
AlphaZero
).Further comments on the results:
1) The result on the 4x4 matrix. This result is highlighted in Abstract, which shows that 4x4 might be a case of great interest to domain experts. The article says that from Strassen’s 49 times multiplication, AlphaTensor reduced it to 47 times. But after a careful reading, I realized that this is not a result on a “real matrix”, but a finite filed, i.e., modular arithmetic {0, 1}. I believe that many people might wonder the same as me: how important is 4x4 {0, 1} modular matrix multiplication in practice?
2) The result of 4x5, 5x5 matrix multiplication. This result is indeed obtained on the real numbers. The best solution is 80 multiplications, and AlphaTensor pushed it to 76 times. This looks like a more impressive result, but it was not highlighted in Abstract. It seems that irregular cases like 4x5, 5x5 are those few people would pay attention, maybe have smaller practical value?
3) The result of 3x3. AphaTensor obtained the result of 23 multiplications, which is not emphasized in this paper, probably because the result has been published in other papers last year. That paper uses the “SAT modeling + pure search” solution, which was applied only for 3x3 matrix. It seems that the SAT SAT would become unmanageably large for 4x4 or larger matrices. However, I have not yet checked the technical details of this solution. Just wondering if more research effort and computation resources are used in this line of development, would it achieve same or better results than AlphaTensor?
4) GPU/TPU Runtime results. The paper did not seem to put much emphasis on results from this part. It is just trying to show that, if the optimization objective of AlphaTensor is modified to reduce real-hardware runtime, by deploying AlphaTensor upon GPU/TPU compiler (e.g. XLA), they can achieve about 20% improvement. It should be emphasized that best matrix multiplication found for GPU is not TPU friendly, and vice versa. This is more reasonable, since in practice, best runtime when the hardware architecture is best exploitation for computation & data reuse, yet GPU and TPU differs significantly in architecture design. The “runtime” objective does not match the “Strassen objective”, either. As mentioned in the paper, the solution found by AlphaTensor does not contain less multiplication arithmetic than Strassen’s algorithm. It is just more suited for TPU/GPU. Since the only modification that they did is changing the reward function, I suspect that if more adjustments are posed to the AlphaZero algorithm itself for the runtime objective, better results/improvements might be achieved.
]]>Compilers are inherently designed to abridge the gap between software and hardware. In the same sense, machine learning compilers are primarily developed for hardware for machine learning programs, e.g., deep neural networks.
Due to the successes of deep neural networks and other machine learning applications, specific hardware devicies have been developed for these computer programs. These hardware might be collectively referred to as AI accelerators, as their major aim is to accelerate tensor program computations.
In below I list some chips that might be used for AI computation:
Having only a dedicated hardware is insufficient to accelerating tensor program computations. They need also software support. Over the past few years, many neural network frameworks have been developed. To name a few, Tensorflow, Pytorch, JAX, and Huawei’s Mindspore.
These frameworks are all created under the above software architecture stack, among which the core question being addressed is how to translate high-level and hardware-agnostic code into low-level, efficient and hardware-aware code. In fact, this core task is fairly identical despite how different these software frameworks may look in appearance. Thus, deep learning compilers do have the right of being exist as standalone projects. Some of them are listed in below.
This specifically refers to the “intermidiate” part of the software framework above.
XLA, developed by Google, originally for tensorflow, but now also supports Pytorch, JAX.
GLOW, developed by Facebook.
Tiramisu, developed by MIT.
TensorComprehensions by Facebook.
AutoTVM, as the name indicates, is developed in TVM, but it supports also many other front-ends, e.g. Tensorflow, Pytorch etc.
Halide is a language in C++
as well as a compiler.
It was first created for accelerating image processing pipelines, but can also be used for deep neural network models, given that
the model specification has been converted into Halide’s C++
language.
In summary, the differences of these tools are majorly reflected in the following aspects:
frontend language expressiveness. For example, Halide does not support cyclic computation graphs, e.g., LSTMs.
search space modelling. For example, Tiramisu models the scheduling space using Polyhedral
, while Halide models the search space as a
decision tree.
algorithms for scheduling. For example, Tiramisu uses ILP solver. AutoTVM used evolutionary algorithms. Halide uses tree search.
backend support
Besides the algorithmic differences, all the tools above rely certain format of intermidate representation, which is usually developed for that specific tool. It would ideal that design on this aspect can be standardized, thus different compilers can mostly focus on developing intelligent compilation algorithms, rather than engineering seemingly different but in-theory equivalent designs. The MLIR represnets an effort towards unifying these intermediate representations.
]]>The 2019 Halide auto-scheduler combines heuristic search
and neural net
for scheduling tensor programs (even though neural nets are tensor programs themselves).
The Adams2019
auto-scheduler’s algorithm can be summarized in the following diagram.
The TVM auto-scheduler shares a similar spirit with Halide, see below:
Althought the algorithmic details in these auto-schedulers are quite different. They do share the same high-level structure:
These three aspects are exactly in line with the search-based RL framework in AlphaGo or AlphaZero. Naturally, if any further development is conducted on one of these three directions, a better auto-schedulers is expected to produce. Indeed, this is also what is happening in reality.
Yet, one more important solution step we overlooked above is how the search space of schedules are defined. Clearly, these search spaces cannot be directly extracted from our sketchy description of the problem using natural language — there must be a formal definition by which the space of schedules can be generated and searched systematically (like in games, we need action definition and rules to control action transitions). Indeed, in both Halide and TVM, the definition of search space is also an important (possibly not less important than the algorithms for searching & learning) task that requires yet another kind of algorithm for automatic modelling.
In my view, the search space modelling algorithm would have to come a trade-off between the following properties:
Once search space modelling is settled, i.e., problem is formally defined, algorithms development can focus on heuristic search, machine learning and the combination of both. The overall algorithm framework is thus a search-based reinforcement learning.
]]>First, I would summarize traditional problem-solving using the following diagram.
Left is raw problem, say from industrial, that the problem-solver (e.g., an engineer or group of them) is trying to solve. They would propose/design solutions, and these solutions are intrinsically related to their understanding of the problem. They would then try to improve the solutions while observing the effect of these solutions. As their knowledge accumulate, a continual refinement of these solutions is expected.
I would capsulize the AI-aided problem-solving paradigm as above. Same as before, the problem-solver receives a raw-problem. The difference part is that problem-solver is now equipped extensive knowledge on AI — these are useful in two aspects:
While people usually pay a great attention on the aspect of algorithm development, in my opinion, the other aspect problem modelling is often overlooked. The reason that I want to emphasize the role of problem modelling is simple: algorithms for the raw-problem are built upon the surrogate formal problem model, so the capability of your algorithms would be limited by the restrictions emanated from the way the problem is formally defined — on the other hand, a too general problem model would only create extra difficulty for the development of an algorithm specifically effective for the raw-problem you are interested.
Many sub-areas have seen relatively separate developments since the concept of artificial intelligence beinging proposed. They are fundamentally motiviated by different scientific inquiries. Some are motivated by the quest to understand the nature of knowledge, including how knowledge is represented, how they accumulate and how they reproduce each other.
Some are more concerned about learning, while arguing that the knowledge might be more conveniently be considered as a fuzzy object (e.g., they might just be represented by a large set of floating numbers). Some are focused on the algorithmic procedures through which a well-defined goal can be achieved.
In the above diagram I listed a few paradigms in AI that could either help problem modelling or problem-solving.
The remaining question is how should we achieve that? In below I capsulize the methods into two categories:
We are primarily interested in category 2), but with yet one more aspiration: automate the process. The approach to this end requires two ingredients:
In below are two instances of the above idea.
Halide is a C++
embedded language, proposed Jonathan Ragan-Kelley and Andrew Adams in 2012.
OpenTuner: An extensible framework for program autotuning
By Jason Ansel et al., PACI 2014
Automatically Scheduling Halide Image Pipelines
By Mullapudi et al. SIGGRAPH 2016
Differentiable Programming for Image Processing and Deep Learning in Halide
By Tzu-Mao Li eta al., SIGGRAPH 2018
Learning to Optimize Halide with Tree Search and Random Programs
By Andrew Adams et al., SIGGRAPH 2019
Efficient automatic scheduling of imaging and vision pipelines for the GPU
By Lunke Andreson et al., Proceedings of the ACM on Programming Languages 2021
TVM is a Python
embedded language; its grammar is very much similar to tensorflow
, pytorch
.
AutoTVM:
Learning to Optimize Tensor Programs
By Chen et al., Neurips 2018
Ansor:
Ansor : Generating High-Performance Tensor Programs for Deep Learning
By Zhang et al., USENIX 2020
Halide supports DAG
like computational logics.
TVM is fully compatible with Tensorflow
and PyTorch
, so it also supports recurrent network architectures.
stage by stage
, where a stage
is a functional object. At each stage, it makes two type of decisions
operators
(an operator is similar to a stage in Halide), then optimize each operator separately.The game of Hex was first described to public by polymath Piet Hein in 1942 under the name Polygon.
Piet Hein followed several principles to design the new game:
Fair: means both player shall have equal chance to win.
Progressive: game positions do not reappear.
Final: guarantee to have limited number of moves to end.
Comprehensible rules: beginners should be able to know how to play quickly.
Strategic: must be versatile in its possibilities for combination.
Decisive: never end with a draw.
Piet Hein’s initial idea was a game in a quadrilateral board, where two players (like two groups of armies in battle field) each is assigned two opposite sides. Player alternately places their stones at square cells in the board. Player who joins its two sides by a line wins the game. Indeed, on a planar graph, the underlying topological property of Hex is also the basis of the four color theorem, which was proven in 1976. Piet Hein initially considered a board with square cells, but one could quickly realize that in such a game, undesired situation could happen,
Instead of square cells, in 1942, Piet Hein finished the game after seeing that the mutual-blocking problem can be solved if they are replaced with hexagonal cells. He chose 11x11 diamond board shape, and the design is complete.
Middle picture from: https://en.wikipedia.org/wiki/Hex_(board_game)
Six years later, in 1948, John Nash at Princeton reinvented the game with the grid format where stones are placed at intersections, similar to the game of Go. David Gale soon converted it into a more natural hexagonal style.
Nash then proved by strategy stealing argument that, from the empty board, there exists a first player winning strategy. Yet theoretic value for each individual opening is unknown. As such, a swap rule is introduced, which says the second player has the option to steal the first player’s first move at the second player’s first turn (equivalent to swap colors).
Hex has extremely simple rules, but this does not make the game simple to play! Just watch a few games played by two strong computer Hex players, can you follow the strategies?
Visit Prof. Ryan Hayward’s homepage for more about the game of Hex.
Solving arbitrary Hex position is PSPACE-complete, since Hex is equivalent to Quantified SAT.
QSAT is a fundamental problem that is believed to be harder than NP-complete.
Computing Nash-equilibrium in general two-player zero-sum game is PPAD-complete
Solving Hex
Playing Hex