# DEVELOPING A BLAS LIBRARY FOR THE AMD AI ENGINE

Tristan Laan, Tiziano De Matteis (<u>t.de.matteis@vu.nl</u>) VU Amsterdam

H2RC – November 22, 2024



### **CAMBRIAN EXPLOSION OF "AI" HARDWARE**





Source: Cerebras

TILE O TILE 1

TILE 2

DRAM (TBs) Host Scale-Out

Virtual Memory Manager

PCle

Source: AMD /Xilinx



Source: Tenstorrent



Cardinal SN10: Tile 321 Fabric Router 1 32b Compute

Source: Cerebras





T Comp

#### Source: Sambanova

GE DRAM

Source: Tenstorrent

Intelligent Engines

Each has its own characteristics...but they are all **spatial architectures** that can be programmed using a dataflow approach



UNIVERSITEIT



### CAMBRIAN EXPLOSION OF "AI" HARDWARE



@Large Research Massivizing Computing Systems http://atlarge.science

#### THE SW ECOSYSTEMS ... FOR ML USERS

Manufacturers provide integration with high-level ML programming frameworks (Pytorch/TensorFlow) and pre-trained models ready-to-use

@Large Research Massivizing Computing Systems http://atlarge.science



Source: Cerebras





### THE SW ECOSYSTEMS ... FOR NON-ML USERS

@Large Resea Massivizing Computing Sys Business logic
AMD Versal AI Engine

For all the rest, we have to rely on lower-level APIs. For instance vector add on AMD AI Engine:

| <pre>def vector_add(x, y):     return x+y</pre>                                                                                                     | <pre>class simpleGraph : public graph {     private:         kernel vadd; input_plio x, y; output_plio out;     public:</pre> |
|-----------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|
| <pre>void vector_add(input_window<int32> input_window<int32> *y, output_wind int N) {   for (unsigned i = 0; i &lt; N / 16; +</int32></int32></pre> | <pre>*x,<br/>ow<int32> *out,</int32></pre>                                                                                    |

## A total of 300+ lines of code to simply add two vectors. This requires knowledge about the HW, the API, how to optimize, ...

```
#include <his_stream.h>
#include <ap_axi_sdata.h>
                                                                       ...and:
extern "C"{
                                                                             Configuration
   void mm2s(ap_int<32> *mem, int size, hls::stream<qdma_axis<32, 0,</pre>
                                                                         0, 0 >> \&s {
                                                                             Placement info
                                                                         #pragma HLS INTERFACE m_axi port = mem offset = slave bundle = gmem
                                                                             Compilation files
#pragma HLS interface axis port = s
                                                                         #pragma HLS INTERFACE s_axilite port = mem bundle = control
                                                                         5
                                                                              . . .
....
```

#### How to avoid an "Hardware Lottery"

Despite the promise of massive parallelism, the scientific and HPC communities have yet to systematically explore the use of spatial devices in areas other than ML

We need proper programming abstractions, open-source libraries of reusable components, and guidelines to democratize access to ML Accelerators

Our work-in-progress project is **AIEBLAS**, a an open-source implementation of Basic Linear Algebra Routines (BLAS) for the AMD AI Engine (AIE) spatial architecture. Our goals are:



- To facilitate the rapid developments of numerical applications
- Leveraging the hardware unique characteristics
- Offering a library that can be easily extend





### DENSE NUMERICAL ALGEBRA ON AMD AI ENGINES

The AMD AI Engines are being offered in data center acceleration card, and commodity CPUs<sup>1</sup>

Let's consider a VCK5000 (data center card):

- An array of 8x50 AIEs (400 in total):
  - Each one with 32 KB local memory and a VLIW vector processor
  - Each can communicate with neighbours or non-neighbour AIEs
- Programmable Logic (PL) for custom hardware

The AIEs can be programmed using the Adaptive Dataflow (ADF) API, the application is represented by a dataflow graph of kernels scheduled one the AIEs.



<sup>&</sup>lt;sup>1</sup><u>https://www.amd.com/en/technologies/xdna.html</u>

#### AIE-BLAS: A BLAS LIBRARY FOR THE AMD AI ENGINE (WIP)

**AXPYDOT:**  $\beta = z^T u$  where  $z = w - \alpha v$ 





#### AIE-BLAS: A BLAS LIBRARY FOR THE AMD AI ENGINE (WIP)



Run on Targeted Device



#### **FIRST RESULTS**

Comparison VCK5000 against OpenBLAS running on a 2x10-cores Xeon Silver 4210R.





Vector size

AXPYDOT

Similar behavior with GEMV

@Large Research Massivizing Computing Systems

http://atlarge.science



#### FIRST RESULTS: W/O MEMORY ACCESSES

Comparison VCK5000 against OpenBLAS running on a 2x10-cores Xeon Silver 4210R.



#### Accessing off-chip memory is non trivial



11

UNIVERSITEIT

#### LEVERAGING MULTIPLE AIEs

Each PL-AIE connection supports 4 GB/s (312 connections in total)



#### **Multi-AIEs implementation is necessary**







#### **AIE-BLAS: NEXT STEPS**

Write good code for spatial accelerator is complicated also for experts. Next steps:

- Support for multi-AIE implementation
- Optimize memory accesses and tiling
- Favor dataflow composability
- Coverage
- Considering to other spatial architectures

A BLAS porting for AMD AI Engine architecture, to **facilitate the development of numerical applications** 

https://github.com/atlarge-research/AIE-BLAS







# **THANK YOU!**



