

## What should be used for Reconfigurable HPC, FPGA or Coarser-Grain Reconfigurable Architecture?

### **Kentaro Sano**

Leader, Processor Research Team Leader, Advanced AI Device Development Unit Leader, Architecture Research G in Feasibility Study for FugakuNEXT RIKEN Center for Computational Science (R-CCS)



e.

### PROCT Processor Research Team, Advanced AI Device Development Unit

### **Goal: Establish HPC & AI architectures suitable in Post-Moore Era**



Invited Talk, H2RC @ SC24

### **This Talk**

• What should be used for Reconfigurable HPC, FPGA or CGRA?

### • FPGA-based HPC

 ✓ ESSPER : Elastic and scalable FPGA-cluster system for high-performance reconfigurable computing, as Prototype FPGA cluster for HPC
 ✓ FPGA-based decoder for quantum error correction (in progress)

- CGRA (Coarse-grained reconfigurable array) for HPC
   ✓ RIKEN CGRA research
  - ✓ Architectural exploration example for HPC workloads





Invited Talk, H2RC @ SC24

[https://top500.org]

Nov 22, 2024

### What Eats Power?

#### Data movement rather than computing

 We should remove unnecessary data movement, and make it shorter.

#### Unsuitable architecture

#### resulting in low efficiency and scalability

- ✓ von-Neumann architectures (CPU & GPU) cannot efficiently scale due to
  - memory-bottlenecked structure; such as register files and LLC slices distributed over NoC for multiple cores
  - Extra mechanisms consuming power just to increase IPC such as out-of-order, branch predictor, thread scheduler

#### • Recent semiconductor scaling cannot save it.

 Power improvement per generation is limited while it can still increase transistors per area for advanced technology nodes like 4, 2, and 1.5nm ...

#### **Communication Dominates Arithmetic**





### **Custom Data-Flow Computing**

### Data-flow computing

- ✓ Localized data-movement
- Lower pressure on memory access with highly pipelined computing by regular data streams
- $\checkmark$  No extra mechanisms for non-computing

### Customization & reconfiguration

Higher efficiency by specialization
 Programmability for various problems

What technology is suitable to for custom data-flow computing? FPGA?





#### 1. Advancement of Fugaku

Research on Functional extension with FPGAs
 (FPGA cluster development, specialized hardware for HPC)





## Experimental FPGA Cluster connected with Supercomputer Fugaku

### Open-Access paper









#### Goal : Design & demonstrate a proof-of-concept FPGA cluster for HPC research

• **ESSPER** : Elastic and scalable FPGA-cluster system for high-performance reconfigurable computing

### • Contributions

- ✓ Design concept of FPGA cluster for HPC
- Classification of FPGA cluster architectures
- Proposed system stack with software-bridged APIs
- ✓ Implementation and evaluation for FPGA-based extension of the world's top-class supercomputer, Fugaku

#### **Open-Access paper**











### **Architecture of ESSPER**



#### Productive customizability

- No OpenCL (not limit computing models)
- FPGA Shell & HLS/HDL programming, where any hardware can be easily implemented

#### Performance scalability

FPGA Shell supporting high-bandwidth and low-latency network dedicated to FPGAs

#### Interoperability

Software-bridged driver and APIs to access FPGAs remotely through host-FPGA bridging network





Elastic and Scalable System for High-Performance Reconfigurable Computing

## **System Design**



### Hardware Organization of ESSPER



**FPGA Shell** 

### **Two Types of Networks**

|                 | Direct network                                                     | Indirect network                                                                |  |
|-----------------|--------------------------------------------------------------------|---------------------------------------------------------------------------------|--|
|                 | FPGA FPGA FPGA                                                     | Ethernet switches                                                               |  |
| Characteristics | <b>p2p-connection without switches</b> ,<br>typical: torus network | <b>connection with switches</b> ,<br>typical: Ethernet                          |  |
| Switching       | <b>circuit</b> or packet (w/ on-chip router)                       | packet                                                                          |  |
| Pros            | <b>low latency,</b><br>easy to use with simple HW                  | <b>flexibility</b> , small diameter,<br>easy adoption of <b>cutting-edge</b>    |  |
| Cons            | large diameter,<br>inflexibility in resource allocation            | higher latency due to packet processing,<br><u>complex and difficult</u> to use |  |



### **FPGA Shells for Direct and Indirect Networks**

#### **Direct connection network (DCN)**

#### **Indirect network (VCSN)**



Tomohiro Ueno, Atsushi Koshiba, Kentaro Sano, "Virtual Circuit-Switching Network with Flexible Topology for High-Performance FPGA Cluster," Procs. of ASAP, pp.41-48, 2021.



### Open-Access paper





Elastic and Scalable System for High-Performance Reconfigurable Computing

## Applications, Joint Research Projects



### **Projects with ESSPER (Selected)**

Elastic and Scalable System for High-Performance Reconfigurable Computing







### **Multi-FPGA Application using Direct Connection Network**

### • Steam computing of Fluid simulation with multiple FPGAs

- ✓ Lattice Boltzmann method (LBM)
- Extended pipeline with ringed FPGAs





### Performance of 2D LBM with 100Gbps Ring NW

Computational performance (FLOPS) when processing about 2GB data





### Quantum Error Correction with FPGAs & MOONSHOT

#### Fault-tolerant quantum computers (FTQC) using quantum error correction (QEC)

- ✓ Need to solve minimum-weight perfect matching (MWPM) problem
- ✓ Need to encode 1000 logical qubits using 1M physical qubits finally
- $\checkmark$  Scalability and low-latency (< 10us) are required.

#### Goal

19

- Explore scalable QEC hardware algorithm and system
- Demonstrate for proof-of-concept



**Classical computers** 

### **Surface Code for Quantum Error Correction**



Surface code with code distance of d = 5 (single logical qubit) Parity measurement for data errors (Same for X and Z, respectively)



### **Surface Code for Quantum Error Correction**



Data qubits with errors

Syndrome Measurement

**Decoding graph** 

(What we know as syndrome)

#### How can we know where the errors exist?



Invited Talk, H2RC @ SC24

### Minimum-Weight Perfect Matching in Syndrome Graph





#### Minimum-weight perfect matching (MWPM)

Syndrome graph with weight per edge (weight ~= Manhattan dist.) Decoding graph (What we know as syndrome)



Invited Talk, H2RC @ SC24

### **Decoding Results for Most likely Errors**

# Need to handle 3D lattice for measurement errors.



Minimum-weight perfect matching (MWPM) MWPM paths in Decoding graph Most likely errors of data qubits in the paths



### Hardware Design for Syndrome Subgraph Decoder





### Lessons Learned with ESSPER

Open-Access paper



- FPGA-based reconfigurable computing works.
- Productivity is not high, especially for multiple FPGAs.
  - Even HLS requires know-how on optimization.
  - Lack of debugging tool, and simulation environment.

### • High scalability, but FP performance is lower than GPUs for major domain.

- ✓ FPGA has higher overhead (area, power, freq) and lower memory bandwidth.
  - Sometimes, FPGA's resource balance doesn't fit requirement (e.g., insufficient on-chip RAM).
- ✓ Customization with FPGA can give better solution for some specific requirement:
  - E.g., non-numerical & low-latency for quantum error correction
- Reconfigurable data-flow itself should be Okay, but How can we make it a first-class citizen in HPC?



#### 2. Exploration of new HPC & AI architectures

- Research on reconfigurable accelerator (e.g. CGRA)
- Research on next-generation AI chip architecture



## **Coarse-Grained Reconfigurable Array for HPC** (and AI)



### Coarse-Grained Reconfigurable Array (CGRA)

• Architecture for data-driven computing

- Composed of an array of processing elements (PEs), where we map DFGs for computing
- ✓ Provide a word-level reconfigurability (e.g., 32-bit)
- ✓ Higher energy efficiency than FPGAs (of bit-level)

✓ Performance close to ASIC-based accelerators

### • Application area of CGRAs

- Traditionally, targeted for lower-power embedded apps, e.g., image processing
- ✓ Recently, expected for hi-performance deep-learning
- Questions

27

- ✓ CGRAs also promising for HPC?
- ✓ What architecture/design decision required HPC?

[1] Liu, Leibo, et al. "A survey of coarse-grained reconfigurable architecture and design: Taxonomy, challenges, and applications." ACM Computing Surveys (CSUR) 52.6 (2019): 1-39.
 [2] Takuya Kojima, et al., "Exploration Framework for Synthesizable CGRAs Targeting HPC: Initial Design and Evaluation," Procs. CGRA4HPC, May 30-June 3, 2022.







#### General structure of the CGRAs [2]

### **HPC Performance Requirement in Roofline Model**

### Roofline model

- Peak performance available according to *arithmetic intensity*
- ✓ Memory-bound or Compute-bound
- Steaming processor can cover memory-bound applications.
- What architecture should be applied to compute-bound?
  - ✓ Higher compute density
  - Higher performance per power



#### **Roofline model for different performance characteristics**



## **Exploration of Trade-offs Between General-Purpose and Specialized Processing Elements in HPC-Oriented CGRA**

Emanuele Del Sozzo, Xinyuan Wang, Boma Adhi, Carlos Cortes, Jason Anderson, Kentaro Sano (Presented at IPDPS'24)



### **RIKEN Baseline CGRA**

- HPC-oriented CGRA with the following design philosophy
  - ✓ Modular design for design space exploration with various configuration and sizes
  - ✓ **Isolation** between computation and memory access
  - ✓ Floating-point operation capability for HPC

MM

offset & command for ex

Address

generator for

3-nested loops

loop info

FIFO

control

FIFO





Data from / to switch block

### **Heterogeneity for HPC**

#### • Extend CGRA with different types of PEs

- ✓ Basic PE : Add, Sub, Mul, FMA
- ✓ Complex PE : Exp, Log, Sqrt, Div
- ✓ Full PE All of them









### **Chip Floorplans of Heterogeneous CGRA**



1:1 cluster floorplan



| 0,0 1,0 2,0 3,0       | 4,0 5,0 | 6,0 7,0 |  |  |
|-----------------------|---------|---------|--|--|
| 0,1 1,1 2,1 3,1       | 4,1 5,1 | 6,1 7,1 |  |  |
| 0,2 1,2 2,2 3,2       | 4,2 5,2 | 6,2 7,2 |  |  |
| 0,3 1,3 2,3 3,3       | 4,3 5,3 | 6,3 7,3 |  |  |
| 0,4 1,4 2,4 3,4       | 4,4 5,4 | 6,4 7,4 |  |  |
| 0,5 1,5 2,5 3,5       | 4,5 5,5 | 6,5 7,5 |  |  |
| 0,6 1,6 2,6 3,6       | 4,6 5,6 | 6,6 7,6 |  |  |
| 0,7 1,7 2,7 3,7       | 4,7 5,7 | 6,7 7,7 |  |  |
| 3:1 cluster floorplan |         |         |  |  |



7:1 column floorplan

| 0,0 1,0 2,0 3,0 | 4,0 5,0 6,0 7,0 |
|-----------------|-----------------|
| 0,1 1,1 2,1 3,1 | 4,1 5,1 6,1 7,1 |
| 0,2 1,2 2,2 3,2 | 4,2 5,2 6,2 7,2 |
| 0,3 1,3 2,3 3,3 | 4,3 5,3 6,3 7,3 |
| 0,4 1,4 2,4 3,4 | 4,4 5,4 6,4 7,4 |
| 0,5 1,5 2,5 3,5 | 4,5 5,5 6,5 7,5 |
| 0,6 1,6 2,6 3,6 | 4,6 5,6 6,6 7,6 |
| 0,7 1,7 2,7 3,7 | 4,7 5,7 6,7 7,7 |
| 7:1 cluste      | r floorplan     |



### Larger Area for Basic PE in Cluster Floorplan

#### T TA T **PE Cluster Basic PE PE Cluster** Complex PE 98.1% Utilization 62% Utilization Basic PE Basic PE **Basic PE** Complex PE 67.2% Utilization 62% 64% 62% **Basic PE** Utilization Utilization Basic PE 29% Utilization 73% Utilization ..

#### Lower utilization due to region constraint









### Summary

**Reconfigurable data-flow computing should be promising for power-efficient HPC.** 

- FPGAs are suitable for domain-specific computing.
  - ✓ ESSPER: FPGA cluster testbed
  - ✓ Quantum error correction
- CGRA should be better for general HPC.
  - ✓ RIKEN CGRA for HPC and AI
  - Need engineering work and compiler development

#### **Future work**

- ESSPER2 with Altera Agilex-M FPGA (mainly for Quantum research)
- ✓ SoC design of CGRA for HPC and AI (preparation for future ASIC)
- Have more collaboration!



#### Hiring researchers, Contact me!







# Thank you !