"Programming Reconfigurable Heterogeneous Computing Clusters Using MPI With Transpilation"
Burkhard Ringlein, IBM Research Zurich
"Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of Selected Benchmarks of the HPCChallenge Benchmark Suite"
Marius Meyer, Paderborn University
"Exploring the acceleration of Nekbone on reconfigurable architectures"
Nick Brown, EPCC at the University of Edinburgh
"Fast, Scalable Quantized Neural Network Inference on FPGAs with FINN and LogicNets"
Yaman Umuroglu, Xilinx
Mixing machine learning into high-performance applications needs co-designed solutions to meet the throughput and latency requirements. Quantized Neural Networks (QNNs) combined with custom FPGA dataflow implementations offer a good balance of performance and flexibility, but building such implementations by hand is difficult and time-consuming. In this talk, we will introduce FINN, an open-source experimental framework by Xilinx Research Labs to help the broader community explore QNN inference on FPGAs. Providing a full-stack solution from quantization-aware training to bitfile, FINN generates high-performance dataflow-style FPGA architectures customized for each network. We will also introduce LogicNets, the newest member of the FINN ecosystem. Through circuit-network co-design, LogicNets enables nanosecond-latency inference with performance in the hundreds of millions of samples per second.
"Rapid System Level Design and Evaluation of Near Memory Fixed Function Units: A Reconfigurable Computing Application"
Maya Gokhale, Lawrence Livermore National Laboratory
Scaling limits have spurred innovation in heterogenous accelerator design from wafer scale down to IoT, leading to a proliferation of specialized circuits targeted to particular application domains. Energy reduction concerns have motivated a concurrent interest in bringing compute closer to memory and storage. Design choices abound in this new landscape of computing devices. However, the speed with which computer architects can explore the design space to yield quantitative, actionable system level evaluation remains a formidable barrier. Software simulation offers the highest flexibility and simulations are relatively fast to build, but slow simulation speed can severely limit the scale and variety of experiments. Hardware emulators, often FPGA-based, have traditionally been used by ASIC and processor design companies, but require considerable investment in capital and labor.
Our research is motivated by the desire to study architectures targeting sparse, irregular memory access patterns. Algorithms that exhibit poor data locality can't take advantage of specialized compute units. Thus we focus on near memory, configurable, fixed function IP blocks designed for irregular access. Our team has pursued an approach that eliminates or reduces time consuming software simulation, while at the same time avoids massive hardware design requiring tens (or more) of FPGA boards. We exploit System on a Chip FPGAs that combine hard processors with configurable logic: the software infrastructure and much of the application run on the hard processors while the IP blocks are mapped onto FPGA logic. With careful clock management, this approach enables quantitative design space exploration at orders of magnitude speedup over software simulation using only a low cost FPGA development board.
While our open source Logic in Memory Emulator (https://github.com/LLNL/lime) has led to significant insights, the approach nevertheless is not a universal solution. Discovering what can and can't be learned through hybrid emulation has, and continues to suggest new ways of combining the speed of FPGA emulation with the convenience of software approaches.
"FPGA Acceleration of Fluid-Flow Kernels"
Ryan Blanchard, University of Florida
"FPGA Fabric is Eating the World"
Steve Casselman, CommaCorp
FPGAs have historically incorporated hardened features when LUTs alone have not been efficient or fast enough for the market. These features have been woven into the beautiful fabric we call modern FPGAs. This talk is about the past, present and future of FPGA fabric as seen through the eyes of Steve Casselman. Some of the topics addressed will be: Why are FPGA great computing? Why is FPGA fabric eating the world? What's the best way to program FPGAs? Where will FPGAs dominate in HPC/Datacenter computing and why?
Steve Casselman first conceived of compiling high level languages into hardware in 1987. He has been working ever since to build the fastest computer in the world based on FPGA fabric.
"FPGAs-as-a-Service Toolkit (FaaST)"
Dylan Rankin, MIT
"FPGA programming and the oneAPI industry initiative"
Michael Kinsner, Intel
For decades field programmable gate arrays (FPGAs) have been programmed primarily though register transfer level (RTL) languages and other low-level tooling. Many higher-level language and high level synthesis solutions have been developed based on C-like languages, and have been met with varying levels of success.
This talk will cover some of the challenges with high level design and sequential languages on spatial architectures, and will summarize a number of the resultant language constructs and controls that are particularly useful when programming FPGAs from high level sequential languages. An overview will be provided of the oneAPI industry initiative, and also of Data Parallel C++ (DPC++) which is based on C++ and extends the Khronos SYCL specification. An example will be shown to highlight that kernels optimized for specific classes of accelerators (including FPGA) can be written within DPC++ and SYCL, while maintaining most of the system-level code to reduce development effort and learning curves. A call will be made for broader input and participation in definition of cross-vendor, cross-industry solutions that enable FPGAs and other accelerators in the heterogeneous compute ecosystem.
"AIgean: An Open Framework for Machine Learning on a Heterogeneous Cluster"
Paul Chow, University of Toronto
AIgean, pronounced like the sea, is an open framework to build and deploy machine learning (ML) algorithms on a heterogeneous cluster of devices (CPUs and FPGAs). We leverage two open-source projects: Galapagos, for multi-FPGA deployment and hls4ml, for generating machine learning kernels synthesizable using Vivado HLS. AIgean provides a full end-to-end multi-FPGA/CPU implementation of a neural network. The user supplies a high-level neural network description and our tool flow is responsible for the synthesizing of the individual layers, partitioning layers across different nodes as well as the bridging and routing required for these layers to communicate. If the user is an expert in a particular domain and would like to tinker with the implementation details of the neural network, we define a flexible implementation stack for ML that includes the layers of Applications & Algorithms, Cluster Deployment & Communication, and Hardware. This allows the user to modify specific layers of abstraction without having to worry about components outside of their area of expertise. We demonstrate the effectiveness of AIgean with three use cases: a small network running on a single network-connected FPGA, an autoencoder running on three FPGAs, and ResNet-50 running across twelve FPGAs.
"OpenCL-enabled Parallel Raytracing for Astrophysical Application on Multiple FPGAs with Optical Links"
Norihisa Fujita, University of Tsukuba