Keynote 1: FPGAs in AWS and First Use Cases (joint talk by AWS, Ngcodec, and Xilinx)
Amazon focuses on the development and deployment of cloud-scale FPGA-accelerated applications, showing you how to get started building high performance solutions on AWS. The talk also provides an update on AWS FPGA partners, and the AWS Marketplace. With the availability of FPGAs on AWS, developers are creating custom hardware accelerated solutions to solve complex problems in areas like big data processing, healthcare and life sciences, security, image and video processing, financial, and applied research. Amazon EC2 F1 provides developers with a cloud development framework, ability to rapidly scale their accelerations, and enabling access to FPGA technologies to millions of AWS customers via AWS Marketplace.
NGcodec will show the development and deployment of an H.265/HEVC encoder for AWS F1. The design for the total system includes algorithm and implementation trade-offs that will be shown. The resulting implementation is a significant improvement over any implementation that leverages CPUs only. The combination of FPGAs and CPUs in a cloud environment offers novel technical and business opportunities.
Xilinx will illustrate both the Silicon and Software environment that is available in the AWS EC F1 instance. This environment has the details to enable efficient Hardware implementations and also has abstractions to enable software developers.
Mark Duffield (AWS)
Mark Duffield is a Principal Solutions Architect at Amazon Web Services. Prior to joining AWS, he was a High Performance Computing SME at IBM, and designed multi-petabyte solutions at DDN Storage. He has deep experience with High Performance Computing, cluster computing, enterprise software development, and distributed file systems. He architects solutions in several verticals, to include weather modeling and forecasting, Electronic Design Automation, manufacturing, and scientific simulations.
Kees Vissers (Xilinx)
Kees Vissers graduated from Delft University in the Netherlands. Since then, he's worked as a researcher at Philips Research in Eindhoven, an industrial fellow at Carnegie Mellon University, a visiting industrial fellow at UC Berkeley, director of architecture at Trimedia, CTO at Chameleon Systems, and a Board member of Beecube. Today he is heading a team of researchers at Xilinx, developing next generation programming environments for processors and FPGA fabrics. His most current research includes work on Neural Networks and reduced precision Neural Networks.
Oliver Gunasekara (Ngcodec)
Oliver Gunasekara joined ARM in 1995 to lead its mobile activities, initially in Europe and then latter in Japan and Silicon Valley. His last role at ARM was Vice President of Ccorporate Business Development and M&A. He joined video start-up W&W Communications in 2007 as Vice President of Mobile business and founded NGCodec in 2012. Oliver holds a Bachelor's degree with honours in Electrical and Electronic Engineering from the University of Greenwich London, UK and a Mini-MBA from the AeA/Executive Institute, Stanford University Graduate School of Business.
Case Study: Usage of High Level Synthesis in HPC Networking"
Pavel Benácek, Viktor Puš, Michal Kekely and Lukáš Richter
Adopting OpenCAPI for High Bandwidth Database Accelerators
Jian Fang, Yvo Mulder, Kangli Huang, Yang Qiao, Xianwei Zeng, Peter Hofstee, Jinho Lee and Jan Hidders
A FPGA-Pipelined Approach for Accelerated Discrete-Event Simulation of HPC Systems
Carlo Pascoe, Sai P. Chenna, Greg Stitt and Herman Lam
OpenCL for FPGAs/HPC: Case Study in 3D FFT
Ahmed Sanaullah, Vipin Sachdeva and Martin Herbordt
RE-HASE: Regular-Expressions Hardware Synthesis Engine
Mohamed El-Hadedy, Xinfei Guo, Xiaoping Huang and Martin Margala
LESS: Loop Nest Execution Strategies for Spatial Architectures
Amalee Wilson, Swapna Raj and Kermin Fleming
Kalin Ovtcharov, Microsoft Research
"Accelerating Deep Neural Networks at Datacenter Scale with the BrainWave Architecture"
In the last several years, advances in deep neural networks (DNN) have led to many breakthroughs in machine learning, spanning diverse fields such as computer vision, speech recognition, and natural language processing. Since then, the size and complexity of DNNs have significantly outpaced the growth of CPU performance, leading to an explosion of DNN accelerators.
Recently, major cloud vendors have turned to ASICs and FPGAs to accelerate DNN serving in latency-sensitive online services. While significant attention has been devoted to deep convolutional neural networks (CNN), much less so has been given toward the more difficult problems of memory-intensive Multilayer Perceptions (MLP) and Recurrent Neural Networks (RNN).
Improving the performance of memory-intensive DNNs (e.g., LSTM) can be achieved using a variety of methods. Batching is one popular approach that can be used to drive up device utilization, but can be harmful to tail latencies (e.g., 99.9th) in the context of online services. Increasing off-chip bandwidth is another option, but incurs high cost and power, and still may not be sufficient to saturate an accelerator with tens or even hundreds of TFLOPS of raw performance.
In this work, we present BrainWave, a scalable, distributed DNN acceleration architecture that enables inferencing of MLPs and RNNs at near-peak device efficiency without the need for input batching. The BrainWave architecture is enabled and powered by FPGAs and is able to reduce latency 10-100X relative to well-tuned software.
Kalin Ovtcharov is a Senior Research Hardware Design Engineer in the Silicon Futures Group at Microsoft Research. He is involved in the development of the BrainWave Accelerator Architecture, an FPGA-based Machine Learning Accelerator used in Microsoft Datacenters. Prior to joining Microsoft, Kalin was a founding member of a start-up company. He has also worked at several other companies employing FPGAs for various applications such as video compression and radar imaging. Kalin holds a Bachelors of Computer Engineering from McMaster University.
Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on AWS F1 FPGA
Xuechao Wei, Cody Hao Yu, Peng Zhang and Jim Wu
A Reconfigurable Heterogeneous Microserver Architecture for Energy-efficient Computing
Martin Kaiser, René Griessl, Jens Hagemeyer, Dirk Jungewelter, Florian Porrmann, Sarah Pilz, Mario Porrmann, Micha Vor Dem Berge and Stefan Krupop
Heterogeneous Multi-Processing in Software-Defined Cloud Storage Nodes
Ulrich Langenbach, Jim Peek and Endric Schubert
Porting a GAMESS Computational Chemistry Kernel to FPGAs
Umayanganie Klaassen, Shirley V. Moore, Kristopher Keipert, Mark S. Gordon, Jeffery S. Vetter and Seyong Lee
Snowflake: Efficient Accelerator for Deep Neural Networks
Aliasger Zaidy, Andre Xian Ming Chang, Abhishek Chaurasia and Eugenio Culurciello
Hal Finkel, Argonne National Laboratory
"FPGAs for Supercomputing? Progress and Challenges"
Field-programmable gate arrays (FPGAs) have become keystones of our modern electronics ecosystem. Despite higher-end components possessing immense processing power, however, FPGAs are rarely employed as general-purpose computational accelerators. In this talk, I'll discuss why FPGAs seem attractive as HPC accelerators and our experience experimenting with different kinds of algorithms on FPGAs. Finally, I'll share some thoughts on how common HPC programming models and environments can be adapted to support FPGAs, and moreover, how partial reconfigurability and combined CPU/FPGA packages can be leveraged to make this easier.
Hal Finkel graduated from Yale University in 2011 with a Ph.D. in theoretical physics focusing on numerical simulation of early-universe cosmology. He’s now the Lead for Compiler Technology and Programming Languages at the ALCF. Hal has contributed to the LLVM compiler infrastructure project for many years and is currently the code owner of the PowerPC backend and the pointer-aliasing-analysis subsystem, among others. He is the lead developer on the bgclang project, which provides LLVM/Clang on the IBM Blue Gene/Q supercomputer, and represents Argonne on the C++ Standards Committee. Hal also helps develop the Hardware/Hybrid Accelerated Cosmology Code (HACC), a two-time IEEE/ACM Gordon Bell Prize finalist. He has designed and implemented a tree-based force evaluation scheme and the I/O subsystem and contributed to many other HACC components.