







# From Dataflow Specifications to Customised Reconfigurable Datapaths Using HLS: the OpenCL Case for FPGAs

IETR research team

### **Rubén Salvador**

[Kindly hosted by INSA: KDesnos, MPelcat, JFNezan, DMenard, LMorin...]

Universidad Politécnica de Madrid (UPM) School of Telecommunications Systems and Engineering (ETSIST) Research Center on Software Technologies and Multimedia Systems (CITSEM)

> Dataflow Workshop Rennes, 12-14 December 2017











### Context













Co-funded by the European Union







Motivation

OpenCL FPGA

Dataflow Mapping On Top Of OpenCL FPGA





### Motivation

OpenCL FPGA

Dataflow Mapping On Top Of OpenCL FPGA



**IETR** cit**S**em



### Customised FPGA-based datapaths for dataflow graphs iFTR citSem



### What can dataflow bring to the OpenCL community?

|      | OpenCL                                                                                                                                                                                              | Dataflow                                                                                                                                                           |
|------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Pros | <ul> <li>Functionally portable</li> <li>Wide community acceptance</li> <li>Support for HLS</li> <li></li> </ul>                                                                                     | <ul> <li>Graph analysis &amp; Guarantees</li> <li>Schedulability, deadlocks, FIFO sizing</li> <li>Concurrent execution model</li> <li>Comms interaction</li> </ul> |
| Cons | <ul> <li>No dataflow (streaming) friendly</li> <li>Global memory comms</li> <li>Compute accelerator model</li> <li>Data offload (writes/reads)</li> <li>Throughput oriented (vs latency)</li> </ul> | <ul><li>Niche domain</li><li>Most work for multi/manycore</li></ul>                                                                                                |
|      |                                                                                                                                                                                                     |                                                                                                                                                                    |

**IETR** cit**S**em





Motivation

OpenCL FPGA

Dataflow Mapping On Top Of OpenCL FPGA



### OpenCL: framework for heterogeneous/parallel computingierr citSem



...

Host Memory



Host

### **OpenCL FPGA Model**

**ietr**cit**S**em







Motivation

OpenCL FPGA

### Dataflow Mapping On Top Of OpenCL FPGA

Desired Features & Expected Gains

Hardware acceleration (custom datapath)

Reduced processor (communication) overhead

Reduced memory transactions

Self-timed Execution

tSem





Dataflow Community

Leverage OpenCL FPGA constructs to generate efficient dataflow

Dataflow-driven "OpenCL" code generation

Tool Expertise & Design Space Exploration

**OpenCL** Community

### **OpenCL Khronos Group Standard**

Recent (2017) proposal: add dataflow semantics to OpenCL standard





### Applying Models of Computation to OpenCL Pipes for FPGA Computing

Nachiket Kapre University of Waterloo 200 University Ave W. Waterloo, Ontario, Canada N2L 3G1 nachiket@uwaterloo.ca

#### ABSTRACT

OpenCL pipes offer a powerful construct for synthesizing multikernel FPGA applications with inter-kernel communication dependencies. The communication discipline between the FPGA kernels is restricted to producer-consumer style patterns supported with on-chip FPGA FIFOs. While this provides few restrictions on the Hiren Patel University of Waterloo 200 University Ave W. Waterloo, Ontario, Canada N2L 3G1 hiren.patel@uwaterloo.ca

compared to the alternatives. They can also be reconfigured as needed to support varying demands of user applications. However, FPGAs have traditionally been difficult to configure as programmers describe their applications as low-level *circuits* rather than high-level software programs. However, at a fundamental level, hardware circuits are specialized parallel programs. This means,

### MoCs semantics to OpenCL

Kapre, Nachiket, and Hiren Patel. *Applying Models of Computation to OpenCL Pipes for FPGA Computing*. Proc. 5th **IWOCL**. ACM, **2017**.

OpenCL compute model + MoC Comms Schemes

Proposal for the OpenCL Standard

a.k.a.: compiler's job

*Synchronous Dataflow (SDF)* 

Bulk Synchronous Parallel (BSP)



Pipes (OpenCL 2.0)



### Standard OpenCL Kernel-to-Kernel communication

Overlap multi-kernel operation

Channels (Intel FPGA) (intel)

Preferred Kernel-to-Kernel communication

Self-triggered kernels (free run decoupled from host)

Host-Kernel Pipes



Kang, K., and P. Yiannacouras. *Host Pipes: Direct Streaming Interface Between OpenCL Host and Kernel*. Proc. 5th **IWOCL**. ACM, **2017**.

... only prototype demo so far

tSem



### **Kernel Operation Possibilities**

**Global Memory** Read Kernel 1 Kernel 2 Kernel 3 Kernel 4 **Global Memory** Kernel 3 Channel/Pipe Kernel 1 Channel/Pipe Kernel 2 Channel/Pipe Kernel 4

### Autorun kernels

Read

Rea

No host-kernel communication logic

Autostart & Auto-restart

Communicate through channels

Intel FPGA SDK for OpenCL: Best Practices Guide https://www.altera.com/en\_US/pdfs/literature/hb/openclsdk/aocl-best-practices-guide.pdf

**IFTR** cit**S**em



Channels





Blocking/Non-blocking Read/Write API

Synchronization mechanisms

I/O Channels -> Streaming DSP

Intel FPGA SDK for OpenCL: Programming Guide https://www.altera.com/content/dam/alterawww/global/en\_US/pdfs/literature/hb/openclsdk/aocl\_programming\_guide.pdf





Motivation

OpenCL FPGA

Dataflow Mapping On Top Of OpenCL FPGA







### 2.1.- Actor firing rules (scheduler)

Actor I/O IFs, firing rules, templates

- Host code:
  - <u>Platform</u> initialization: *automatic*
  - Job management: *automatic* 
    - input data & result data
    - "only" necessary for the host/device frontier
    - pointers mapped to device buffers
- Kernel code:
  - <u>I/O interfaces (firing rules)</u>: *automatic*
  - <u>Functionality</u>: <u>manual</u> (provided by user)

Enough with channels sync? Borrow from *CAPH* ¿?





### Mapping (PiSDF) Dataflow Graphs To OpenCL Model



### 2.1.- Actor firing rules (scheduler)

Actor I/O IFs, firing rules, templates

### 2.2.- Buffer generation

Leverage current PREESM buffer generation Pipes vs Channels vs Ad-hoc Buffer

### 2.3.- Memory Accesses Optimization

Streaming Dataflow Shared/Global Memory

Local FPGA DDRs (kernel only)

Enough with channels sync? Borrow from *CAPH* ¿?



Different workloads?

Out-of-order accesses?

Rubén Salvador



### future(future)















Centro de Investigación en Tecnologías del Software y Sistemas Multimedia para la Sostenibilidad



## Thanks for your attention!!

ruben.salvador@upm.es https://twitter.com/RubenSalvadorP http://blogs.upm.es/rubensalvador/