Skip to content

FPGA: Move all FPGA samples to IP Authoring, targeting Agilex by default #1307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jan 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions DirectProgramming/C++SYCL_FPGA/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -269,8 +269,6 @@ qsub -I -l nodes=1:fpga_runtime:ppn=2 -d .

Only `fpga_compile` nodes support compiling to FPGA. When compiling for FPGA hardware, increase the job timeout to 24 hours.

Executing programs on FPGA hardware is only supported on `fpga_runtime` nodes of the appropriate type, such as `fpga_runtime:arria10` or `fpga_runtime:stratix10`.

Neither compiling nor executing programs on FPGA hardware are supported on the login nodes. For more information, see the [Intel® oneAPI Base Toolkit Get Started Guide](https://devcloud.intel.com/oneapi/documentation/base-toolkit/).

>**Note**: Since Intel® DevCloud for oneAPI includes the appropriate development environment already configured for you, you do not need to set environment variables.
Expand Down
53 changes: 34 additions & 19 deletions DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/anr/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ You can also find more information about [troubleshooting build errors](/DirectP
| Optimized for | Description
|:--- |:---
| OS | Ubuntu* 18.04/20.04 <br> RHEL*/CentOS* 8 <br> SUSE* 15 <br> Windows* 10
| Hardware | Intel® Programmable Acceleration Card (PAC) with Intel Arria® 10 GX FPGA <br> FPGA Programmable Acceleration Card (PAC) D5005 (with Intel Stratix® 10 SX) <br> Intel Xeon® CPU E5-1650 v2 @ 3.50GHz (host machine)
| Hardware | Intel® Agilex™, Arria® 10, and Stratix® 10 FPGAs
| Software | Intel® oneAPI DPC++/C++ Compiler

> **Note**: Even though the Intel DPC++/C++ OneAPI compiler is enough to compile for emulation, generating reports and generating RTL, there are extra software requirements for the simulation flow and FPGA compiles.
Expand All @@ -48,6 +48,8 @@ You can also find more information about [troubleshooting build errors](/DirectP
> - ModelSim® SE
>
> When using the hardware compile flow, Intel® Quartus® Prime Pro Edition must be installed and accessible through your PATH.
>
> :warning: Make sure you add the device files associated with the FPGA that you are targeting to your Intel® Quartus® Prime installation.

## Key Implementation Details

Expand Down Expand Up @@ -149,17 +151,26 @@ The design uses the following generic header files.
### On Linux*

1. Change to the sample directory.
2. Configure the build system for **Intel® PAC with Intel Arria® 10 GX FPGA**, which is the default.
2. Configure the build system for the Agilex™ device family, which is the default.

```
mkdir build
cd build
cmake ..
```
For **Intel® FPGA PAC D5005 (with Intel Stratix® 10 SX)**, enter the following:
```
cmake .. -DFPGA_DEVICE=intel_s10sx_pac:pac_s10
```

> **Note**: You can change the default target by using the command:
> ```
> cmake .. -DFPGA_DEVICE=<FPGA device family or FPGA part number>
> ```
>
> Alternatively, you can target an explicit FPGA board variant and BSP by using the following command:
> ```
> cmake .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
> ```
>
> You will only be able to run an executable on the FPGA if you specified a BSP.

3. Compile the design. (The provided targets match the recommended development flow.)

1. Compile for emulation (fast compile time, targets emulated FPGA device).
Expand All @@ -181,23 +192,27 @@ The design uses the following generic header files.
make fpga
```

(Optional) The hardware compiles listed above can take several hours to complete; alternatively, you can download FPGA precompiled binaries (compatible with Linux* Ubuntu* 18.04) from [https://iotdk.intel.com/fpga-precompiled-binaries/latest/anr.fpga.tar.gz](https://iotdk.intel.com/fpga-precompiled-binaries/latest/anr.fpga.tar.gz).

### On Windows*

>**Note**: The Intel® PAC with Intel Arria® 10 GX FPGA and Intel® FPGA PAC D5005 (with Intel Stratix® 10 SX) do not yet support Windows*. Compiling to FPGA hardware on Windows* requires a third-party or custom Board Support Package (BSP) with Windows* support.

1. Change to the sample directory.
2. Configure the build system for **Intel® PAC with Intel Arria® 10 GX FPGA**, which is the default.
2. Configure the build system for the Agilex™ device family, which is the default.
```
mkdir build
cd build
cmake -G "NMake Makefiles" ..
```
To compile for the **Intel® FPGA PAC D5005 (with Intel Stratix® 10 SX)**, enter the following:
```
cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=intel_s10sx_pac:pac_s10
```

> **Note**: You can change the default target by using the command:
> ```
> cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<FPGA device family or FPGA part number>
> ```
>
> Alternatively, you can target an explicit FPGA board variant and BSP by using the following command:
> ```
> cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
> ```
>
> You will only be able to run an executable on the FPGA if you specified a BSP.

3. Compile the design. (The provided targets match the recommended development flow.)

Expand Down Expand Up @@ -229,11 +244,11 @@ The design uses the following generic header files.
```
./anr.fpga_emu
```
2. Run the sample on the FPGA simulator device:
2. Run the sample on the FPGA simulator device.
```
CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1 ./anr.fpga_sim
```
3. Alternatively, run the sample on the FPGA device.
3. Alternatively, run the sample on the FPGA device (only if you ran `cmake` with `-DFPGA_DEVICE=<board-support-package>:<board-variant>`).
```
./anr.fpga
```
Expand All @@ -244,13 +259,13 @@ The design uses the following generic header files.
```
anr.fpga_emu.exe
```
2. Run the sample on the FPGA simulator device:
2. Run the sample on the FPGA simulator device.
```
set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1
anr.fpga_sim.exe
set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=
```
3. Alternatively, run the sample on the FPGA device.
3. Alternatively, run the sample on the FPGA device (only if you ran `cmake` with `-DFPGA_DEVICE=<board-support-package>:<board-variant>`).
```
anr.fpga.exe
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,36 @@ set(FPGA_TARGET ${TARGET_NAME}.fpga)

# FPGA board selection
if(NOT DEFINED FPGA_DEVICE)
set(FPGA_DEVICE "intel_a10gx_pac:pac_a10")
set(FPGA_DEVICE "Agilex")
set(DEVICE_FLAG "Agilex")
message(STATUS "FPGA_DEVICE was not specified.\
\nConfiguring the design to run on the default FPGA board ${FPGA_DEVICE} (Intel(R) PAC with Intel Arria(R) 10 GX FPGA). \
\nPlease refer to the README for information on board selection.")
\nConfiguring the design to the default FPGA family: ${FPGA_DEVICE}\
\nPlease refer to the README for information on target selection.")

set(BSP_FLAG "")
else()
message(STATUS "Configuring the design to run on FPGA board ${FPGA_DEVICE}")
string(TOLOWER ${FPGA_DEVICE} FPGA_DEVICE_NAME)
if(FPGA_DEVICE_NAME MATCHES ".*a10.*" OR FPGA_DEVICE_NAME MATCHES ".*arria10.*")
set(DEVICE_FLAG "A10")
elseif(FPGA_DEVICE_NAME MATCHES ".*s10.*" OR FPGA_DEVICE_NAME MATCHES ".*stratix10.*")
set(DEVICE_FLAG "S10")
elseif(FPGA_DEVICE_NAME MATCHES ".*agilex.*")
set(DEVICE_FLAG "Agilex")
else()
message(FATAL_ERROR "An unrecognized or custom board was passed, but DEVICE_FLAG was not specified. \
Please make sure you have set -DDEVICE_FLAG=A10, -DDEVICE_FLAG=S10 or \
-DDEVICE_FLAG=Agilex.")
endif()
message(STATUS "Configuring the design with the following target: ${FPGA_DEVICE}")

# Check if the target is a BSP
if(IS_BSP MATCHES "1" OR FPGA_DEVICE MATCHES ".*pac_a10.*|.*pac_s10.*")
set(BSP_FLAG "-DIS_BSP")
else()
set(BSP_FLAG "")
message(STATUS "The selected target ${FPGA_DEVICE} is assumed to be an FPGA part number, so the IS_BSP macro will not be passed to your C++ code.")
message(STATUS "If the target is actually a BSP, run cmake with -DIS_BSP=1 to pass the IS_BSP macro to your C++ code.")
endif()
endif()

# These are Windows-specific flags:
Expand Down Expand Up @@ -46,11 +70,11 @@ endif()
# e.g. cmake .. -DSEED=7
if(NOT DEFINED SEED)
# the default seed
if(FPGA_DEVICE MATCHES ".*a10.*")
if(DEVICE_FLAG MATCHES "A10")
set(SEED 1)
elseif(FPGA_DEVICE MATCHES ".*s10.*")
elseif(DEVICE_FLAG MATCHES "S10")
set(SEED 2)
elseif(FPGA_DEVICE MATCHES ".*agilex.*")
elseif(DEVICE_FLAG MATCHES "Agilex")
set(SEED 3)
else()
set(SEED 4)
Expand Down Expand Up @@ -79,11 +103,11 @@ if(PIXELS_PER_CYCLE)
message(STATUS "PIXELS_PER_CYCLE explicitly set to ${PIXELS_PER_CYCLE}")
else()
# Default PIXELS_PER_CYCLE based on the board being used
if(FPGA_DEVICE MATCHES ".*a10.*")
if(DEVICE_FLAG MATCHES "A10")
set(PIXELS_PER_CYCLE 2)
elseif(FPGA_DEVICE MATCHES ".*s10.*")
elseif(DEVICE_FLAG MATCHES "S10")
set(PIXELS_PER_CYCLE 2)
elseif(FPGA_DEVICE MATCHES ".*agilex.*")
elseif(DEVICE_FLAG MATCHES "Agilex")
set(PIXELS_PER_CYCLE 1)
else()
message(WARNING "Unknown board: setting PIXELS_PER_CYCLE to 1")
Expand Down Expand Up @@ -120,13 +144,13 @@ endif()
# 1. The "compile" stage compiles the device code to an intermediate representation (SPIR-V).
# 2. The "link" stage invokes the compiler's FPGA backend before linking.
# For this reason, FPGA backend flags must be passed as link flags in CMake.
set(EMULATOR_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${CONSTEXPR_STEPS} ${WIN_FLAG} ${AC_TYPES_FLAG} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG} -DFPGA_EMULATOR")
set(EMULATOR_LINK_FLAGS "-fsycl -fintelfpga ${AC_TYPES_FLAG} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG}")
set(SIMULATOR_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${CONSTEXPR_STEPS} ${WIN_FLAG} ${AC_TYPES_FLAG} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG} -Xssimulation -DFPGA_SIMULATOR")
set(SIMULATOR_LINK_FLAGS "-fsycl -fintelfpga -Xssimulation -Xsghdl -Xstarget=${FPGA_DEVICE} ${USER_HARDWARE_FLAGS}")
set(REPORT_LINK_FLAGS "-fsycl -fintelfpga -Xshardware ${PROFILE_FLAG} ${FLAT_COMPILE_FLAG} -Xsparallel=2 ${SEED_FLAG} -Xstarget=${FPGA_DEVICE} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG} ${IP_MODE_FLAG} ${USER_HARDWARE_FLAGS}")
set(HARDWARE_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${CONSTEXPR_STEPS} ${WIN_FLAG} ${AC_TYPES_FLAG} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG} -DFPGA_HARDWARE")
set(HARDWARE_LINK_FLAGS "${REPORT_LINK_FLAGS} ${AC_TYPES_FLAG}")
set(EMULATOR_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${CONSTEXPR_STEPS} ${WIN_FLAG} ${AC_TYPES_FLAG} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG} -DFPGA_EMULATOR ${BSP_FLAG}")
set(EMULATOR_LINK_FLAGS "-fsycl -fintelfpga ${AC_TYPES_FLAG} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG} ${BSP_FLAG}")
set(SIMULATOR_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${CONSTEXPR_STEPS} ${WIN_FLAG} ${AC_TYPES_FLAG} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG} -Xssimulation -DFPGA_SIMULATOR ${BSP_FLAG}")
set(SIMULATOR_LINK_FLAGS "-fsycl -fintelfpga -Xssimulation -Xsghdl -Xstarget=${FPGA_DEVICE} ${USER_HARDWARE_FLAGS} ${AC_TYPES_FLAG} ${BSP_FLAG}")
set(REPORT_LINK_FLAGS "-fsycl -fintelfpga -Xshardware ${PROFILE_FLAG} ${FLAT_COMPILE_FLAG} -Xsparallel=2 ${SEED_FLAG} -Xstarget=${FPGA_DEVICE} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG} ${USER_HARDWARE_FLAGS} ${BSP_FLAG}")
set(HARDWARE_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${CONSTEXPR_STEPS} ${WIN_FLAG} ${AC_TYPES_FLAG} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG} -DFPGA_HARDWARE ${BSP_FLAG}")
set(HARDWARE_LINK_FLAGS "${REPORT_LINK_FLAGS} ${AC_TYPES_FLAG} ${BSP_FLAG}")
# use cmake -D USER_HARDWARE_FLAGS=<flags> to set extra flags for FPGA backend compilation

###############################################################################
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -347,15 +347,7 @@ std::vector<event> SubmitANRKernels(queue& q, int cols, int rows,
// submit the vertical kernel using a column stencil
auto vertical_kernel = q.single_task<VerticalKernelID>([=] {
// copy host side intensity sigma LUT to the device
// For testing the kernel system as an IP and checking the area and Fmax,
// we allow the user to turn off connections to device memory. In this case
// (the DISABLE_DEVICE_MEM macro IS defined), the results will be incorrect
// since there is no way to get the data to/from the device.
#if defined(IP_MODE)
IntensitySigmaLUT sig_i_lut;
#else
IntensitySigmaLUT sig_i_lut(sig_i_lut_data_ptr);
#endif

// build the constexpr exp() and inverse LUT ROMs
constexpr ExpLUT exp_lut;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,13 @@ template <typename KernelId, typename T, typename Pipe, int pixels_per_cycle>
event SubmitInputDMA(queue &q, T *in_ptr, int rows, int cols, int frames) {
using PipeType = DataBundle<T, pixels_per_cycle>;

#if defined (IS_BSP)
// LSU attribute to turn off caching
using NonCachingLSU =
ext::intel::lsu<ext::intel::burst_coalesce<true>, ext::intel::cache<0>,
ext::intel::statically_coalesce<true>,
ext::intel::prefetch<false>>;
#endif

// validate the number of columns
if ((cols % pixels_per_cycle) != 0) {
Expand All @@ -41,7 +43,12 @@ event SubmitInputDMA(queue &q, T *in_ptr, int rows, int cols, int frames) {

// Using device memory
return q.single_task<KernelId>([=]() [[intel::kernel_args_restrict]] {

#if defined (IS_BSP)
device_ptr<T> in(in_ptr);
#else
T* in(in_ptr);
#endif

// coalesce the following two loops into a single for-loop using the
// loop_coalesce attribute
Expand All @@ -51,7 +58,11 @@ event SubmitInputDMA(queue &q, T *in_ptr, int rows, int cols, int frames) {
PipeType pipe_data;
#pragma unroll
for (int k = 0; k < pixels_per_cycle; k++) {
#if defined (IS_BSP)
pipe_data[k] = NonCachingLSU::load(in + i * pixels_per_cycle + k);
#else
pipe_data[k] = in[i * pixels_per_cycle + k];
#endif
}
Pipe::write(pipe_data);
}
Expand All @@ -77,7 +88,12 @@ event SubmitOutputDMA(queue &q, T *out_ptr, int rows, int cols, int frames) {

// Using device memory
return q.single_task<KernelId>([=]() [[intel::kernel_args_restrict]] {

#if defined (IS_BSP)
device_ptr<T> out(out_ptr);
#else
T* out(out_ptr);
#endif

// coalesce the following two loops into a single for-loop using the
// loop_coalesce attribute
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ class IntensitySigmaLUT {
// default constructor
IntensitySigmaLUT() {}

#if defined (IS_BSP)
// construct from a device_ptr (for constructing from device memory)
IntensitySigmaLUT(device_ptr<float> ptr) {
// use a pipelined LSU to load from device memory since we don't
Expand All @@ -25,6 +26,14 @@ class IntensitySigmaLUT {
data_[i] = PipelinedLSU::load(ptr + i);
}
}
#else
// construct from a regular pointer
IntensitySigmaLUT(float* ptr) {
for (int i = 0; i < lut_depth; i++) {
data_[i] = ptr[i];
}
}
#endif

// construct from the ANR parameters (actually builds the LUT)
IntensitySigmaLUT(ANRParams params) {
Expand All @@ -39,8 +48,12 @@ class IntensitySigmaLUT {
}

// helper static method to allocate enough memory to hold the LUT
static float* AllocateDevice(sycl::queue& q) {
static float* Allocate(sycl::queue& q) {
#if defined (IS_BSP)
float* ptr = sycl::malloc_device<float>(lut_depth, q);
#else
float* ptr = sycl::malloc_shared<float>(lut_depth, q);
#endif
if (ptr == nullptr) {
std::cerr << "ERROR: could not allocate space for 'ptr'\n";
std::terminate();
Expand All @@ -49,7 +62,7 @@ class IntensitySigmaLUT {
}

// helper method to copy the data to the device
sycl::event CopyDataToDevice(sycl::queue& q, float* ptr) {
sycl::event CopyData(sycl::queue& q, float* ptr) {
return q.memcpy(ptr, data_, lut_depth * sizeof(float));
}

Expand Down
18 changes: 16 additions & 2 deletions DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/anr/src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,7 @@ int main(int argc, char* argv[]) {
// create the output pixels (initialize to all 0s)
std::vector<PixelT> out_pixels(in_pixels.size(), 0);

#if defined (IS_BSP)
// allocate memory on the device for the input and output
PixelT *in, *out;
if ((in = malloc_device<PixelT>(pixel_count, q)) == nullptr) {
Expand All @@ -127,18 +128,31 @@ int main(int argc, char* argv[]) {
std::cerr << "ERROR: could not allocate space for 'out'\n";
std::terminate();
}
#else
// allocate memory on the host for the input and output
PixelT *in, *out;
if ((in = malloc_shared<PixelT>(pixel_count, q)) == nullptr) {
std::cerr << "ERROR: could not allocate space for 'in'\n";
std::terminate();
}
if ((out = malloc_shared<PixelT>(pixel_count, q)) == nullptr) {
std::cerr << "ERROR: could not allocate space for 'out'\n";
std::terminate();
}
#endif


// copy the input data to the device memory and wait for the copy to finish
q.memcpy(in, in_pixels.data(), pixel_count * sizeof(PixelT)).wait();

// allocate space for the intensity sigma LUT
float* sig_i_lut_data_ptr = IntensitySigmaLUT::AllocateDevice(q);
float* sig_i_lut_data_ptr = IntensitySigmaLUT::Allocate(q);

// create the intensity sigma LUT data locally on the host
IntensitySigmaLUT sig_i_lut_host(params);

// copy the intensity sigma LUT to the device
sig_i_lut_host.CopyDataToDevice(q, sig_i_lut_data_ptr).wait();
sig_i_lut_host.CopyData(q, sig_i_lut_data_ptr).wait();
//////////////////////////////////////////////////////////////////////////////

// track timing information in ms
Expand Down
Loading