oneapi-src · yuguen · Jan 25, 2023 · Jan 12, 2023 · Jan 13, 2023 · Jan 13, 2023
diff --git a/DirectProgramming/C++SYCL_FPGA/README.md b/DirectProgramming/C++SYCL_FPGA/README.md
@@ -269,8 +269,6 @@ qsub -I -l nodes=1:fpga_runtime:ppn=2 -d .
 
 Only `fpga_compile` nodes support compiling to FPGA. When compiling for FPGA hardware, increase the job timeout to 24 hours.
 
-Executing programs on FPGA hardware is only supported on `fpga_runtime` nodes of the appropriate type, such as `fpga_runtime:arria10` or `fpga_runtime:stratix10`.
-
 Neither compiling nor executing programs on FPGA hardware are supported on the login nodes. For more information, see the [Intel® oneAPI Base Toolkit Get Started Guide](https://devcloud.intel.com/oneapi/documentation/base-toolkit/).
 
 >**Note**: Since Intel® DevCloud for oneAPI includes the appropriate development environment already configured for you, you do not need to set environment variables.

diff --git a/DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/anr/README.md b/DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/anr/README.md
@@ -37,7 +37,7 @@ You can also find more information about [troubleshooting build errors](/DirectP
 | Optimized for                     | Description
 |:---                               |:---
 | OS                                | Ubuntu* 18.04/20.04 <br> RHEL*/CentOS* 8 <br> SUSE* 15 <br> Windows* 10
-| Hardware                          | Intel® Programmable Acceleration Card (PAC) with Intel Arria® 10 GX FPGA <br> FPGA Programmable Acceleration Card (PAC) D5005 (with Intel Stratix® 10 SX) <br> Intel Xeon® CPU E5-1650 v2 @ 3.50GHz (host machine)
+| Hardware                          | Intel® Agilex™, Arria® 10, and Stratix® 10 FPGAs
 | Software                          | Intel® oneAPI DPC++/C++ Compiler
 
 > **Note**: Even though the Intel DPC++/C++ OneAPI compiler is enough to compile for emulation, generating reports and generating RTL, there are extra software requirements for the simulation flow and FPGA compiles.
@@ -48,6 +48,8 @@ You can also find more information about [troubleshooting build errors](/DirectP
 > - ModelSim® SE
 >
 > When using the hardware compile flow, Intel® Quartus® Prime Pro Edition must be installed and accessible through your PATH.
+>
+> :warning: Make sure you add the device files associated with the FPGA that you are targeting to your Intel® Quartus® Prime installation.
 
 ## Key Implementation Details
 
@@ -149,17 +151,26 @@ The design uses the following generic header files.
 ### On Linux*
 
 1. Change to the sample directory.
-2. Configure the build system for **Intel® PAC with Intel Arria® 10 GX FPGA**, which is the default.
+2. Configure the build system for the Agilex™ device family, which is the default.
 
    ```
    mkdir build
    cd build
    cmake ..
    ```
-   For **Intel® FPGA PAC D5005 (with Intel Stratix® 10 SX)**, enter the following:
-   ```
-   cmake .. -DFPGA_DEVICE=intel_s10sx_pac:pac_s10
-   ```
+
+   > **Note**: You can change the default target by using the command:
+   >  ```
+   >  cmake .. -DFPGA_DEVICE=<FPGA device family or FPGA part number>
+   >  ``` 
+   >
+   > Alternatively, you can target an explicit FPGA board variant and BSP by using the following command: 
+   >  ```
+   >  cmake .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
+   >  ``` 
+   >
+   > You will only be able to run an executable on the FPGA if you specified a BSP.
+
 3. Compile the design. (The provided targets match the recommended development flow.)
 
    1. Compile for emulation (fast compile time, targets emulated FPGA device).
@@ -181,23 +192,27 @@ The design uses the following generic header files.
       make fpga
       ```
 
-   (Optional) The hardware compiles listed above can take several hours to complete; alternatively, you can download FPGA precompiled binaries (compatible with Linux* Ubuntu* 18.04) from [https://iotdk.intel.com/fpga-precompiled-binaries/latest/anr.fpga.tar.gz](https://iotdk.intel.com/fpga-precompiled-binaries/latest/anr.fpga.tar.gz).
-
 ### On Windows*
 
->**Note**: The Intel® PAC with Intel Arria® 10 GX FPGA and Intel® FPGA PAC D5005 (with Intel Stratix® 10 SX) do not yet support Windows*. Compiling to FPGA hardware on Windows* requires a third-party or custom Board Support Package (BSP) with Windows* support.
-
 1. Change to the sample directory.
-2. Configure the build system for **Intel® PAC with Intel Arria® 10 GX FPGA**, which is the default.
+2. Configure the build system for the Agilex™ device family, which is the default.
    ```
    mkdir build
    cd build
    cmake -G "NMake Makefiles" ..
    ```
-   To compile for the **Intel® FPGA PAC D5005 (with Intel Stratix® 10 SX)**, enter the following:
-   ```
-   cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=intel_s10sx_pac:pac_s10
-   ```
+
+   > **Note**: You can change the default target by using the command:
+   >  ```
+   >  cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<FPGA device family or FPGA part number>
+   >  ``` 
+   >
+   > Alternatively, you can target an explicit FPGA board variant and BSP by using the following command: 
+   >  ```
+   >  cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<board-support-package>:<board-variant>
+   >  ``` 
+   >
+   > You will only be able to run an executable on the FPGA if you specified a BSP.
 
 3. Compile the design. (The provided targets match the recommended development flow.)
 
@@ -229,11 +244,11 @@ The design uses the following generic header files.
    ```
    ./anr.fpga_emu
    ```
-2. Run the sample on the FPGA simulator device:
+2. Run the sample on the FPGA simulator device.
    ```
    CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1 ./anr.fpga_sim
    ```
-3. Alternatively, run the sample on the FPGA device.
+3. Alternatively, run the sample on the FPGA device (only if you ran `cmake` with `-DFPGA_DEVICE=<board-support-package>:<board-variant>`).
    ```
    ./anr.fpga
    ```
@@ -244,13 +259,13 @@ The design uses the following generic header files.
    ```
    anr.fpga_emu.exe
    ```
-2. Run the sample on the FPGA simulator device:
+2. Run the sample on the FPGA simulator device.
    ```
    set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1
    anr.fpga_sim.exe
    set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=
    ```
-3. Alternatively, run the sample on the FPGA device.
+3. Alternatively, run the sample on the FPGA device (only if you ran `cmake` with `-DFPGA_DEVICE=<board-support-package>:<board-variant>`).
    ```
    anr.fpga.exe
    ```

diff --git a/DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/anr/src/CMakeLists.txt b/DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/anr/src/CMakeLists.txt
@@ -6,12 +6,36 @@ set(FPGA_TARGET ${TARGET_NAME}.fpga)
 
 # FPGA board selection
 if(NOT DEFINED FPGA_DEVICE)
-    set(FPGA_DEVICE "intel_a10gx_pac:pac_a10")
+    set(FPGA_DEVICE "Agilex")
+    set(DEVICE_FLAG "Agilex")
     message(STATUS "FPGA_DEVICE was not specified.\
-                    \nConfiguring the design to run on the default FPGA board ${FPGA_DEVICE} (Intel(R) PAC with Intel Arria(R) 10 GX FPGA). \
-                    \nPlease refer to the README for information on board selection.")
+                    \nConfiguring the design to the default FPGA family: ${FPGA_DEVICE}\
+                    \nPlease refer to the README for information on target selection.")
+
+    set(BSP_FLAG "")
 else()
-    message(STATUS "Configuring the design to run on FPGA board ${FPGA_DEVICE}")
+    string(TOLOWER ${FPGA_DEVICE} FPGA_DEVICE_NAME)
+    if(FPGA_DEVICE_NAME MATCHES ".*a10.*" OR FPGA_DEVICE_NAME MATCHES ".*arria10.*")
+      set(DEVICE_FLAG "A10")
+    elseif(FPGA_DEVICE_NAME MATCHES ".*s10.*" OR FPGA_DEVICE_NAME MATCHES ".*stratix10.*")
+      set(DEVICE_FLAG "S10")
+    elseif(FPGA_DEVICE_NAME MATCHES ".*agilex.*")
+      set(DEVICE_FLAG "Agilex")
+    else()
+      message(FATAL_ERROR "An unrecognized or custom board was passed, but DEVICE_FLAG was not specified. \
+                           Please make sure you have set -DDEVICE_FLAG=A10, -DDEVICE_FLAG=S10 or \
+                           -DDEVICE_FLAG=Agilex.")
+    endif()
+    message(STATUS "Configuring the design with the following target: ${FPGA_DEVICE}")
+
+    # Check if the target is a BSP
+    if(IS_BSP MATCHES "1" OR FPGA_DEVICE MATCHES ".*pac_a10.*|.*pac_s10.*")
+        set(BSP_FLAG "-DIS_BSP")
+    else()
+        set(BSP_FLAG "")
+        message(STATUS "The selected target ${FPGA_DEVICE} is assumed to be an FPGA part number, so the IS_BSP macro will not be passed to your C++ code.")
+        message(STATUS "If the target is actually a BSP, run cmake with -DIS_BSP=1 to pass the IS_BSP macro to your C++ code.")
+    endif()
 endif()
 
 # These are Windows-specific flags:
@@ -46,11 +70,11 @@ endif()
 # e.g. cmake .. -DSEED=7
 if(NOT DEFINED SEED)
     # the default seed
-  if(FPGA_DEVICE MATCHES ".*a10.*")
+  if(DEVICE_FLAG MATCHES "A10")
     set(SEED 1)
-  elseif(FPGA_DEVICE MATCHES ".*s10.*")
+  elseif(DEVICE_FLAG MATCHES "S10")
     set(SEED 2)
-  elseif(FPGA_DEVICE MATCHES ".*agilex.*")
+  elseif(DEVICE_FLAG MATCHES "Agilex")
     set(SEED 3)
   else()
     set(SEED 4)
@@ -79,11 +103,11 @@ if(PIXELS_PER_CYCLE)
     message(STATUS "PIXELS_PER_CYCLE explicitly set to ${PIXELS_PER_CYCLE}")
 else()
   # Default PIXELS_PER_CYCLE based on the board being used
-  if(FPGA_DEVICE MATCHES ".*a10.*")
+  if(DEVICE_FLAG MATCHES "A10")
     set(PIXELS_PER_CYCLE 2)
-  elseif(FPGA_DEVICE MATCHES ".*s10.*")
+  elseif(DEVICE_FLAG MATCHES "S10")
     set(PIXELS_PER_CYCLE 2)
-  elseif(FPGA_DEVICE MATCHES ".*agilex.*")
+  elseif(DEVICE_FLAG MATCHES "Agilex")
     set(PIXELS_PER_CYCLE 1)
   else()
     message(WARNING "Unknown board: setting PIXELS_PER_CYCLE to 1")
@@ -120,13 +144,13 @@ endif()
 # 1. The "compile" stage compiles the device code to an intermediate representation (SPIR-V).
 # 2. The "link" stage invokes the compiler's FPGA backend before linking.
 #    For this reason, FPGA backend flags must be passed as link flags in CMake.
-set(EMULATOR_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${CONSTEXPR_STEPS} ${WIN_FLAG} ${AC_TYPES_FLAG} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG} -DFPGA_EMULATOR")
-set(EMULATOR_LINK_FLAGS "-fsycl -fintelfpga ${AC_TYPES_FLAG} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG}")
-set(SIMULATOR_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${CONSTEXPR_STEPS} ${WIN_FLAG} ${AC_TYPES_FLAG} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG} -Xssimulation -DFPGA_SIMULATOR")
-set(SIMULATOR_LINK_FLAGS "-fsycl -fintelfpga -Xssimulation -Xsghdl -Xstarget=${FPGA_DEVICE} ${USER_HARDWARE_FLAGS}")
-set(REPORT_LINK_FLAGS "-fsycl -fintelfpga -Xshardware ${PROFILE_FLAG} ${FLAT_COMPILE_FLAG} -Xsparallel=2 ${SEED_FLAG} -Xstarget=${FPGA_DEVICE} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG} ${IP_MODE_FLAG} ${USER_HARDWARE_FLAGS}")
-set(HARDWARE_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${CONSTEXPR_STEPS} ${WIN_FLAG} ${AC_TYPES_FLAG} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG} -DFPGA_HARDWARE")
-set(HARDWARE_LINK_FLAGS "${REPORT_LINK_FLAGS} ${AC_TYPES_FLAG}")
+set(EMULATOR_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${CONSTEXPR_STEPS} ${WIN_FLAG} ${AC_TYPES_FLAG} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG} -DFPGA_EMULATOR ${BSP_FLAG}")
+set(EMULATOR_LINK_FLAGS "-fsycl -fintelfpga ${AC_TYPES_FLAG} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG} ${BSP_FLAG}")
+set(SIMULATOR_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${CONSTEXPR_STEPS} ${WIN_FLAG} ${AC_TYPES_FLAG} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG} -Xssimulation -DFPGA_SIMULATOR ${BSP_FLAG}")
+set(SIMULATOR_LINK_FLAGS "-fsycl -fintelfpga -Xssimulation -Xsghdl -Xstarget=${FPGA_DEVICE} ${USER_HARDWARE_FLAGS} ${AC_TYPES_FLAG} ${BSP_FLAG}")
+set(REPORT_LINK_FLAGS "-fsycl -fintelfpga -Xshardware ${PROFILE_FLAG} ${FLAT_COMPILE_FLAG} -Xsparallel=2 ${SEED_FLAG} -Xstarget=${FPGA_DEVICE} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG} ${USER_HARDWARE_FLAGS} ${BSP_FLAG}")
+set(HARDWARE_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${CONSTEXPR_STEPS} ${WIN_FLAG} ${AC_TYPES_FLAG} ${FILTER_SIZE_FLAG} ${PIXELS_PER_CYCLE_FLAG} ${MAX_COLS_FLAG} ${PIXEL_BITS_FLAG} -DFPGA_HARDWARE ${BSP_FLAG}")
+set(HARDWARE_LINK_FLAGS "${REPORT_LINK_FLAGS} ${AC_TYPES_FLAG} ${BSP_FLAG}")
 # use cmake -D USER_HARDWARE_FLAGS=<flags> to set extra flags for FPGA backend compilation
 
 ###############################################################################

diff --git a/DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/anr/src/anr.hpp b/DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/anr/src/anr.hpp
@@ -347,15 +347,7 @@ std::vector<event> SubmitANRKernels(queue& q, int cols, int rows,
   // submit the vertical kernel using a column stencil
   auto vertical_kernel = q.single_task<VerticalKernelID>([=] {
     // copy host side intensity sigma LUT to the device
-    // For testing the kernel system as an IP and checking the area and Fmax,
-    // we allow the user to turn off connections to device memory. In this case
-    // (the DISABLE_DEVICE_MEM macro IS defined), the results will be incorrect
-    // since there is no way to get the data to/from the device.
-#if defined(IP_MODE)
-    IntensitySigmaLUT sig_i_lut;
-#else
     IntensitySigmaLUT sig_i_lut(sig_i_lut_data_ptr);
-#endif
 
     // build the constexpr exp() and inverse LUT ROMs
     constexpr ExpLUT exp_lut;

diff --git a/DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/anr/src/dma_kernels.hpp b/DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/anr/src/dma_kernels.hpp
@@ -22,11 +22,13 @@ template <typename KernelId, typename T, typename Pipe, int pixels_per_cycle>
 event SubmitInputDMA(queue &q, T *in_ptr, int rows, int cols, int frames) {
   using PipeType = DataBundle<T, pixels_per_cycle>;
 
+#if defined (IS_BSP)
   // LSU attribute to  turn off caching
   using NonCachingLSU =
       ext::intel::lsu<ext::intel::burst_coalesce<true>, ext::intel::cache<0>,
                       ext::intel::statically_coalesce<true>,
                       ext::intel::prefetch<false>>;
+#endif 
 
   // validate the number of columns
   if ((cols % pixels_per_cycle) != 0) {
@@ -41,7 +43,12 @@ event SubmitInputDMA(queue &q, T *in_ptr, int rows, int cols, int frames) {
 
   // Using device memory
   return q.single_task<KernelId>([=]() [[intel::kernel_args_restrict]] {
+
+#if defined (IS_BSP)
     device_ptr<T> in(in_ptr);
+#else 
+    T* in(in_ptr);
+#endif  
 
     // coalesce the following two loops into a single for-loop using the
     // loop_coalesce attribute
@@ -51,7 +58,11 @@ event SubmitInputDMA(queue &q, T *in_ptr, int rows, int cols, int frames) {
         PipeType pipe_data;
         #pragma unroll
         for (int k = 0; k < pixels_per_cycle; k++) {
+#if defined (IS_BSP)
           pipe_data[k] = NonCachingLSU::load(in + i * pixels_per_cycle + k);
+#else 
+          pipe_data[k] = in[i * pixels_per_cycle + k];
+#endif   
         }
         Pipe::write(pipe_data);
       }
@@ -77,7 +88,12 @@ event SubmitOutputDMA(queue &q, T *out_ptr, int rows, int cols, int frames) {
 
   // Using device memory
   return q.single_task<KernelId>([=]() [[intel::kernel_args_restrict]] {
+
+#if defined (IS_BSP)
     device_ptr<T> out(out_ptr);
+#else 
+    T* out(out_ptr);
+#endif      
 
     // coalesce the following two loops into a single for-loop using the
     // loop_coalesce attribute

diff --git a/DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/anr/src/intensity_sigma_lut.hpp b/DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/anr/src/intensity_sigma_lut.hpp
@@ -16,6 +16,7 @@ class IntensitySigmaLUT {
   // default constructor
   IntensitySigmaLUT() {}
 
+#if defined (IS_BSP)
   // construct from a device_ptr (for constructing from device memory)
   IntensitySigmaLUT(device_ptr<float> ptr) {
     // use a pipelined LSU to load from device memory since we don't
@@ -25,6 +26,14 @@ class IntensitySigmaLUT {
       data_[i] = PipelinedLSU::load(ptr + i);
     }
   }
+#else 
+  // construct from a regular pointer
+  IntensitySigmaLUT(float* ptr) {
+    for (int i = 0; i < lut_depth; i++) {
+      data_[i] = ptr[i];
+    }
+  }
+#endif
 
   // construct from the ANR parameters (actually builds the LUT)
   IntensitySigmaLUT(ANRParams params) {
@@ -39,8 +48,12 @@ class IntensitySigmaLUT {
   }
 
   // helper static method to allocate enough memory to hold the LUT
-  static float* AllocateDevice(sycl::queue& q) {
+  static float* Allocate(sycl::queue& q) {
+#if defined (IS_BSP)
     float* ptr = sycl::malloc_device<float>(lut_depth, q);
+#else 
+    float* ptr = sycl::malloc_shared<float>(lut_depth, q);
+#endif   
     if (ptr == nullptr) {
       std::cerr << "ERROR: could not allocate space for 'ptr'\n";
       std::terminate();
@@ -49,7 +62,7 @@ class IntensitySigmaLUT {
   }
 
   // helper method to copy the data to the device
-  sycl::event CopyDataToDevice(sycl::queue& q, float* ptr) {
+  sycl::event CopyData(sycl::queue& q, float* ptr) {
     return q.memcpy(ptr, data_, lut_depth * sizeof(float));
   }
 

diff --git a/DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/anr/src/main.cpp b/DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/anr/src/main.cpp
@@ -117,6 +117,7 @@ int main(int argc, char* argv[]) {
   // create the output pixels (initialize to all 0s)
   std::vector<PixelT> out_pixels(in_pixels.size(), 0);
 
+#if defined (IS_BSP)
   // allocate memory on the device for the input and output
   PixelT *in, *out;
   if ((in = malloc_device<PixelT>(pixel_count, q)) == nullptr) {
@@ -127,18 +128,31 @@ int main(int argc, char* argv[]) {
     std::cerr << "ERROR: could not allocate space for 'out'\n";
     std::terminate();
   }
+#else 
+  // allocate memory on the host for the input and output
+  PixelT *in, *out;
+  if ((in = malloc_shared<PixelT>(pixel_count, q)) == nullptr) {
+    std::cerr << "ERROR: could not allocate space for 'in'\n";
+    std::terminate();
+  }
+  if ((out = malloc_shared<PixelT>(pixel_count, q)) == nullptr) {
+    std::cerr << "ERROR: could not allocate space for 'out'\n";
+    std::terminate();
+  }
+#endif   
+
 
   // copy the input data to the device memory and wait for the copy to finish
   q.memcpy(in, in_pixels.data(), pixel_count * sizeof(PixelT)).wait();
 
   // allocate space for the intensity sigma LUT
-  float* sig_i_lut_data_ptr = IntensitySigmaLUT::AllocateDevice(q);
+  float* sig_i_lut_data_ptr = IntensitySigmaLUT::Allocate(q);
 
   // create the intensity sigma LUT data locally on the host
   IntensitySigmaLUT sig_i_lut_host(params);
 
   // copy the intensity sigma LUT to the device
-  sig_i_lut_host.CopyDataToDevice(q, sig_i_lut_data_ptr).wait();
+  sig_i_lut_host.CopyData(q, sig_i_lut_data_ptr).wait();
   //////////////////////////////////////////////////////////////////////////////
 
   // track timing information in ms