-
Notifications
You must be signed in to change notification settings - Fork 727
FPGA: New Code Sample minimum_latency
#1302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 6 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
02b9fc0
Introduce the code sample
shuoniu-intel 54a43ec
Update the design and add a new target manual_revert
shuoniu-intel 18d3609
Create README
shuoniu-intel c8cdfa0
Address Yohann's comments
shuoniu-intel ef74d46
Update directory path in README
shuoniu-intel d2d1344
Incorporate a global change
shuoniu-intel e98e5bd
Address John's comments
shuoniu-intel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
20 changes: 20 additions & 0 deletions
20
...amming/C++SYCL_FPGA/Tutorials/Features/optimization_levels/minimum_latency/CMakeLists.txt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
if(UNIX) | ||
# Direct CMake to use icpx rather than the default C++ compiler/linker | ||
set(CMAKE_CXX_COMPILER icpx) | ||
else() # Windows | ||
# Force CMake to use icx-cl rather than the default C++ compiler/linker | ||
# (needed on Windows only) | ||
include (CMakeForceCompiler) | ||
CMAKE_FORCE_CXX_COMPILER (icx-cl IntelDPCPP) | ||
include (Platform/Windows-Clang) | ||
endif() | ||
|
||
cmake_minimum_required (VERSION 3.4) | ||
|
||
project(MinimumLatency CXX) | ||
|
||
set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}) | ||
set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}) | ||
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}) | ||
|
||
add_subdirectory (src) |
278 changes: 278 additions & 0 deletions
278
...g/C++SYCL_FPGA/Tutorials/Features/optimization_levels/minimum_latency/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,278 @@ | ||
# `Minimum Latency Flow` sample | ||
|
||
This FPGA tutorial demonstrates how to compile your design with the minimum latency flow to achieve low latency at the cost of reduced f<sub>MAX</sub>. | ||
|
||
| Optimized for | Description | ||
|:--- |:--- | ||
| OS | Linux* Ubuntu* 18.04/20.04 <br> RHEL*/CentOS* 8 <br> SUSE* 15 <br> Windows* 10 | ||
| Hardware | Intel® Agilex™, Arria® 10, and Stratix® 10 FPGAs | ||
| Software | Intel® oneAPI DPC++/C++ Compiler | ||
| What you will learn | How to use the minimum latency flow to compile low-latency designs<br>How to manually override underlying controls set by the minimum latency flow | ||
| Time to complete | 20 minutes | ||
|
||
> **Note**: Even though the Intel DPC++/C++ OneAPI compiler is enough to compile for emulation, generating reports and generating RTL, there are extra software requirements for the simulation flow and FPGA compiles. | ||
> | ||
> For using the simulator flow, Intel® Quartus® Prime Pro Edition and one of the following simulators must be installed and accessible through your PATH: | ||
> - Questa*-Intel® FPGA Edition | ||
> - Questa*-Intel® FPGA Starter Edition | ||
> - ModelSim® SE | ||
> | ||
> When using the hardware compile flow, Intel® Quartus® Prime Pro Edition must be installed and accessible through your PATH. | ||
> | ||
> :warning: Make sure you add the device files associated with the FPGA that you are targeting to your Intel® Quartus® Prime installation. | ||
|
||
## Prerequisites | ||
|
||
This sample is part of the FPGA code samples. | ||
It is categorized as a Tier 3 sample that demonstatres a compiler feature. | ||
|
||
```mermaid | ||
flowchart LR | ||
tier1("Tier 1: Get Started") | ||
tier2("Tier 2: Explore the Fundamentals") | ||
tier3("Tier 3: Explore the Advanced Techniques") | ||
tier4("Tier 4: Explore the Reference Designs") | ||
|
||
tier1 --> tier2 --> tier3 --> tier4 | ||
|
||
style tier1 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff | ||
style tier2 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff | ||
style tier3 fill:#f96,stroke:#333,stroke-width:1px,color:#fff | ||
style tier4 fill:#0071c1,stroke:#0071c1,stroke-width:1px,color:#fff | ||
``` | ||
|
||
Find more information about how to navigate this part of the code samples in the [FPGA top-level README.md](/DirectProgramming/C++SYCL_FPGA/README.md). | ||
You can also find more information about [troubleshooting build errors](/DirectProgramming/C++SYCL_FPGA/README.md#troubleshooting), [running the sample on the Intel® DevCloud](/DirectProgramming/C++SYCL_FPGA/README.md#build-and-run-the-samples-on-intel-devcloud-optional), [using Visual Studio Code with the code samples](/DirectProgramming/C++SYCL_FPGA/README.md#use-visual-studio-code-vs-code-optional), [links to selected documentation](/DirectProgramming/C++SYCL_FPGA/README.md#documentation), etc. | ||
|
||
## Purpose | ||
|
||
This FPGA tutorial demonstrates how to use the minimum latency flow to compile low-latency designs and how to manually override underlying controls set by the minimum latency flow. By default, the minimum latency flow tries to achieve lower latency at the cost of decreased f<sub>MAX</sub>, so this flow is a good starting point for optimizing latency-sensitive designs. | ||
|
||
To compile your design with the minimum latency flow, pass the `-Xsoptimize=latency` flag to the `icpx` command. | ||
|
||
The minimum latency flow implies the following compiler controls: | ||
- Disable hyper-optimized handshaking on Intel Stratix® 10 and Intel Agilex™ devices | ||
shuoniu-intel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Use zero-latency stall-free clusters exit FIFO | ||
- Disable loop speculation | ||
- Remove the 1-cycle delay on the pipelined loop limiter | ||
|
||
The following table shows how users can manually override these underlying controls: | ||
| |Control Flags/Attributes |Reference | ||
|:--- |:--- |:--- | ||
|Hyper-optimized handshaking |`-Xshyper-optimized-handshaking=<auto\|off\|on>` |[Modify the Handshaking Protocol Between Clusters](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top/flags-attr-prag-ext/optimization-flags/hyper-opt-handshaking.html) | ||
|Exit FIFO latency of stall-free clusters|`-Xssfc-exit-fifo-type=<default\|zero-latency\|low-latency>`|[Global Control of Exit FIFO Latency of Stall-free Clusters](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top/flags-attr-prag-ext/optimization-flags/control-exit-fifo-latency.html) | ||
|Loop speculation |`[[intel::speculated_iterations(N)]]` |[`speculated_iterations` Attribute](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top/flags-attr-prag-ext/loop-directives/speculated-iterations-attribute.html) | ||
|Pipelined loop limiter |N/A |N/A | ||
|
||
> **Note**: Using these manual controls overrides the underlying controls individually without affecting other underlying controls introduced by the `-Xsoptimize=latency` compiler flag. | ||
|
||
### Understanding the Tutorial Design | ||
|
||
The basic function performed by the tutorial kernel is a RGB to grayscale algorithm. To see the impact of the minimum latency flow in this tutorial in terms of latency and f<sub>MAX</sub>, and also see how to override the minimum latency flow with specific manual controls, the design needs to be compiled three times. | ||
|
||
Part 1 compiles the design without passing the `-Xsoptimize=latency` flag. In this default flow, the compiler targets higher throughput and f<sub>MAX</sub> with the sacrifice of latency and area. | ||
|
||
Part 2 compiles the design with the `-Xsoptimize=latency` flag, so the minimum latency flow is used in this compile. By setting up the underlying compiler controls listed above, the minimum latency flow achieves lower latency by trading off f<sub>MAX</sub>. | ||
|
||
Part 3 also compiles the design with the minimum latency flow, as well as manual controls that revert minimum latency flow's default underlying controls. Therefore, latency and f<sub>MAX</sub> of this compile are the same as part 1. | ||
|
||
## Key Concepts | ||
|
||
* How to use the minimum latency flow to compile low-latency designs | ||
* How to manually override underlying controls set by the minimum latency flow | ||
|
||
## Building the `minimum_latency` Tutorial | ||
|
||
> **Note**: When working with the command-line interface (CLI), you should configure the oneAPI toolkits using environment variables. | ||
> Set up your CLI environment by sourcing the `setvars` script located in the root of your oneAPI installation every time you open a new terminal window. | ||
> This practice ensures that your compiler, libraries, and tools are ready for development. | ||
> | ||
> Linux*: | ||
> - For system wide installations: `. /opt/intel/oneapi/setvars.sh` | ||
> - For private installations: ` . ~/intel/oneapi/setvars.sh` | ||
> - For non-POSIX shells, like csh, use the following command: `bash -c 'source <install-dir>/setvars.sh ; exec csh'` | ||
> | ||
> Windows*: | ||
> - `C:\Program Files(x86)\Intel\oneAPI\setvars.bat` | ||
> - Windows PowerShell*, use the following command: `cmd.exe "/K" '"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" && powershell'` | ||
> | ||
> For more information on configuring environment variables, see [Use the setvars Script with Linux* or macOS*](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-development-environment-setup/use-the-setvars-script-with-linux-or-macos.html) or [Use the setvars Script with Windows*](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/oneapi-development-environment-setup/use-the-setvars-script-with-windows.html). | ||
|
||
### On a Linux* System | ||
|
||
1. Generate the `Makefile` by running `cmake`: | ||
|
||
``` | ||
mkdir build | ||
cd build | ||
``` | ||
To compile for the default target (the Agilex™ device family), run `cmake` using the command: | ||
``` | ||
cmake .. | ||
``` | ||
|
||
> **Note**: You can change the default target by using the command: | ||
> ``` | ||
> cmake .. -DFPGA_DEVICE=<FPGA device family or FPGA part number> | ||
> ``` | ||
> | ||
> Alternatively, you can target an explicit FPGA board variant and BSP by using the following command: | ||
> ``` | ||
> cmake .. -DFPGA_DEVICE=<board-support-package>:<board-variant> | ||
> ``` | ||
> | ||
> You will only be able to run an executable on the FPGA if you specified a BSP. | ||
|
||
2. Compile the design using the generated `Makefile`. The following build targets are provided, matching the recommended development flow: | ||
|
||
* Compile for emulation (fast compile time, targets emulated FPGA device): | ||
|
||
```bash | ||
make fpga_emu | ||
``` | ||
|
||
* Generate the optimization reports: | ||
|
||
```bash | ||
make report | ||
``` | ||
|
||
* Compile for simulation (fast compile time, targets simulated FPGA device, reduced data size): | ||
|
||
```bash | ||
make fpga_sim | ||
``` | ||
|
||
* Compile for FPGA hardware (longer compile time, targets FPGA device): | ||
|
||
```bash | ||
make fpga | ||
``` | ||
|
||
### On a Windows* System | ||
|
||
1. Generate the `Makefile` by running `cmake`. | ||
|
||
``` | ||
mkdir build | ||
cd build | ||
``` | ||
To compile for the default target (the Agilex™ device family), run `cmake` using the command: | ||
``` | ||
cmake -G "NMake Makefiles" .. | ||
``` | ||
> **Note**: You can change the default target by using the command: | ||
> ``` | ||
> cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<FPGA device family or FPGA part number> | ||
> ``` | ||
> | ||
> Alternatively, you can target an explicit FPGA board variant and BSP by using the following command: | ||
> ``` | ||
> cmake -G "NMake Makefiles" .. -DFPGA_DEVICE=<board-support-package>:<board-variant> | ||
> ``` | ||
> | ||
> You will only be able to run an executable on the FPGA if you specified a BSP. | ||
|
||
2. Compile the design through the generated `Makefile`. The following build targets are provided, matching the recommended development flow: | ||
|
||
* Compile for emulation (fast compile time, targets emulated FPGA device): | ||
``` | ||
nmake fpga_emu | ||
``` | ||
* Generate the optimization reports: | ||
``` | ||
nmake report | ||
``` | ||
* Compile for simulation (fast compile time, targets simulated FPGA device, reduced data size): | ||
``` | ||
nmake fpga_sim | ||
`` | ||
* Compile for FPGA hardware (longer compile time, targets FPGA device): | ||
``` | ||
nmake fpga | ||
``` | ||
|
||
> **Note**: If you encounter any issues with long paths when compiling under Windows*, you may have to create your ‘build’ directory in a shorter path, for example c:\samples\build. You can then run cmake from that directory, and provide cmake with the full path to your sample directory. | ||
|
||
## Examining the Reports | ||
|
||
Locate the pair of `report.html` files in either: | ||
|
||
* **Report-only compile**: `no_control_report.prj`, `minimum_latency_report.prj`, and `manual_revert_report.prj` | ||
* **FPGA hardware compile**: `no_control.fpga.prj`, `minimum_latency.fpga.prj`, and `manual_revert.fpga.prj` | ||
|
||
Open the reports in Chrome*, Firefox*, Edge*, or Internet Explorer*. | ||
|
||
Navigate to **Loop Analysis** (**Throughput Analysis > Loop Analysis**). In this viewer, you can find the latency of loops in the kernel. The latency of the compile with the minimum latency flow (part 2) should be smaller than the other two compiles. Also, the latency of the other two compiles (part 1 & 3) should be the same. | ||
|
||
Navigate to **Clock Frequency Summary** (**Summary > Clock Frequency Summary**) in `no_control.fpga.prj/reports/report.html`, `minimum_latency.fpga.prj/reports/report.html`, and `manual_revert.fpga.prj/reports/report.html` (after `make fpga` completes). In this table, you can find the actual f<sub>MAX</sub>. The f<sub>MAX</sub> of the compile with the minimum latency flow (part 2) should be smaller than the other two compiles. Also, the f<sub>MAX</sub> of the other two compiles (part 1 & 3) should be the same. Note that only the report generated by the FPGA hardware compile will reflect the true f<sub>MAX</sub> affected by the minimum latency flow. The difference is **not** apparent in the reports generated by `make report` because a design's f<sub>MAX</sub> cannot be predicted. | ||
|
||
## Running the Sample | ||
|
||
1. Run the sample on the FPGA emulator (the kernel executes on the CPU): | ||
|
||
```bash | ||
./no_control.fpga_emu (Linux) | ||
no_control.fpga_emu.exe (Windows) | ||
``` | ||
|
||
2. Run the sample on the FPGA simulator device: | ||
|
||
* On Linux | ||
```bash | ||
CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1 ./no_control.fpga_sim | ||
CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1 ./minimum_latency.fpga_sim | ||
CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1 ./manual_revert.fpga_sim | ||
``` | ||
* On Windows | ||
```bash | ||
set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA=1 | ||
no_control.fpga_sim.exe | ||
minimum_latency.fpga_sim.exe | ||
manual_revert.fpga_sim.exe | ||
set CL_CONTEXT_MPSIM_DEVICE_INTELFPGA= | ||
``` | ||
|
||
3. Run the sample on the FPGA device (only if you ran `cmake` with `-DFPGA_DEVICE=<board-support-package>:<board-variant>`): | ||
|
||
```bash | ||
./no_control.fpga (Linux) | ||
./minimum_latency.fpga (Linux) | ||
./manual_revert.fpga (Linux) | ||
no_control.fpga.exe (Windows) | ||
minimum_latency.fpga.exe (Windows) | ||
manual_revert.fpga.exe (Windows) | ||
``` | ||
|
||
### Example of Output | ||
|
||
Output of sample without minimum latency flow: | ||
```txt | ||
Kernel Throughput: 195.716MB/s | ||
Exec Time: 1.9491e-05s, InputMB: 0.0038147MB | ||
PASSED: all kernel results are correct | ||
``` | ||
|
||
Output of sample with minimum latency flow: | ||
```txt | ||
Kernel Throughput: 137.764MB/s | ||
Exec Time: 2.769e-05s, InputMB: 0.0038147MB | ||
PASSED: all kernel results are correct | ||
``` | ||
|
||
Output of sample with minimum latency flow but controls manually reverted: | ||
```txt | ||
Kernel Throughput: 192.934MB/s | ||
Exec Time: 1.9772e-05s, InputMB: 0.0038147MB | ||
PASSED: all kernel results are correct | ||
``` | ||
|
||
### Discussion of Results | ||
|
||
Comparing to Intel Arria® 10 GX FPGA, it is more notable on Intel Stratix® 10 SX FPGA that the minimum latency flow significantly reduces the latency, along with the f<sub>MAX</sub> and the throughput. That is because the minimum latency flow disables the hyper-optimized handshaking, which achieves higher f<sub>MAX</sub> at the cost of increased latency. For more information on the hyper-optimized handshaking protocol on Intel Stratix® 10 and Intel Agilex™ devices, see [Modify the Handshaking Protocol Between Clusters](https://www.intel.com/content/www/us/en/develop/documentation/oneapi-fpga-optimization-guide/top/flags-attr-prag-ext/optimization-flags/hyper-opt-handshaking.html). | ||
|
||
## License | ||
|
||
Code samples are licensed under the MIT license. See [License.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/License.txt) for details. | ||
|
||
Third-party program Licenses can be found here: [third-party-programs.txt](https://github.com/oneapi-src/oneAPI-samples/blob/master/third-party-programs.txt). |
71 changes: 71 additions & 0 deletions
71
...ogramming/C++SYCL_FPGA/Tutorials/Features/optimization_levels/minimum_latency/sample.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
{ | ||
"guid": "22F00FD4-D485-449F-8612-EDF2C276B1B5", | ||
"name": "Minimum Latency", | ||
"categories": ["Toolkit/oneAPI Direct Programming/C++SYCL FPGA/Tutorials/Features/optimization_levels"], | ||
"description": "An Intel® FPGA tutorial demonstrating the minimum latency optimization level", | ||
"toolchain": ["icpx"], | ||
"os": ["linux", "windows"], | ||
"targetDevice": ["FPGA"], | ||
"builder": ["ide", "cmake"], | ||
"languages": [{"cpp":{}}], | ||
"commonFolder": { | ||
"base": "../../../..", | ||
"include": [ | ||
"README.md", | ||
"Tutorials/Features/optimization_levels/minimum_latency", | ||
"include" | ||
], | ||
"exclude": [] | ||
}, | ||
"ciTests": { | ||
"linux": [ | ||
{ | ||
"id": "fpga_emu", | ||
"steps": [ | ||
"icpx --version", | ||
"mkdir build", | ||
"cd build", | ||
"cmake ..", | ||
"make fpga_emu", | ||
"./no_control.fpga_emu" | ||
] | ||
}, | ||
{ | ||
"id": "report", | ||
"steps": [ | ||
"icpx --version", | ||
"mkdir build", | ||
"cd build", | ||
"cmake ..", | ||
"make report" | ||
] | ||
} | ||
], | ||
"windows": [ | ||
{ | ||
"id": "fpga_emu", | ||
"steps": [ | ||
"icpx --version", | ||
"cd ../../..", | ||
"mkdir build", | ||
"cd build", | ||
"cmake -G \"NMake Makefiles\" ../Tutorials/Features/optimization_levels/minimum_latency", | ||
"nmake fpga_emu", | ||
"no_control.fpga_emu.exe" | ||
] | ||
}, | ||
{ | ||
"id": "report", | ||
"steps": [ | ||
"icpx --version", | ||
"cd ../../..", | ||
"mkdir build", | ||
"cd build", | ||
"cmake -G \"NMake Makefiles\" ../Tutorials/Features/optimization_levels/minimum_latency", | ||
"nmake report" | ||
] | ||
} | ||
] | ||
}, | ||
"expertise": "Concepts and Functionality" | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.