You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: DirectProgramming/C++SYCL_FPGA/Tutorials/Tools/dynamic_profiler/README.md
+4-2
Original file line number
Diff line number
Diff line change
@@ -12,6 +12,8 @@ This FPGA tutorial demonstrates how to use the Intel® FPGA Dynamic Profiler for
12
12
| What you will learn | About the Intel® FPGA Dynamic Profiler for DPC++ <br> How to set up and use this tool <br> A case study of using this tool to identify performance bottlenecks in pipes.
13
13
| Time to complete | 15 minutes
14
14
15
+
> **Note**: This sample has been tuned to show the results described on Arria 10 devices. While it compiles and runs on the other supported devices, the hardware profiling results may differ slighly from what is described below.
16
+
15
17
> **Note**: Even though the Intel DPC++/C++ OneAPI compiler is enough to compile for emulation, generating reports and generating RTL, there are extra software requirements for the simulation flow and FPGA compiles.
16
18
>
17
19
> For using the simulator flow, Intel® Quartus® Prime Pro Edition and one of the following simulators must be installed and accessible through your PATH:
@@ -139,7 +141,7 @@ When analyzing performance data to optimize a design, the goal is to get as clos
139
141
140
142
#### Analyzing Stall and Occupancy Metrics
141
143
142
-
In this tutorial, there are two design scenarios defined in dynamic_profiler.cpp. One showing a naive pre-optimized design, and a second showing the same design optimized based on data collected through the Intel® FPGA Dynamic Profiler for DPC++.
144
+
In this tutorial, there are two design scenarios defined in dynamic_profiler.cpp. One showing a naive pre-optimized design, and a second showing the same design optimized based on data collected through the Intel® FPGA Dynamic Profiler for DPC++ on an Arria 10 device.
143
145
144
146
##### Pre-optimization Version #####
145
147
@@ -155,7 +157,7 @@ The second scenario is an example of what the design might look like after being
155
157
- a producer SYCL kernel (ProducerAfter) that reads data from a buffer, performs the first computation on the data and writes this value to a pipe (ProducerToConsumerAfterPipe), and
156
158
- a consumer SYCL kernel (ConsumerAfter) that reads from the pipe (ProducerToConsumerAfterPipe), does the second set of computations and fills up the output buffer.
157
159
158
-
When looking at the performance data for the two "after optimization" kernels in the Bottom-Up view, you should see that ProducerAfter's pipe write (on line 105) and the ConsumerAfter's pipe read (line 120) both have stall percentages near 0%. This indicates the pipe is being used more effectively - now the read and write side of the pipe are being used at similar rates, so the pipe operations are not creating stalls in the pipeline. This also speeds up the overall design execution - the two "after" kernels take less time to execute than the two before kernels.
160
+
When looking at the performance data for the two "after optimization" kernels in the Bottom-Up view, you should see that ProducerAfter's pipe write (on line 126) and the ConsumerAfter's pipe read (line 139) both have stall percentages near 0%. This indicates the pipe is being used more effectively - now the read and write side of the pipe are being used at similar rates, so the pipe operations are not creating stalls in the pipeline. This also speeds up the overall design execution - the two "after" kernels take less time to execute than the two before kernels.
0 commit comments