Skip to content

Commit 6819ab9

Browse files
FPGA: Bug fix to the functionality of the dynamic_profiler tutorial (#1340)
1 parent 1e2fdf0 commit 6819ab9

File tree

2 files changed

+10
-12
lines changed

2 files changed

+10
-12
lines changed

DirectProgramming/C++SYCL_FPGA/Tutorials/Tools/dynamic_profiler/README.md

+4-2
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ This FPGA tutorial demonstrates how to use the Intel® FPGA Dynamic Profiler for
1212
| What you will learn | About the Intel® FPGA Dynamic Profiler for DPC++ <br> How to set up and use this tool <br> A case study of using this tool to identify performance bottlenecks in pipes.
1313
| Time to complete | 15 minutes
1414

15+
> **Note**: This sample has been tuned to show the results described on Arria 10 devices. While it compiles and runs on the other supported devices, the hardware profiling results may differ slighly from what is described below.
16+
1517
> **Note**: Even though the Intel DPC++/C++ OneAPI compiler is enough to compile for emulation, generating reports and generating RTL, there are extra software requirements for the simulation flow and FPGA compiles.
1618
>
1719
> For using the simulator flow, Intel® Quartus® Prime Pro Edition and one of the following simulators must be installed and accessible through your PATH:
@@ -139,7 +141,7 @@ When analyzing performance data to optimize a design, the goal is to get as clos
139141

140142
#### Analyzing Stall and Occupancy Metrics
141143

142-
In this tutorial, there are two design scenarios defined in dynamic_profiler.cpp. One showing a naive pre-optimized design, and a second showing the same design optimized based on data collected through the Intel® FPGA Dynamic Profiler for DPC++.
144+
In this tutorial, there are two design scenarios defined in dynamic_profiler.cpp. One showing a naive pre-optimized design, and a second showing the same design optimized based on data collected through the Intel® FPGA Dynamic Profiler for DPC++ on an Arria 10 device.
143145

144146
##### Pre-optimization Version #####
145147

@@ -155,7 +157,7 @@ The second scenario is an example of what the design might look like after being
155157
- a producer SYCL kernel (ProducerAfter) that reads data from a buffer, performs the first computation on the data and writes this value to a pipe (ProducerToConsumerAfterPipe), and
156158
- a consumer SYCL kernel (ConsumerAfter) that reads from the pipe (ProducerToConsumerAfterPipe), does the second set of computations and fills up the output buffer.
157159

158-
When looking at the performance data for the two "after optimization" kernels in the Bottom-Up view, you should see that ProducerAfter's pipe write (on line 105) and the ConsumerAfter's pipe read (line 120) both have stall percentages near 0%. This indicates the pipe is being used more effectively - now the read and write side of the pipe are being used at similar rates, so the pipe operations are not creating stalls in the pipeline. This also speeds up the overall design execution - the two "after" kernels take less time to execute than the two before kernels.
160+
When looking at the performance data for the two "after optimization" kernels in the Bottom-Up view, you should see that ProducerAfter's pipe write (on line 126) and the ConsumerAfter's pipe read (line 139) both have stall percentages near 0%. This indicates the pipe is being used more effectively - now the read and write side of the pipe are being used at similar rates, so the pipe operations are not creating stalls in the pipeline. This also speeds up the overall design execution - the two "after" kernels take less time to execute than the two before kernels.
159161

160162
![](profiler_pipe_tutorial_bottom_up.png)
161163

DirectProgramming/C++SYCL_FPGA/Tutorials/Tools/dynamic_profiler/src/dynamic_profiler.cpp

+6-10
Original file line numberDiff line numberDiff line change
@@ -33,21 +33,17 @@ class ProducerAfterKernel;
3333
class ConsumerAfterKernel;
3434

3535
// kSize = # of floats to process on each kernel execution.
36-
#if defined(FPGA_EMULATOR)
37-
constexpr int kSize = 4096;
38-
#elif defined(FPGA_SIMULATOR)
36+
#if defined(FPGA_EMULATOR) or defined(FPGA_SIMULATOR)
3937
constexpr int kSize = 64;
4038
#else
4139
constexpr int kSize = 262144;
4240
#endif
4341

4442
// Number of iterations performed in the consumer kernels
4543
// This controls the amount of work done by the Consumer.
46-
#if defined(FPGA_SIMULATOR)
47-
constexpr int kComplexity = 2000;
48-
#else
49-
constexpr int kComplexity = 32;
50-
#endif
44+
// After the optimization, the Producer and Consumer split the work.
45+
constexpr int kComplexity1 = 1900;
46+
constexpr int kComplexity2 = 2000;
5147

5248
// Perform two stages of processing on the input data.
5349
// The output of ConsumerWork1 needs to go to the input
@@ -56,15 +52,15 @@ constexpr int kComplexity = 32;
5652
// can be replaced with more useful operations.
5753
float ConsumerWork1(float f) {
5854
float output = f;
59-
for (int j = 0; j < kComplexity; j++) {
55+
for (int j = 0; j < kComplexity1; j++) {
6056
output = 20 * f + j - output;
6157
}
6258
return output;
6359
}
6460

6561
float ConsumerWork2(float f) {
6662
auto output = f;
67-
for (int j = 0; j < kComplexity; j++) {
63+
for (int j = 0; j < kComplexity2; j++) {
6864
output = output + f * j;
6965
}
7066
return output;

0 commit comments

Comments
 (0)