Skip to content

Bug fix to the functionality of the dynamic_profiler tutorial #1340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ This FPGA tutorial demonstrates how to use the Intel® FPGA Dynamic Profiler for
| What you will learn | About the Intel® FPGA Dynamic Profiler for DPC++ <br> How to set up and use this tool <br> A case study of using this tool to identify performance bottlenecks in pipes.
| Time to complete | 15 minutes

> **Note**: This sample has been tuned to show the results described on Arria 10 devices. While it compiles and runs on the other supported devices, the hardware profiling results may differ slighly from what is described below.

> **Note**: Even though the Intel DPC++/C++ OneAPI compiler is enough to compile for emulation, generating reports and generating RTL, there are extra software requirements for the simulation flow and FPGA compiles.
>
> For using the simulator flow, Intel® Quartus® Prime Pro Edition and one of the following simulators must be installed and accessible through your PATH:
Expand Down Expand Up @@ -139,7 +141,7 @@ When analyzing performance data to optimize a design, the goal is to get as clos

#### Analyzing Stall and Occupancy Metrics

In this tutorial, there are two design scenarios defined in dynamic_profiler.cpp. One showing a naive pre-optimized design, and a second showing the same design optimized based on data collected through the Intel® FPGA Dynamic Profiler for DPC++.
In this tutorial, there are two design scenarios defined in dynamic_profiler.cpp. One showing a naive pre-optimized design, and a second showing the same design optimized based on data collected through the Intel® FPGA Dynamic Profiler for DPC++ on an Arria 10 device.

##### Pre-optimization Version #####

Expand All @@ -155,7 +157,7 @@ The second scenario is an example of what the design might look like after being
- a producer SYCL kernel (ProducerAfter) that reads data from a buffer, performs the first computation on the data and writes this value to a pipe (ProducerToConsumerAfterPipe), and
- a consumer SYCL kernel (ConsumerAfter) that reads from the pipe (ProducerToConsumerAfterPipe), does the second set of computations and fills up the output buffer.

When looking at the performance data for the two "after optimization" kernels in the Bottom-Up view, you should see that ProducerAfter's pipe write (on line 105) and the ConsumerAfter's pipe read (line 120) both have stall percentages near 0%. This indicates the pipe is being used more effectively - now the read and write side of the pipe are being used at similar rates, so the pipe operations are not creating stalls in the pipeline. This also speeds up the overall design execution - the two "after" kernels take less time to execute than the two before kernels.
When looking at the performance data for the two "after optimization" kernels in the Bottom-Up view, you should see that ProducerAfter's pipe write (on line 126) and the ConsumerAfter's pipe read (line 139) both have stall percentages near 0%. This indicates the pipe is being used more effectively - now the read and write side of the pipe are being used at similar rates, so the pipe operations are not creating stalls in the pipeline. This also speeds up the overall design execution - the two "after" kernels take less time to execute than the two before kernels.

![](profiler_pipe_tutorial_bottom_up.png)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,21 +33,17 @@ class ProducerAfterKernel;
class ConsumerAfterKernel;

// kSize = # of floats to process on each kernel execution.
#if defined(FPGA_EMULATOR)
constexpr int kSize = 4096;
#elif defined(FPGA_SIMULATOR)
#if defined(FPGA_EMULATOR) or defined(FPGA_SIMULATOR)
constexpr int kSize = 64;
#else
constexpr int kSize = 262144;
#endif

// Number of iterations performed in the consumer kernels
// This controls the amount of work done by the Consumer.
#if defined(FPGA_SIMULATOR)
constexpr int kComplexity = 2000;
#else
constexpr int kComplexity = 32;
#endif
// After the optimization, the Producer and Consumer split the work.
constexpr int kComplexity1 = 1900;
constexpr int kComplexity2 = 2000;

// Perform two stages of processing on the input data.
// The output of ConsumerWork1 needs to go to the input
Expand All @@ -56,15 +52,15 @@ constexpr int kComplexity = 32;
// can be replaced with more useful operations.
float ConsumerWork1(float f) {
float output = f;
for (int j = 0; j < kComplexity; j++) {
for (int j = 0; j < kComplexity1; j++) {
output = 20 * f + j - output;
}
return output;
}

float ConsumerWork2(float f) {
auto output = f;
for (int j = 0; j < kComplexity; j++) {
for (int j = 0; j < kComplexity2; j++) {
output = output + f * j;
}
return output;
Expand Down