FPGA: Bug fix to the functionality of the dynamic_profiler tutorial (#1340)

alisanikiforova · web-flow · commit 6819ab9ea80a · 2023-02-10T16:06:44.000+01:00
diff --git a/DirectProgramming/C++SYCL_FPGA/Tutorials/Tools/dynamic_profiler/README.md b/DirectProgramming/C++SYCL_FPGA/Tutorials/Tools/dynamic_profiler/README.md
@@ -12,6 +12,8 @@ This FPGA tutorial demonstrates how to use the Intel® FPGA Dynamic Profiler for
 | What you will learn               | About the Intel® FPGA Dynamic Profiler for DPC++ <br> How to set up and use this tool <br> A case study of using this tool to identify performance bottlenecks in pipes.
 | Time to complete                  | 15 minutes
 
+> **Note**: This sample has been tuned to show the results described on Arria 10 devices. While it compiles and runs on the other supported devices, the hardware profiling results may differ slighly from what is described below.
+
 > **Note**: Even though the Intel DPC++/C++ OneAPI compiler is enough to compile for emulation, generating reports and generating RTL, there are extra software requirements for the simulation flow and FPGA compiles.
 >
 > For using the simulator flow, Intel® Quartus® Prime Pro Edition and one of the following simulators must be installed and accessible through your PATH:
@@ -139,7 +141,7 @@ When analyzing performance data to optimize a design, the goal is to get as clos
 
 #### Analyzing Stall and Occupancy Metrics
 
-In this tutorial, there are two design scenarios defined in dynamic_profiler.cpp. One showing a naive pre-optimized design, and a second showing the same design optimized based on data collected through the Intel® FPGA Dynamic Profiler for DPC++.
+In this tutorial, there are two design scenarios defined in dynamic_profiler.cpp. One showing a naive pre-optimized design, and a second showing the same design optimized based on data collected through the Intel® FPGA Dynamic Profiler for DPC++ on an Arria 10 device.
 
 ##### Pre-optimization Version #####
 
@@ -155,7 +157,7 @@ The second scenario is an example of what the design might look like after being
 - a producer SYCL kernel (ProducerAfter) that reads data from a buffer, performs the first computation on the data and writes this value to a pipe (ProducerToConsumerAfterPipe), and
 - a consumer SYCL kernel (ConsumerAfter) that reads from the pipe (ProducerToConsumerAfterPipe), does the second set of computations and fills up the output buffer.
 
-When looking at the performance data for the two "after optimization" kernels in the Bottom-Up view, you should see that ProducerAfter's pipe write (on line 105) and the ConsumerAfter's pipe read (line 120) both have stall percentages near 0%. This indicates the pipe is being used more effectively - now the read and write side of the pipe are being used at similar rates, so the pipe operations are not creating stalls in the pipeline. This also speeds up the overall design execution - the two "after" kernels take less time to execute than the two before kernels.
+When looking at the performance data for the two "after optimization" kernels in the Bottom-Up view, you should see that ProducerAfter's pipe write (on line 126) and the ConsumerAfter's pipe read (line 139) both have stall percentages near 0%. This indicates the pipe is being used more effectively - now the read and write side of the pipe are being used at similar rates, so the pipe operations are not creating stalls in the pipeline. This also speeds up the overall design execution - the two "after" kernels take less time to execute than the two before kernels.
 
 ![](profiler_pipe_tutorial_bottom_up.png)
 
diff --git a/DirectProgramming/C++SYCL_FPGA/Tutorials/Tools/dynamic_profiler/src/dynamic_profiler.cpp b/DirectProgramming/C++SYCL_FPGA/Tutorials/Tools/dynamic_profiler/src/dynamic_profiler.cpp
@@ -33,21 +33,17 @@ class ProducerAfterKernel;
 class ConsumerAfterKernel;
 
 // kSize = # of floats to process on each kernel execution.
-#if defined(FPGA_EMULATOR)
-constexpr int kSize = 4096;
-#elif defined(FPGA_SIMULATOR)
+#if defined(FPGA_EMULATOR) or defined(FPGA_SIMULATOR)
 constexpr int kSize = 64;
 #else
 constexpr int kSize = 262144;
 #endif
 
 // Number of iterations performed in the consumer kernels
 // This controls the amount of work done by the Consumer.
-#if defined(FPGA_SIMULATOR)
-constexpr int kComplexity = 2000;
-#else
-constexpr int kComplexity = 32;
-#endif
+// After the optimization, the Producer and Consumer split the work.
+constexpr int kComplexity1 = 1900;
+constexpr int kComplexity2 = 2000;
 
 // Perform two stages of processing on the input data.
 // The output of ConsumerWork1 needs to go to the input
@@ -56,15 +52,15 @@ constexpr int kComplexity = 32;
 // can be replaced with more useful operations.
 float ConsumerWork1(float f) {
   float output = f;
-  for (int j = 0; j < kComplexity; j++) {
+  for (int j = 0; j < kComplexity1; j++) {
     output = 20 * f + j - output;
   }
   return output;
 }
 
 float ConsumerWork2(float f) {
   auto output = f;
-  for (int j = 0; j < kComplexity; j++) {
+  for (int j = 0; j < kComplexity2; j++) {
     output = output + f * j;
   }
   return output;