Skip to content

Remove fp-relaxed flag #1321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -83,16 +83,13 @@ The following list shows the key optimization techniques included in the referen
2. Using two copies of the compute matrix to read a full row and a full column per cycle.
3. Converting the nested loop into a single merged loop and applying Triangular Loop optimizations. This approach enables the ability to generate a design that is pipelined efficiently.
4. Fully vectorizing the dot products using loop unrolling.
5. Using the `-Xsfp-relaxed` compiler option to reorder floating point operations and allowing the inference of a specialized dot-product DSP. This option further reduces the number of DSP blocks needed by the implementation, the overall latency, and pipeline depth.
6. Using an efficient memory banking scheme to generate high performance hardware (all local memories are single-read, single-write).
7. Using the `fpga_reg` attribute to insert more pipeline stages where needed to improve the frequency achieved by the design.
5. Using an efficient memory banking scheme to generate high performance hardware (all local memories are single-read, single-write).
6. Using the `fpga_reg` attribute to insert more pipeline stages where needed to improve the frequency achieved by the design.

### Matrix Dimensions and FPGA Resources

In this reference design, the Cholesky decomposition algorithm is used to factor a real _n_ × _n_ matrix. The algorithm computes the vector dot product of two rows of the matrix. In our FPGA implementation, the dot product is computed in a loop over the _n_ elements in the row. The loop is fully unrolled to maximize throughput, so *n* real multiplication operations are performed in parallel on the FPGA and followed by sequential additions to compute the dot product result.

The sample uses the `-fp-relaxed` compiler option, which permits the compiler to reorder floating point additions (for example, to assume that floating point addition is commutative). The compiler reorders the additions so that the dot product arithmetic can be optimally implemented using the specialized floating point Digital Signal Processing (DSP) hardware on the FPGA.

With this optimization, our FPGA implementation requires _n_ DSPs to compute the real floating point dot product. The input matrix is also replicated two times in order to be able to read two full rows per cycle. The matrix size is constrained by the total FPGA DSP and RAM resources available.

### Compiler Flags Used
Expand All @@ -101,7 +98,6 @@ With this optimization, our FPGA implementation requires _n_ DSPs to compute the
|:--- |:---
|`-Xshardware` | Target FPGA hardware (as opposed to FPGA emulator)
|`-Xsclock=<target fmax>MHz` | The FPGA backend attempts to achieve <target fmax> MHz
|`-Xsfp-relaxed` | Allows the FPGA backend to re-order floating point arithmetic operations (for example, permit assuming $(a + b + c) == (c + a + b)$ )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You removed the explanation of fp-relaxed but you kept this flag in the Windows flags in the CMakeLists.txt. Should you instead explain that this flag is necessary on Windows only (and why)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Remove fp-relaxed for all platform, added fp-precise for window's emulator only.

|`-Xsparallel=2` | Use 2 cores when compiling the bitstream through Quartus
|`-Xsseed` | Specifies the Quartus compile seed, to potentially yield slightly higher fmax

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -66,11 +66,13 @@ endif()

# This is a Windows-specific flag that enables error handling in host code
if(WIN32)
set(PLATFORM_SPECIFIC_COMPILE_FLAGS "/EHsc /Qactypes /Wall /fp:precise")
set(PLATFORM_SPECIFIC_LINK_FLAGS "/Qactypes /fp:precise")
set(PLATFORM_SPECIFIC_COMPILE_FLAGS "/EHsc /Qactypes /Wall")
set(PLATFORM_SPECIFIC_LINK_FLAGS "/Qactypes ")
set(EMULATOR_PLATFORM_FLAGS "/fp:precise")
else()
set(PLATFORM_SPECIFIC_COMPILE_FLAGS "-qactypes -Wall -fno-finite-math-only -fp-model=precise")
set(PLATFORM_SPECIFIC_LINK_FLAGS "-fp-model=precise")
set(PLATFORM_SPECIFIC_COMPILE_FLAGS "-qactypes -Wall -fno-finite-math-only ")
set(PLATFORM_SPECIFIC_LINK_FLAGS "")
set(EMULATOR_PLATFORM_FLAGS "")
endif()

if(IGNORE_DEFAULT_SEED)
Expand Down Expand Up @@ -98,12 +100,12 @@ message(STATUS "SEED=${SEED}")
# 1. The "compile" stage compiles the device code to an intermediate representation (SPIR-V).
# 2. The "link" stage invokes the compiler's FPGA backend before linking.
# For this reason, FPGA backend flags must be passed as link flags in CMake.
set(EMULATOR_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${PLATFORM_SPECIFIC_COMPILE_FLAGS} -Wformat-security -Werror=format-security -fbracket-depth=512 -DFIXED_ITERATIONS=${FIXED_ITERATIONS} -DCOMPLEX=${COMPLEX} -DMATRIX_DIMENSION=${MATRIX_DIMENSION} -DFPGA_EMULATOR ${BSP_FLAG}")
set(EMULATOR_LINK_FLAGS "-fsycl -fintelfpga ${PLATFORM_SPECIFIC_LINK_FLAGS} ${BSP_FLAG}")
set(SIMULATOR_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${PLATFORM_SPECIFIC_COMPILE_FLAGS} -DFPGA_SIMULATOR -fbracket-depth=512 -DFIXED_ITERATIONS=${FIXED_ITERATIONS} -DCOMPLEX=${COMPLEX} -DMATRIX_DIMENSION=${MATRIX_DIMENSION} -Xsfp-relaxed ${USER_SIMULATOR_FLAGS} ${BSP_FLAG}")
set(SIMULATOR_LINK_FLAGS "-fsycl -fintelfpga ${PLATFORM_SPECIFIC_LINK_FLAGS} -Xssimulation -Xsghdl -Xsclock=${CLOCK_TARGET} -Xstarget=${FPGA_DEVICE} ${USER_HARDWARE_FLAGS} -Xsfp-relaxed ${BSP_FLAG}")
set(HARDWARE_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${PLATFORM_SPECIFIC_COMPILE_FLAGS} -Wformat-security -Werror=format-security -fbracket-depth=512 -DFIXED_ITERATIONS=${FIXED_ITERATIONS} -DCOMPLEX=${COMPLEX} -DMATRIX_DIMENSION=${MATRIX_DIMENSION} -Xsfp-relaxed -DFPGA_HARDWARE ${BSP_FLAG}")
set(HARDWARE_LINK_FLAGS "-fsycl -fintelfpga ${PLATFORM_SPECIFIC_LINK_FLAGS} -Xshardware -Xsclock=${CLOCK_TARGET} -Xsparallel=2 ${SEED} -Xstarget=${FPGA_DEVICE} ${USER_HARDWARE_FLAGS} -Xsfp-relaxed ${BSP_FLAG}")
set(EMULATOR_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${EMULATOR_PLATFORM_FLAGS} ${PLATFORM_SPECIFIC_COMPILE_FLAGS} -Wformat-security -Werror=format-security -fbracket-depth=512 -DFIXED_ITERATIONS=${FIXED_ITERATIONS} -DCOMPLEX=${COMPLEX} -DMATRIX_DIMENSION=${MATRIX_DIMENSION} -DFPGA_EMULATOR ${BSP_FLAG}")
set(EMULATOR_LINK_FLAGS "-fsycl -fintelfpga ${EMULATOR_PLATFORM_FLAGS} ${PLATFORM_SPECIFIC_LINK_FLAGS} ${BSP_FLAG}")
set(SIMULATOR_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${PLATFORM_SPECIFIC_COMPILE_FLAGS} -DFPGA_SIMULATOR -fbracket-depth=512 -DFIXED_ITERATIONS=${FIXED_ITERATIONS} -DCOMPLEX=${COMPLEX} -DMATRIX_DIMENSION=${MATRIX_DIMENSION} ${USER_SIMULATOR_FLAGS} ${BSP_FLAG}")
set(SIMULATOR_LINK_FLAGS "-fsycl -fintelfpga ${PLATFORM_SPECIFIC_LINK_FLAGS} -Xssimulation -Xsghdl -Xsclock=${CLOCK_TARGET} -Xstarget=${FPGA_DEVICE} ${USER_HARDWARE_FLAGS} ${BSP_FLAG}")
set(HARDWARE_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${PLATFORM_SPECIFIC_COMPILE_FLAGS} -Wformat-security -Werror=format-security -fbracket-depth=512 -DFIXED_ITERATIONS=${FIXED_ITERATIONS} -DCOMPLEX=${COMPLEX} -DMATRIX_DIMENSION=${MATRIX_DIMENSION} -DFPGA_HARDWARE ${BSP_FLAG}")
set(HARDWARE_LINK_FLAGS "-fsycl -fintelfpga ${PLATFORM_SPECIFIC_LINK_FLAGS} -Xshardware -Xsclock=${CLOCK_TARGET} -Xsparallel=2 ${SEED} -Xstarget=${FPGA_DEVICE} ${USER_HARDWARE_FLAGS} ${BSP_FLAG}")
# use cmake -D USER_HARDWARE_FLAGS=<flags> to set extra flags for FPGA backend compilation


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,6 @@ Performance results are based on testing as of April 26, 2022.

In this reference design, the Cholesky decomposition algorithm is used to factor a real _n_ × _n_ matrix. The algorithm computes the vector dot product of two rows of the matrix. In our FPGA implementation, the dot product is computed in a loop over the row's _n_ elements. The loop is fully unrolled to maximize throughput. As a result, *n* real multiplication operations are performed in parallel on the FPGA, followed by sequential additions to compute the dot product result.

We use the compiler option `-fp-relaxed`, which permits the compiler to reorder floating point additions (i.e. to assume that floating point addition is commutative). The compiler uses this freedom to reorder the additions so that the dot product arithmetic can be optimally implemented using the FPGA's specialized floating point DSP (Digital Signal Processing) hardware.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar remark for all the affected samples


With this optimization, our FPGA implementation requires _n_ DSPs to compute the real floating point dot product. The input matrix is also replicated two times in order to be able to read two full rows per cycle. The matrix size is constrained by the total FPGA DSP and RAM resources available.

The matrix inversion algorithm used in this reference design performs a Gaussian elimination to invert the triangular matrix _L_ obtained by the Cholesky decomposition. To do so, another _n_ DSPs are required to perform the associated dot-product. Finally, the matrix product of $LI^{\star}LI$ also requires _n_ DSPs.
Expand All @@ -111,10 +109,9 @@ The design uses the following key optimization techniques:
2. Using two copies of the compute matrix in order to be able to read a full row and a full column per cycle.
3. Converting the nested loop into a single merged loop and applying Triangular Loop optimizations. This allows us to generate a design that is very well pipelined.
4. Fully vectorizing the dot products using loop unrolling.
5. Using the compiler flag -Xsfp-relaxed to re-order floating point operations and allowing the inference of a specialized dot-product DSP. This further reduces the number of DSP blocks needed by the implementation, the overall latency, and pipeline depth.
6. Using an efficient memory banking scheme to generate high performance hardware (all local memories are single-read, single-write).
7. Using the `fpga_reg` attribute to insert more pipeline stages where needed to improve the frequency achieved by the design.
8. Using the input matrices properties (hermitian positive matrices) to reduce the number of operations. For example, the (_LI_*) * _LI_ computation only requires to compute half of the output matrix as the result is symmetric.
5. Using an efficient memory banking scheme to generate high performance hardware (all local memories are single-read, single-write).
6. Using the `fpga_reg` attribute to insert more pipeline stages where needed to improve the frequency achieved by the design.
7. Using the input matrices properties (hermitian positive matrices) to reduce the number of operations. For example, the (_LI_*) * _LI_ computation only requires to compute half of the output matrix as the result is symmetric.

### Source Code Breakdown

Expand All @@ -134,7 +131,6 @@ For descriptions of `streaming_cholesky.hpp`, `streaming_cholesky_inversion.hpp`
|:--- |:---
|`-Xshardware` | Target FPGA hardware (as opposed to FPGA emulator)
|`-Xsclock=<target fmax>MHz` | The FPGA backend attempts to achieve <target fmax> MHz
|`-Xsfp-relaxed` | Allows the FPGA backend to re-order floating point arithmetic operations (for example, permit assuming $(a + b + c) == (c + a + b)$)
|`-Xsparallel=2` | Use 2 cores when compiling the bitstream through Quartus
|`-Xsseed` | Specifies the Quartus compile seed to yield slightly higher, possibly, fmax

Expand Down Expand Up @@ -332,4 +328,4 @@ PASSED

Code samples are licensed under the MIT license. See [License.txt](/License.txt) for details.

Third party program Licenses can be found here: [third-party-programs.txt](/third-party-programs.txt).
Third party program Licenses can be found here: [third-party-programs.txt](/third-party-programs.txt).
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,13 @@ endif()

# This is a Windows-specific flag that enables error handling in host code
if(WIN32)
set(PLATFORM_SPECIFIC_COMPILE_FLAGS "/EHsc /Qactypes /Wall /fp:precise")
set(PLATFORM_SPECIFIC_LINK_FLAGS "/Qactypes /fp:precise")
set(PLATFORM_SPECIFIC_COMPILE_FLAGS "/EHsc /Qactypes /Wall ")
set(PLATFORM_SPECIFIC_LINK_FLAGS "/Qactypes ")
set(EMULATOR_PLATFORM_FLAGS "/fp:precise")
else()
set(PLATFORM_SPECIFIC_COMPILE_FLAGS "-qactypes -Wall -fno-finite-math-only -fp-model=precise ")
set(PLATFORM_SPECIFIC_LINK_FLAGS "-fp-model=precise ")
set(PLATFORM_SPECIFIC_COMPILE_FLAGS "-qactypes -Wall -fno-finite-math-only ")
set(PLATFORM_SPECIFIC_LINK_FLAGS "")
set(EMULATOR_PLATFORM_FLAGS "")
endif()

if(DEVICE_FLAG MATCHES "A10")
Expand Down Expand Up @@ -106,12 +108,12 @@ message(STATUS "SEED=${SEED}")
# 1. The "compile" stage compiles the device code to an intermediate representation (SPIR-V).
# 2. The "link" stage invokes the compiler's FPGA backend before linking.
# For this reason, FPGA backend flags must be passed as link flags in CMake.
set(EMULATOR_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${PLATFORM_SPECIFIC_COMPILE_FLAGS} -Wformat-security -Werror=format-security -fbracket-depth=512 -DFIXED_ITERATIONS_DECOMPOSITION=${FIXED_ITERATIONS_DECOMPOSITION} -DFIXED_ITERATIONS_INVERSION=${FIXED_ITERATIONS_INVERSION} -DCOMPLEX=${COMPLEX} -DMATRIX_DIMENSION=${MATRIX_DIMENSION} -DFPGA_EMULATOR ${BSP_FLAG}")
set(EMULATOR_LINK_FLAGS "-fsycl -fintelfpga ${PLATFORM_SPECIFIC_LINK_FLAGS} ${BSP_FLAG}")
set(SIMULATOR_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${PLATFORM_SPECIFIC_COMPILE_FLAGS} -DFPGA_SIMULATOR -DFIXED_ITERATIONS_DECOMPOSITION=${FIXED_ITERATIONS_DECOMPOSITION} -DFIXED_ITERATIONS_INVERSION=${FIXED_ITERATIONS_INVERSION} -DCOMPLEX=${COMPLEX} -DMATRIX_DIMENSION=${MATRIX_DIMENSION} -Xsfp-relaxed ${USER_HARDWARE_FLAGS} ${BSP_FLAG}")
set(SIMULATOR_LINK_FLAGS "-fsycl -fintelfpga ${PLATFORM_SPECIFIC_LINK_FLAGS} -Xssimulation -Xsghdl -Xsclock=${CLOCK_TARGET} -Xstarget=${FPGA_DEVICE} ${USER_SIMULATOR_FLAGS} -Xsfp-relaxed ${BSP_FLAG}")
set(HARDWARE_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${PLATFORM_SPECIFIC_COMPILE_FLAGS} -Wformat-security -Werror=format-security -fbracket-depth=512 -DFIXED_ITERATIONS_DECOMPOSITION=${FIXED_ITERATIONS_DECOMPOSITION} -DFIXED_ITERATIONS_INVERSION=${FIXED_ITERATIONS_INVERSION} -DCOMPLEX=${COMPLEX} -DMATRIX_DIMENSION=${MATRIX_DIMENSION} -Xsfp-relaxed -DFPGA_HARDWARE ${BSP_FLAG}")
set(HARDWARE_LINK_FLAGS "-fsycl -fintelfpga ${PLATFORM_SPECIFIC_LINK_FLAGS} -Xshardware -Xsclock=${CLOCK_TARGET} -Xsparallel=2 ${SEED} -Xstarget=${FPGA_DEVICE} ${USER_HARDWARE_FLAGS} -Xsfp-relaxed ${BSP_FLAG}")
set(EMULATOR_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${EMULATOR_PLATFORM_FLAGS} ${PLATFORM_SPECIFIC_COMPILE_FLAGS} -Wformat-security -Werror=format-security -fbracket-depth=512 -DFIXED_ITERATIONS_DECOMPOSITION=${FIXED_ITERATIONS_DECOMPOSITION} -DFIXED_ITERATIONS_INVERSION=${FIXED_ITERATIONS_INVERSION} -DCOMPLEX=${COMPLEX} -DMATRIX_DIMENSION=${MATRIX_DIMENSION} -DFPGA_EMULATOR ${BSP_FLAG}")
set(EMULATOR_LINK_FLAGS "-fsycl -fintelfpga ${EMULATOR_PLATFORM_FLAGS} ${PLATFORM_SPECIFIC_LINK_FLAGS} ${BSP_FLAG}")
set(SIMULATOR_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${PLATFORM_SPECIFIC_COMPILE_FLAGS} -DFPGA_SIMULATOR -DFIXED_ITERATIONS_DECOMPOSITION=${FIXED_ITERATIONS_DECOMPOSITION} -DFIXED_ITERATIONS_INVERSION=${FIXED_ITERATIONS_INVERSION} -DCOMPLEX=${COMPLEX} -DMATRIX_DIMENSION=${MATRIX_DIMENSION} ${USER_HARDWARE_FLAGS} ${BSP_FLAG}")
set(SIMULATOR_LINK_FLAGS "-fsycl -fintelfpga ${PLATFORM_SPECIFIC_LINK_FLAGS} -Xssimulation -Xsghdl -Xsclock=${CLOCK_TARGET} -Xstarget=${FPGA_DEVICE} ${USER_SIMULATOR_FLAGS} ${BSP_FLAG}")
set(HARDWARE_COMPILE_FLAGS "-fsycl -fintelfpga -Wall ${PLATFORM_SPECIFIC_COMPILE_FLAGS} -Wformat-security -Werror=format-security -fbracket-depth=512 -DFIXED_ITERATIONS_DECOMPOSITION=${FIXED_ITERATIONS_DECOMPOSITION} -DFIXED_ITERATIONS_INVERSION=${FIXED_ITERATIONS_INVERSION} -DCOMPLEX=${COMPLEX} -DMATRIX_DIMENSION=${MATRIX_DIMENSION} -DFPGA_HARDWARE ${BSP_FLAG}")
set(HARDWARE_LINK_FLAGS "-fsycl -fintelfpga ${PLATFORM_SPECIFIC_LINK_FLAGS} -Xshardware -Xsclock=${CLOCK_TARGET} -Xsparallel=2 ${SEED} -Xstarget=${FPGA_DEVICE} ${USER_HARDWARE_FLAGS} ${BSP_FLAG}")
# use cmake -D USER_HARDWARE_FLAGS=<flags> to set extra flags for FPGA backend compilation

###############################################################################
Expand Down
8 changes: 2 additions & 6 deletions DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/qrd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,8 +74,6 @@ Performance results are based on testing as of July 29, 2020.

The QR decomposition algorithm factors a complex _m_ × _n_ matrix, where _m_ ≥ _n_. The algorithm computes the vector dot product of two columns of the matrix. In our FPGA implementation, the dot product is computed in a loop over the column's _m_ elements. The loop is unrolled fully to maximize throughput. The *m* complex multiplication operations are performed in parallel on the FPGA followed by sequential additions to compute the dot product result.

The design uses the `-fp-relaxed` option, which permits the compiler to reorder floating point additions (to assume that floating point addition is commutative). The compiler reorders the additions so that the dot product arithmetic can be optimally implemented using the specialized floating point DSP (Digital Signal Processing) hardware in the FPGA.

With this optimization, our FPGA implementation requires 4*m* DSPs to compute the complex floating point dot product or 2*m* DSPs for the real case. The matrix size is constrained by the total FPGA DSP resources available.

By default, the design is parameterized to process 128 × 128 matrices when compiled targeting an Intel® Arria® 10 FPGA. It is parameterized to process 256 × 256 matrices when compiled targeting a Intel® Stratix® 10 or Intel® Agilex™ FPGA; however, the design can process matrices from 4 x 4 to 512 x 512.
Expand All @@ -92,17 +90,15 @@ The key optimization techniques used are as follows:
1. Refactoring the original Gram-Schmidt algorithm to merge two dot products into one, reducing the total number of dot products needed to three from two. This helps us reduce the DSPs required for the implementation.
2. Converting the nested loop into a single merged loop and applying Triangular Loop optimizations. This allows us to generate a design that is very well pipelined.
3. Fully vectorizing the dot products using loop unrolling.
4. Using the compiler flag -Xsfp-relaxed to re-order floating point operations and allowing the inference of a specialized dot-product DSP. This further reduces the number of DSP blocks needed by the implementation, the overall latency, and pipeline depth.
5. Using an efficient memory banking scheme to generate high performance hardware.
6. Using the `fpga_reg` attribute to insert more pipeline stages where needed to improve the frequency achieved by the design.
4. Using an efficient memory banking scheme to generate high performance hardware.
5. Using the `fpga_reg` attribute to insert more pipeline stages where needed to improve the frequency achieved by the design.

### Compiler Flags Used

| Flag | Description
|:--- |:---
| `-Xshardware` | Target FPGA hardware (as opposed to FPGA emulator)
| `-Xsclock=360MHz` | The FPGA backend attempts to achieve 360 MHz
| `-Xsfp-relaxed` | Allows the FPGA backend to re-order floating point arithmetic operations (e.g. permit assuming (a + b + c) == (c + a + b) )
| `-Xsparallel=2` | Use 2 cores when compiling the bitstream through Intel® Quartus®
| `-Xsseed` | Specifies the Intel® Quartus® compile seed, to yield slightly higher fmax

Expand Down
Loading