You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/cholesky/README.md
+2-6
Original file line number
Diff line number
Diff line change
@@ -83,16 +83,13 @@ The following list shows the key optimization techniques included in the referen
83
83
2. Using two copies of the compute matrix to read a full row and a full column per cycle.
84
84
3. Converting the nested loop into a single merged loop and applying Triangular Loop optimizations. This approach enables the ability to generate a design that is pipelined efficiently.
85
85
4. Fully vectorizing the dot products using loop unrolling.
86
-
5. Using the `-Xsfp-relaxed` compiler option to reorder floating point operations and allowing the inference of a specialized dot-product DSP. This option further reduces the number of DSP blocks needed by the implementation, the overall latency, and pipeline depth.
87
-
6. Using an efficient memory banking scheme to generate high performance hardware (all local memories are single-read, single-write).
88
-
7. Using the `fpga_reg` attribute to insert more pipeline stages where needed to improve the frequency achieved by the design.
86
+
5. Using an efficient memory banking scheme to generate high performance hardware (all local memories are single-read, single-write).
87
+
6. Using the `fpga_reg` attribute to insert more pipeline stages where needed to improve the frequency achieved by the design.
89
88
90
89
### Matrix Dimensions and FPGA Resources
91
90
92
91
In this reference design, the Cholesky decomposition algorithm is used to factor a real _n_ × _n_ matrix. The algorithm computes the vector dot product of two rows of the matrix. In our FPGA implementation, the dot product is computed in a loop over the _n_ elements in the row. The loop is fully unrolled to maximize throughput, so *n* real multiplication operations are performed in parallel on the FPGA and followed by sequential additions to compute the dot product result.
93
92
94
-
The sample uses the `-fp-relaxed` compiler option, which permits the compiler to reorder floating point additions (for example, to assume that floating point addition is commutative). The compiler reorders the additions so that the dot product arithmetic can be optimally implemented using the specialized floating point Digital Signal Processing (DSP) hardware on the FPGA.
95
-
96
93
With this optimization, our FPGA implementation requires _n_ DSPs to compute the real floating point dot product. The input matrix is also replicated two times in order to be able to read two full rows per cycle. The matrix size is constrained by the total FPGA DSP and RAM resources available.
97
94
98
95
### Compiler Flags Used
@@ -101,7 +98,6 @@ With this optimization, our FPGA implementation requires _n_ DSPs to compute the
101
98
|:--- |:---
102
99
|`-Xshardware` | Target FPGA hardware (as opposed to FPGA emulator)
103
100
|`-Xsclock=<target fmax>MHz` | The FPGA backend attempts to achieve <targetfmax> MHz
104
-
|`-Xsfp-relaxed` | Allows the FPGA backend to re-order floating point arithmetic operations (for example, permit assuming $(a + b + c) == (c + a + b)$ )
105
101
|`-Xsparallel=2` | Use 2 cores when compiling the bitstream through Quartus
106
102
|`-Xsseed` | Specifies the Quartus compile seed, to potentially yield slightly higher fmax
Copy file name to clipboardExpand all lines: DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/cholesky_inversion/README.md
+4-8
Original file line number
Diff line number
Diff line change
@@ -86,8 +86,6 @@ Performance results are based on testing as of April 26, 2022.
86
86
87
87
In this reference design, the Cholesky decomposition algorithm is used to factor a real _n_ × _n_ matrix. The algorithm computes the vector dot product of two rows of the matrix. In our FPGA implementation, the dot product is computed in a loop over the row's _n_ elements. The loop is fully unrolled to maximize throughput. As a result, *n* real multiplication operations are performed in parallel on the FPGA, followed by sequential additions to compute the dot product result.
88
88
89
-
We use the compiler option `-fp-relaxed`, which permits the compiler to reorder floating point additions (i.e. to assume that floating point addition is commutative). The compiler uses this freedom to reorder the additions so that the dot product arithmetic can be optimally implemented using the FPGA's specialized floating point DSP (Digital Signal Processing) hardware.
90
-
91
89
With this optimization, our FPGA implementation requires _n_ DSPs to compute the real floating point dot product. The input matrix is also replicated two times in order to be able to read two full rows per cycle. The matrix size is constrained by the total FPGA DSP and RAM resources available.
92
90
93
91
The matrix inversion algorithm used in this reference design performs a Gaussian elimination to invert the triangular matrix _L_ obtained by the Cholesky decomposition. To do so, another _n_ DSPs are required to perform the associated dot-product. Finally, the matrix product of $LI^{\star}LI$ also requires _n_ DSPs.
@@ -111,10 +109,9 @@ The design uses the following key optimization techniques:
111
109
2. Using two copies of the compute matrix in order to be able to read a full row and a full column per cycle.
112
110
3. Converting the nested loop into a single merged loop and applying Triangular Loop optimizations. This allows us to generate a design that is very well pipelined.
113
111
4. Fully vectorizing the dot products using loop unrolling.
114
-
5. Using the compiler flag -Xsfp-relaxed to re-order floating point operations and allowing the inference of a specialized dot-product DSP. This further reduces the number of DSP blocks needed by the implementation, the overall latency, and pipeline depth.
115
-
6. Using an efficient memory banking scheme to generate high performance hardware (all local memories are single-read, single-write).
116
-
7. Using the `fpga_reg` attribute to insert more pipeline stages where needed to improve the frequency achieved by the design.
117
-
8. Using the input matrices properties (hermitian positive matrices) to reduce the number of operations. For example, the (_LI_*) * _LI_ computation only requires to compute half of the output matrix as the result is symmetric.
112
+
5. Using an efficient memory banking scheme to generate high performance hardware (all local memories are single-read, single-write).
113
+
6. Using the `fpga_reg` attribute to insert more pipeline stages where needed to improve the frequency achieved by the design.
114
+
7. Using the input matrices properties (hermitian positive matrices) to reduce the number of operations. For example, the (_LI_*) * _LI_ computation only requires to compute half of the output matrix as the result is symmetric.
118
115
119
116
### Source Code Breakdown
120
117
@@ -134,7 +131,6 @@ For descriptions of `streaming_cholesky.hpp`, `streaming_cholesky_inversion.hpp`
134
131
|:--- |:---
135
132
|`-Xshardware` | Target FPGA hardware (as opposed to FPGA emulator)
136
133
|`-Xsclock=<target fmax>MHz` | The FPGA backend attempts to achieve <targetfmax> MHz
137
-
|`-Xsfp-relaxed` | Allows the FPGA backend to re-order floating point arithmetic operations (for example, permit assuming $(a + b + c) == (c + a + b)$)
138
134
|`-Xsparallel=2` | Use 2 cores when compiling the bitstream through Quartus
139
135
|`-Xsseed` | Specifies the Quartus compile seed to yield slightly higher, possibly, fmax
140
136
@@ -332,4 +328,4 @@ PASSED
332
328
333
329
Code samples are licensed under the MIT license. See [License.txt](/License.txt) for details.
334
330
335
-
Third party program Licenses can be found here: [third-party-programs.txt](/third-party-programs.txt).
331
+
Third party program Licenses can be found here: [third-party-programs.txt](/third-party-programs.txt).
Copy file name to clipboardExpand all lines: DirectProgramming/C++SYCL_FPGA/ReferenceDesigns/qrd/README.md
+2-6
Original file line number
Diff line number
Diff line change
@@ -74,8 +74,6 @@ Performance results are based on testing as of July 29, 2020.
74
74
75
75
The QR decomposition algorithm factors a complex _m_ × _n_ matrix, where _m_ ≥ _n_. The algorithm computes the vector dot product of two columns of the matrix. In our FPGA implementation, the dot product is computed in a loop over the column's _m_ elements. The loop is unrolled fully to maximize throughput. The *m* complex multiplication operations are performed in parallel on the FPGA followed by sequential additions to compute the dot product result.
76
76
77
-
The design uses the `-fp-relaxed` option, which permits the compiler to reorder floating point additions (to assume that floating point addition is commutative). The compiler reorders the additions so that the dot product arithmetic can be optimally implemented using the specialized floating point DSP (Digital Signal Processing) hardware in the FPGA.
78
-
79
77
With this optimization, our FPGA implementation requires 4*m* DSPs to compute the complex floating point dot product or 2*m* DSPs for the real case. The matrix size is constrained by the total FPGA DSP resources available.
80
78
81
79
By default, the design is parameterized to process 128 × 128 matrices when compiled targeting an Intel® Arria® 10 FPGA. It is parameterized to process 256 × 256 matrices when compiled targeting a Intel® Stratix® 10 or Intel® Agilex™ FPGA; however, the design can process matrices from 4 x 4 to 512 x 512.
@@ -92,17 +90,15 @@ The key optimization techniques used are as follows:
92
90
1. Refactoring the original Gram-Schmidt algorithm to merge two dot products into one, reducing the total number of dot products needed to three from two. This helps us reduce the DSPs required for the implementation.
93
91
2. Converting the nested loop into a single merged loop and applying Triangular Loop optimizations. This allows us to generate a design that is very well pipelined.
94
92
3. Fully vectorizing the dot products using loop unrolling.
95
-
4. Using the compiler flag -Xsfp-relaxed to re-order floating point operations and allowing the inference of a specialized dot-product DSP. This further reduces the number of DSP blocks needed by the implementation, the overall latency, and pipeline depth.
96
-
5. Using an efficient memory banking scheme to generate high performance hardware.
97
-
6. Using the `fpga_reg` attribute to insert more pipeline stages where needed to improve the frequency achieved by the design.
93
+
4. Using an efficient memory banking scheme to generate high performance hardware.
94
+
5. Using the `fpga_reg` attribute to insert more pipeline stages where needed to improve the frequency achieved by the design.
0 commit comments