-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[RISCV] Add a generic OOO CPU #120712
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV] Add a generic OOO CPU #120712
Conversation
//===----------------------------------------------------------------------===// | ||
|
||
def GenericOOOModel : SchedMachineModel { | ||
int IssueWidth = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we go to 6? I thought 4 was reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought is we should use ARM neoverse N series (and the trend of high performance server cores) as references. I don't have a strong opinion on this.
dc29064
to
8a245cd
Compare
@llvm/pr-subscribers-clang @llvm/pr-subscribers-backend-risc-v Author: Pengcheng Wang (wangpc-pp) ChangesPatch is 104.22 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/120712.diff 6 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCV.td b/llvm/lib/Target/RISCV/RISCV.td
index 963124140cd035..57843b9b93abd8 100644
--- a/llvm/lib/Target/RISCV/RISCV.td
+++ b/llvm/lib/Target/RISCV/RISCV.td
@@ -45,7 +45,7 @@ include "RISCVMacroFusion.td"
//===----------------------------------------------------------------------===//
// RISC-V Scheduling Models
//===----------------------------------------------------------------------===//
-
+include "RISCVSchedGenericOOO.td"
include "RISCVSchedMIPSP8700.td"
include "RISCVSchedRocket.td"
include "RISCVSchedSiFive7.td"
diff --git a/llvm/lib/Target/RISCV/RISCVProcessors.td b/llvm/lib/Target/RISCV/RISCVProcessors.td
index 61c7c21367036f..c846c27989d8c2 100644
--- a/llvm/lib/Target/RISCV/RISCVProcessors.td
+++ b/llvm/lib/Target/RISCV/RISCVProcessors.td
@@ -103,6 +103,8 @@ def GENERIC_RV64 : RISCVProcessorModel<"generic-rv64",
// Support generic for compatibility with other targets. The triple will be used
// to change to the appropriate rv32/rv64 version.
def GENERIC : RISCVTuneProcessorModel<"generic", NoSchedModel>, GenericTuneInfo;
+def GENERIC_OOO : RISCVTuneProcessorModel<"generic-ooo", GenericOOOModel>,
+ GenericTuneInfo;
def MIPS_P8700 : RISCVProcessorModel<"mips-p8700",
MIPSP8700Model,
diff --git a/llvm/lib/Target/RISCV/RISCVSchedGenericOOO.td b/llvm/lib/Target/RISCV/RISCVSchedGenericOOO.td
new file mode 100644
index 00000000000000..f7bf824ccebe0c
--- /dev/null
+++ b/llvm/lib/Target/RISCV/RISCVSchedGenericOOO.td
@@ -0,0 +1,494 @@
+//===-- RISCVSchedGenericOOO.td - Generic O3 Processor -----*- tablegen -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+//===----------------------------------------------------------------------===//
+// We assume that:
+// * 6-issue out-of-order CPU with 192 ROB entries.
+// * Units:
+// * IXU (Integer GenericOOOALU Unit): 4 units, only one can execute division.
+// * FXU (Floating-point Unit): 2 units.
+// * LSU (Load/Store Unit): 2 units.
+// * VXU (Vector Unit): 1 unit.
+// * Latency:
+// * Integer instructions: 1 cycle.
+// * Multiplication instructions: 4 cycles.
+// * Multiplication/Division instructions: 7-13 cycles.
+// * Floating-point instructions: 4-6 cycles.
+// * Vector instructions: 2-6 cycles.
+// * Load/Store:
+// * IXU: 4 cycles.
+// * FXU: 6 cycles.
+// * VXU: 6 cycles.
+// * Integer/floating-point/vector div/rem/sqrt/... are non-pipelined.
+//===----------------------------------------------------------------------===//
+
+def GenericOOOModel : SchedMachineModel {
+ int IssueWidth = 6;
+ int MicroOpBufferSize = 192;
+ int LoadLatency = 4;
+ int MispredictPenalty = 8;
+ let CompleteModel = 0;
+}
+
+let SchedModel = GenericOOOModel in {
+//===----------------------------------------------------------------------===//
+// Resource groups
+//===----------------------------------------------------------------------===//
+def GenericOOODIV : ProcResource<1>;
+def GenericOOOIXU : ProcResource<3>;
+def GenericOOOALU : ProcResGroup<[GenericOOODIV, GenericOOOIXU]>;
+def GenericOOOLSU : ProcResource<2>;
+def GenericOOOFPU : ProcResource<2>;
+// TODO: Add vector scheduling.
+// def GenericOOOVXU : ProcResource<1>;
+
+//===----------------------------------------------------------------------===//
+// Branches
+//===----------------------------------------------------------------------===//
+def : WriteRes<WriteJmp, [GenericOOOALU]>;
+def : WriteRes<WriteJalr, [GenericOOOALU]>;
+def : WriteRes<WriteJal, [GenericOOOALU]>;
+
+//===----------------------------------------------------------------------===//
+// Integer arithmetic and logic
+//===----------------------------------------------------------------------===//
+def : WriteRes<WriteIALU, [GenericOOOALU]>;
+def : WriteRes<WriteIALU32, [GenericOOOALU]>;
+def : WriteRes<WriteShiftImm, [GenericOOOALU]>;
+def : WriteRes<WriteShiftImm32, [GenericOOOALU]>;
+def : WriteRes<WriteShiftReg, [GenericOOOALU]>;
+def : WriteRes<WriteShiftReg32, [GenericOOOALU]>;
+
+//===----------------------------------------------------------------------===//
+// Integer multiplication
+//===----------------------------------------------------------------------===//
+let Latency = 4 in {
+ def : WriteRes<WriteIMul, [GenericOOOALU]>;
+ def : WriteRes<WriteIMul32, [GenericOOOALU]>;
+}
+
+//===----------------------------------------------------------------------===//
+// Integer division
+//===----------------------------------------------------------------------===//
+def : WriteRes<WriteIDiv32, [GenericOOODIV]> {
+ let Latency = 13;
+ let ReleaseAtCycles = [13];
+}
+def : WriteRes<WriteIDiv, [GenericOOODIV]> {
+ let Latency = 21;
+ let ReleaseAtCycles = [21];
+}
+def : WriteRes<WriteIRem32, [GenericOOODIV]> {
+ let Latency = 13;
+ let ReleaseAtCycles = [13];
+}
+def : WriteRes<WriteIRem, [GenericOOODIV]> {
+ let Latency = 21;
+ let ReleaseAtCycles = [21];
+}
+
+//===----------------------------------------------------------------------===//
+// Integer memory
+//===----------------------------------------------------------------------===//
+// Load
+let Latency = 4 in {
+ def : WriteRes<WriteLDB, [GenericOOOLSU]>;
+ def : WriteRes<WriteLDH, [GenericOOOLSU]>;
+ def : WriteRes<WriteLDW, [GenericOOOLSU]>;
+ def : WriteRes<WriteLDD, [GenericOOOLSU]>;
+}
+
+// Store
+def : WriteRes<WriteSTB, [GenericOOOLSU]>;
+def : WriteRes<WriteSTH, [GenericOOOLSU]>;
+def : WriteRes<WriteSTW, [GenericOOOLSU]>;
+def : WriteRes<WriteSTD, [GenericOOOLSU]>;
+
+//===----------------------------------------------------------------------===//
+// Atomic
+//===----------------------------------------------------------------------===//
+let Latency = 4 in {
+ def : WriteRes<WriteAtomicLDW, [GenericOOOLSU]>;
+ def : WriteRes<WriteAtomicLDD, [GenericOOOLSU]>;
+}
+
+let Latency = 5 in {
+ def : WriteRes<WriteAtomicW, [GenericOOOLSU]>;
+ def : WriteRes<WriteAtomicD, [GenericOOOLSU]>;
+}
+
+def : WriteRes<WriteAtomicSTW, [GenericOOOLSU]>;
+def : WriteRes<WriteAtomicSTD, [GenericOOOLSU]>;
+
+//===----------------------------------------------------------------------===//
+// Floating-point
+//===----------------------------------------------------------------------===//
+// Floating-point load
+let Latency = 6 in {
+ def : WriteRes<WriteFLD32, [GenericOOOLSU]>;
+ def : WriteRes<WriteFLD64, [GenericOOOLSU]>;
+}
+
+// Floating-point store
+def : WriteRes<WriteFST32, [GenericOOOLSU]>;
+def : WriteRes<WriteFST64, [GenericOOOLSU]>;
+
+// Arithmetic and logic
+let Latency = 4 in {
+ def : WriteRes<WriteFAdd32, [GenericOOOFPU]>;
+ def : WriteRes<WriteFAdd64, [GenericOOOFPU]>;
+}
+
+let Latency = 5 in {
+ def : WriteRes<WriteFMul32, [GenericOOOFPU]>;
+ def : WriteRes<WriteFMul64, [GenericOOOFPU]>;
+}
+
+let Latency = 6 in {
+ def : WriteRes<WriteFMA32, [GenericOOOFPU]>;
+ def : WriteRes<WriteFMA64, [GenericOOOFPU]>;
+}
+
+def : WriteRes<WriteFSGNJ32, [GenericOOOFPU]>;
+def : WriteRes<WriteFSGNJ64, [GenericOOOFPU]>;
+def : WriteRes<WriteFMinMax32, [GenericOOOFPU]>;
+def : WriteRes<WriteFMinMax64, [GenericOOOFPU]>;
+
+// Compare
+let Latency = 2 in {
+ def : WriteRes<WriteFCmp32, [GenericOOOFPU]>;
+ def : WriteRes<WriteFCmp64, [GenericOOOFPU]>;
+}
+
+// Division
+let Latency = 13, ReleaseAtCycles = [13] in {
+ def : WriteRes<WriteFDiv32, [GenericOOOFPU]>;
+ def : WriteRes<WriteFSqrt32, [GenericOOOFPU]>;
+}
+
+let Latency = 17, ReleaseAtCycles = [17] in {
+ def : WriteRes<WriteFDiv64, [GenericOOOFPU]>;
+ def : WriteRes<WriteFSqrt64, [GenericOOOFPU]>;
+}
+
+// Conversions
+let Latency = 4 in {
+ def : WriteRes<WriteFCvtI32ToF32, [GenericOOOFPU]>;
+ def : WriteRes<WriteFCvtI32ToF64, [GenericOOOFPU]>;
+ def : WriteRes<WriteFCvtI64ToF32, [GenericOOOFPU]>;
+ def : WriteRes<WriteFCvtI64ToF64, [GenericOOOFPU]>;
+}
+
+let Latency = 4 in {
+ def : WriteRes<WriteFCvtF32ToI32, [GenericOOOFPU]>;
+ def : WriteRes<WriteFCvtF32ToI64, [GenericOOOFPU]>;
+}
+
+let Latency = 4 in {
+ def : WriteRes<WriteFCvtF64ToI32, [GenericOOOFPU]>;
+ def : WriteRes<WriteFCvtF64ToI64, [GenericOOOFPU]>;
+}
+
+let Latency = 4 in {
+ def : WriteRes<WriteFCvtF64ToF32, [GenericOOOFPU]>;
+ def : WriteRes<WriteFCvtF32ToF64, [GenericOOOFPU]>;
+}
+
+let Latency = 6 in {
+ def : WriteRes<WriteFMovI32ToF32, [GenericOOOFPU]>;
+ def : WriteRes<WriteFMovI64ToF64, [GenericOOOFPU]>;
+ def : WriteRes<WriteFMovF32ToI32, [GenericOOOFPU]>;
+ def : WriteRes<WriteFMovF64ToI64, [GenericOOOFPU]>;
+}
+
+// Classify
+def : WriteRes<WriteFClass32, [GenericOOOFPU]>;
+def : WriteRes<WriteFClass64, [GenericOOOFPU]>;
+
+//===----------------------------------------------------------------------===//
+// Zicsr extension
+//===----------------------------------------------------------------------===//
+def : WriteRes<WriteCSR, [GenericOOOALU]>;
+
+//===----------------------------------------------------------------------===//
+// Zabha extension
+//===----------------------------------------------------------------------===//
+let Latency = 5 in {
+ def : WriteRes<WriteAtomicB, [GenericOOOLSU]>;
+ def : WriteRes<WriteAtomicH, [GenericOOOLSU]>;
+}
+
+//===----------------------------------------------------------------------===//
+// Zba extension
+//===----------------------------------------------------------------------===//
+def : WriteRes<WriteSHXADD, [GenericOOOALU]>;
+def : WriteRes<WriteSHXADD32, [GenericOOOALU]>;
+
+//===----------------------------------------------------------------------===//
+// Zbb extension
+//===----------------------------------------------------------------------===//
+def : WriteRes<WriteCLZ, [GenericOOOALU]>;
+def : WriteRes<WriteCTZ, [GenericOOOALU]>;
+def : WriteRes<WriteCPOP, [GenericOOOALU]>;
+def : WriteRes<WriteCLZ32, [GenericOOOALU]>;
+def : WriteRes<WriteCTZ32, [GenericOOOALU]>;
+def : WriteRes<WriteCPOP32, [GenericOOOALU]>;
+def : WriteRes<WriteRotateReg, [GenericOOOALU]>;
+def : WriteRes<WriteRotateImm, [GenericOOOALU]>;
+def : WriteRes<WriteRotateReg32, [GenericOOOALU]>;
+def : WriteRes<WriteRotateImm32, [GenericOOOALU]>;
+def : WriteRes<WriteREV8, [GenericOOOALU]>;
+def : WriteRes<WriteORCB, [GenericOOOALU]>;
+def : WriteRes<WriteIMinMax, [GenericOOOALU]>;
+
+//===----------------------------------------------------------------------===//
+// Zbc extension
+//===----------------------------------------------------------------------===//
+def : WriteRes<WriteCLMUL, [GenericOOOALU]>;
+
+//===----------------------------------------------------------------------===//
+// Zbs extension
+//===----------------------------------------------------------------------===//
+def : WriteRes<WriteSingleBit, [GenericOOOALU]>;
+def : WriteRes<WriteSingleBitImm, [GenericOOOALU]>;
+def : WriteRes<WriteBEXT, [GenericOOOALU]>;
+def : WriteRes<WriteBEXTI, [GenericOOOALU]>;
+
+//===----------------------------------------------------------------------===//
+// Zbkb extension
+//===----------------------------------------------------------------------===//
+def : WriteRes<WriteBREV8, [GenericOOOALU]>;
+def : WriteRes<WritePACK, [GenericOOOALU]>;
+def : WriteRes<WritePACK32, [GenericOOOALU]>;
+def : WriteRes<WriteZIP, [GenericOOOALU]>;
+
+//===----------------------------------------------------------------------===//
+// Zbkx extension
+//===----------------------------------------------------------------------===//
+def : WriteRes<WriteXPERM, [GenericOOOALU]>;
+
+//===----------------------------------------------------------------------===//
+// Zfa extension
+//===----------------------------------------------------------------------===//
+let Latency = 3 in {
+ def : WriteRes<WriteFRoundF16, [GenericOOOFPU]>;
+ def : WriteRes<WriteFRoundF32, [GenericOOOFPU]>;
+ def : WriteRes<WriteFRoundF64, [GenericOOOFPU]>;
+}
+
+let Latency = 5 in {
+ def : WriteRes<WriteFLI16, [GenericOOOFPU]>;
+ def : WriteRes<WriteFLI32, [GenericOOOFPU]>;
+ def : WriteRes<WriteFLI64, [GenericOOOFPU]>;
+}
+
+//===----------------------------------------------------------------------===//
+// Zfh extension
+//===----------------------------------------------------------------------===//
+// Zfhmin
+// Load/Store
+let Latency = 6 in
+def : WriteRes<WriteFLD16, [GenericOOOLSU]>;
+def : WriteRes<WriteFST16, [GenericOOOLSU]>;
+
+// Conversions
+let Latency = 3 in {
+ def : WriteRes<WriteFCvtF16ToF64, [GenericOOOFPU]>;
+ def : WriteRes<WriteFCvtF64ToF16, [GenericOOOFPU]>;
+ def : WriteRes<WriteFCvtF32ToF16, [GenericOOOFPU]>;
+ def : WriteRes<WriteFCvtF16ToF32, [GenericOOOFPU]>;
+}
+
+let Latency = 4 in {
+ def : WriteRes<WriteFMovI16ToF16, [GenericOOOFPU]>;
+ def : WriteRes<WriteFMovF16ToI16, [GenericOOOFPU]>;
+}
+
+// Other than Zfhmin
+let Latency = 4 in {
+ def : WriteRes<WriteFCvtI64ToF16, []>;
+ def : WriteRes<WriteFCvtI32ToF16, []>;
+ def : WriteRes<WriteFCvtF16ToI64, []>;
+ def : WriteRes<WriteFCvtF16ToI32, []>;
+}
+
+// Arithmetic and logic
+let Latency = 4 in
+def : WriteRes<WriteFAdd16, [GenericOOOFPU]>;
+
+let Latency = 5 in
+def : WriteRes<WriteFMul16, [GenericOOOFPU]>;
+
+let Latency = 6 in
+def : WriteRes<WriteFMA16, [GenericOOOFPU]>;
+
+def : WriteRes<WriteFSGNJ16, [GenericOOOFPU]>;
+def : WriteRes<WriteFMinMax16, [GenericOOOFPU]>;
+
+// Compare
+let Latency = 2 in
+def : WriteRes<WriteFCmp16, [GenericOOOFPU]>;
+
+// Division
+let Latency = 9, ReleaseAtCycles = [9] in {
+ def : WriteRes<WriteFDiv16, [GenericOOOFPU]>;
+ def : WriteRes<WriteFSqrt16, [GenericOOOFPU]>;
+}
+
+// Classify
+def : WriteRes<WriteFClass16, [GenericOOOFPU]>;
+
+//===----------------------------------------------------------------------===//
+// Misc
+//===----------------------------------------------------------------------===//
+let Latency = 0 in
+def : WriteRes<WriteNop, [GenericOOOALU]>;
+
+//===----------------------------------------------------------------------===//
+// Bypass and advance
+//===----------------------------------------------------------------------===//
+def : ReadAdvance<ReadJmp, 0>;
+def : ReadAdvance<ReadJalr, 0>;
+def : ReadAdvance<ReadCSR, 0>;
+def : ReadAdvance<ReadStoreData, 0>;
+def : ReadAdvance<ReadMemBase, 0>;
+def : ReadAdvance<ReadIALU, 0>;
+def : ReadAdvance<ReadIALU32, 0>;
+def : ReadAdvance<ReadShiftImm, 0>;
+def : ReadAdvance<ReadShiftImm32, 0>;
+def : ReadAdvance<ReadShiftReg, 0>;
+def : ReadAdvance<ReadShiftReg32, 0>;
+def : ReadAdvance<ReadIDiv, 0>;
+def : ReadAdvance<ReadIDiv32, 0>;
+def : ReadAdvance<ReadIRem, 0>;
+def : ReadAdvance<ReadIRem32, 0>;
+def : ReadAdvance<ReadIMul, 0>;
+def : ReadAdvance<ReadIMul32, 0>;
+def : ReadAdvance<ReadAtomicWA, 0>;
+def : ReadAdvance<ReadAtomicWD, 0>;
+def : ReadAdvance<ReadAtomicDA, 0>;
+def : ReadAdvance<ReadAtomicDD, 0>;
+def : ReadAdvance<ReadAtomicLDW, 0>;
+def : ReadAdvance<ReadAtomicLDD, 0>;
+def : ReadAdvance<ReadAtomicSTW, 0>;
+def : ReadAdvance<ReadAtomicSTD, 0>;
+def : ReadAdvance<ReadFStoreData, 0>;
+def : ReadAdvance<ReadFMemBase, 0>;
+def : ReadAdvance<ReadFAdd32, 0>;
+def : ReadAdvance<ReadFAdd64, 0>;
+def : ReadAdvance<ReadFMul32, 0>;
+def : ReadAdvance<ReadFMA32, 0>;
+def : ReadAdvance<ReadFMA32Addend, 0>;
+def : ReadAdvance<ReadFMul64, 0>;
+def : ReadAdvance<ReadFMA64, 0>;
+def : ReadAdvance<ReadFMA64Addend, 0>;
+def : ReadAdvance<ReadFDiv32, 0>;
+def : ReadAdvance<ReadFDiv64, 0>;
+def : ReadAdvance<ReadFSqrt32, 0>;
+def : ReadAdvance<ReadFSqrt64, 0>;
+def : ReadAdvance<ReadFCmp32, 0>;
+def : ReadAdvance<ReadFCmp64, 0>;
+def : ReadAdvance<ReadFSGNJ32, 0>;
+def : ReadAdvance<ReadFSGNJ64, 0>;
+def : ReadAdvance<ReadFMinMax32, 0>;
+def : ReadAdvance<ReadFMinMax64, 0>;
+def : ReadAdvance<ReadFCvtF32ToI32, 0>;
+def : ReadAdvance<ReadFCvtF32ToI64, 0>;
+def : ReadAdvance<ReadFCvtF64ToI32, 0>;
+def : ReadAdvance<ReadFCvtF64ToI64, 0>;
+def : ReadAdvance<ReadFCvtI32ToF32, 0>;
+def : ReadAdvance<ReadFCvtI32ToF64, 0>;
+def : ReadAdvance<ReadFCvtI64ToF32, 0>;
+def : ReadAdvance<ReadFCvtI64ToF64, 0>;
+def : ReadAdvance<ReadFCvtF32ToF64, 0>;
+def : ReadAdvance<ReadFCvtF64ToF32, 0>;
+def : ReadAdvance<ReadFMovF32ToI32, 0>;
+def : ReadAdvance<ReadFMovI32ToF32, 0>;
+def : ReadAdvance<ReadFMovF64ToI64, 0>;
+def : ReadAdvance<ReadFMovI64ToF64, 0>;
+def : ReadAdvance<ReadFClass32, 0>;
+def : ReadAdvance<ReadFClass64, 0>;
+
+// Zabha
+def : ReadAdvance<ReadAtomicBA, 0>;
+def : ReadAdvance<ReadAtomicBD, 0>;
+def : ReadAdvance<ReadAtomicHA, 0>;
+def : ReadAdvance<ReadAtomicHD, 0>;
+
+// Zba extension
+def : ReadAdvance<ReadSHXADD, 0>;
+def : ReadAdvance<ReadSHXADD32, 0>;
+
+// Zbb extension
+def : ReadAdvance<ReadRotateImm, 0>;
+def : ReadAdvance<ReadRotateImm32, 0>;
+def : ReadAdvance<ReadRotateReg, 0>;
+def : ReadAdvance<ReadRotateReg32, 0>;
+def : ReadAdvance<ReadCLZ, 0>;
+def : ReadAdvance<ReadCLZ32, 0>;
+def : ReadAdvance<ReadCTZ, 0>;
+def : ReadAdvance<ReadCTZ32, 0>;
+def : ReadAdvance<ReadCPOP, 0>;
+def : ReadAdvance<ReadCPOP32, 0>;
+def : ReadAdvance<ReadREV8, 0>;
+def : ReadAdvance<ReadORCB, 0>;
+def : ReadAdvance<ReadIMinMax, 0>;
+
+// Zbc extension
+def : ReadAdvance<ReadCLMUL, 0>;
+
+// Zbs extension
+def : ReadAdvance<ReadSingleBit, 0>;
+def : ReadAdvance<ReadSingleBitImm, 0>;
+
+// Zbkb
+def : ReadAdvance<ReadBREV8, 0>;
+def : ReadAdvance<ReadPACK, 0>;
+def : ReadAdvance<ReadPACK32, 0>;
+def : ReadAdvance<ReadZIP, 0>;
+
+// Zbkx
+def : ReadAdvance<ReadXPERM, 0>;
+
+// Zfa extension
+def : ReadAdvance<ReadFRoundF32, 0>;
+def : ReadAdvance<ReadFRoundF64, 0>;
+def : ReadAdvance<ReadFRoundF16, 0>;
+
+// Zfh extension
+def : ReadAdvance<ReadFCvtF16ToF64, 0>;
+def : ReadAdvance<ReadFCvtF64ToF16, 0>;
+def : ReadAdvance<ReadFCvtF32ToF16, 0>;
+def : ReadAdvance<ReadFCvtF16ToF32, 0>;
+def : ReadAdvance<ReadFMovI16ToF16, 0>;
+def : ReadAdvance<ReadFMovF16ToI16, 0>;
+
+def : ReadAdvance<ReadFAdd16, 0>;
+def : ReadAdvance<ReadFClass16, 0>;
+def : ReadAdvance<ReadFCvtI64ToF16, 0>;
+def : ReadAdvance<ReadFCvtI32ToF16, 0>;
+def : ReadAdvance<ReadFCvtF16ToI64, 0>;
+def : ReadAdvance<ReadFCvtF16ToI32, 0>;
+def : ReadAdvance<ReadFDiv16, 0>;
+def : ReadAdvance<ReadFCmp16, 0>;
+def : ReadAdvance<ReadFMA16, 0>;
+def : ReadAdvance<ReadFMA16Addend, 0>;
+def : ReadAdvance<ReadFMinMax16, 0>;
+def : ReadAdvance<ReadFMul16, 0>;
+def : ReadAdvance<ReadFSGNJ16, 0>;
+def : ReadAdvance<ReadFSqrt16, 0>;
+
+//===----------------------------------------------------------------------===//
+// Unsupported extensions
+//===----------------------------------------------------------------------===//
+defm : UnsupportedSchedV;
+defm : UnsupportedSchedZvk;
+defm : UnsupportedSchedZvkned;
+defm : UnsupportedSchedSFB;
+defm : UnsupportedSchedXsfvcp;
+}
diff --git a/llvm/test/tools/llvm-mca/RISCV/GenericOOO/atomic.s b/llvm/test/tools/llvm-mca/RISCV/GenericOOO/atomic.s
new file mode 100644
index 00000000000000..e8c19eaa4c618d
--- /dev/null
+++ b/llvm/test/tools/llvm-mca/RISCV/GenericOOO/atomic.s
@@ -0,0 +1,601 @@
+# NOTE: Assertions have been autogenerated by utils/update_mca_test_checks.py
+# RUN: llvm-mca -mtriple=riscv64 -mattr=+rva23u64,+zabha -mcpu=generic-ooo --all-stats -iterations=1 < %s | FileCheck %s
+
+# Zalrsc
+lr.w t0, (t1)
+lr.w.aq t1, (t2)
+lr.w.rl t2, (t3)
+lr.w.aqrl t3, (t4)
+sc.w t6, t5, (t4)
+sc.w.aq t5, t4, (t3)
+sc.w.rl t4, t3, (t2)
+sc.w.aqrl t3, t2, (t1)
+
+lr.d t0, (t1)
+lr.d.aq t1, (t2)
+lr.d.rl t2, (t3)
+lr.d.aqrl t3, (t4)
+sc.d t6, t5, (t4)
+sc.d.aq t5, t4, (t3)
+sc.d.rl t4, t3, (t2)
+sc.d.aqrl t3, t2, (t1)
+
+# Zaamo
+amoswap.w a4, ra, (s0)
+amoadd.w a1, a2, (a3)
+amoxor.w a2, a3, (a4)
+amoand.w a3, a4, (a5)
+amoor.w a4, a5, (a6)
+amomin.w a5, a6, (a7)
+amomax.w s7, s6, (s5)
+amominu.w s6, s5, (s4)
+amomaxu.w s5, s4, (s3)
+
+amoswap.w.aq a4, ra, (s0)
+amoadd.w.aq a1, a2, (a3)
+amoxor.w.aq a2, a3, (a4)
+amoand.w.aq a3, a4, (a5)
+amoor.w.aq a4, a5, (a6)
+amomin.w.aq a5, a6, (a7)
+amomax.w.aq s7, s6, (s5)
+amominu.w.aq s6, s5, (s4)
+amomaxu.w.aq s5, s4, (s3)
+
+amoswap.w.rl a4, ra, (s0)
+amoadd.w.rl a1, a2, (a3)
+amoxor.w.rl a2, a3, (a4)
+amoand.w.rl a3, a4, (a5)
+amoor.w.rl a4, a5, (a6)
+amomin.w.rl a5, a6, (a7)
+amomax.w.rl s7, s6, (s5)
+amominu.w.rl s6, s5, (s4)
+amomaxu.w.rl s5, s4, (s3)
+
+amoswap.w.aqrl a4, ra, (s0)
+amoadd.w.aqrl a1, a2, (a3)
+amoxor.w.aqrl a2, a3, (a4)
+amoand.w.aqrl a3, a4, (a5)
+amoor.w.aqrl ...
[truncated]
|
def : WriteRes<WriteCPOP, [GenericOOOALU]>; | ||
def : WriteRes<WriteCLZ32, [GenericOOOALU]>; | ||
def : WriteRes<WriteCTZ32, [GenericOOOALU]>; | ||
def : WriteRes<WriteCPOP32, [GenericOOOALU]>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think CPOP and friends like CT/LZ are usually more expensive than one cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There must be some improvements that can be done for CPOP in SiFive's implementations, because SCR7/TTAscalonD8 are both of 1 cycle (which matches XiangShan-KunMingHu and several cores I know).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RISCVSchedXiangShanNanHu.td has Latency=3 for CPOP
Oh I see you wrote KunMingHu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gcc uses a latency of 2 for cpop in their generic-ooo. Its a lot of logic to implement and takes work to make it fast. It's 3 cycles on Intel X86 CPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still not convinced because I think for a high performance O3 CPU, it should have such instruction characteristics (same for CLMUL). 3 cycles CTPOP on X86 is vector CTPOP, right?
As for GCC's generic O3 model, I don't think it is reasonable enough, we may adjust it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 cycle CTPOP on Intel is for scalar. AMD Zen is better than Intel https://uops.info/table.html?search=popcnt&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_ADLE=on&cb_ZEN4=on&cb_measurements=on&cb_doc=on&cb_base=on&cb_sse=on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 cycle CTPOP on Intel is for scalar. AMD Zen is better than Intel https://uops.info/table.html?search=popcnt&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_ADLE=on&cb_ZEN4=on&cb_measurements=on&cb_doc=on&cb_base=on&cb_sse=on
I guess that is why Intel's CPUs are not competitive as Zen* now. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you!
Please wait for additional approval from other reviewers :)
//===----------------------------------------------------------------------===// | ||
// Zbc extension | ||
//===----------------------------------------------------------------------===// | ||
def : WriteRes<WriteCLMUL, [GenericOOOALU]>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it appears that gcc uses a latency of 2 for this in their generic-ooo
def : WriteRes<WriteCPOP, [GenericOOOALU]>; | ||
def : WriteRes<WriteCLZ32, [GenericOOOALU]>; | ||
def : WriteRes<WriteCTZ32, [GenericOOOALU]>; | ||
def : WriteRes<WriteCPOP32, [GenericOOOALU]>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gcc uses a latency of 2 for cpop in their generic-ooo. Its a lot of logic to implement and takes work to make it fast. It's 3 cycles on Intel X86 CPUs.
Ping. Any more comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
We add a generic out-of-order CPU model here just like what GCC has done. People may use this model to evaluate some optimizations, and more importantly, people can use this model as a template to customize their own CPU model. The design (units, cycles, ...) of this model is random so don't take it seriously.
a2a4791
to
e731820
Compare
We add a generic out-of-order CPU model here just like what GCC has done. People may use this model to evaluate some optimizations, and more importantly, people can use this model as a template to customize their own CPU models. The design (units, cycles, ...) of this model is random so don't take it seriously.
We add a generic out-of-order CPU model here just like what GCC has done. People may use this model to evaluate some optimizations, and more importantly, people can use this model as a template to customize their own CPU models. The design (units, cycles, ...) of this model is random so don't take it seriously.
We add a generic out-of-order CPU model here just like what GCC
has done.
People may use this model to evaluate some optimizations, and more
importantly, people can use this model as a template to customize
their own CPU models.
The design (units, cycles, ...) of this model is random so don't
take it seriously.