Skip to content

[StrTable] Switch intrinsics to StringTable and work around MSVC #123548

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 28, 2025

Conversation

chandlerc
Copy link
Member

@chandlerc chandlerc commented Jan 20, 2025

Historically, the main example of very large string tables used the
EmitCharArray to work around MSVC limitations with string literals,
but that was switched (without removing the API) in order to consolidate
on a nicer emission primitive.

While this large string table in IntrinsicsImpl.inc seems to compile
correctly on MSVC without the work around in EmitCharArray (and that
this PR adds back to the nicer emission path), other users have
repeatedly hit this MSVC limitation as you can see in the discussion on
PR #120534. This PR teaches the string offset table emission to look at
the size of the table and switch to the char array emission strategy
when the table becomes too large.

This work around does have the downside of making compile times worse
for large string tables, but that appears unavoidable until we can
identify known good MSVC versions and switch to requiring them for all
LLVM users. It also reduces searchability of the generated string table
-- I looked at emitting a comment with each string but it is tricky
because the escaping rules for an inline comment are different from
those of of a string literal, and there's no real way to turn the string
literal into a comment.

While improving the output in this way, also clean up the output to not
emit an extraneous empty string at the end of the string table, and
update the StringTable class to not look for that. It isn't actually
used by anything and is wasteful.

This PR also switches the IntrinsicsImpl.inc string tables over to the
new StringTable runtime abstraction. I didn't want to do this until
landing the MSVC workaround in case it caused even this example to start
hitting the MSVC bug, but I wanted to switch here so that I could
simplify the API for emitting the string table with the workaround
present. With the two different emission strategies, its important to
use a very exact syntax and that seems better encapsulated in the API.

Last but not least, the SDNodeInfoEmitter is updated, including its
tests to match the new output.

This PR should unblock landing #120534 and letting us switch all of
Clang's builtins to use string tables. That PR has all the details
motivating the overall effort.

Follow-up patches will try to consolidate the remaining users onto the
single interface, but those at least were easy to separate into
follow-ups and keep this PR somewhat smaller.

@chandlerc chandlerc requested a review from rnk January 20, 2025 05:04
@llvmbot llvmbot added clang Clang issues not falling into any other category lldb clang:frontend Language frontend issues, e.g. anything involving "Sema" tablegen llvm:ir llvm:binary-utilities llvm:adt labels Jan 20, 2025
@llvmbot
Copy link
Member

llvmbot commented Jan 20, 2025

@llvm/pr-subscribers-lldb
@llvm/pr-subscribers-llvm-ir
@llvm/pr-subscribers-tablegen
@llvm/pr-subscribers-llvm-adt

@llvm/pr-subscribers-clang

Author: Chandler Carruth (chandlerc)

Changes

Note: This PR depends on #123302 and #123308 -- only the last of the three commits should be reviewed here.


Historically, the main example of very large string tables used the
EmitCharArray to work around MSVC limitations with string literals,
but that was switched (without removing the API) in order to consolidate
on a nicer emission primitive.

While this large string table in IntrinsicsImpl.inc seems to compile
correctly on MSVC without the work around in EmitCharArray (and that
this PR adds back to the nicer emission path), other users have
repeatedly hit this MSVC limitation as you can see in the discussion on
PR #120534. This PR teaches the string offset table emission to look at
the size of the table and switch to the char array emission strategy
when the table becomes too large.

This work around does have the downside of making compile times worse
for large string tables, but that appears unavoidable until we can
identify known good MSVC versions and switch to requiring them for all
LLVM users. It also reduces searchability of the generated string table
-- I looked at emitting a comment with each string but it is tricky
because the escaping rules for an inline comment are different from
those of of a string literal, and there's no real way to turn the string
literal into a comment.

This PR also switches the IntrinsicsImpl.inc string tables over to the
new StringTable runtime abstraction. I didn't want to do this until
landing the MSVC workaround in case it caused even this example to start
hitting the MSVC bug, but I wanted to switch here so that I could
simplify the API for emitting the string table with the workaround
present. With the two different emission strategies, its important to
use a very exact syntax and that seems better encapsulated in the API.

This PR should unblock landing #120534 and letting us switch all of
Clang's builtins to use string tables. That PR has all the details
motivating the overall effort.

Follow-up patches will try to consolidate the remaining users onto the
single interface, but those at least were easy to separate into
follow-ups and keep this PR somewhat smaller.


Patch is 37.28 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/123548.diff

15 Files Affected:

  • (modified) clang/lib/Basic/DiagnosticIDs.cpp (+8-10)
  • (modified) clang/lib/Frontend/CompilerInvocation.cpp (+1-1)
  • (modified) clang/tools/diagtool/DiagnosticNames.cpp (+2-1)
  • (modified) clang/utils/TableGen/ClangDiagnosticsEmitter.cpp (+6-18)
  • (modified) lldb/source/Plugins/Platform/MacOSX/PlatformDarwin.cpp (+2-1)
  • (modified) llvm/include/llvm/ADT/StringTable.h (+49-1)
  • (modified) llvm/include/llvm/Option/OptTable.h (+38-29)
  • (modified) llvm/include/llvm/TableGen/StringToOffsetTable.h (+51-31)
  • (modified) llvm/lib/IR/Intrinsics.cpp (+10-9)
  • (modified) llvm/lib/Option/OptTable.cpp (+36-34)
  • (modified) llvm/test/TableGen/MixedCasedMnemonic.td (+1-1)
  • (modified) llvm/tools/llvm-objdump/llvm-objdump.cpp (+2-1)
  • (modified) llvm/unittests/Option/OptionMarshallingTest.cpp (+2-1)
  • (modified) llvm/utils/TableGen/Basic/IntrinsicEmitter.cpp (+3-4)
  • (modified) llvm/utils/TableGen/OptionParserEmitter.cpp (+5-8)
diff --git a/clang/lib/Basic/DiagnosticIDs.cpp b/clang/lib/Basic/DiagnosticIDs.cpp
index d77f28c80b2eb2..55f868147134b7 100644
--- a/clang/lib/Basic/DiagnosticIDs.cpp
+++ b/clang/lib/Basic/DiagnosticIDs.cpp
@@ -16,6 +16,7 @@
 #include "clang/Basic/SourceManager.h"
 #include "llvm/ADT/STLExtras.h"
 #include "llvm/ADT/SmallVector.h"
+#include "llvm/ADT/StringTable.h"
 #include "llvm/Support/ErrorHandling.h"
 #include "llvm/Support/Path.h"
 #include <map>
@@ -618,11 +619,7 @@ namespace {
     uint16_t SubGroups;
     StringRef Documentation;
 
-    // String is stored with a pascal-style length byte.
-    StringRef getName() const {
-      return StringRef(DiagGroupNames + NameOffset + 1,
-                       DiagGroupNames[NameOffset]);
-    }
+    StringRef getName() const { return DiagGroupNames[NameOffset]; }
   };
 }
 
@@ -669,11 +666,12 @@ StringRef DiagnosticIDs::getWarningOptionForDiag(unsigned DiagID) {
 
 std::vector<std::string> DiagnosticIDs::getDiagnosticFlags() {
   std::vector<std::string> Res{"-W", "-Wno-"};
-  for (size_t I = 1; DiagGroupNames[I] != '\0';) {
-    std::string Diag(DiagGroupNames + I + 1, DiagGroupNames[I]);
-    I += DiagGroupNames[I] + 1;
-    Res.push_back("-W" + Diag);
-    Res.push_back("-Wno-" + Diag);
+  for (StringRef Name : DiagGroupNames) {
+    if (Name.empty())
+      continue;
+
+    Res.push_back((Twine("-W") + Name).str());
+    Res.push_back((Twine("-Wno-") + Name).str());
   }
 
   return Res;
diff --git a/clang/lib/Frontend/CompilerInvocation.cpp b/clang/lib/Frontend/CompilerInvocation.cpp
index 58658dedbaf1ee..3bf124e4827be9 100644
--- a/clang/lib/Frontend/CompilerInvocation.cpp
+++ b/clang/lib/Frontend/CompilerInvocation.cpp
@@ -282,7 +282,7 @@ using ArgumentConsumer = CompilerInvocation::ArgumentConsumer;
 #undef OPTTABLE_STR_TABLE_CODE
 
 static llvm::StringRef lookupStrInTable(unsigned Offset) {
-  return &OptionStrTable[Offset];
+  return OptionStrTable[Offset];
 }
 
 #define SIMPLE_ENUM_VALUE_TABLE
diff --git a/clang/tools/diagtool/DiagnosticNames.cpp b/clang/tools/diagtool/DiagnosticNames.cpp
index eb90f082437b33..1004e7bf2063b1 100644
--- a/clang/tools/diagtool/DiagnosticNames.cpp
+++ b/clang/tools/diagtool/DiagnosticNames.cpp
@@ -9,6 +9,7 @@
 #include "DiagnosticNames.h"
 #include "clang/Basic/AllDiagnostics.h"
 #include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/StringTable.h"
 
 using namespace clang;
 using namespace diagtool;
@@ -74,7 +75,7 @@ static const GroupRecord OptionTable[] = {
 };
 
 llvm::StringRef GroupRecord::getName() const {
-  return StringRef(DiagGroupNames + NameOffset + 1, DiagGroupNames[NameOffset]);
+  return DiagGroupNames[NameOffset];
 }
 
 GroupRecord::subgroup_iterator GroupRecord::subgroup_begin() const {
diff --git a/clang/utils/TableGen/ClangDiagnosticsEmitter.cpp b/clang/utils/TableGen/ClangDiagnosticsEmitter.cpp
index fb00c640d6b144..50dbe4d5a8cab7 100644
--- a/clang/utils/TableGen/ClangDiagnosticsEmitter.cpp
+++ b/clang/utils/TableGen/ClangDiagnosticsEmitter.cpp
@@ -1782,19 +1782,11 @@ static void emitDiagArrays(DiagsInGroupTy &DiagsInGroup,
 
 /// Emit a list of group names.
 ///
-/// This creates a long string which by itself contains a list of pascal style
-/// strings, which consist of a length byte directly followed by the string.
-///
-/// \code
-///   static const char DiagGroupNames[] = {
-///     \000\020#pragma-messages\t#warnings\020CFString-literal"
-///   };
-/// \endcode
+/// This creates an `llvm::StringTable` of all the diagnostic group names.
 static void emitDiagGroupNames(const StringToOffsetTable &GroupNames,
                                raw_ostream &OS) {
-  OS << "static const char DiagGroupNames[] = {\n";
-  GroupNames.EmitString(OS);
-  OS << "};\n\n";
+  GroupNames.EmitStringTableDef(OS, "DiagGroupNames");
+  OS << "\n";
 }
 
 /// Emit diagnostic arrays and related data structures.
@@ -1806,7 +1798,7 @@ static void emitDiagGroupNames(const StringToOffsetTable &GroupNames,
 ///  #ifdef GET_DIAG_ARRAYS
 ///     static const int16_t DiagArrays[];
 ///     static const int16_t DiagSubGroups[];
-///     static const char DiagGroupNames[];
+///     static constexpr llvm::StringTable DiagGroupNames;
 ///  #endif
 ///  \endcode
 static void emitAllDiagArrays(DiagsInGroupTy &DiagsInGroup,
@@ -1858,9 +1850,7 @@ static void emitDiagTable(DiagsInGroupTy &DiagsInGroup,
                                "0123456789!@#$%^*-+=:?") != std::string::npos)
       PrintFatalError("Invalid character in diagnostic group '" + Name + "'");
     OS << Name << " */, ";
-    // Store a pascal-style length byte at the beginning of the string.
-    std::string PascalName = char(Name.size()) + Name.str();
-    OS << *GroupNames.GetStringOffset(PascalName) << ", ";
+    OS << *GroupNames.GetStringOffset(Name) << ", ";
 
     // Special handling for 'pedantic'.
     const bool IsPedantic = Name == "pedantic";
@@ -1949,9 +1939,7 @@ void clang::EmitClangDiagGroups(const RecordKeeper &Records, raw_ostream &OS) {
 
   StringToOffsetTable GroupNames;
   for (const auto &[Name, Group] : DiagsInGroup) {
-    // Store a pascal-style length byte at the beginning of the string.
-    std::string PascalName = char(Name.size()) + Name.str();
-    GroupNames.GetOrAddStringOffset(PascalName, false);
+    GroupNames.GetOrAddStringOffset(Name);
   }
 
   emitAllDiagArrays(DiagsInGroup, DiagsInPedantic, GroupsInPedantic, GroupNames,
diff --git a/lldb/source/Plugins/Platform/MacOSX/PlatformDarwin.cpp b/lldb/source/Plugins/Platform/MacOSX/PlatformDarwin.cpp
index 2a36f95c94d0ce..51e9a6d81b8390 100644
--- a/lldb/source/Plugins/Platform/MacOSX/PlatformDarwin.cpp
+++ b/lldb/source/Plugins/Platform/MacOSX/PlatformDarwin.cpp
@@ -42,6 +42,7 @@
 #include "lldb/Utility/Status.h"
 #include "lldb/Utility/Timer.h"
 #include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/StringTable.h"
 #include "llvm/Support/Error.h"
 #include "llvm/Support/FileSystem.h"
 #include "llvm/Support/Threading.h"
@@ -1083,7 +1084,7 @@ void PlatformDarwin::AddClangModuleCompilationOptionsForSDKType(
   if (!version.empty() && sdk_type != XcodeSDK::Type::Linux &&
       sdk_type != XcodeSDK::Type::XROS) {
 #define OPTION(PREFIX_OFFSET, NAME_OFFSET, VAR, ...)                           \
-  llvm::StringRef opt_##VAR = &OptionStrTable[NAME_OFFSET];                    \
+  llvm::StringRef opt_##VAR = OptionStrTable[NAME_OFFSET];                     \
   (void)opt_##VAR;
 #include "clang/Driver/Options.inc"
 #undef OPTION
diff --git a/llvm/include/llvm/ADT/StringTable.h b/llvm/include/llvm/ADT/StringTable.h
index 4049f892fa66e0..5f9ecc452808de 100644
--- a/llvm/include/llvm/ADT/StringTable.h
+++ b/llvm/include/llvm/ADT/StringTable.h
@@ -10,6 +10,8 @@
 #define LLVM_ADT_STRING_TABLE_H
 
 #include "llvm/ADT/StringRef.h"
+#include "llvm/ADT/iterator.h"
+#include <iterator>
 #include <limits>
 
 namespace llvm {
@@ -51,6 +53,14 @@ class StringTable {
     constexpr Offset() = default;
     constexpr Offset(unsigned Value) : Value(Value) {}
 
+    friend constexpr bool operator==(const Offset &LHS, const Offset &RHS) {
+      return LHS.Value == RHS.Value;
+    }
+
+    friend constexpr bool operator!=(const Offset &LHS, const Offset &RHS) {
+      return LHS.Value != RHS.Value;
+    }
+
     constexpr unsigned value() const { return Value; }
   };
 
@@ -69,9 +79,13 @@ class StringTable {
     assert(!Table.empty() && "Requires at least a valid empty string.");
     assert(Table.data()[0] == '\0' && "Offset zero must be the empty string.");
     // Ensure that `strlen` from any offset cannot overflow the end of the table
-    // by insisting on a null byte at the end.
+    // by insisting on a null byte at the end. We also insist on the last string
+    // within the table being *separately* null terminated. This structure is
+    // used to enable predictable iteration over all the strings when needed.
     assert(Table.data()[Table.size() - 1] == '\0' &&
            "Last byte must be a null byte.");
+    assert(Table.data()[Table.size() - 2] == '\0' &&
+           "Next-to-last byte must be a null byte.");
   }
 
   // Get a string from the table starting with the provided offset. The returned
@@ -84,6 +98,40 @@ class StringTable {
 
   /// Returns the byte size of the table.
   constexpr size_t size() const { return Table.size(); }
+
+  class Iterator
+      : public iterator_facade_base<Iterator, std::forward_iterator_tag,
+                                    const StringRef> {
+    friend StringTable;
+
+    const StringTable *Table;
+    Offset O;
+
+    // A cache of one value to allow `*` to return a reference.
+    mutable StringRef S;
+
+    explicit constexpr Iterator(const StringTable &Table, Offset O)
+        : Table(&Table), O(O) {}
+
+  public:
+    constexpr Iterator(const Iterator &RHS) = default;
+    constexpr Iterator(Iterator &&RHS) = default;
+
+    bool operator==(const Iterator &RHS) const {
+      assert(Table == RHS.Table && "Compared iterators for unrelated tables!");
+      return O == RHS.O;
+    }
+
+    const StringRef &operator*() const { return (S = (*Table)[O]); }
+
+    Iterator &operator++() {
+      O = O.value() + (*Table)[O].size() + 1;
+      return *this;
+    }
+  };
+
+  constexpr Iterator begin() const { return Iterator(*this, 0); }
+  constexpr Iterator end() const { return Iterator(*this, size() - 1); }
 };
 
 } // namespace llvm
diff --git a/llvm/include/llvm/Option/OptTable.h b/llvm/include/llvm/Option/OptTable.h
index 38a03fef7ae124..61a58aa304ecb4 100644
--- a/llvm/include/llvm/Option/OptTable.h
+++ b/llvm/include/llvm/Option/OptTable.h
@@ -12,6 +12,7 @@
 #include "llvm/ADT/ArrayRef.h"
 #include "llvm/ADT/SmallString.h"
 #include "llvm/ADT/StringRef.h"
+#include "llvm/ADT/StringTable.h"
 #include "llvm/Option/OptSpecifier.h"
 #include "llvm/Support/StringSaver.h"
 #include <cassert>
@@ -54,7 +55,7 @@ class OptTable {
   /// Entry for a single option instance in the option data table.
   struct Info {
     unsigned PrefixesOffset;
-    unsigned PrefixedNameOffset;
+    StringTable::Offset PrefixedNameOffset;
     const char *HelpText;
     // Help text for specific visibilities. A list of pairs, where each pair
     // is a list of visibilities and a specific help string for those
@@ -80,34 +81,37 @@ class OptTable {
 
     bool hasNoPrefix() const { return PrefixesOffset == 0; }
 
-    unsigned getNumPrefixes(ArrayRef<unsigned> PrefixesTable) const {
-      return PrefixesTable[PrefixesOffset];
+    unsigned getNumPrefixes(ArrayRef<StringTable::Offset> PrefixesTable) const {
+      // We embed the number of prefixes in the value of the first offset.
+      return PrefixesTable[PrefixesOffset].value();
     }
 
-    ArrayRef<unsigned>
-    getPrefixOffsets(ArrayRef<unsigned> PrefixesTable) const {
-      return hasNoPrefix() ? ArrayRef<unsigned>()
+    ArrayRef<StringTable::Offset>
+    getPrefixOffsets(ArrayRef<StringTable::Offset> PrefixesTable) const {
+      return hasNoPrefix() ? ArrayRef<StringTable::Offset>()
                            : PrefixesTable.slice(PrefixesOffset + 1,
                                                  getNumPrefixes(PrefixesTable));
     }
 
-    void appendPrefixes(const char *StrTable, ArrayRef<unsigned> PrefixesTable,
+    void appendPrefixes(const StringTable &StrTable,
+                        ArrayRef<StringTable::Offset> PrefixesTable,
                         SmallVectorImpl<StringRef> &Prefixes) const {
-      for (unsigned PrefixOffset : getPrefixOffsets(PrefixesTable))
-        Prefixes.push_back(&StrTable[PrefixOffset]);
+      for (auto PrefixOffset : getPrefixOffsets(PrefixesTable))
+        Prefixes.push_back(StrTable[PrefixOffset]);
     }
 
-    StringRef getPrefix(const char *StrTable, ArrayRef<unsigned> PrefixesTable,
+    StringRef getPrefix(const StringTable &StrTable,
+                        ArrayRef<StringTable::Offset> PrefixesTable,
                         unsigned PrefixIndex) const {
-      return &StrTable[getPrefixOffsets(PrefixesTable)[PrefixIndex]];
+      return StrTable[getPrefixOffsets(PrefixesTable)[PrefixIndex]];
     }
 
-    StringRef getPrefixedName(const char *StrTable) const {
-      return &StrTable[PrefixedNameOffset];
+    StringRef getPrefixedName(const StringTable &StrTable) const {
+      return StrTable[PrefixedNameOffset];
     }
 
-    StringRef getName(const char *StrTable,
-                      ArrayRef<unsigned> PrefixesTable) const {
+    StringRef getName(const StringTable &StrTable,
+                      ArrayRef<StringTable::Offset> PrefixesTable) const {
       unsigned PrefixLength =
           hasNoPrefix() ? 0 : getPrefix(StrTable, PrefixesTable, 0).size();
       return getPrefixedName(StrTable).drop_front(PrefixLength);
@@ -117,13 +121,13 @@ class OptTable {
 private:
   // A unified string table for these options. Individual strings are stored as
   // null terminated C-strings at offsets within this table.
-  const char *StrTable;
+  const StringTable *StrTable;
 
   // A table of different sets of prefixes. Each set starts with the number of
   // prefixes in that set followed by that many offsets into the string table
   // for each of the prefix strings. This is essentially a Pascal-string style
   // encoding.
-  ArrayRef<unsigned> PrefixesTable;
+  ArrayRef<StringTable::Offset> PrefixesTable;
 
   /// The option information table.
   ArrayRef<Info> OptionInfos;
@@ -161,7 +165,8 @@ class OptTable {
 protected:
   /// Initialize OptTable using Tablegen'ed OptionInfos. Child class must
   /// manually call \c buildPrefixChars once they are fully constructed.
-  OptTable(const char *StrTable, ArrayRef<unsigned> PrefixesTable,
+  OptTable(const StringTable &StrTable,
+           ArrayRef<StringTable::Offset> PrefixesTable,
            ArrayRef<Info> OptionInfos, bool IgnoreCase = false);
 
   /// Build (or rebuild) the PrefixChars member.
@@ -171,10 +176,12 @@ class OptTable {
   virtual ~OptTable();
 
   /// Return the string table used for option names.
-  const char *getStrTable() const { return StrTable; }
+  const StringTable &getStrTable() const { return *StrTable; }
 
   /// Return the prefixes table used for option names.
-  ArrayRef<unsigned> getPrefixesTable() const { return PrefixesTable; }
+  ArrayRef<StringTable::Offset> getPrefixesTable() const {
+    return PrefixesTable;
+  }
 
   /// Return the total number of option classes.
   unsigned getNumOptions() const { return OptionInfos.size(); }
@@ -187,25 +194,25 @@ class OptTable {
 
   /// Lookup the name of the given option.
   StringRef getOptionName(OptSpecifier id) const {
-    return getInfo(id).getName(StrTable, PrefixesTable);
+    return getInfo(id).getName(*StrTable, PrefixesTable);
   }
 
   /// Lookup the prefix of the given option.
   StringRef getOptionPrefix(OptSpecifier id) const {
     const Info &I = getInfo(id);
     return I.hasNoPrefix() ? StringRef()
-                           : I.getPrefix(StrTable, PrefixesTable, 0);
+                           : I.getPrefix(*StrTable, PrefixesTable, 0);
   }
 
   void appendOptionPrefixes(OptSpecifier id,
                             SmallVectorImpl<StringRef> &Prefixes) const {
     const Info &I = getInfo(id);
-    I.appendPrefixes(StrTable, PrefixesTable, Prefixes);
+    I.appendPrefixes(*StrTable, PrefixesTable, Prefixes);
   }
 
   /// Lookup the prefixed name of the given option.
   StringRef getOptionPrefixedName(OptSpecifier id) const {
-    return getInfo(id).getPrefixedName(StrTable);
+    return getInfo(id).getPrefixedName(*StrTable);
   }
 
   /// Get the kind of the given option.
@@ -418,19 +425,21 @@ class OptTable {
 /// Specialization of OptTable
 class GenericOptTable : public OptTable {
 protected:
-  GenericOptTable(const char *StrTable, ArrayRef<unsigned> PrefixesTable,
+  GenericOptTable(const StringTable &StrTable,
+                  ArrayRef<StringTable::Offset> PrefixesTable,
                   ArrayRef<Info> OptionInfos, bool IgnoreCase = false);
 };
 
 class PrecomputedOptTable : public OptTable {
 protected:
-  PrecomputedOptTable(const char *StrTable, ArrayRef<unsigned> PrefixesTable,
+  PrecomputedOptTable(const StringTable &StrTable,
+                      ArrayRef<StringTable::Offset> PrefixesTable,
                       ArrayRef<Info> OptionInfos,
-                      ArrayRef<unsigned> PrefixesUnionOffsets,
+                      ArrayRef<StringTable::Offset> PrefixesUnionOffsets,
                       bool IgnoreCase = false)
       : OptTable(StrTable, PrefixesTable, OptionInfos, IgnoreCase) {
-    for (unsigned PrefixOffset : PrefixesUnionOffsets)
-      PrefixesUnion.push_back(&StrTable[PrefixOffset]);
+    for (auto PrefixOffset : PrefixesUnionOffsets)
+      PrefixesUnion.push_back(StrTable[PrefixOffset]);
     buildPrefixChars();
   }
 };
diff --git a/llvm/include/llvm/TableGen/StringToOffsetTable.h b/llvm/include/llvm/TableGen/StringToOffsetTable.h
index d4bb685acce327..9cec97ed31848d 100644
--- a/llvm/include/llvm/TableGen/StringToOffsetTable.h
+++ b/llvm/include/llvm/TableGen/StringToOffsetTable.h
@@ -27,6 +27,12 @@ class StringToOffsetTable {
   std::string AggregateString;
 
 public:
+  StringToOffsetTable() {
+    // Ensure we always put the empty string at offset zero. That lets empty
+    // initialization also be zero initialization for offsets into the table.
+    GetOrAddStringOffset("");
+  }
+
   bool empty() const { return StringOffset.empty(); }
   size_t size() const { return AggregateString.size(); }
 
@@ -51,28 +57,62 @@ class StringToOffsetTable {
     return II->second;
   }
 
-  // Emit the string using string literal concatenation, for better readability
-  // and searchability.
-  void EmitStringLiteralDef(raw_ostream &OS, const Twine &Decl,
-                            const Twine &Indent = "  ") const {
+  // Emit a string table definition with the provided name and indent.
+  //
+  // When possible, this uses string-literal concatenation to emit the string
+  // contents in a readable and searchable way. However, for (very) large string
+  // tables MSVC cannot reliably use string literals and so there we use a large
+  // character array. We still use a line oriented emission and add comments to
+  // provide searchability even in this case.
+  //
+  // The string table, and its input string contents, are always emitted as both
+  // `static` and `constexpr`. Both `Name` and (`Name` + "Storage") must be
+  // valid identifiers to declare.
+  void EmitStringTableDef(raw_ostream &OS, const Twine &Name,
+                          const Twine &Indent = "") const {
     OS << formatv(R"(
 #ifdef __GNUC__
 #pragma GCC diagnostic push
 #pragma GCC diagnostic ignored "-Woverlength-strings"
 #endif
-{0}{1} = )",
-                  Indent, Decl);
+{0}static constexpr char {1}Storage[] = )",
+                  Indent, Name);
 
+    // MSVC silently miscompiles string literals longer than 64k in some
+    // circumstances. When the string table is longer, emit it as an array of
+    // character literals.
+    bool UseChars = AggregateString.size() > (64 * 1024);
+    OS << (UseChars ? "{\n" : "\n");
+
+    llvm::ListSeparator LineSep(UseChars ? ",\n" : "\n");
     for (StringRef Str : split(AggregateString, '\0')) {
-      OS << "\n" << Indent << "  \"";
-      OS.write_escaped(Str);
-      OS << "\\0\"";
+      OS << LineSep << Indent << "  ";
+      // If we can, just emit this as a string literal to be concatenated.
+      if (!UseChars) {
+        OS << "\"";
+        OS.write_escaped(Str);
+        OS << "\\0\"";
+        continue;
+      }
+
+      llvm::ListSeparator CharSep(", ");
+      for (char C : Str) {
+        OS << CharSep << "'";
+        OS.write_escaped(StringRef(&C, 1));
+        OS << "'";
+      }
+      OS << CharSep << "'\\0'";
     }
-    OS << R"(;
+    OS << LineSep << Indent << (UseChars ? "};" : "  ;");
+
+    OS << formatv(R"(
 #ifdef __GNUC__
 #pragma GCC diagnostic pop
 #endif
-)";
+
+{0}static constexpr llvm::StringTable {1} = {1}Storage;
+)",
+                  Indent, Name);
   }
 
   // Emit the string as one single string.
@@ -110,26 +150,6 @@ class StringToOffsetTable {
     }
     O << "\"";
   }
-
-  /// Emit the string using character literals. MSVC has a limitation that
-  /// string literals cannot be longer than 64K.
-  void EmitCharArray(raw_ostream &O) {
-    assert(AggregateString.find(')') == std::string::npos &&
-           "can't emit raw string with closing parens");
-    int Count = 0;
...
[truncated]

Historically, the main example of *very* large string tables used the
`EmitCharArray` to work around MSVC limitations with string literals,
but that was switched (without removing the API) in order to consolidate
on a nicer emission primitive.

While this large string table in `IntrinsicsImpl.inc` seems to compile
correctly on MSVC without the work around in `EmitCharArray` (and that
this PR adds back to the nicer emission path), other users have
repeatedly hit this MSVC limitation as you can see in the discussion on
PR llvm#120534. This PR teaches the string offset table emission to look at
the size of the table and switch to the char array emission strategy
when the table becomes too large.

This work around does have the downside of making compile times worse
for large string tables, but that appears unavoidable until we can
identify known good MSVC versions and switch to requiring them for all
LLVM users. It also reduces searchability of the generated string table
-- I looked at emitting a comment with each string but it is tricky
because the escaping rules for an inline comment are different from
those of of a string literal, and there's no real way to turn the string
literal into a comment.

While improving the output in this way, also clean up the output to not
emit an extraneous empty string at the end of the string table, and
update the `StringTable` class to not look for that. It isn't actually
used by anything and is wasteful.

This PR also switches the `IntrinsicsImpl.inc` string tables over to the
new `StringTable` runtime abstraction. I didn't want to do this until
landing the MSVC workaround in case it caused even this example to start
hitting the MSVC bug, but I wanted to switch here so that I could
simplify the API for emitting the string table with the workaround
present. With the two different emission strategies, its important to
use a very exact syntax and that seems better encapsulated in the API.

Last but not least, the `SDNodeInfoEmitter` is updated, including its
tests to match the new output.

This PR should unblock landing llvm#120534 and letting us switch all of
Clang's builtins to use string tables. That PR has all the details
motivating the overall effort.

Follow-up patches will try to consolidate the remaining users onto the
single interface, but those at least were easy to separate into
follow-ups and keep this PR somewhat smaller.
@chandlerc
Copy link
Member Author

Ping -- all the dependent changes have landed and this is ready to be reviewed / merged!

I've already had to update this as more string table using code landed, so I'd like it to not wait too long...

Copy link
Contributor

@s-barannikov s-barannikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chandlerc
Copy link
Member Author

Sounds good, and thanks for all the reviews!

@chandlerc chandlerc merged commit f4de28a into llvm:main Jan 28, 2025
8 checks passed
@chandlerc chandlerc deleted the strtable-large branch January 28, 2025 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category lldb llvm:adt llvm:binary-utilities llvm:ir tablegen
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants