Skip to content

[lld][COFF] Remove duplicate strtab entries #141197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

HaohaiWen
Copy link
Contributor

String table size is too big for large binary when symbol table is
enabled. Some strings in strtab is same so it can be reused.

String table size is too big for large binary when symbol table is
enabled. Some strings in strtab is same so it can be reused.
@llvmbot
Copy link
Member

llvmbot commented May 23, 2025

@llvm/pr-subscribers-platform-windows

@llvm/pr-subscribers-lld-coff

Author: Haohai Wen (HaohaiWen)

Changes

String table size is too big for large binary when symbol table is
enabled. Some strings in strtab is same so it can be reused.


Full diff: https://github.com/llvm/llvm-project/pull/141197.diff

1 Files Affected:

  • (modified) lld/COFF/Writer.cpp (+8-4)
diff --git a/lld/COFF/Writer.cpp b/lld/COFF/Writer.cpp
index db6133e20a037..6ec83a5f77e5a 100644
--- a/lld/COFF/Writer.cpp
+++ b/lld/COFF/Writer.cpp
@@ -285,6 +285,7 @@ class Writer {
   std::unique_ptr<FileOutputBuffer> &buffer;
   std::map<PartialSectionKey, PartialSection *> partialSections;
   std::vector<char> strtab;
+  StringMap<size_t> strtabMap;
   std::vector<llvm::object::coff_symbol16> outputSymtab;
   std::vector<ECCodeMapEntry> codeMap;
   IdataContents idata;
@@ -1439,10 +1440,13 @@ void Writer::assignOutputSectionIndices() {
 
 size_t Writer::addEntryToStringTable(StringRef str) {
   assert(str.size() > COFF::NameSize);
-  size_t offsetOfEntry = strtab.size() + 4; // +4 for the size field
-  strtab.insert(strtab.end(), str.begin(), str.end());
-  strtab.push_back('\0');
-  return offsetOfEntry;
+  size_t newOffsetOfEntry = strtab.size() + 4; // +4 for the size field
+  auto res = strtabMap.try_emplace(str, newOffsetOfEntry);
+  if (res.second) {
+    strtab.insert(strtab.end(), str.begin(), str.end());
+    strtab.push_back('\0');
+  }
+  return res.first->getValue();
 }
 
 std::optional<coff_symbol16> Writer::createSymbol(Defined *def) {

@llvmbot
Copy link
Member

llvmbot commented May 23, 2025

@llvm/pr-subscribers-lld

Author: Haohai Wen (HaohaiWen)

Changes

String table size is too big for large binary when symbol table is
enabled. Some strings in strtab is same so it can be reused.


Full diff: https://github.com/llvm/llvm-project/pull/141197.diff

1 Files Affected:

  • (modified) lld/COFF/Writer.cpp (+8-4)
diff --git a/lld/COFF/Writer.cpp b/lld/COFF/Writer.cpp
index db6133e20a037..6ec83a5f77e5a 100644
--- a/lld/COFF/Writer.cpp
+++ b/lld/COFF/Writer.cpp
@@ -285,6 +285,7 @@ class Writer {
   std::unique_ptr<FileOutputBuffer> &buffer;
   std::map<PartialSectionKey, PartialSection *> partialSections;
   std::vector<char> strtab;
+  StringMap<size_t> strtabMap;
   std::vector<llvm::object::coff_symbol16> outputSymtab;
   std::vector<ECCodeMapEntry> codeMap;
   IdataContents idata;
@@ -1439,10 +1440,13 @@ void Writer::assignOutputSectionIndices() {
 
 size_t Writer::addEntryToStringTable(StringRef str) {
   assert(str.size() > COFF::NameSize);
-  size_t offsetOfEntry = strtab.size() + 4; // +4 for the size field
-  strtab.insert(strtab.end(), str.begin(), str.end());
-  strtab.push_back('\0');
-  return offsetOfEntry;
+  size_t newOffsetOfEntry = strtab.size() + 4; // +4 for the size field
+  auto res = strtabMap.try_emplace(str, newOffsetOfEntry);
+  if (res.second) {
+    strtab.insert(strtab.end(), str.begin(), str.end());
+    strtab.push_back('\0');
+  }
+  return res.first->getValue();
 }
 
 std::optional<coff_symbol16> Writer::createSymbol(Defined *def) {

@HaohaiWen HaohaiWen requested a review from mstorsjo May 23, 2025 04:44
@MaskRay
Copy link
Member

MaskRay commented May 23, 2025

Test needed. In ELF we can use llvm-readelf -p .strtab to dump the section table. I don't know what the COFF counterpart is.

@mstorsjo
Copy link
Member

Instead of just using std::map to deduplicate whole strings, it's also possible to use the StringTableBuilder class for deduplicating and doing tail merging of strings - optimizing the string table even further (at the cost of a little longer link time).

This was attempted before in 9ffeaaa, but I had to revert it in 4d2eda2; see the revert commit for an explanation of the issues I ran into, including a suggestion on what one could do to reapply it.

Anyway, using a std::map probably is lower overhead than using StringTableBuilder, and it keeps the strings in their current order (i.e. the early section names don't end up too far back in the string table).

Do you have any numbers on how much binary size you save with this? It could be interesting to compare with the StringTableBuilder case.

@HaohaiWen
Copy link
Contributor Author

Instead of just using std::map to deduplicate whole strings, it's also possible to use the StringTableBuilder class for deduplicating and doing tail merging of strings - optimizing the string table even further (at the cost of a little longer link time).

This was attempted before in 9ffeaaa, but I had to revert it in 4d2eda2; see the revert commit for an explanation of the issues I ran into, including a suggestion on what one could do to reapply it.

Anyway, using a std::map probably is lower overhead than using StringTableBuilder, and it keeps the strings in their current order (i.e. the early section names don't end up too far back in the string table).

Do you have any numbers on how much binary size you save with this? It could be interesting to compare with the StringTableBuilder case.

Thanks for reminding.
The blender binary size has been reduced about 300MB. There are many duplicate symbol name.

@HaohaiWen
Copy link
Contributor Author

Instead of just using std::map to deduplicate whole strings, it's also possible to use the StringTableBuilder class for deduplicating and doing tail merging of strings - optimizing the string table even further (at the cost of a little longer link time).

This was attempted before in 9ffeaaa, but I had to revert it in 4d2eda2; see the revert commit for an explanation of the issues I ran into, including a suggestion on what one could do to reapply it.

Anyway, using a std::map probably is lower overhead than using StringTableBuilder, and it keeps the strings in their current order (i.e. the early section names don't end up too far back in the string table).

Do you have any numbers on how much binary size you save with this? It could be interesting to compare with the StringTableBuilder case.

How about using two StringTableBuilder, one for long section name and another for others?
I have observed large strtab size many times for large projects.
Will you revive your patch with supporting for prioritized long section name?

@mstorsjo
Copy link
Member

How about using two StringTableBuilder, one for long section name and another for others? I have observed large strtab size many times for large projects. Will you revive your patch with supporting for prioritized long section name?

I don't plan on reviving it myself right now, but if you're working in this area, I would appreciate if you'd try to reapply that patch (revert the revert) and see if it can be made to work again on current git main. Then with that in place, it would be interesting to see if you see any further significant size reductions on your blender testcase.

If it doesn't really reduce the binary size any much further (perhaps cases where tail merging makes any difference are rare?), then perhaps it's best to just go with the much simpler approach of your current patch here.

@williamweixiao williamweixiao requested a review from MaskRay May 23, 2025 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants