Skip to content

[Serialization] Use stable hash functions #96136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

MaskRay
Copy link
Member

@MaskRay MaskRay commented Jun 20, 2024

clangSerialization currently uses hash_combine/hash_value from
Hashing.h, which are not guaranteed to be deterministic.
Replace these uses with xxh3_64bits.

Created using spr 1.3.5-bogner
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:modules C++20 modules and Clang Header Modules llvm:adt labels Jun 20, 2024
@MaskRay MaskRay requested a review from ChuanqiXu9 June 20, 2024 03:51
@llvmbot
Copy link
Member

llvmbot commented Jun 20, 2024

@llvm/pr-subscribers-clang
@llvm/pr-subscribers-llvm-adt

@llvm/pr-subscribers-clang-modules

Author: Fangrui Song (MaskRay)

Changes

clangSerialization currently uses hash_combine/hash_value from
Hashing.h, which are not guaranteed to be deterministic.
Replace these uses with xxh3_64bits.


Full diff: https://github.com/llvm/llvm-project/pull/96136.diff

4 Files Affected:

  • (modified) clang/lib/AST/ODRHash.cpp (+1-1)
  • (modified) clang/lib/Serialization/ASTReader.cpp (+6-4)
  • (modified) clang/lib/Serialization/ASTWriter.cpp (+5-2)
  • (modified) llvm/include/llvm/ADT/FoldingSet.h (+19-4)
diff --git a/clang/lib/AST/ODRHash.cpp b/clang/lib/AST/ODRHash.cpp
index 1249531eab09f..fbfe92318dc5e 100644
--- a/clang/lib/AST/ODRHash.cpp
+++ b/clang/lib/AST/ODRHash.cpp
@@ -251,7 +251,7 @@ unsigned ODRHash::CalculateHash() {
 
   assert(I == Bools.rend());
   Bools.clear();
-  return ID.ComputeHash();
+  return ID.computeStableHash();
 }
 
 namespace {
diff --git a/clang/lib/Serialization/ASTReader.cpp b/clang/lib/Serialization/ASTReader.cpp
index 973475cf56b8c..0bf8e416aff4a 100644
--- a/clang/lib/Serialization/ASTReader.cpp
+++ b/clang/lib/Serialization/ASTReader.cpp
@@ -1197,7 +1197,7 @@ unsigned DeclarationNameKey::getHash() const {
     break;
   }
 
-  return ID.ComputeHash();
+  return ID.computeStableHash();
 }
 
 ModuleFile *
@@ -2029,7 +2029,10 @@ const FileEntry *HeaderFileInfoTrait::getFile(const internal_key_type &Key) {
 }
 
 unsigned HeaderFileInfoTrait::ComputeHash(internal_key_ref ikey) {
-  return llvm::hash_combine(ikey.Size, ikey.ModTime);
+  uint8_t buf[sizeof(ikey.Size) + sizeof(ikey.ModTime)];
+  memcpy(buf, &ikey.Size, sizeof(ikey.Size));
+  memcpy(buf + sizeof(ikey.Size), &ikey.ModTime, sizeof(ikey.ModTime));
+  return llvm::xxh3_64bits(buf);
 }
 
 HeaderFileInfoTrait::internal_key_type
@@ -2636,8 +2639,7 @@ InputFile ASTReader::getInputFile(ModuleFile &F, unsigned ID, bool Complain) {
       return OriginalChange;
     }
 
-    // FIXME: hash_value is not guaranteed to be stable!
-    auto ContentHash = hash_value(MemBuffOrError.get()->getBuffer());
+    auto ContentHash = xxh3_64bits(MemBuffOrError.get()->getBuffer());
     if (StoredContentHash == static_cast<uint64_t>(ContentHash))
       return Change{Change::None};
 
diff --git a/clang/lib/Serialization/ASTWriter.cpp b/clang/lib/Serialization/ASTWriter.cpp
index 50fa44d34f524..546ebcc7b9a19 100644
--- a/clang/lib/Serialization/ASTWriter.cpp
+++ b/clang/lib/Serialization/ASTWriter.cpp
@@ -1782,7 +1782,7 @@ void ASTWriter::WriteInputFiles(SourceManager &SourceMgr,
             .ValidateASTInputFilesContent) {
       auto MemBuff = Cache->getBufferIfLoaded();
       if (MemBuff)
-        ContentHash = hash_value(MemBuff->getBuffer());
+        ContentHash = xxh3_64bits(MemBuff->getBuffer());
       else
         PP->Diag(SourceLocation(), diag::err_module_unable_to_hash_content)
             << Entry.File.getName();
@@ -1987,7 +1987,10 @@ namespace {
       // The hash is based only on size/time of the file, so that the reader can
       // match even when symlinking or excess path elements ("foo/../", "../")
       // change the form of the name. However, complete path is still the key.
-      return llvm::hash_combine(key.Size, key.ModTime);
+      uint8_t buf[sizeof(key.Size) + sizeof(key.ModTime)];
+      memcpy(buf, &key.Size, sizeof(key.Size));
+      memcpy(buf + sizeof(key.Size), &key.ModTime, sizeof(key.ModTime));
+      return llvm::xxh3_64bits(buf);
     }
 
     std::pair<unsigned, unsigned>
diff --git a/llvm/include/llvm/ADT/FoldingSet.h b/llvm/include/llvm/ADT/FoldingSet.h
index f82eabd5044b2..3c2eaade57e47 100644
--- a/llvm/include/llvm/ADT/FoldingSet.h
+++ b/llvm/include/llvm/ADT/FoldingSet.h
@@ -21,6 +21,7 @@
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/iterator.h"
 #include "llvm/Support/Allocator.h"
+#include "llvm/Support/xxhash.h"
 #include <cassert>
 #include <cstddef>
 #include <cstdint>
@@ -294,12 +295,19 @@ class FoldingSetNodeIDRef {
   FoldingSetNodeIDRef() = default;
   FoldingSetNodeIDRef(const unsigned *D, size_t S) : Data(D), Size(S) {}
 
-  /// ComputeHash - Compute a strong hash value for this FoldingSetNodeIDRef,
-  /// used to lookup the node in the FoldingSetBase.
+  // Compute a strong hash value used to lookup the node in the FoldingSetBase.
+  // The hash value is not guaranteed to be deterministic across processes.
   unsigned ComputeHash() const {
     return static_cast<unsigned>(hash_combine_range(Data, Data + Size));
   }
 
+  // Compute a deterministic hash value across processes that is suitable for
+  // on-disk serialization.
+  unsigned computeStableHash() const {
+    return static_cast<unsigned>(xxh3_64bits(ArrayRef(
+        reinterpret_cast<const uint8_t *>(Data), sizeof(unsigned) * Size)));
+  }
+
   bool operator==(FoldingSetNodeIDRef) const;
 
   bool operator!=(FoldingSetNodeIDRef RHS) const { return !(*this == RHS); }
@@ -366,12 +374,19 @@ class FoldingSetNodeID {
   /// object to be used to compute a new profile.
   inline void clear() { Bits.clear(); }
 
-  /// ComputeHash - Compute a strong hash value for this FoldingSetNodeID, used
-  /// to lookup the node in the FoldingSetBase.
+  // Compute a strong hash value for this FoldingSetNodeID, used to lookup the
+  // node in the FoldingSetBase. The hash value is not guaranteed to be
+  // deterministic across processes.
   unsigned ComputeHash() const {
     return FoldingSetNodeIDRef(Bits.data(), Bits.size()).ComputeHash();
   }
 
+  // Compute a deterministic hash value across processes that is suitable for
+  // on-disk serialization.
+  unsigned computeStableHash() const {
+    return FoldingSetNodeIDRef(Bits.data(), Bits.size()).computeStableHash();
+  }
+
   /// operator== - Used to compare two nodes to each other.
   bool operator==(const FoldingSetNodeID &RHS) const;
   bool operator==(const FoldingSetNodeIDRef RHS) const;

@MaskRay
Copy link
Member Author

MaskRay commented Jun 20, 2024

This patch fixes most clangSerialization issues if we make llvm::hash_combine/llvm::hash_value non-deterministic.
https://github.com/MaskRay/llvm-project/tree/hashing-seed

Some check-clang-modules tests still fail (@ChuanqiXu9) along with other failures (e.g. HIP @jhuber6 )

@ChuanqiXu9
Copy link
Member

This looks generally good.

Some check-clang-modules tests still fail (@ChuanqiXu9)

What does this mean? Do you mean there are some failures after this patch? What are they? Or are these tests failing before this patch?

@MaskRay
Copy link
Member Author

MaskRay commented Jun 20, 2024

This looks generally good.

Some check-clang-modules tests still fail (@ChuanqiXu9)

What does this mean? Do you mean there are some failures after this patch? What are they? Or are these tests failing before this patch?

When https://github.com/MaskRay/llvm-project/tree/hashing-seed is applied, some module tests will fail.
This PR fixes most of them but some are not and I am having a hard time figuring out them...

Applying this PR alone does not cause failures.

@ChuanqiXu9
Copy link
Member

This looks generally good.

Some check-clang-modules tests still fail (@ChuanqiXu9)

What does this mean? Do you mean there are some failures after this patch? What are they? Or are these tests failing before this patch?

When https://github.com/MaskRay/llvm-project/tree/hashing-seed is applied, some module tests will fail. This PR fixes most of them but some are not and I am having a hard time figuring out them...

Applying this PR alone does not cause failures.

Which tests are still failing?

@MaskRay
Copy link
Member Author

MaskRay commented Jun 21, 2024

  • Clang :: APINotes/module-cache.m
  • Clang :: Driver/cuda-bindings.cu
  • Clang :: Driver/cuda-phases.cu
  • Clang :: Driver/hip-phases.hip
  • Clang :: Driver/hip-toolchain-no-rdc.hip
  • Clang :: Driver/openmp-system-arch.c
  • Clang :: Modules/Rmodule-build.m
  • Clang :: Modules/context-hash.c
  • Clang :: Modules/extensions.c
  • Clang :: Modules/implicit-invalidate-common.c
  • Clang :: Modules/module-feature.m
  • Clang :: Modules/module-map-path-hash.cpp
  • Clang :: Modules/modules-with-same-name.m
  • Clang :: Modules/prebuilt-implicit-modules.m
  • Clang :: Modules/shadow.m
  • Clang :: Modules/system-out-of-date-test.m
  • Clang :: Modules/warn-unused-local-typedef.cpp
  • Clang :: VFS/module-header-mismatches.m
  • Clang :: VFS/module_missing_vfs.m

@MaskRay
Copy link
Member Author

MaskRay commented Jun 21, 2024

This looks generally good.

Which tests are still failing?

OK. 12c0281 has fixed these failures. If this PR is applied as well, applying https://github.com/MaskRay/llvm-project/tree/hashing-seed will not cause check-clang failures.

Copy link
Member

@ChuanqiXu9 ChuanqiXu9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM then.

@MaskRay MaskRay merged commit 874dcae into main Jun 21, 2024
12 checks passed
@MaskRay MaskRay deleted the users/MaskRay/spr/serialization-use-stable-hash-functions branch June 21, 2024 06:53
@ChuanqiXu9
Copy link
Member

I suspect this is the root cause that some modules related test are failing on armv8-quick. It passes with commit 12c0281 (https://lab.llvm.org/buildbot/#/builders/154/builds/320) and fails with b39f523 (https://lab.llvm.org/buildbot/#/builders/154/builds/322). I feel b39f523 is not related to tests.

@mgorny
Copy link
Member

mgorny commented Jun 22, 2024

Also hit it on 32-bit x86; filed #96379 before I managed to bisect it.

MaskRay added a commit that referenced this pull request Jun 22, 2024
https://reviews.llvm.org/D67249 added content hash (see
-fvalidate-ast-input-files-content) using llvm::hash_code (size_t).
The hash value is 32-bit on 32-bit systems, which was unintentional.

Fix #96379: #96136 switched the hash function to xxh3_64bit but did not
update the ContentHash type, leading to mismatch between ASTReader and
ASTWriter.
@MaskRay
Copy link
Member Author

MaskRay commented Jun 22, 2024

I suspect this is the root cause that some modules related test are failing on armv8-quick. It passes with commit 12c0281 (lab.llvm.org/buildbot/#/builders/154/builds/320) and fails with b39f523 (lab.llvm.org/buildbot/#/builders/154/builds/322). I feel b39f523 is not related to tests.

Also hit it on 32-bit x86; filed #96379 before I managed to bisect it.

Thanks for the reports. Reproducible with -DLLVM_TARGETS_TO_BUILD=host -DLLVM_ENABLE_PROJECTS='clang;lld' -DCMAKE_BUILD_TYPE=Debug -DCMAKE_{C,CXX}_FLAGS=-m32 -DCMAKE_{EXE,SHARED}_LINKER_FLAGS=-m32.

The content hash unintentionally used size_t, which is 32-bit and this change caused a mismatch. Fixed by f3005d5

@mgorny
Copy link
Member

mgorny commented Jun 23, 2024

Thanks!

AlexisPerry pushed a commit to llvm-project-tlp/llvm-project that referenced this pull request Jul 9, 2024
clangSerialization currently uses hash_combine/hash_value from
Hashing.h, which are not guaranteed to be deterministic.
Replace these uses with xxh3_64bits.

Pull Request: llvm#96136
AlexisPerry pushed a commit to llvm-project-tlp/llvm-project that referenced this pull request Jul 9, 2024
https://reviews.llvm.org/D67249 added content hash (see
-fvalidate-ast-input-files-content) using llvm::hash_code (size_t).
The hash value is 32-bit on 32-bit systems, which was unintentional.

Fix llvm#96379: llvm#96136 switched the hash function to xxh3_64bit but did not
update the ContentHash type, leading to mismatch between ASTReader and
ASTWriter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:modules C++20 modules and Clang Header Modules clang Clang issues not falling into any other category llvm:adt
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants