Skip to content

[StaticDataLayout][PGO] Add profile format for static data layout, and the classes to operate on the profiles. #138170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 16, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions llvm/include/llvm/ADT/MapVector.h
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,8 @@ class MapVector {
return std::move(Vector);
}

ArrayRef<value_type> getArrayRef() const { return Vector; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably put this into a separate prep patch with a test in llvm/unittests/ADT/MapVectorTest.cpp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in #138726


size_type size() const { return Vector.size(); }

/// Grow the MapVector so that it can contain at least \p NumEntries items
Expand Down
156 changes: 156 additions & 0 deletions llvm/include/llvm/ProfileData/DataAccessProf.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
//===- DataAccessProf.h - Data access profile format support ---------*- C++
//-*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
//
// This file contains support to construct and use data access profiles.
//
// For the original RFC of this pass please see
// https://discourse.llvm.org/t/rfc-profile-guided-static-data-partitioning/83744
//
//===----------------------------------------------------------------------===//

#ifndef LLVM_PROFILEDATA_DATAACCESSPROF_H_
#define LLVM_PROFILEDATA_DATAACCESSPROF_H_

#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/DenseMapInfoVariant.h"
#include "llvm/ADT/MapVector.h"
#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/SetVector.h"
#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/ProfileData/InstrProf.h"
#include "llvm/Support/Allocator.h"
#include "llvm/Support/Error.h"
#include "llvm/Support/StringSaver.h"

#include <cstdint>
#include <variant>

namespace llvm {

namespace data_access_prof {
// The location of data in the source code.
struct DataLocation {
// The filename where the data is located.
StringRef FileName;
// The line number in the source code.
uint32_t Line;
};

// The data access profiles for a symbol.
struct DataAccessProfRecord {
// Represents a data symbol. The semantic comes in two forms: a symbol index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would the different forms be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantic of this field depends on the IsStringLiteral field below. For a string literal, IsStringLiteral is true and SymbolID has the semantic of 'hash'; otherwise, IsStringLiteral is false and SymbolID represent an index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood that, but my question was more about why in practice some would be string literals and some would be hashes. Might be useful to note this in a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood that, but my question was more about why in practice some would be string literals and some would be hashes.

This makes sense. Added comment at L55 to explain why two forms are used.

// for symbol name if `IsStringLiteral` is false, or the hash of a string
// content if `IsStringLiteral` is true. Required.
uint64_t SymbolID;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little confusing that SymbolID is a different thing (a type) in the following class. Suggest making these different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I renamed the type std::variant<StringRef, uint64_t> to SymbolHandle.


// The access count of symbol. Required.
uint64_t AccessCount;

// True iff this is a record for string literal (symbols with name pattern
// `.str.*` in the symbol table). Required.
bool IsStringLiteral;

// The locations of data in the source code. Optional.
llvm::SmallVector<DataLocation> Locations;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we can have many instances of DataAccessProfRecord. Do we know the most common number of entries here? If we don't, we might want to start out with llvm::SmallVector<DataLocation, 0>, which still saves space relative to std::vector<...>. I'm afraid that the inlined storage goes wasted (and multiplied by the number of DataAccessProfRecord entries) if the real usage is way off the number of inlined entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know the most common number of entries here? If we don't, we might want to start out with llvm::SmallVector<DataLocation, 0>

I'd expect that many records have small single-digit number of locations, and definitely within the range of unsigned, which is type for SmallVector<T, 0> size and capacity per https://llvm.org/docs/ProgrammersManual.html#llvm-adt-smallvector-h.

Done.

};

/// Encapsulates the data access profile data and the methods to operate on it.
/// This class provides profile look-up, serialization and deserialization.
class DataAccessProfData {
public:
// SymbolID is either a string representing symbol name, or a uint64_t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to question for above class - what determines which type is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type is determined by the symbol itself. This variant is uint64_t for string literals (which has the unstable name pattern like .str.<N>[.llvm.<hash>]), and the canonicalized symbol name for those symbols with stable mangled names relative to source code. Updated the comment. PTAL.

// representing the content hash of a string literal.
using SymbolID = std::variant<StringRef, uint64_t>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the use of StringRef, is there a risk of use-after-free error (if the caller passes in a Ref to short lived string)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question. This class keeps owned copy of strings (inside saver member of type llvm::UniqueStringSaver) and references the owned copies as opposed to caller's copy. FWIW, I moved saver to the beginning of class member list to make sure it's destructed after other class members that references the strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably add a comment here and in the below type that the StringRef is owned by this class's saver object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fixed a potential use-after-free bug in L69 with std::tie(Key, RecordID) = *saveStringToMap(StrToIndexMap, Saver, *CanonicalName); in the updated change. Basically Key should use the owned copy.

using StringToIndexMap = llvm::MapVector<StringRef, uint64_t>;

DataAccessProfData() : saver(Allocator) {}

/// Serialize profile data to the output stream.
/// Storage layout:
/// - Serialized strings.
/// - The encoded hashes.
/// - Records.
Error serialize(ProfOStream &OS) const;

/// Deserialize this class from the given buffer.
Error deserialize(const unsigned char *&Ptr);

/// Returns a pointer of profile record for \p SymbolID, or nullptr if there
/// isn't a record. Internally, this function will canonicalize the symbol
/// name before the lookup.
const DataAccessProfRecord *getProfileRecord(const SymbolID SymID) const;

/// Returns true if \p SymID is seen in profiled binaries and cold.
bool isKnownColdSymbol(const SymbolID SymID) const;

/// Methods to add symbolized data access profile. Returns error if duplicated
/// symbol names or content hashes are seen. The user of this class should
/// aggregate counters that corresponds to the same symbol name or with the
/// same string literal hash before calling 'add*' methods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'add*' and 'set*' methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Error addSymbolizedDataAccessProfile(SymbolID SymbolID, uint64_t AccessCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shorten the name --> addDataAccessProfile

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the semantics of the method, may be change the name to 'setDataAccessProfile'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Error addSymbolizedDataAccessProfile(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document location parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

SymbolID SymbolID, uint64_t AccessCount,
const llvm::SmallVector<DataLocation> &Locations);
Error addKnownSymbolWithoutSamples(SymbolID SymbolID);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the same as setting the symbol access count to zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, and the idea is similar to SamplePGO's function symbol list.


/// Returns a iterable StringRef for strings in the order they are added.
auto getStrings() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Prefer spelling out the type instead of auto for the return type.

Same for the one below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://google.github.io/styleguide/cppguide.html#Type_deduction also favors spelling out the return type [1], so did it for the return type below.

For this function, updated it to return an ArrayRef of StrToIndexMap and the callsites to use make_first_range accordingly. I'm open to return SmallVector<StringRef> directly in this function since currently each caller of this helper function takes the return value to make a small vector.

  • The return type of llvm::make_first_range(StrToIndexMap.getArrayRef()) looks complex. If I understand correctly, its type depends on the lambda function type and the lambda function type depends on element type..

[1] Around Return type deduction, it has
Use return type deduction (for both functions and lambdas) only if the function body has a very small number of return statements, and very little other code ... Furthermore, use it only if the function or lambda has a very narrow scope, because functions with deduced return types don't define abstraction boundaries: the implementation is the interface. In particular, public functions in header files should almost never have deduced return types.

ArrayRef<std::pair<StringRef, uint64_t>> RefSymbolNames(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use SymbolID instead of the std::pair for consistency and simplicity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated change used new MapVector::getArrayRef as suggested, and this line is simplified away.

StrToIndexMap.begin(), StrToIndexMap.end());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use your new MapVector::getArrayRef here instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! done.

return llvm::make_first_range(RefSymbolNames);
}

/// Returns array reference for various internal data structures.
inline ArrayRef<
std::pair<std::variant<StringRef, uint64_t>, DataAccessProfRecord>>
getRecords() const {
return Records.getArrayRef();
}
inline ArrayRef<StringRef> getKnownColdSymbols() const {
return KnownColdSymbols.getArrayRef();
}
inline ArrayRef<uint64_t> getKnownColdHashes() const {
return KnownColdHashes.getArrayRef();
}

private:
/// Serialize the symbol strings into the output stream.
Error serializeStrings(ProfOStream &OS) const;

/// Deserialize the symbol strings from \p Ptr and increment \p Ptr to the
/// start of the next payload.
Error deserializeStrings(const unsigned char *&Ptr,
const uint64_t NumSampledSymbols,
uint64_t NumColdKnownSymbols);

/// Decode the records and increment \p Ptr to the start of the next payload.
Error deserializeRecords(const unsigned char *&Ptr);

/// A helper function to compute a storage index for \p SymbolID.
uint64_t getEncodedIndex(const SymbolID SymbolID) const;

// `Records` stores the records and `SymbolToRecordIndex` maps a symbol ID to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see SymbolToRecordIndex defined anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops, SymbolToRecordIndex refers to deleted code. Removed it.

Originally I had

DenseMap<SymbolID, uint64_t> SymbolToRecordIndex;
SmallVector<DataAccessProfRecord> Records;

as opposed to
MapVector<SymbolID, DataAccessProfRecord>, in order to get a ArrayRef of the records.

Later I realized that it's simpler to use the latter and add MapVector::getArrayRef. Strictly speaking MapVector::getArrayRef returns ArrayRef<key, value> rather than ArrayRef<value>, but the implementation is overall simpler to have MapVector.

// its record index.
MapVector<SymbolID, DataAccessProfRecord> Records;

// Use MapVector to keep input order of strings for serialization and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this comment to the typedef?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

// deserialization.
StringToIndexMap StrToIndexMap;
llvm::SetVector<uint64_t> KnownColdHashes;
llvm::SetVector<StringRef> KnownColdSymbols;
// Keeps owned copies of the input strings.
llvm::BumpPtrAllocator Allocator;
llvm::UniqueStringSaver saver;
};

} // namespace data_access_prof
} // namespace llvm

#endif // LLVM_PROFILEDATA_DATAACCESSPROF_H_
16 changes: 11 additions & 5 deletions llvm/include/llvm/ProfileData/InstrProf.h
Original file line number Diff line number Diff line change
Expand Up @@ -357,6 +357,12 @@ void createPGONameMetadata(GlobalObject &GO, StringRef PGOName);
/// the duplicated profile variables for Comdat functions.
bool needsComdatForCounter(const GlobalObject &GV, const Module &M);

/// \c NameStrings is a string composed of one of more possibly encoded
/// sub-strings. The substrings are separated by 0 or more zero bytes. This
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "null" bytes? the "0" and "zero" are confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done by mentioning the substrings are separated by \01, which is returned by getInstrProfNameSeparator.

/// method decodes the string and calls `NameCallback` for each substring.
Error readAndDecodeStrings(StringRef NameStrings,
std::function<Error(StringRef)> NameCallback);

/// An enum describing the attributes of an instrumented profile.
enum class InstrProfKind {
Unknown = 0x0,
Expand Down Expand Up @@ -493,6 +499,11 @@ class InstrProfSymtab {
public:
using AddrHashMap = std::vector<std::pair<uint64_t, uint64_t>>;

// Returns the canonial name of the given PGOName. In a canonical name, all
// suffixes that begins with "." except ".__uniq." are stripped.
// FIXME: Unify this with `FunctionSamples::getCanonicalFnName`.
static StringRef getCanonicalName(StringRef PGOName);

private:
using AddrIntervalMap =
IntervalMap<uint64_t, uint64_t, 4, IntervalMapHalfOpenInfo<uint64_t>>;
Expand Down Expand Up @@ -528,11 +539,6 @@ class InstrProfSymtab {

static StringRef getExternalSymbol() { return "** External Symbol **"; }

// Returns the canonial name of the given PGOName. In a canonical name, all
// suffixes that begins with "." except ".__uniq." are stripped.
// FIXME: Unify this with `FunctionSamples::getCanonicalFnName`.
static StringRef getCanonicalName(StringRef PGOName);

// Add the function into the symbol table, by creating the following
// map entries:
// name-set = {PGOFuncName} union {getCanonicalName(PGOFuncName)}
Expand Down
1 change: 1 addition & 0 deletions llvm/lib/ProfileData/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
add_llvm_component_library(LLVMProfileData
DataAccessProf.cpp
GCOV.cpp
IndexedMemProfData.cpp
InstrProf.cpp
Expand Down
Loading