Skip to content

[StaticDataLayout][PGO] Add profile format for static data layout, and the classes to operate on the profiles. #138170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 16, 2025
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
167 changes: 167 additions & 0 deletions llvm/include/llvm/ProfileData/DataAccessProf.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
//===- DataAccessProf.h - Data access profile format support ---------*- C++
//-*-===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
//
// This file contains support to construct and use data access profiles.
//
// For the original RFC of this pass please see
// https://discourse.llvm.org/t/rfc-profile-guided-static-data-partitioning/83744
//
//===----------------------------------------------------------------------===//

#ifndef LLVM_PROFILEDATA_DATAACCESSPROF_H_
#define LLVM_PROFILEDATA_DATAACCESSPROF_H_

#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/DenseMapInfoVariant.h"
#include "llvm/ADT/MapVector.h"
#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/SetVector.h"
#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringRef.h"
#include "llvm/ProfileData/InstrProf.h"
#include "llvm/Support/Allocator.h"
#include "llvm/Support/Error.h"
#include "llvm/Support/StringSaver.h"

#include <cstdint>
#include <variant>

namespace llvm {

namespace data_access_prof {
// The location of data in the source code.
struct DataLocation {
// The filename where the data is located.
StringRef FileName;
// The line number in the source code.
uint32_t Line;
};

// The data access profiles for a symbol.
struct DataAccessProfRecord {
DataAccessProfRecord(uint64_t SymbolID, uint64_t AccessCount,
bool IsStringLiteral)
: SymbolID(SymbolID), AccessCount(AccessCount),
IsStringLiteral(IsStringLiteral) {}

// Represents a data symbol. The semantic comes in two forms: a symbol index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would the different forms be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantic of this field depends on the IsStringLiteral field below. For a string literal, IsStringLiteral is true and SymbolID has the semantic of 'hash'; otherwise, IsStringLiteral is false and SymbolID represent an index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood that, but my question was more about why in practice some would be string literals and some would be hashes. Might be useful to note this in a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understood that, but my question was more about why in practice some would be string literals and some would be hashes.

This makes sense. Added comment at L55 to explain why two forms are used.

// for symbol name if `IsStringLiteral` is false, or the hash of a string
// content if `IsStringLiteral` is true. For most of the symbolizable static
// data, the mangled symbol names remain stable relative to the source code
// and therefore used to identify symbols across binary releases. String
// literals have unstable name patterns like `.str.N[.llvm.hash]`, so we use
// the content hash instead. This is a required field.
uint64_t SymbolID;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little confusing that SymbolID is a different thing (a type) in the following class. Suggest making these different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I renamed the type std::variant<StringRef, uint64_t> to SymbolHandle.


// The access count of symbol. Required.
uint64_t AccessCount;

// True iff this is a record for string literal (symbols with name pattern
// `.str.*` in the symbol table). Required.
bool IsStringLiteral;

// The locations of data in the source code. Optional.
llvm::SmallVector<DataLocation, 0> Locations;
};

/// Encapsulates the data access profile data and the methods to operate on it.
/// This class provides profile look-up, serialization and deserialization.
class DataAccessProfData {
public:
// SymbolID is either a string representing symbol name if the symbol has
// stable mangled name relative to source code, or a uint64_t representing the
// content hash of a string literal (with unstable name patterns like
// `.str.N[.llvm.hash]`). The StringRef is owned by the class's saver object.
using SymbolHandle = std::variant<StringRef, uint64_t>;
using StringToIndexMap = llvm::MapVector<StringRef, uint64_t>;

DataAccessProfData() : Saver(Allocator) {}

/// Serialize profile data to the output stream.
/// Storage layout:
/// - Serialized strings.
/// - The encoded hashes.
/// - Records.
Error serialize(ProfOStream &OS) const;

/// Deserialize this class from the given buffer.
Error deserialize(const unsigned char *&Ptr);

/// Returns a pointer of profile record for \p SymbolID, or nullptr if there
/// isn't a record. Internally, this function will canonicalize the symbol
/// name before the lookup.
const DataAccessProfRecord *getProfileRecord(const SymbolHandle SymID) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the future intended usage but handing out a pointer to a record makes it hard to reason about the lifetime. How can we prevent users from accidentally introducing bugs?

In practice I think the user will query the record for a symbol and then take some action based on the record. So usually 1 (or a small number of) DataAccessProfRecord will be live at a time. If so we can inline the contents (the string for the location etc) to avoid any lifetime issues. This will also allow us to move the optimized implementation out of the public api. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the look-up API to return a record with owned strings (i.e., of type std::string), and the internal records with referenced strings. Please take another look, thanks!


/// Returns true if \p SymID is seen in profiled binaries and cold.
bool isKnownColdSymbol(const SymbolHandle SymID) const;

/// Methods to set symbolized data access profile. Returns error if duplicated
/// symbol names or content hashes are seen. The user of this class should
/// aggregate counters that correspond to the same symbol name or with the
/// same string literal hash before calling 'set*' methods.
Error setDataAccessProfile(SymbolHandle SymbolID, uint64_t AccessCount);
/// Similar to the method above, for records with \p Locations representing
/// the `filename:line` where this symbol shows up. Note because of linker's
/// merge of identical symbols (e.g., unnamed_addr string literals), one
/// symbol is likely to have multiple locations.
Error setDataAccessProfile(SymbolHandle SymbolID, uint64_t AccessCount,
ArrayRef<DataLocation> Locations);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In actual usage (outside of the existing tests) will we need to accept an ArrayRef<DataLocation> or can it be something like setDataAccessProfile(SymbolHandle, AccessCount, ArrayRef<DILocation>)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source location of a global variable is recorded in DIGlobalVariable for many cases (e.g., https://gcc.godbolt.org/z/rW3cKdnfP is a small test case), and filename/line are inlined in this DIGlobalVariable class as opposed to carried by DILocation.

That said, it's more general to name them SourceLocation and SourceLocationRef like how we have LineLocation, so took the liberty to choose SourceLocation.

Error addKnownSymbolWithoutSamples(SymbolHandle SymbolID);

/// Returns an iterable StringRef for strings in the order they are added.
/// Each string may be a symbol name or a file name.
auto getStrings() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Prefer spelling out the type instead of auto for the return type.

Same for the one below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://google.github.io/styleguide/cppguide.html#Type_deduction also favors spelling out the return type [1], so did it for the return type below.

For this function, updated it to return an ArrayRef of StrToIndexMap and the callsites to use make_first_range accordingly. I'm open to return SmallVector<StringRef> directly in this function since currently each caller of this helper function takes the return value to make a small vector.

  • The return type of llvm::make_first_range(StrToIndexMap.getArrayRef()) looks complex. If I understand correctly, its type depends on the lambda function type and the lambda function type depends on element type..

[1] Around Return type deduction, it has
Use return type deduction (for both functions and lambdas) only if the function body has a very small number of return statements, and very little other code ... Furthermore, use it only if the function or lambda has a very narrow scope, because functions with deduced return types don't define abstraction boundaries: the implementation is the interface. In particular, public functions in header files should almost never have deduced return types.

return llvm::make_first_range(StrToIndexMap.getArrayRef());
}

/// Returns array reference for various internal data structures.
auto getRecords() const { return Records.getArrayRef(); }
ArrayRef<StringRef> getKnownColdSymbols() const {
return KnownColdSymbols.getArrayRef();
}
ArrayRef<uint64_t> getKnownColdHashes() const {
return KnownColdHashes.getArrayRef();
}

private:
/// Serialize the symbol strings into the output stream.
Error serializeSymbolsAndFilenames(ProfOStream &OS) const;

/// Deserialize the symbol strings from \p Ptr and increment \p Ptr to the
/// start of the next payload.
Error deserializeSymbolsAndFilenames(const unsigned char *&Ptr,
const uint64_t NumSampledSymbols,
const uint64_t NumColdKnownSymbols);

/// Decode the records and increment \p Ptr to the start of the next payload.
Error deserializeRecords(const unsigned char *&Ptr);

/// A helper function to compute a storage index for \p SymbolID.
uint64_t getEncodedIndex(const SymbolHandle SymbolID) const;

// Keeps owned copies of the input strings.
// NOTE: Keep `Saver` initialized before other class members that reference
// its string copies and destructed after they are destructed.
llvm::BumpPtrAllocator Allocator;
llvm::UniqueStringSaver Saver;

// `Records` stores the records.
MapVector<SymbolHandle, DataAccessProfRecord> Records;

// Use MapVector to keep input order of strings for serialization and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this comment to the typedef?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

// deserialization.
StringToIndexMap StrToIndexMap;
llvm::SetVector<uint64_t> KnownColdHashes;
llvm::SetVector<StringRef> KnownColdSymbols;
};

} // namespace data_access_prof
} // namespace llvm

#endif // LLVM_PROFILEDATA_DATAACCESSPROF_H_
17 changes: 12 additions & 5 deletions llvm/include/llvm/ProfileData/InstrProf.h
Original file line number Diff line number Diff line change
Expand Up @@ -357,6 +357,13 @@ void createPGONameMetadata(GlobalObject &GO, StringRef PGOName);
/// the duplicated profile variables for Comdat functions.
bool needsComdatForCounter(const GlobalObject &GV, const Module &M);

/// \c NameStrings is a string composed of one or more possibly encoded
/// sub-strings. The substrings are separated by `\01` (returned by
/// InstrProf.h:getInstrProfNameSeparator). This method decodes the string and
/// calls `NameCallback` for each substring.
Error readAndDecodeStrings(StringRef NameStrings,
std::function<Error(StringRef)> NameCallback);

/// An enum describing the attributes of an instrumented profile.
enum class InstrProfKind {
Unknown = 0x0,
Expand Down Expand Up @@ -493,6 +500,11 @@ class InstrProfSymtab {
public:
using AddrHashMap = std::vector<std::pair<uint64_t, uint64_t>>;

// Returns the canonial name of the given PGOName. In a canonical name, all
// suffixes that begins with "." except ".__uniq." are stripped.
// FIXME: Unify this with `FunctionSamples::getCanonicalFnName`.
static StringRef getCanonicalName(StringRef PGOName);

private:
using AddrIntervalMap =
IntervalMap<uint64_t, uint64_t, 4, IntervalMapHalfOpenInfo<uint64_t>>;
Expand Down Expand Up @@ -528,11 +540,6 @@ class InstrProfSymtab {

static StringRef getExternalSymbol() { return "** External Symbol **"; }

// Returns the canonial name of the given PGOName. In a canonical name, all
// suffixes that begins with "." except ".__uniq." are stripped.
// FIXME: Unify this with `FunctionSamples::getCanonicalFnName`.
static StringRef getCanonicalName(StringRef PGOName);

// Add the function into the symbol table, by creating the following
// map entries:
// name-set = {PGOFuncName} union {getCanonicalName(PGOFuncName)}
Expand Down
1 change: 1 addition & 0 deletions llvm/lib/ProfileData/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
add_llvm_component_library(LLVMProfileData
DataAccessProf.cpp
GCOV.cpp
IndexedMemProfData.cpp
InstrProf.cpp
Expand Down
Loading