Skip to content

[clang] Introduce "binary" StringLiteral for #embed data #127629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions clang/include/clang/AST/Expr.h
Original file line number Diff line number Diff line change
Expand Up @@ -1756,7 +1756,14 @@ enum class StringLiteralKind {
UTF8,
UTF16,
UTF32,
Unevaluated
Unevaluated,
// Binary kind of string literal is used for the data coming via #embed
// directive. File's binary contents is transformed to a special kind of
// string literal that in some cases may be used directly as an initializer
// and some features of classic string literals are not applicable to this
// kind of a string literal, for example finding a particular byte's source
// location for better diagnosing.
Binary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining this is for embed?

I'm sorry it took me a while to understand how this patch works.
(The reason is that this allows us to not "cast" to char in getCodeUnitS() - which is only used in C23 mode)

Maybe also add a comment in getCodeUnitS and/or CheckC23ConstexprInitStringLiteral

Copy link
Contributor Author

@Fznamznon Fznamznon Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments in 8ff7d03 . Do you think it turned out helpful?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, Thanks!

};

/// StringLiteral - This represents a string literal expression, e.g. "foo"
Expand Down Expand Up @@ -1888,6 +1895,8 @@ class StringLiteral final
int64_t getCodeUnitS(size_t I, uint64_t BitWidth) const {
int64_t V = getCodeUnit(I);
if (isOrdinary() || isWide()) {
// Ordinary and wide string literals have types that can be signed.
// It is important for checking C23 constexpr initializers.
unsigned Width = getCharByteWidth() * BitWidth;
llvm::APInt AInt(Width, (uint64_t)V);
V = AInt.getSExtValue();
Expand Down Expand Up @@ -5029,9 +5038,9 @@ class EmbedExpr final : public Expr {
assert(EExpr && CurOffset != ULLONG_MAX &&
"trying to dereference an invalid iterator");
IntegerLiteral *N = EExpr->FakeChildNode;
StringRef DataRef = EExpr->Data->BinaryData->getBytes();
N->setValue(*EExpr->Ctx,
llvm::APInt(N->getValue().getBitWidth(), DataRef[CurOffset],
llvm::APInt(N->getValue().getBitWidth(),
EExpr->Data->BinaryData->getCodeUnit(CurOffset),
Comment on lines +5042 to +5043
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change that line (and nothing else) does it still works?
I have a feeling this is enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish it did. This particular line helps for generic case of #embed inside of an initializer list. The rest that this patch is adding is for #embed "fast path" where we simply put StringLiteral to the initializer list instead of EmbedExpr when a single #embed is used to initialize char array.
When we do that, the iterators of EmbedExpr won't be in use and the fail from #119256 is still in place.

N->getType()->isSignedIntegerType()));
// We want to return a reference to the fake child node in the
// EmbedExpr, not the local variable N.
Expand Down
7 changes: 7 additions & 0 deletions clang/lib/AST/Expr.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1104,6 +1104,7 @@ unsigned StringLiteral::mapCharByteWidth(TargetInfo const &Target,
switch (SK) {
case StringLiteralKind::Ordinary:
case StringLiteralKind::UTF8:
case StringLiteralKind::Binary:
CharByteWidth = Target.getCharWidth();
break;
case StringLiteralKind::Wide:
Expand Down Expand Up @@ -1216,6 +1217,7 @@ void StringLiteral::outputString(raw_ostream &OS) const {
switch (getKind()) {
case StringLiteralKind::Unevaluated:
case StringLiteralKind::Ordinary:
case StringLiteralKind::Binary:
break; // no prefix.
case StringLiteralKind::Wide:
OS << 'L';
Expand Down Expand Up @@ -1332,6 +1334,11 @@ StringLiteral::getLocationOfByte(unsigned ByteNo, const SourceManager &SM,
const LangOptions &Features,
const TargetInfo &Target, unsigned *StartToken,
unsigned *StartTokenByteOffset) const {
// No source location of bytes for binary literals since they don't come from
// source.
if (getKind() == StringLiteralKind::Binary)
return getStrTokenLoc(0);

assert((getKind() == StringLiteralKind::Ordinary ||
getKind() == StringLiteralKind::UTF8 ||
getKind() == StringLiteralKind::Unevaluated) &&
Expand Down
2 changes: 1 addition & 1 deletion clang/lib/Parse/ParseInit.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -445,7 +445,7 @@ ExprResult Parser::createEmbedExpr() {
Context.MakeIntValue(Str.size(), Context.getSizeType());
QualType ArrayTy = Context.getConstantArrayType(
Ty, ArraySize, nullptr, ArraySizeModifier::Normal, 0);
return StringLiteral::Create(Context, Str, StringLiteralKind::Ordinary,
return StringLiteral::Create(Context, Str, StringLiteralKind::Binary,
false, ArrayTy, StartLoc);
};

Expand Down
1 change: 1 addition & 0 deletions clang/lib/Sema/SemaInit.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ static StringInitFailureKind IsStringInit(Expr *Init, const ArrayType *AT,
return SIF_None;
[[fallthrough]];
case StringLiteralKind::Ordinary:
case StringLiteralKind::Binary:
// char array can be initialized with a narrow string.
// Only allow char x[] = "foo"; not char x[] = L"foo";
if (ElemTy->isCharType())
Expand Down
21 changes: 21 additions & 0 deletions clang/test/Preprocessor/embed_constexpr.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
// RUN: %clang_cc1 %s -fsyntax-only --embed-dir=%S/Inputs -verify -std=c23

static constexpr unsigned char data[] = {
#embed "big_char.txt"
};

static constexpr char data1[] = {
#embed "big_char.txt" // expected-error {{constexpr initializer evaluates to 255 which is not exactly representable in type 'const char'}}
};

static constexpr int data2[] = {
#embed "big_char.txt"
};

static constexpr unsigned data3[] = {
#embed "big_char.txt" suffix(, -1) // expected-error {{constexpr initializer evaluates to -1 which is not exactly representable in type 'const unsigned int'}}
};

static constexpr int data4[] = {
#embed "big_char.txt" suffix(, -1)
};