Skip to content

[SE-0456] Span properties over utf8 views #80116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Apr 3, 2025

Conversation

glessard
Copy link
Contributor

@glessard glessard commented Mar 19, 2025

Implements most of the String-related parts of SE-0456 not implemented in #78561

Spans vended from the small string representation are disabled at the moment

Addresses the remainder of rdar://137710901

@glessard glessard force-pushed the rdar137710901-utf8view-span-notsmol branch 2 times, most recently from 6129e40 to fed549e Compare March 19, 2025 22:22
@glessard
Copy link
Contributor Author

@swift-ci please smoke test

@glessard glessard force-pushed the rdar137710901-utf8view-span-notsmol branch from fed549e to deee376 Compare March 20, 2025 00:01
@glessard
Copy link
Contributor Author

@swift-ci please test

@glessard glessard force-pushed the rdar137710901-utf8view-span-notsmol branch from deee376 to aab1e19 Compare March 20, 2025 15:53
@glessard glessard marked this pull request as ready for review March 20, 2025 15:54
@glessard glessard requested a review from a team as a code owner March 20, 2025 15:54
@glessard
Copy link
Contributor Author

@swift-ci please test

@glessard
Copy link
Contributor Author

@swift-ci please test macOS platform

@glessard
Copy link
Contributor Author

@swift-ci please test macOS platfom

@glessard
Copy link
Contributor Author

The full macOS test is failing due to swift-testing. Falling back to smoke-test.

@glessard
Copy link
Contributor Author

@swift-ci please smoke test macOS platform

@glessard glessard force-pushed the rdar137710901-utf8view-span-notsmol branch from aab1e19 to f1f8c8f Compare March 22, 2025 00:27
@glessard
Copy link
Contributor Author

@swift-ci please test

@glessard glessard force-pushed the rdar137710901-utf8view-span-notsmol branch from a5b3374 to 94a16f3 Compare March 25, 2025 16:49
@glessard
Copy link
Contributor Author

@swift-ci please test

@glessard glessard requested a review from jmschonfeld March 25, 2025 20:02
@milseman milseman mentioned this pull request Mar 26, 2025
// handle non-UTF8 Objective-C bridging cases here
if !_wholeGuts.isFastUTF8 && _wholeGuts._object.hasObjCBridgeableObject {
let base: String.UTF8View = self._base
let first = base._foreignDistance(from: base.startIndex, to: startIndex)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine and I'm not suggesting changing it now, but I wonder if we can avoid a bunch of -characterAtIndex: calls by making a version of this that operates on the associated buffer directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the indices are to the UTF16 buffer, so a translation is needed. I'd like @lorentey's eyes on this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks correct to me, but I think David is right, performance is going to be a real issue. 😫

The point of these span properties is that they provide fast direct access to in-memory storage; doing a full UTF-16 → UTF-8 transcoding of an arbitrarily large prefix in the base string (plus the entire contents of the substring itself) every time span is accessed is violating this goal. We want to encourage developers to think of these properties as cheap enough to call whenever and wherever they need them; loops like the one below should not raise any eyebrows:

for word in string.split(separator: / /) {
  let span = word.span // oops
  ...
}

The one-time transcoding of the base string is often amortized away to near-zero with repeated calls to span, but the two index conversions in the substring's implementation will be repeated from scratch on every iteration. Straightforward-looking loops will become accidentally quadratic.

Unfortunately the only way to avoid this footgun I can see is to not provide this property on this type at all; we should consider doing that!

If we do end up wanting to keep this property, then I think the doc comment needs to loudly explain this major gotcha, and it should recommend strategies to avoid hitting it. (E.g., by first copying the substring into a standalone string instance.)

for word in string.split(separator: / /) {
  let str = String(word)
  let span = str.span // OK
  ...
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that we're steadily reducing the number of cases this will apply to, but yes, we don't currently have a strategy for getting that number to zero

Copy link
Contributor

@Catfish-Man Catfish-Man left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The associated objects approach is miserable, but we knew that. Changes look good

extension String.UTF8View {

@available(SwiftStdlib 6.2, *)
public var span: Span<UTF8.CodeUnit> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're missing an #if !(os(watchOS) && _pointerBitWidth(_32)) around this. 😭

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and we'll need a doc comment, too -- the complexity (amortized over many calls) is particularly relevant.

// handle non-UTF8 Objective-C bridging cases here
if !_wholeGuts.isFastUTF8 && _wholeGuts._object.hasObjCBridgeableObject {
let base: String.UTF8View = self._base
let first = base._foreignDistance(from: base.startIndex, to: startIndex)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks correct to me, but I think David is right, performance is going to be a real issue. 😫

The point of these span properties is that they provide fast direct access to in-memory storage; doing a full UTF-16 → UTF-8 transcoding of an arbitrarily large prefix in the base string (plus the entire contents of the substring itself) every time span is accessed is violating this goal. We want to encourage developers to think of these properties as cheap enough to call whenever and wherever they need them; loops like the one below should not raise any eyebrows:

for word in string.split(separator: / /) {
  let span = word.span // oops
  ...
}

The one-time transcoding of the base string is often amortized away to near-zero with repeated calls to span, but the two index conversions in the substring's implementation will be repeated from scratch on every iteration. Straightforward-looking loops will become accidentally quadratic.

Unfortunately the only way to avoid this footgun I can see is to not provide this property on this type at all; we should consider doing that!

If we do end up wanting to keep this property, then I think the doc comment needs to loudly explain this major gotcha, and it should recommend strategies to avoid hitting it. (E.g., by first copying the substring into a standalone string instance.)

for word in string.split(separator: / /) {
  let str = String(word)
  let span = str.span // OK
  ...
}

extension Substring.UTF8View {

@available(SwiftStdlib 6.2, *)
public var span: Span<UTF8.CodeUnit> {
Copy link
Member

@lorentey lorentey Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this particular case it is extremely important for this property to have a doc comment that explains its complexity.

Unlike the other span properties, this one has linear time complexity for bridged strings, and that requires developers to be careful to avoid calling it in a loop. (E.g., most Substring.UTF8View operations in the stdlib (and elsewhere) must not use this property, as they have no way to avoid getting called in a loop.)

private static var associationKey: UnsafeRawPointer {
// We never dereference this, we only use this address as a unique key
unsafe unsafeBitCast(
ObjectIdentifier(_StringGuts.self),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How fast is this metatype lookup in practice? (Elsewhere we sometimes decided to use the address of a dummy symbol instead; but this may in fact be the better pattern.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, on second look, this isn't exactly right -- _StringGuts is a public type, so it is trivial to generate this ID outside the stdlib. It would be better to use the metatype of an internal type here.

(It's not that hard to work out what ID the stdlib uses, but making it too easy sends the wrong signal.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two options are to cast about for a suitable type, or make up a private type and use that

Suggested change
ObjectIdentifier(_StringGuts.self),
ObjectIdentifier(__StringStorage.self),

Copy link
Member

@lorentey lorentey Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__StringStorage is technically internal, but it is the isa of native Swift strings bridged to ObjC, so its identity is still very public. Something like _StringObject would be a better choice, I think -- a pure internal (non-@usableFromInline) or private struct would be even better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go with a new made-up private type, then.

@glessard glessard force-pushed the rdar137710901-utf8view-span-notsmol branch from c772470 to 279c036 Compare March 31, 2025 19:07
@glessard glessard requested a review from lorentey March 31, 2025 19:07
@glessard glessard force-pushed the rdar137710901-utf8view-span-notsmol branch from 5291f1e to 4d23610 Compare April 1, 2025 05:24
@glessard
Copy link
Contributor Author

glessard commented Apr 1, 2025

@swift-ci please test

@glessard
Copy link
Contributor Author

glessard commented Apr 1, 2025

@swift-ci please smoke test macOS platform

@glessard
Copy link
Contributor Author

glessard commented Apr 2, 2025

@swift-ci please test

@glessard
Copy link
Contributor Author

glessard commented Apr 2, 2025

@swift-ci please test macOS platform

@glessard glessard force-pushed the rdar137710901-utf8view-span-notsmol branch from bacebab to 7539366 Compare April 3, 2025 00:54
@glessard
Copy link
Contributor Author

glessard commented Apr 3, 2025

@swift-ci please test

@glessard glessard enabled auto-merge April 3, 2025 00:54
@glessard glessard merged commit aedb869 into swiftlang:main Apr 3, 2025
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants