-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[SE-0456] Span properties over utf8 views #80116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SE-0456] Span properties over utf8 views #80116
Conversation
6129e40
to
fed549e
Compare
@swift-ci please smoke test |
fed549e
to
deee376
Compare
@swift-ci please test |
deee376
to
aab1e19
Compare
@swift-ci please test |
@swift-ci please test macOS platform |
@swift-ci please test macOS platfom |
The full macOS test is failing due to swift-testing. Falling back to smoke-test. |
@swift-ci please smoke test macOS platform |
aab1e19
to
f1f8c8f
Compare
@swift-ci please test |
a5b3374
to
94a16f3
Compare
@swift-ci please test |
// handle non-UTF8 Objective-C bridging cases here | ||
if !_wholeGuts.isFastUTF8 && _wholeGuts._object.hasObjCBridgeableObject { | ||
let base: String.UTF8View = self._base | ||
let first = base._foreignDistance(from: base.startIndex, to: startIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine and I'm not suggesting changing it now, but I wonder if we can avoid a bunch of -characterAtIndex: calls by making a version of this that operates on the associated buffer directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case the indices are to the UTF16 buffer, so a translation is needed. I'd like @lorentey's eyes on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks correct to me, but I think David is right, performance is going to be a real issue. 😫
The point of these span
properties is that they provide fast direct access to in-memory storage; doing a full UTF-16 → UTF-8 transcoding of an arbitrarily large prefix in the base string (plus the entire contents of the substring itself) every time span
is accessed is violating this goal. We want to encourage developers to think of these properties as cheap enough to call whenever and wherever they need them; loops like the one below should not raise any eyebrows:
for word in string.split(separator: / /) {
let span = word.span // oops
...
}
The one-time transcoding of the base string is often amortized away to near-zero with repeated calls to span
, but the two index conversions in the substring's implementation will be repeated from scratch on every iteration. Straightforward-looking loops will become accidentally quadratic.
Unfortunately the only way to avoid this footgun I can see is to not provide this property on this type at all; we should consider doing that!
If we do end up wanting to keep this property, then I think the doc comment needs to loudly explain this major gotcha, and it should recommend strategies to avoid hitting it. (E.g., by first copying the substring into a standalone string instance.)
for word in string.split(separator: / /) {
let str = String(word)
let span = str.span // OK
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that we're steadily reducing the number of cases this will apply to, but yes, we don't currently have a strategy for getting that number to zero
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The associated objects approach is miserable, but we knew that. Changes look good
extension String.UTF8View { | ||
|
||
@available(SwiftStdlib 6.2, *) | ||
public var span: Span<UTF8.CodeUnit> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're missing an #if !(os(watchOS) && _pointerBitWidth(_32))
around this. 😭
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and we'll need a doc comment, too -- the complexity (amortized over many calls) is particularly relevant.
// handle non-UTF8 Objective-C bridging cases here | ||
if !_wholeGuts.isFastUTF8 && _wholeGuts._object.hasObjCBridgeableObject { | ||
let base: String.UTF8View = self._base | ||
let first = base._foreignDistance(from: base.startIndex, to: startIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks correct to me, but I think David is right, performance is going to be a real issue. 😫
The point of these span
properties is that they provide fast direct access to in-memory storage; doing a full UTF-16 → UTF-8 transcoding of an arbitrarily large prefix in the base string (plus the entire contents of the substring itself) every time span
is accessed is violating this goal. We want to encourage developers to think of these properties as cheap enough to call whenever and wherever they need them; loops like the one below should not raise any eyebrows:
for word in string.split(separator: / /) {
let span = word.span // oops
...
}
The one-time transcoding of the base string is often amortized away to near-zero with repeated calls to span
, but the two index conversions in the substring's implementation will be repeated from scratch on every iteration. Straightforward-looking loops will become accidentally quadratic.
Unfortunately the only way to avoid this footgun I can see is to not provide this property on this type at all; we should consider doing that!
If we do end up wanting to keep this property, then I think the doc comment needs to loudly explain this major gotcha, and it should recommend strategies to avoid hitting it. (E.g., by first copying the substring into a standalone string instance.)
for word in string.split(separator: / /) {
let str = String(word)
let span = str.span // OK
...
}
extension Substring.UTF8View { | ||
|
||
@available(SwiftStdlib 6.2, *) | ||
public var span: Span<UTF8.CodeUnit> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this particular case it is extremely important for this property to have a doc comment that explains its complexity.
Unlike the other span
properties, this one has linear time complexity for bridged strings, and that requires developers to be careful to avoid calling it in a loop. (E.g., most Substring.UTF8View operations in the stdlib (and elsewhere) must not use this property, as they have no way to avoid getting called in a loop.)
stdlib/public/core/StringGuts.swift
Outdated
private static var associationKey: UnsafeRawPointer { | ||
// We never dereference this, we only use this address as a unique key | ||
unsafe unsafeBitCast( | ||
ObjectIdentifier(_StringGuts.self), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How fast is this metatype lookup in practice? (Elsewhere we sometimes decided to use the address of a dummy symbol instead; but this may in fact be the better pattern.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, on second look, this isn't exactly right -- _StringGuts is a public type, so it is trivial to generate this ID outside the stdlib. It would be better to use the metatype of an internal type here.
(It's not that hard to work out what ID the stdlib uses, but making it too easy sends the wrong signal.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two options are to cast about for a suitable type, or make up a private type and use that
ObjectIdentifier(_StringGuts.self), | |
ObjectIdentifier(__StringStorage.self), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__StringStorage
is technically internal, but it is the isa of native Swift strings bridged to ObjC, so its identity is still very public. Something like _StringObject
would be a better choice, I think -- a pure internal (non-@usableFromInline
) or private struct would be even better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go with a new made-up private type, then.
- move to a new file - create a new `NSString` object for each new test
c772470
to
279c036
Compare
5291f1e
to
4d23610
Compare
@swift-ci please test |
@swift-ci please smoke test macOS platform |
@swift-ci please test |
@swift-ci please test macOS platform |
Co-authored-by: Karoy Lorentey <[email protected]>
- with tweaks in associated object code
bacebab
to
7539366
Compare
@swift-ci please test |
Implements most of the
String
-related parts of SE-0456 not implemented in #78561Span
s vended from the small string representation are disabled at the momentAddresses the remainder of rdar://137710901