Skip to content

Improper truncation when reading from UTF-8 CHAR(n) fields containing characters outside of the Basic Multilingual Plane #1213

Closed
@YetNothingThunders

Description

@YetNothingThunders

When using the ADO.NET provider to read a UTF-8 CHAR(n) field containing at least one character outside of the Basic Multilingual Plane (e.g. any emoji), the result will be improperly truncated. As an example, reading a CHAR(1) field containing the character '😊' (code point 0x1F60A) will result in a string value containing only the high surrogate (0xD83D). If this same character is stored in a VARCHAR(1) field, reading it works as expected.

I believe the cause of this issue can be found in GdsStatement.ReadRawValue:

var s = xdr.ReadString(innerCharset, field.Length);
if ((field.Length % field.Charset.BytesPerCharacter) == 0 &&
s.Length > field.CharCount)
{
return s.Substring(0, field.CharCount);
}

After reading the string value from the IXdrReader, that value is truncated to remove the extra characters that were present in the buffer as padding. However, this truncation combines usage of the DbField.CharCount property (the number of Unicode code points stored in the field) with the .NET string.Length property and string.Substring method (which are based on the number of UTF-16 code units), leading to incorrect behavior when a single code point is encoded using multiple code units.

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions