Description
When using the ADO.NET provider to read a UTF-8 CHAR(n) field containing at least one character outside of the Basic Multilingual Plane (e.g. any emoji), the result will be improperly truncated. As an example, reading a CHAR(1) field containing the character '😊' (code point 0x1F60A) will result in a string value containing only the high surrogate (0xD83D). If this same character is stored in a VARCHAR(1) field, reading it works as expected.
I believe the cause of this issue can be found in GdsStatement.ReadRawValue
:
After reading the string value from the IXdrReader
, that value is truncated to remove the extra characters that were present in the buffer as padding. However, this truncation combines usage of the DbField.CharCount
property (the number of Unicode code points stored in the field) with the .NET string.Length
property and string.Substring
method (which are based on the number of UTF-16 code units), leading to incorrect behavior when a single code point is encoded using multiple code units.