Skip to content

Drivers can provide PDO_ODBC incorrectly converted data for long columns with stateful encodings #10733

Open
@NattyNarwhal

Description

@NattyNarwhal

Description

A user contacted me about some data that was fetched incorrectly with PDO_ODBC, but worked with PDO_IBM and procedural ODBC.

The following code:

<?php
// Db2i as the example; full dataset available on request.
$user = 'user here';
$password = 'password here';
$rdb = 'DSN here' ;
// for those unaware of the db2i options: CCSID=1208 specifies UTF-8, ALWAYSCALCLEN=1 forces the driver to provide calculate the output length of a column after character conversions, NAM=1 affects naming rules/schema order
$dsn = "DSN={{$rdb}};NAM=1;ALWAYSCALCLEN=1;CCSID=1208";

$schema = 'CSC2023' ;
$table = 'ITEM' ;
$column = 'CODE' ;
$sql = <<< NOWDOC
SELECT DESC
 FROM "$schema"."$table"
 WHERE $column >= ?
 LIMIT 5555 OFFSET 0
NOWDOC;
$args = ["99999"];

$attrOpions = [
    PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
    PDO::ATTR_PERSISTENT => false,
    PDO::ATTR_DEFAULT_FETCH_MODE => PDO::FETCH_ASSOC,
];
$db = new PDO("odbc:$dsn", $user, $password, $attrOpions);

$stmt = $db->prepare($sql);
$stmt->execute($args);
while (($row = $stmt->fetch())) {
        var_dump($row["DESC"]);
}

$db = odbc_connect($dsn, $user, $password);

$stmt = odbc_prepare($db, $sql);
odbc_execute($stmt, $args);
while (($row = odbc_fetch_array($stmt))) {
        var_dump($row["DESC"]);
}

Resulted in this output:

string(1463) "�^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^ABCDEFGHIJKLMOPR                                                                                                                            "
string(1463) "☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘ABCDEFGHIJKLMOPR                                                                                                                            "

But I expected this output instead:

string(1463) "☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘ABCDEFGHIJKLMOPR                                                                                                                            "
string(1463) "☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘ABCDEFGHIJKLMOPR                                                                                                                            "

My suspicion is this is because a really annoying corner case where:

  • The driver is doing character conversions
  • The driver is converting from a stateful character encoding (Not like UTF-8 where it's technically stateful but you can just blithely append them and decode from the beginning to the continuation characters. I'm talking like Shift-JIS where shift in/out characters are used and you must know what state you're in before decoding a bunch of characters)
  • PDO_ODBC is fetching a long column, so PDO_ODBC fetches in 256 byte chunks

In this case:

  • The Db2i driver does do character conversion from the EBCDIC source to the ASCII/Unicode encoding we want
  • The column is in CCSID 5035 (an MBCS EBCDIC encoding that uses shift in/out characters to switch between SBCS/DBCS Japanese encodings)
  • The column is over 256 bytes in length

I believe this is because, in a stateful encoding, it defaults to SBCS, and switches to DBCS with a shift characater. If I look at the hex dump of the EBCDIC string in the database, indeed, the string begins with a shift character to switch it into DBCS. There lies the problem - converting characters at the beginning is fine, but converting from the middle chunk is a problem because the shift occurs earlier in the string. Thus, it interprets DBCS data as SBCS and converts that to UTF-8 instead. (This also jacks up the length calculation.)

FWIW, I think PDO_ODBC isn't at fault here, and is really the victim - the driver failed to keep track of shift state when decoding a chunk, or it should just decode all at once and deliver substrings. (The user is reporting this issue w/ the driver to IBM, so there may be a fix for that driver at least.) However, it does show how uncommon behaviours interacting with each other can cause confusing issues - even if it's technically correct. Character encodings and lengths are annoying; dealing with keeping track of multiple is really annoying. I can imagine there could be other drivers that trigger this behaviour.

It also raises the question of why PDO_ODBC does this to begin with, which is why I bring this up. The existing behaviour might not be worth keeping if it has "fun" interactions with drivers. In #81688, me and Christoph seemed a little confused why PDO_ODBC does this to begin with, when procedural ODBC and many other languages' ODBC bindings don't. The performance benefits seems counter-intuitive (more round-trips?) and it introduces some possible corner cases (i.e. this, GH-6716, or GH-9498, which brings up 86188).

As a possible solution that might end up being similar, I wonder if it might be best to keep the bind only small data (i.e. 256 bytes; could that be raised?), and call SQLGetData with long data, but allocate and fetch the whole thing (with the size provided by the driver) all at once. It'd probably avoid confusing issues with encoding conversion/BLOBs for good. Procedural ODBC just seems to bind the larger size anyways...

FWIW, a unixODBC trace and some printfs I put down, to see what was happening: https://gist.github.com/NattyNarwhal/278aedf8e3c48bb8dcb4ae14cde0b283 - note how when the driver starts misconverting, the length stops adding up. Perhaps that could be used to abort and return an error for a bandaid fix.

PHP Version

PHP 8.3.0-DEV

Operating System

Fedora 37

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions