Description
Description
A user contacted me about some data that was fetched incorrectly with PDO_ODBC
, but worked with PDO_IBM
and procedural ODBC.
The following code:
<?php
// Db2i as the example; full dataset available on request.
$user = 'user here';
$password = 'password here';
$rdb = 'DSN here' ;
// for those unaware of the db2i options: CCSID=1208 specifies UTF-8, ALWAYSCALCLEN=1 forces the driver to provide calculate the output length of a column after character conversions, NAM=1 affects naming rules/schema order
$dsn = "DSN={{$rdb}};NAM=1;ALWAYSCALCLEN=1;CCSID=1208";
$schema = 'CSC2023' ;
$table = 'ITEM' ;
$column = 'CODE' ;
$sql = <<< NOWDOC
SELECT DESC
FROM "$schema"."$table"
WHERE $column >= ?
LIMIT 5555 OFFSET 0
NOWDOC;
$args = ["99999"];
$attrOpions = [
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
PDO::ATTR_PERSISTENT => false,
PDO::ATTR_DEFAULT_FETCH_MODE => PDO::FETCH_ASSOC,
];
$db = new PDO("odbc:$dsn", $user, $password, $attrOpions);
$stmt = $db->prepare($sql);
$stmt->execute($args);
while (($row = $stmt->fetch())) {
var_dump($row["DESC"]);
}
$db = odbc_connect($dsn, $user, $password);
$stmt = odbc_prepare($db, $sql);
odbc_execute($stmt, $args);
while (($row = odbc_fetch_array($stmt))) {
var_dump($row["DESC"]);
}
Resulted in this output:
string(1463) "�^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^」Vァョ」W」�」T「"「^ABCDEFGHIJKLMOPR "
string(1463) "☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘ABCDEFGHIJKLMOPR "
But I expected this output instead:
string(1463) "☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘ABCDEFGHIJKLMOPR "
string(1463) "☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘☆株★■▲┐┘ABCDEFGHIJKLMOPR "
My suspicion is this is because a really annoying corner case where:
- The driver is doing character conversions
- The driver is converting from a stateful character encoding (Not like UTF-8 where it's technically stateful but you can just blithely append them and decode from the beginning to the continuation characters. I'm talking like Shift-JIS where shift in/out characters are used and you must know what state you're in before decoding a bunch of characters)
PDO_ODBC
is fetching a long column, soPDO_ODBC
fetches in 256 byte chunks
In this case:
- The Db2i driver does do character conversion from the EBCDIC source to the ASCII/Unicode encoding we want
- The column is in CCSID 5035 (an MBCS EBCDIC encoding that uses shift in/out characters to switch between SBCS/DBCS Japanese encodings)
- The column is over 256 bytes in length
I believe this is because, in a stateful encoding, it defaults to SBCS, and switches to DBCS with a shift characater. If I look at the hex dump of the EBCDIC string in the database, indeed, the string begins with a shift character to switch it into DBCS. There lies the problem - converting characters at the beginning is fine, but converting from the middle chunk is a problem because the shift occurs earlier in the string. Thus, it interprets DBCS data as SBCS and converts that to UTF-8 instead. (This also jacks up the length calculation.)
FWIW, I think PDO_ODBC
isn't at fault here, and is really the victim - the driver failed to keep track of shift state when decoding a chunk, or it should just decode all at once and deliver substrings. (The user is reporting this issue w/ the driver to IBM, so there may be a fix for that driver at least.) However, it does show how uncommon behaviours interacting with each other can cause confusing issues - even if it's technically correct. Character encodings and lengths are annoying; dealing with keeping track of multiple is really annoying. I can imagine there could be other drivers that trigger this behaviour.
It also raises the question of why PDO_ODBC
does this to begin with, which is why I bring this up. The existing behaviour might not be worth keeping if it has "fun" interactions with drivers. In #81688, me and Christoph seemed a little confused why PDO_ODBC
does this to begin with, when procedural ODBC and many other languages' ODBC bindings don't. The performance benefits seems counter-intuitive (more round-trips?) and it introduces some possible corner cases (i.e. this, GH-6716, or GH-9498, which brings up 86188).
As a possible solution that might end up being similar, I wonder if it might be best to keep the bind only small data (i.e. 256 bytes; could that be raised?), and call SQLGetData
with long data, but allocate and fetch the whole thing (with the size provided by the driver) all at once. It'd probably avoid confusing issues with encoding conversion/BLOBs for good. Procedural ODBC just seems to bind the larger size anyways...
FWIW, a unixODBC trace and some printfs I put down, to see what was happening: https://gist.github.com/NattyNarwhal/278aedf8e3c48bb8dcb4ae14cde0b283 - note how when the driver starts misconverting, the length stops adding up. Perhaps that could be used to abort and return an error for a bandaid fix.
PHP Version
PHP 8.3.0-DEV
Operating System
Fedora 37