Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# content of pandas/tests/io/parser/test_example.py
from io import StringIO
import numpy as np
import pytest
from pandas._libs.parsers import TextReader
from pandas.api.types import is_extension_array_dtype
import pandas._testing as tm
from pandas import array
from pandas.io.parsers.c_parser_wrapper import ensure_dtype_objs
@pytest.mark.parametrize(
"dtype", [
"uint64", "int64", "uint32", "int32", "uint16", "int16", "uint8", "int8",
"UInt64","Int64", "UInt32", "Int32", "UInt16", "Int16", "UInt8", "Int8"
]
)
def test_integer_overflow_with_user_dtype(dtype):
dtype = ensure_dtype_objs(dtype)
is_ext_dtype = is_extension_array_dtype(dtype)
maxint = np.iinfo(dtype.type if is_ext_dtype else dtype).max
reader = TextReader(StringIO(f"{maxint}"), header=None, dtype=dtype)
result = reader.read()
if is_ext_dtype:
expected = array([maxint], dtype=dtype)
tm.assert_extension_array_equal(result[0], expected)
else:
expected = np.array([maxint], dtype=dtype)
tm.assert_numpy_array_equal(result[0], expected)
reader = TextReader(StringIO(f"{maxint + 1}"), header=None, dtype=dtype)
with pytest.raises(Exception):
result = reader.read()
print(result, end=" ")
$ pytest pandas/tests/io/parser/test_example.py -sv
========================================================================================================= test session starts ==========================================================================================================
platform linux -- Python 3.8.13, pytest-7.1.2, pluggy-1.0.0 -- /opt/conda/bin/python
cachedir: .pytest_cache
hypothesis profile 'ci' -> deadline=None, suppress_health_check=[HealthCheck.too_slow], database=DirectoryBasedExampleDatabase('/home/pandas/.hypothesis/examples')
rootdir: /home/pandas, configfile: pyproject.toml
plugins: cython-0.2.0, xdist-2.5.0, cov-3.0.0, asyncio-0.18.3, forked-1.4.0, hypothesis-6.46.9, instafail-0.4.1
asyncio: mode=strict
collected 16 items
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[uint64] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[int64] {0: array([9223372036854775808], dtype=uint64)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[uint32] {0: array([0], dtype=uint32)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[int32] {0: array([-2147483648], dtype=int32)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[uint16] {0: array([0], dtype=uint16)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[int16] {0: array([-32768], dtype=int16)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[uint8] {0: array([0], dtype=uint8)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[int8] {0: array([-128], dtype=int8)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[UInt64] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[Int64] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[UInt32] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[Int32] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[UInt16] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[Int16] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[UInt8] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[Int8] PASSED
...
Issue Description
As the example shows, all invocations with the extension dtype variants (UInt64, etc.) and with the non-extension dtype uint64 manage to parse the max-value but fail at max + 1 with an exception (more specifically we get an OverflowError for uint64, a ValueError for UInt64 and TypeErrors for all other extension dtypes, so I simpled checked for any exception in the example). This is the safe and IMHO expected behavior.
The issue arises when parsing an integer value with a user defined dtype TextReader(..., dtype != None)
and only for non-extension dtypes:
- Requesting int64, we obtain maxint64 + 1 as a uint64. This is at least safe, but not expected and different from the behavior of Int64.
- For all other non-extension dtypes, a silent overflow occurs
The second problem comes from
pandas/pandas/_libs/parsers.pyx
Line 1191 in c355145
casting="unsafe"
parameter is used. Furthermore, for int64, we do not reach this line and just return with the result from _try_uint64
.
Expected Behavior
Non-extension integer dtypes should have the same behavior like the extension dtypes, i.e. only return exactly the requested dtype (if specified by the user) and raise when this dtype is insufficient to hold the parsed value.
Installed Versions
1.5.0.dev0+839.gc355145c7f