Skip to content

BUG: integer overflow in csv_reader  #47167

Open
@SandroCasagrande

Description

@SandroCasagrande

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# content of pandas/tests/io/parser/test_example.py
from io import StringIO

import numpy as np
import pytest
from pandas._libs.parsers import TextReader
from pandas.api.types import is_extension_array_dtype

import pandas._testing as tm
from pandas import array
from pandas.io.parsers.c_parser_wrapper import ensure_dtype_objs


@pytest.mark.parametrize(
    "dtype", [
        "uint64", "int64", "uint32", "int32", "uint16", "int16", "uint8", "int8",
        "UInt64","Int64", "UInt32", "Int32", "UInt16", "Int16", "UInt8", "Int8"
    ]
)
def test_integer_overflow_with_user_dtype(dtype):
    dtype = ensure_dtype_objs(dtype)
    is_ext_dtype = is_extension_array_dtype(dtype)
    maxint = np.iinfo(dtype.type if is_ext_dtype else dtype).max

    reader = TextReader(StringIO(f"{maxint}"), header=None, dtype=dtype)
    result = reader.read()
    if is_ext_dtype:
        expected = array([maxint], dtype=dtype)
        tm.assert_extension_array_equal(result[0], expected)
    else:
        expected = np.array([maxint], dtype=dtype)
        tm.assert_numpy_array_equal(result[0], expected)

    reader = TextReader(StringIO(f"{maxint + 1}"), header=None, dtype=dtype)
    with pytest.raises(Exception):
        result = reader.read()
        print(result, end=" ")
$ pytest pandas/tests/io/parser/test_example.py -sv
========================================================================================================= test session starts ==========================================================================================================
platform linux -- Python 3.8.13, pytest-7.1.2, pluggy-1.0.0 -- /opt/conda/bin/python
cachedir: .pytest_cache
hypothesis profile 'ci' -> deadline=None, suppress_health_check=[HealthCheck.too_slow], database=DirectoryBasedExampleDatabase('/home/pandas/.hypothesis/examples')
rootdir: /home/pandas, configfile: pyproject.toml
plugins: cython-0.2.0, xdist-2.5.0, cov-3.0.0, asyncio-0.18.3, forked-1.4.0, hypothesis-6.46.9, instafail-0.4.1
asyncio: mode=strict
collected 16 items                                                                                                                                                                                                                     

pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[uint64] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[int64] {0: array([9223372036854775808], dtype=uint64)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[uint32] {0: array([0], dtype=uint32)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[int32] {0: array([-2147483648], dtype=int32)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[uint16] {0: array([0], dtype=uint16)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[int16] {0: array([-32768], dtype=int16)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[uint8] {0: array([0], dtype=uint8)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[int8] {0: array([-128], dtype=int8)} FAILED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[UInt64] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[Int64] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[UInt32] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[Int32] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[UInt16] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[Int16] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[UInt8] PASSED
pandas/tests/io/parser/test_example.py::test_integer_overflow_with_user_dtype[Int8] PASSED

...

Issue Description

As the example shows, all invocations with the extension dtype variants (UInt64, etc.) and with the non-extension dtype uint64 manage to parse the max-value but fail at max + 1 with an exception (more specifically we get an OverflowError for uint64, a ValueError for UInt64 and TypeErrors for all other extension dtypes, so I simpled checked for any exception in the example). This is the safe and IMHO expected behavior.

The issue arises when parsing an integer value with a user defined dtype TextReader(..., dtype != None) and only for non-extension dtypes:

  1. Requesting int64, we obtain maxint64 + 1 as a uint64. This is at least safe, but not expected and different from the behavior of Int64.
  2. For all other non-extension dtypes, a silent overflow occurs

The second problem comes from

result = result.astype(dtype)
where the default casting="unsafe" parameter is used. Furthermore, for int64, we do not reach this line and just return with the result from _try_uint64.

Expected Behavior

Non-extension integer dtypes should have the same behavior like the extension dtypes, i.e. only return exactly the requested dtype (if specified by the user) and raise when this dtype is insufficient to hold the parsed value.

Installed Versions

1.5.0.dev0+839.gc355145c7f

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions