Skip to content

BUG: pd.read_csv error when parse_dates used on index_col for dtype_backend="pyarrow" #57930

Open
@MCRE-BE

Description

@MCRE-BE

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import datetime

import pandas as pd

df = pd.DataFrame({'a': [1], 'b': [datetime.datetime.now()]})
df.to_csv('test.csv', index=None)


a = pd.read_csv(
    'test.csv',
    parse_dates=['b'],
    index_col="b",
    dtype_backend="pyarrow",
    engine="pyarrow",
)

pd.read_csv(
    'test.csv',
    parse_dates=['b'],
    index_col="b",
)

the following fails :

pd.read_csv(
    'test.csv',
    parse_dates=['b'],
    index_col="b",
    dtype_backend="pyarrow",
)

Issue Description

returns
ValueError: not all elements from date_cols are numpy arrays

ValueError Traceback (most recent call last) Cell In[78], [line 1](vscode-notebook-cell:?execution_count=78&line=1) ----> [1](vscode-notebook-cell:?execution_count=78&line=1) pd.read_csv( [2](vscode-notebook-cell:?execution_count=78&line=2) 'test.csv', [3](vscode-notebook-cell:?execution_count=78&line=3) parse_dates=['b'], [4](vscode-notebook-cell:?execution_count=78&line=4) index_col="b", [5](vscode-notebook-cell:?execution_count=78&line=5) dtype_backend="pyarrow", [6](vscode-notebook-cell:?execution_count=78&line=6) )

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\readers.py:948, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
935 kwds_defaults = _refine_defaults_read(
936 dialect,
937 delimiter,
(...)
944 dtype_backend=dtype_backend,
945 )
946 kwds.update(kwds_defaults)
--> 948 return _read(filepath_or_buffer, kwds)

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\readers.py:617, in _read(filepath_or_buffer, kwds)
614 return parser
616 with parser:
--> 617 return parser.read(nrows)

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\readers.py:1748, in TextFileReader.read(self, nrows)
1741 nrows = validate_integer("nrows", nrows)
1742 try:
1743 # error: "ParserBase" has no attribute "read"
1744 (
1745 index,
1746 columns,
1747 col_dict,
-> 1748 ) = self._engine.read( # type: ignore[attr-defined]
1749 nrows
1750 )
1751 except Exception:
1752 self.close()

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\c_parser_wrapper.py:333, in CParserWrapper.read(self, nrows)
330 data = {k: v for k, (i, v) in zip(names, data_tups)}
332 names, date_data = self._do_date_conversions(names, data)
--> 333 index, column_names = self._make_index(date_data, alldata, names)
335 return index, column_names, date_data

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\base_parser.py:371, in ParserBase._make_index(self, data, alldata, columns, indexnamerow)
369 elif not self._has_complex_date_col:
370 simple_index = self._get_simple_index(alldata, columns)
--> 371 index = self._agg_index(simple_index)
372 elif self._has_complex_date_col:
373 if not self._name_processed:

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\base_parser.py:468, in ParserBase._agg_index(self, index, try_parse_dates)
466 for i, arr in enumerate(index):
467 if try_parse_dates and self._should_parse_dates(i):
--> 468 arr = self._date_conv(
469 arr,
470 col=self.index_names[i] if self.index_names is not None else None,
471 )
473 if self.na_filter:
474 col_na_values = self.na_values

File c:\ProgramData\miniconda3\envs\DLL_Forecast\Lib\site-packages\pandas\io\parsers\base_parser.py:1142, in _make_date_converter..converter(col, *date_cols)
1139 return date_cols[0]
1141 if date_parser is lib.no_default:
-> 1142 strs = parsing.concat_date_cols(date_cols)
1143 date_fmt = (
1144 date_format.get(col) if isinstance(date_format, dict) else date_format
1145 )
1147 with warnings.catch_warnings():

File parsing.pyx:1155, in pandas._libs.tslibs.parsing.concat_date_cols()
ValueError: not all elements from date_cols are numpy arrays

Expected Behavior

As with dtype_backend="numpy_nullable" the index column should be parsed correctly.

It works with the following set-up, so likely an issue with the C parser (?)

Installed Versions

INSTALLED VERSIONS

commit : a671b5a
python : 3.11.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19042
machine : AMD64
processor : AMD64 Family 23 Model 24 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Dutch_Belgium.1252

pandas : 2.1.4
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0
setuptools : 69.2.0
pip : 24.0
Cython : None
pytest : 8.1.1
hypothesis : None
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : 3.1.9
lxml.etree : 5.1.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.3
IPython : 8.22.2
pandas_datareader : None
bs4 : 4.12.3
bottleneck : 1.3.8
dataframe-api-compat: None
fastparquet : None
fsspec : 2024.3.0
gcsfs : None
matplotlib : 3.8.3
numba : 0.59.0
numexpr : 2.9.0
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 15.0.0
pyreadstat : None
pyxlsb : 1.0.10
s3fs : None
scipy : 1.12.0
sqlalchemy : 2.0.28
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
zstandard : 0.22.0
tzdata : 2024.1
qtpy : 2.4.1
pyqt5 : None

Metadata

Metadata

Assignees

Labels

Arrowpyarrow functionalityBugDatetimeDatetime data dtypeExtensionArrayExtending pandas with custom dtypes or arrays.IO CSVread_csv, to_csv

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions