Skip to content

BUG: in case of dtype mismatch (int vs category), error message from concat is not crystal clear #42552

Closed
@yohplala

Description

@yohplala
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
exi = pd.read_parquet('/home/yoh/Documents/code/data/existing.parquet', engine='fastparquet')
new = pd.read_parquet('/home/yoh/Documents/code/data/new.parquet', engine='fastparquet')
to_record = pd.concat([exi, new])

Please, find files enclosed. I could not succeed to re-create manually faulty dataframes.
faulty_dataframes.zip
(size of each is 5 rows x 8 columns)

Problem description

Before installation of pandas 1.3.0, I was using pandas 1.2.5 and fastparquet 0.6.4.dev0 and this extract of data was not causing problem.
After I installed pandas 1.3.0 the concat command is now issuing following error:

to_record = pd.concat([exi, new])
Traceback (most recent call last):

  File "<ipython-input-2-9967cb321e9e>", line 4, in <module>
    to_record = pd.concat([exi, new])

  File "/home/yoh/anaconda3/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)

  File "/home/yoh/anaconda3/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 307, in concat
    return op.get_result()

  File "/home/yoh/anaconda3/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 532, in get_result
    new_data = concatenate_managers(

  File "/home/yoh/anaconda3/lib/python3.8/site-packages/pandas/core/internals/concat.py", line 222, in concatenate_managers
    values = _concatenate_join_units(join_units, concat_axis, copy=copy)

  File "/home/yoh/anaconda3/lib/python3.8/site-packages/pandas/core/internals/concat.py", line 486, in _concatenate_join_units
    to_concat = [

  File "/home/yoh/anaconda3/lib/python3.8/site-packages/pandas/core/internals/concat.py", line 487, in <listcomp>
    ju.get_reindexed_values(empty_dtype=empty_dtype, upcasted_na=upcasted_na)

  File "/home/yoh/anaconda3/lib/python3.8/site-packages/pandas/core/internals/concat.py", line 403, in get_reindexed_values
    values = self.block.get_values()

  File "/home/yoh/anaconda3/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 1360, in get_values
    return np.asarray(values).reshape(self.shape)

ValueError: cannot reshape array of size 5 into shape (1,0)

I could notice that using pyarrow to read the files back allows having dataframes not causing any error.
I also tried to concat various extract of the dataframes by selecting column one by one, or even several at once, and it does not raise the error. For instance, following concat do not raise trouble:

to_record = pd.concat([exi[['timestamp','period','side']], new[['side','timestamp','period']]])
to_record = pd.concat([exi[['period','id']], new[['id','period']]])
to_record = pd.concat([exi['tracking'], new['tracking']])
# etc...

I am at a loss to reduce the trouble to the root cause.
Please, would anyone has some advice?

Expected Output

No error :)

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f00ed8f
python : 3.8.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.8.0-59-generic
Version : #66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.0
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.3
setuptools : 52.0.0.post20210125
Cython : 0.29.23
pytest : 6.2.4
hypothesis : None
sphinx : 4.0.2
blosc : None
feather : None
xlsxwriter : 1.4.4
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.22.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2021.06.0
fastparquet : 0.6.4.dev0
gcsfs : None
matplotlib : 3.3.4
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.19
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.18.2
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugCategoricalCategorical Data TypeError ReportingIncorrect or improved errors from pandasIO Parquetparquet, featherReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions