Description
Pandas version checks
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- I have confirmed this bug exists on the main branch of pandas.
Issue Description
Apparently, it is not possible to read in a .csv table with named column index AND named row index.
How to reproduce:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3),
columns=["A", "B", "C"]).round(1)
df.index.name = "Count"
df.columns.name = "Name"
# INCONSISTENCY 1:
# .to_csv() only saves the row index name, it ignores the column index name.
df.to_csv("out.csv")
# The work-around: convert to column index to MultiIndex
df.columns = pd.MultiIndex.from_arrays([df.columns])
# Now, .to_csv() stores both column and index names.
df.to_csv("out.csv")
This is how the file looks like:
Name,A,B,C
Count,,,
0,1.8,0.4,1.0
1,2.2,1.9,-1.0
2,1.0,-0.2,-0.1
3,0.4,0.1,1.5
4,0.8,0.1,0.4
However, when reading in, I found no way to preserve both the index name of row and column indices.
df_new = pd.read_csv("out.csv")
print(df_new)
# ==> Line 1 is a row instead of column header
df_new = pd.read_csv("out.csv", header=[0])
print(df_new)
# ==> Line 1 is a row instead of column header
df_new = pd.read_csv("out.csv", header=[0, 1])
print(df_new)
# ==> Creates a two-level multi-index column header
df_new = pd.read_csv("out.csv", header=[0, 1], index_col=[0])
print(df_new)
# ==> Creates a two-level multi-index column header
df_new = pd.read_csv("out.csv", header=[1], index_col=[0])
print(df_new)
# ==> Correctly create the row index, but loses the column info
df_new = pd.read_csv("out.csv", header=[1], index_col=[0])
print(df_new)
# ==> Correctly create the row index, but loses the column info
I think this behavior is inconsistent, as switching to a multi-index with two or more levels works as expected:
# INCONSISTENCY 2:
# A multi-index column header with >1 levels is handled correctly
df_mh = df.copy()
l1 = list("ABC")
l2 = list("abc")
df_mh.columns = pd.MultiIndex.from_arrays([l1, l2])
df_mh.columns.names = ["Name1", "Name2"]
df_mh.to_csv("out_mh.csv")
df_new = pd.read_csv("out_mh.csv", header=[0,1], index_col=[0])
This reads in the CSV correctly.
The following SO posts are related:
Expected Behavior
In summary, I identified the following inconsistencies:
- For a simple pd.Index, the row index name is preserved using
to_csv()
. However, the column index name is not preserved. This is unexpected, and at least could be documented - For the corner case with a pd.MultiIndex using one level, pd.read_csv() fails to read the multi-header back in.
I know, it's a corner case. However, the fact that the column index can have a name, and the fact that column and header are both represented using the same object, it is somewhat unexpected that those inconsistencies occur.
Installed Versions
INSTALLED VERSIONS
commit : fd3f571
python : 3.10.5.final.0
python-bits : 64
OS : Darwin
OS-release : 21.6.0
Version : Darwin Kernel Version 21.6.0: Thu Sep 29 20:13:56 PDT 2022; root:xnu-8020.240.7~1/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.UTF-8
pandas : 2.2.0
numpy : 1.23.1
pytz : 2022.1
dateutil : 2.8.2
setuptools : 58.1.0
pip : 24.0
Cython : 0.29.30
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.1
numba : None
numexpr : 2.9.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 15.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None