Skip to content

BUG: Inconsistent handling of named column indices in to_csv() and read_csv(). #57453

Open
@normanius

Description

@normanius

Pandas version checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandas.
  • I have confirmed this bug exists on the main branch of pandas.

Issue Description

Apparently, it is not possible to read in a .csv table with named column index AND named row index.

How to reproduce:

import pandas as pd
import numpy as np

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), 
                  columns=["A", "B", "C"]).round(1)
df.index.name = "Count"
df.columns.name = "Name"

# INCONSISTENCY 1: 
# .to_csv() only saves the row index name, it ignores the column index name.
df.to_csv("out.csv")

# The work-around: convert to column index to MultiIndex
df.columns = pd.MultiIndex.from_arrays([df.columns])

# Now, .to_csv() stores both column and index names.
df.to_csv("out.csv")

This is how the file looks like:

Name,A,B,C
Count,,,
0,1.8,0.4,1.0
1,2.2,1.9,-1.0
2,1.0,-0.2,-0.1
3,0.4,0.1,1.5
4,0.8,0.1,0.4

However, when reading in, I found no way to preserve both the index name of row and column indices.

df_new = pd.read_csv("out.csv")
print(df_new)
# ==> Line 1 is a row instead of column header

df_new = pd.read_csv("out.csv", header=[0])
print(df_new)
# ==> Line 1 is a row instead of column header

df_new = pd.read_csv("out.csv", header=[0, 1])
print(df_new)
# ==> Creates a two-level multi-index column header

df_new = pd.read_csv("out.csv", header=[0, 1], index_col=[0])
print(df_new)
# ==> Creates a two-level multi-index column header

df_new = pd.read_csv("out.csv", header=[1], index_col=[0])
print(df_new)
# ==> Correctly create the row index, but loses the column info

df_new = pd.read_csv("out.csv", header=[1], index_col=[0])
print(df_new)
# ==> Correctly create the row index, but loses the column info

I think this behavior is inconsistent, as switching to a multi-index with two or more levels works as expected:

# INCONSISTENCY 2:
# A multi-index column header with >1 levels is handled correctly
df_mh = df.copy()
l1 = list("ABC")
l2 = list("abc")
df_mh.columns = pd.MultiIndex.from_arrays([l1, l2])
df_mh.columns.names = ["Name1", "Name2"]
df_mh.to_csv("out_mh.csv")
df_new = pd.read_csv("out_mh.csv", header=[0,1], index_col=[0])

This reads in the CSV correctly.

The following SO posts are related:

Expected Behavior

In summary, I identified the following inconsistencies:

  1. For a simple pd.Index, the row index name is preserved using to_csv(). However, the column index name is not preserved. This is unexpected, and at least could be documented
  2. For the corner case with a pd.MultiIndex using one level, pd.read_csv() fails to read the multi-header back in.

I know, it's a corner case. However, the fact that the column index can have a name, and the fact that column and header are both represented using the same object, it is somewhat unexpected that those inconsistencies occur.

Installed Versions

INSTALLED VERSIONS

commit : fd3f571
python : 3.10.5.final.0
python-bits : 64
OS : Darwin
OS-release : 21.6.0
Version : Darwin Kernel Version 21.6.0: Thu Sep 29 20:13:56 PDT 2022; root:xnu-8020.240.7~1/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.UTF-8

pandas : 2.2.0
numpy : 1.23.1
pytz : 2022.1
dateutil : 2.8.2
setuptools : 58.1.0
pip : 24.0
Cython : 0.29.30
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.1
numba : None
numexpr : 2.9.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 15.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIO CSVread_csv, to_csvNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions