Description
This issue is to discuss how to obtain the size of a dataframe. I'll show with an example, and base it in the pandas API.
Given a dataframe:
import pandas
data = {'col1': [1, 2, 3, 4],
'col2': [5, 6, 7, 8]}
df = pandas.DataFrame(data)
I think the Pythonic and simpler way to get the number of rows and columns is to just use Python's len
, what pandas does:
>>> len(df) # number of rows
4
>>> len(df.columns) # number of columns
2
I guess an alternative could be to use df.num_rows
and df.num_columns
, but IMHO it doesn't add much value, and just makes the API more complex.
One thing to note, is that pandas mostly implements the dict
API for a dataframe (as if it was a dictionary of lists, like in the example data
). But when returning the number of rows with len(df)
, this is inconsistent with the dict
API, which would return the number of columns (keys). So, with the proposed API len(data) != len(df)
. I think being fully consistent with the dict
API would be misleading, but worth considering it.
Then, pandas offers some extra properties:
df.ndim == 2
df.shape == len(df), len(df.columns)
df.size == len(df) * len(df.columns)
I guess the reason for the first two is that pandas originally implemented Panel
, a three dimensional data structure, and ndim
and shape
made sense with it. But I don't think they add much value now.
I don't think size
is that commonly used (will check once we have the data of analyzing pandas usage), and it's trivial for the users to implement it, so I wouldn't add it to the API.
Proposal
len(df)
returning the number of rowslen(df.columns)
returning the number of columns
And nothing else regarding the shape of a dataframe.