Skip to content

ENH: pd.DataFrame.info() to show line numbers GH17304 #17332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

pratapvardhan
Copy link
Contributor

@pratapvardhan pratapvardhan commented Aug 25, 2017

Refactored to self.columns and len(self.columns)

New output

>>> import pandas as pd
>>> df = pd.DataFrame(pd.np.random.rand(4, 10), 
                      columns=['%s%s' % (x, pd.np.random.randint(2, 10)*'a') for x in range(10)])
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 10 columns):
 #.  Column        Non-Null Count & Dtype
---  ------        ----------------------
 0   0aaaaaaaaa    4 non-null float64
 1   1aaaaa        4 non-null float64
 2   2aaaaaa       4 non-null float64
 3   3aaaaaa       4 non-null float64
 4   4aaaaaa       4 non-null float64
 5   5aaaaaaa      4 non-null float64
 6   6aa           4 non-null float64
 7   7aaaaaaa      4 non-null float64
 8   8aaaaaaa      4 non-null float64
 9   9aaaaaaaaa    4 non-null float64
dtypes: float64(10)
memory usage: 392.0 bytes

counts = None

tmpl = "%s%s"
if show_counts:
counts = self.count()
if len(cols) != len(counts): # pragma: no cover
raise AssertionError('Columns must equal counts (%d != %d)'
% (len(cols), len(counts)))
% (cols_count, len(counts)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use .format string-formatting instead.

dtype = dtypes.iloc[i]
col = pprint_thing(col)

line_no = ("%d. " % (i + 1)).rjust(space_num)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

count = ""
if show_counts:
count = counts.iloc[i]

lines.append(_put_str(col, space) + tmpl % (count, dtype))
lines.append(line_no + _put_str(col, space) +
tmpl % (count, dtype))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

assert 'a 1 non-null int64\n' == lines[3]
assert 'a 1 non-null float64\n' == lines[4]
assert '1. a 1 non-null int64\n' == lines[3]
assert '2. a 1 non-null float64\n' == lines[4]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bashtage : Do you still prefer 0-indexing? I'm actually okay with 1-indexing because 0-indexing may not be as intuitive for non-purist programmers like ourselves 😉

Copy link
Contributor Author

@pratapvardhan pratapvardhan Aug 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I'm in favor of 1-indexing usage for .info() too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of purist, I think it is more of a loss of information since 0 is meaningful while 1 is not.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks guys for working on this feature :). Is it ready in new version of pandas?

lines.append('Data columns (total %d columns):' %
len(self.columns))
space = max([len(pprint_thing(k)) for k in self.columns]) + 4
lines.append('Data columns (total %d columns):' % cols_count)
Copy link
Member

@gfyoung gfyoung Aug 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@gfyoung
Copy link
Member

gfyoung commented Aug 25, 2017

Thanks for the PR! Just address the test failures and add a whatsnew, and you should be set I think.

@gfyoung gfyoung added Enhancement Error Reporting Incorrect or improved errors from pandas labels Aug 25, 2017
@bashtage
Copy link
Contributor

I find 1 based indexing confusing since df.iloc[:,1] will not be the column indexed as 1.

Maybe a header:

Data columns (total 10 columns):
Index  Column   Non-Null Count
-----  ------       ----------
 0     0aaaaa       4 non-null float64
 1     1aa          4 non-null float64

@codecov
Copy link

codecov bot commented Aug 25, 2017

Codecov Report

Merging #17332 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17332      +/-   ##
==========================================
- Coverage   91.01%   90.99%   -0.02%     
==========================================
  Files         162      162              
  Lines       49567    49570       +3     
==========================================
- Hits        45113    45107       -6     
- Misses       4454     4463       +9
Flag Coverage Δ
#multiple 88.77% <100%> (ø) ⬆️
#single 40.24% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 97.72% <100%> (-0.1%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96f92eb...234b042. Read the comment docs.

@codecov
Copy link

codecov bot commented Aug 25, 2017

Codecov Report

Merging #17332 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17332      +/-   ##
==========================================
+ Coverage   91.92%   91.92%   +<.01%     
==========================================
  Files         160      160              
  Lines       49913    49926      +13     
==========================================
+ Hits        45882    45895      +13     
  Misses       4031     4031
Flag Coverage Δ
#multiple 90.3% <100%> (ø) ⬆️
#single 42.09% <0%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/frame.py 97.21% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da6e26d...7c13220. Read the comment docs.

@pratapvardhan
Copy link
Contributor Author

@gfyoung -- Changes have been pushed, any thoughts?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pratapvardhan if you want to rebase and revise to a style more like @bashtage proposed #17332 (comment), we can proceed.
also move the whatsnew to 0.22

@pratapvardhan
Copy link
Contributor Author

@jreback -- updated with @bashtage proposed style (0-index with header)

@bashtage
Copy link
Contributor

Have you checked with very long column names and all possible dtypes? Just wondering if it always is readable on say 80 columns.

@pratapvardhan
Copy link
Contributor Author

pratapvardhan commented Oct 29, 2017

@bashtage -- some samples

In [20]: df = pd.DataFrame(np.random.rand(4, 10), columns=[
    ...:         '%s%s' % (x, np.random.randint(2, 50)*'a') for x in range(10)])

In [21]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 10 columns):
 #.  Column                                               Non-Null Count & Dtype
---  ------                                               ----------------------
 0   0aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa         4 non-null float64
 1   1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa    4 non-null float64
 2   2aaaaaa                                              4 non-null float64
 3   3aaaaaaaaaaaaaaaaaaaaaaaaaaaaa                       4 non-null float64
 4   4aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa                  4 non-null float64
 5   5aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa               4 non-null float64
 6   6aaaaaaaaaaaaaaaaaaaa                                4 non-null float64
 7   7aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa                   4 non-null float64
 8   8aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa      4 non-null float64
 9   9aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa       4 non-null float64
dtypes: float64(10)
memory usage: 392.0 bytes

In [22]: df = pd.DataFrame(np.random.rand(4,4), columns=['%s' % x for x in range(4)])

In [23]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #.  Column    Non-Null Count & Dtype
---  ------    ----------------------
 0   0         4 non-null float64
 1   1         4 non-null float64
 2   2         4 non-null float64
 3   3         4 non-null float64
dtypes: float64(4)
memory usage: 200.0 bytes

In [16]: df.info(null_counts=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #.  Column    Dtype
---  ------    -----
 0   0         float64
 1   1         float64
 2   2         float64
 3   3         float64
dtypes: float64(4)
memory usage: 200.0 bytes

In [24]: pd.DataFrame({'A': np.random.randn(5),
    ...:               'B': pd.date_range('1/1/2000', periods=5),
    ...:               'C': ['c']*5,
    ...:               'D': [1]*5}).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #.  Column    Non-Null Count & Dtype
---  ------    ----------------------
 0   A         5 non-null float64
 1   B         5 non-null datetime64[ns]
 2   C         5 non-null object
 3   D         5 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 232.0+ bytes

space constructor takes care of column character length and verbosity defaults to provided settings.

@jreback
Copy link
Contributor

jreback commented Nov 25, 2017

can you rebase / update

@pratapvardhan
Copy link
Contributor Author

I've rebased, not sure what is causing the lint issue. The diff seems flake8 fine.

@jreback
Copy link
Contributor

jreback commented Jan 6, 2018

Check for use of lists instead of generators in built-in Python functions
pandas/core/frame.py:            space = max([len(pprint_thing(k)) for k in cols])

pushed a commit to fix

@pratapvardhan
Copy link
Contributor Author

@jreback -- Thanks for that.

@@ -462,3 +462,4 @@ Other

- Improved error message when attempting to use a Python keyword as an identifier in a ``numexpr`` backed query (:issue:`18221`)
- :func:`Timestamp.replace` will now handle Daylight Savings transitions gracefully (:issue:`18319`)
- :func:`DataFrame.info()` now shows line numbers for column summary (:issue:`17304`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will need a sub-section for this to show as this is a major change to output formatting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this in new features

@@ -239,8 +239,8 @@ def test_info_duplicate_columns_shows_correct_dtypes(self):
frame.info(buf=io)
io.seek(0)
lines = io.readlines()
assert 'a 1 non-null int64\n' == lines[3]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a an explict test for this, one that specifically checks the formatting (a bit duplicative of this one), but like it separate

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show the output of a sample

counts = None

tmpl = "%s%s"
header = _put_str('Index', space_num) + _put_str('Column', space)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this 'N' rather than 'Index'. Also can you put a line of '-' after the header, formatted the same

@jreback
Copy link
Contributor

jreback commented Feb 24, 2018

can you rebase and update

@jreback
Copy link
Contributor

jreback commented Jul 7, 2018

this looked ok if we can comeback and rebase it @pandas-dev/pandas-core

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't use "Index" to indicate the number, that's a bit confusing with the other meaning of that.

@pratapvardhan
Copy link
Contributor Author

pratapvardhan commented Jul 9, 2018

@jreback @jorisvandenbossche -- using #. instead of Index for the line number column. Below are the sample outputs. Moved this to 0.24.0 whatnew section

In [20]: df = pd.DataFrame(np.random.rand(4, 10), columns=[
    ...:         '%s%s' % (x, np.random.randint(2, 50)*'a') for x in range(10)])

In [21]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 10 columns):
 #.  Column                                               Non-Null Count & Dtype
---  ------                                               ----------------------
 0   0aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa         4 non-null float64
 1   1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa    4 non-null float64
 2   2aaaaaa                                              4 non-null float64
 3   3aaaaaaaaaaaaaaaaaaaaaaaaaaaaa                       4 non-null float64
 4   4aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa                  4 non-null float64
 5   5aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa               4 non-null float64
 6   6aaaaaaaaaaaaaaaaaaaa                                4 non-null float64
 7   7aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa                   4 non-null float64
 8   8aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa      4 non-null float64
 9   9aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa       4 non-null float64
dtypes: float64(10)
memory usage: 392.0 bytes

In [22]: df = pd.DataFrame(np.random.rand(4,4), columns=['%s' % x for x in range(4)])

In [23]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #.  Column    Non-Null Count & Dtype
---  ------    ----------------------
 0   0         4 non-null float64
 1   1         4 non-null float64
 2   2         4 non-null float64
 3   3         4 non-null float64
dtypes: float64(4)
memory usage: 200.0 bytes

In [16]: df.info(null_counts=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #.  Column    Dtype
---  ------    -----
 0   0         float64
 1   1         float64
 2   2         float64
 3   3         float64
dtypes: float64(4)
memory usage: 200.0 bytes

In [24]: pd.DataFrame({'A': np.random.randn(5),
    ...:               'B': pd.date_range('1/1/2000', periods=5),
    ...:               'C': ['c']*5,
    ...:               'D': [1]*5}).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #.  Column    Non-Null Count & Dtype
---  ------    ----------------------
 0   A         5 non-null float64
 1   B         5 non-null datetime64[ns]
 2   C         5 non-null object
 3   D         5 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 232.0+ bytes

@bashtage
Copy link
Contributor

bashtage commented Jul 9, 2018 via email

Output Formatting Enhancements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- `df.info()` now shows line numbers for the columns summary (:issue:`17304`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use a :func:`DataFrame.info` here

int_col 5 non-null int64
text_col 5 non-null object
float_col 5 non-null float64
#. Column Non-Null Count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe want to call this
Non-Null Count & Dtype?

will need to adjust this header if null_counts=False is passed to .info() (add a tests as well)

Copy link
Contributor Author

@pratapvardhan pratapvardhan Jul 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for df.info(null_counts=False) header will have dtype, or would you prefer Dtype -- is that fine? I'll push a test case once you confirm.

In [38]: df.info(null_counts=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
 #.  Column    dtype
---  ------    -----
 0   foo1      object
 1   foo2      float64
dtypes: float64(1), object(1)
memory usage: 168.0+ bytes

@WillAyd
Copy link
Member

WillAyd commented Feb 27, 2019

Closing as stale. Ping if you'd like to continue

@WillAyd WillAyd closed this Feb 27, 2019
@rotuna
Copy link

rotuna commented Sep 8, 2019

@WillAyd What was left to be done? I can try finishing it up

@WillAyd
Copy link
Member

WillAyd commented Sep 9, 2019

@rotuna if you'd like to pick up you can make your own branch off of this, merge in master, address comments above and post your own PR

Note we've had a lot of stylistic changes to our code since this was started (namely introducing black) so might be a large diff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feature wanted: pd.DataFrame.info() should show line numbers
8 participants