ENH: pd.DataFrame.info() to show line numbers GH17304 #17332

pratapvardhan · 2017-08-25T08:42:20Z

closes feature wanted: pd.DataFrame.info() should show line numbers #17304
tests updated and passed
passes flake8 diff
whatsnew entry

Refactored to self.columns and len(self.columns)

New output

>>> import pandas as pd
>>> df = pd.DataFrame(pd.np.random.rand(4, 10), 
                      columns=['%s%s' % (x, pd.np.random.randint(2, 10)*'a') for x in range(10)])
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 10 columns):
 #.  Column        Non-Null Count & Dtype
---  ------        ----------------------
 0   0aaaaaaaaa    4 non-null float64
 1   1aaaaa        4 non-null float64
 2   2aaaaaa       4 non-null float64
 3   3aaaaaa       4 non-null float64
 4   4aaaaaa       4 non-null float64
 5   5aaaaaaa      4 non-null float64
 6   6aa           4 non-null float64
 7   7aaaaaaa      4 non-null float64
 8   8aaaaaaa      4 non-null float64
 9   9aaaaaaaaa    4 non-null float64
dtypes: float64(10)
memory usage: 392.0 bytes

gfyoung · 2017-08-25T08:55:50Z

pandas/core/frame.py

            counts = None

            tmpl = "%s%s"
            if show_counts:
                counts = self.count()
                if len(cols) != len(counts):  # pragma: no cover
                    raise AssertionError('Columns must equal counts (%d != %d)'
-                                         % (len(cols), len(counts)))
+                                         % (cols_count, len(counts)))


Let's use .format string-formatting instead.

gfyoung · 2017-08-25T08:56:15Z

pandas/core/frame.py

                dtype = dtypes.iloc[i]
                col = pprint_thing(col)
-
+                line_no = ("%d. " % (i + 1)).rjust(space_num)


Same as above.

gfyoung · 2017-08-25T08:56:19Z

pandas/core/frame.py

                count = ""
                if show_counts:
                    count = counts.iloc[i]

-                lines.append(_put_str(col, space) + tmpl % (count, dtype))
+                lines.append(line_no + _put_str(col, space) +
+                             tmpl % (count, dtype))


Same as above.

gfyoung · 2017-08-25T08:57:21Z

pandas/tests/frame/test_repr_info.py

-        assert 'a    1 non-null int64\n' == lines[3]
-        assert 'a    1 non-null float64\n' == lines[4]
+        assert '1. a    1 non-null int64\n' == lines[3]
+        assert '2. a    1 non-null float64\n' == lines[4]


@bashtage : Do you still prefer 0-indexing? I'm actually okay with 1-indexing because 0-indexing may not be as intuitive for non-purist programmers like ourselves 😉

I agree, I'm in favor of 1-indexing usage for .info() too.

In terms of purist, I think it is more of a loss of information since 0 is meaningful while 1 is not.

Thanks guys for working on this feature :). Is it ready in new version of pandas?

gfyoung · 2017-08-25T08:57:58Z

pandas/core/frame.py

-            lines.append('Data columns (total %d columns):' %
-                         len(self.columns))
-            space = max([len(pprint_thing(k)) for k in self.columns]) + 4
+            lines.append('Data columns (total %d columns):' % cols_count)


Same as above.

gfyoung · 2017-08-25T08:58:44Z

Thanks for the PR! Just address the test failures and add a whatsnew, and you should be set I think.

bashtage · 2017-08-25T09:08:11Z

I find 1 based indexing confusing since df.iloc[:,1] will not be the column indexed as 1.

Maybe a header:

Data columns (total 10 columns):
Index  Column   Non-Null Count
-----  ------       ----------
 0     0aaaaa       4 non-null float64
 1     1aa          4 non-null float64

codecov · 2017-08-25T17:09:40Z

Codecov Report

Merging #17332 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17332      +/-   ##
==========================================
- Coverage   91.01%   90.99%   -0.02%     
==========================================
  Files         162      162              
  Lines       49567    49570       +3     
==========================================
- Hits        45113    45107       -6     
- Misses       4454     4463       +9

Flag	Coverage Δ
#multiple	`88.77% <100%> (ø)`	⬆️
#single	`40.24% <0%> (-0.07%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.72% <100%> (-0.1%)`	⬇️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96f92eb...234b042. Read the comment docs.

codecov · 2017-08-25T17:09:48Z

Codecov Report

Merging #17332 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17332      +/-   ##
==========================================
+ Coverage   91.92%   91.92%   +<.01%     
==========================================
  Files         160      160              
  Lines       49913    49926      +13     
==========================================
+ Hits        45882    45895      +13     
  Misses       4031     4031

Flag	Coverage Δ
#multiple	`90.3% <100%> (ø)`	⬆️
#single	`42.09% <0%> (-0.02%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.21% <100%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update da6e26d...7c13220. Read the comment docs.

pratapvardhan · 2017-09-06T06:49:54Z

@gfyoung -- Changes have been pushed, any thoughts?

jreback

@pratapvardhan if you want to rebase and revise to a style more like @bashtage proposed #17332 (comment), we can proceed.
also move the whatsnew to 0.22

pratapvardhan · 2017-10-29T10:17:00Z

@jreback -- updated with @bashtage proposed style (0-index with header)

bashtage · 2017-10-29T11:32:50Z

Have you checked with very long column names and all possible dtypes? Just wondering if it always is readable on say 80 columns.

pratapvardhan · 2017-10-29T11:47:08Z

@bashtage -- some samples

In [20]: df = pd.DataFrame(np.random.rand(4, 10), columns=[
    ...:         '%s%s' % (x, np.random.randint(2, 50)*'a') for x in range(10)])

In [21]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 10 columns):
 #.  Column                                               Non-Null Count & Dtype
---  ------                                               ----------------------
 0   0aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa         4 non-null float64
 1   1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa    4 non-null float64
 2   2aaaaaa                                              4 non-null float64
 3   3aaaaaaaaaaaaaaaaaaaaaaaaaaaaa                       4 non-null float64
 4   4aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa                  4 non-null float64
 5   5aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa               4 non-null float64
 6   6aaaaaaaaaaaaaaaaaaaa                                4 non-null float64
 7   7aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa                   4 non-null float64
 8   8aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa      4 non-null float64
 9   9aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa       4 non-null float64
dtypes: float64(10)
memory usage: 392.0 bytes

In [22]: df = pd.DataFrame(np.random.rand(4,4), columns=['%s' % x for x in range(4)])

In [23]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #.  Column    Non-Null Count & Dtype
---  ------    ----------------------
 0   0         4 non-null float64
 1   1         4 non-null float64
 2   2         4 non-null float64
 3   3         4 non-null float64
dtypes: float64(4)
memory usage: 200.0 bytes

In [16]: df.info(null_counts=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #.  Column    Dtype
---  ------    -----
 0   0         float64
 1   1         float64
 2   2         float64
 3   3         float64
dtypes: float64(4)
memory usage: 200.0 bytes

In [24]: pd.DataFrame({'A': np.random.randn(5),
    ...:               'B': pd.date_range('1/1/2000', periods=5),
    ...:               'C': ['c']*5,
    ...:               'D': [1]*5}).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #.  Column    Non-Null Count & Dtype
---  ------    ----------------------
 0   A         5 non-null float64
 1   B         5 non-null datetime64[ns]
 2   C         5 non-null object
 3   D         5 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 232.0+ bytes

space constructor takes care of column character length and verbosity defaults to provided settings.

jreback · 2017-11-25T16:16:36Z

can you rebase / update

pratapvardhan · 2018-01-06T18:04:48Z

I've rebased, not sure what is causing the lint issue. The diff seems flake8 fine.

jreback · 2018-01-06T18:12:38Z

Check for use of lists instead of generators in built-in Python functions
pandas/core/frame.py:            space = max([len(pprint_thing(k)) for k in cols])

pushed a commit to fix

pratapvardhan · 2018-01-07T06:31:43Z

@jreback -- Thanks for that.

jreback · 2018-01-07T15:17:09Z

doc/source/whatsnew/v0.23.0.txt

@@ -462,3 +462,4 @@ Other

 - Improved error message when attempting to use a Python keyword as an identifier in a ``numexpr`` backed query (:issue:`18221`)
 - :func:`Timestamp.replace` will now handle Daylight Savings transitions gracefully (:issue:`18319`)
+- :func:`DataFrame.info()` now shows line numbers for column summary (:issue:`17304`)


this will need a sub-section for this to show as this is a major change to output formatting

put this in new features

jreback · 2018-01-07T15:20:23Z

pandas/tests/frame/test_repr_info.py

@@ -239,8 +239,8 @@ def test_info_duplicate_columns_shows_correct_dtypes(self):
        frame.info(buf=io)
        io.seek(0)
        lines = io.readlines()
-        assert 'a    1 non-null int64\n' == lines[3]


can you add a an explict test for this, one that specifically checks the formatting (a bit duplicative of this one), but like it separate

jreback

can you show the output of a sample

jreback · 2018-01-07T15:21:36Z

pandas/core/frame.py

            counts = None

-            tmpl = "%s%s"
+            header = _put_str('Index', space_num) + _put_str('Column', space)


Maybe call this 'N' rather than 'Index'. Also can you put a line of '-' after the header, formatted the same

jreback · 2018-02-24T17:25:57Z

can you rebase and update

jreback · 2018-07-07T14:47:34Z

this looked ok if we can comeback and rebase it @pandas-dev/pandas-core

jorisvandenbossche

We shouldn't use "Index" to indicate the number, that's a bit confusing with the other meaning of that.

pratapvardhan · 2018-07-09T18:02:21Z

@jreback @jorisvandenbossche -- using #. instead of Index for the line number column. Below are the sample outputs. Moved this to 0.24.0 whatnew section

In [20]: df = pd.DataFrame(np.random.rand(4, 10), columns=[
    ...:         '%s%s' % (x, np.random.randint(2, 50)*'a') for x in range(10)])

In [21]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 10 columns):
 #.  Column                                               Non-Null Count & Dtype
---  ------                                               ----------------------
 0   0aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa         4 non-null float64
 1   1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa    4 non-null float64
 2   2aaaaaa                                              4 non-null float64
 3   3aaaaaaaaaaaaaaaaaaaaaaaaaaaaa                       4 non-null float64
 4   4aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa                  4 non-null float64
 5   5aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa               4 non-null float64
 6   6aaaaaaaaaaaaaaaaaaaa                                4 non-null float64
 7   7aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa                   4 non-null float64
 8   8aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa      4 non-null float64
 9   9aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa       4 non-null float64
dtypes: float64(10)
memory usage: 392.0 bytes

In [22]: df = pd.DataFrame(np.random.rand(4,4), columns=['%s' % x for x in range(4)])

In [23]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #.  Column    Non-Null Count & Dtype
---  ------    ----------------------
 0   0         4 non-null float64
 1   1         4 non-null float64
 2   2         4 non-null float64
 3   3         4 non-null float64
dtypes: float64(4)
memory usage: 200.0 bytes

In [16]: df.info(null_counts=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #.  Column    Dtype
---  ------    -----
 0   0         float64
 1   1         float64
 2   2         float64
 3   3         float64
dtypes: float64(4)
memory usage: 200.0 bytes

In [24]: pd.DataFrame({'A': np.random.randn(5),
    ...:               'B': pd.date_range('1/1/2000', periods=5),
    ...:               'C': ['c']*5,
    ...:               'D': [1]*5}).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #.  Column    Non-Null Count & Dtype
---  ------    ----------------------
 0   A         5 non-null float64
 1   B         5 non-null datetime64[ns]
 2   C         5 non-null object
 3   D         5 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 232.0+ bytes

bashtage · 2018-07-09T20:27:05Z

What about iloc instead of #.? If staying with # the . seems unnecessary.

…

On Mon, Jul 9, 2018, 19:03 Pratap Vardhan ***@***.***> wrote: @jreback <https://github.com/jreback> @jorisvandenbossche <https://github.com/jorisvandenbossche> -- using #. instead of Index. Below are the sample outputs. Moved this to 0.24.0 whatnew section In [20]: df = pd.DataFrame(np.random.rand(4, 10), columns=[ ...: '%s%s' % (x, np.random.randint(2, 50)*'a') for x in range(10)]) In [21]: df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 4 entries, 0 to 3 Data columns (total 10 columns): #. Column Non-Null Count --- ------ -------------- 0 0aaaaaaaaaaaaaa 4 non-null float64 1 1aaaaa 4 non-null float64 2 2aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 4 non-null float64 3 3aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 4 non-null float64 4 4aaaaaaaaaaaaaaaaaaaa 4 non-null float64 5 5aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 4 non-null float64 6 6aaaaaaaaaaaaa 4 non-null float64 7 7aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 4 non-null float64 8 8aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 4 non-null float64 9 9aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 4 non-null float64 dtypes: float64(10) memory usage: 392.0 bytes In [22]: df = pd.DataFrame(np.random.rand(4,4), columns=['%s' % x for x in range(4)]) In [23]: df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 4 entries, 0 to 3 Data columns (total 4 columns): #. Column Non-Null Count --- ------ -------------- 0 0 4 non-null float64 1 1 4 non-null float64 2 2 4 non-null float64 3 3 4 non-null float64 dtypes: float64(4) memory usage: 200.0 bytes In [24]: pd.DataFrame({'A': np.random.randn(5), ...: 'B': pd.date_range('1/1/2000', periods=5), ...: 'C': ['c']*5, ...: 'D': [1]*5}).info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 4 columns): #. Column Non-Null Count --- ------ -------------- 0 A 5 non-null float64 1 B 5 non-null datetime64[ns] 2 C 5 non-null object 3 D 5 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(1), object(1) memory usage: 232.0+ bytes — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17332 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFU5RWA_nNf26qQTauJe2aCr--2jaXxCks5uE5rVgaJpZM4PCYU5> .

jreback · 2018-07-09T21:45:18Z

doc/source/whatsnew/v0.24.0.txt

+Output Formatting Enhancements
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+- `df.info()` now shows line numbers for the columns summary (:issue:`17304`)


you can use a :func:`DataFrame.info` here

jreback · 2018-07-09T21:46:58Z

pandas/core/frame.py

-        int_col      5 non-null int64
-        text_col     5 non-null object
-        float_col    5 non-null float64
+         #.  Column       Non-Null Count


maybe want to call this
Non-Null Count & Dtype?

will need to adjust this header if null_counts=False is passed to .info() (add a tests as well)

for df.info(null_counts=False) header will have dtype, or would you prefer Dtype -- is that fine? I'll push a test case once you confirm.

In [38]: df.info(null_counts=False) <class 'pandas.core.frame.DataFrame'> RangeIndex: 6 entries, 0 to 5 Data columns (total 2 columns): #. Column dtype --- ------ ----- 0 foo1 object 1 foo2 float64 dtypes: float64(1), object(1) memory usage: 168.0+ bytes

WillAyd · 2019-02-27T23:31:54Z

Closing as stale. Ping if you'd like to continue

rotuna · 2019-09-08T11:27:14Z

@WillAyd What was left to be done? I can try finishing it up

WillAyd · 2019-09-09T17:02:16Z

@rotuna if you'd like to pick up you can make your own branch off of this, merge in master, address comments above and post your own PR

Note we've had a lot of stylistic changes to our code since this was started (namely introducing black) so might be a large diff

gfyoung reviewed Aug 25, 2017

View reviewed changes

gfyoung added Enhancement Error Reporting Incorrect or improved errors from pandas labels Aug 25, 2017

pratapvardhan force-pushed the info branch from 5bb9ba3 to 234b042 Compare August 25, 2017 17:09

jreback requested changes Oct 28, 2017

View reviewed changes

pratapvardhan force-pushed the info branch from 234b042 to 98e275a Compare October 29, 2017 09:38

pratapvardhan force-pushed the info branch from 98e275a to 89a6a01 Compare January 6, 2018 17:00

jreback requested changes Jan 7, 2018

View reviewed changes

jorisvandenbossche requested changes Jul 8, 2018

View reviewed changes

ENH: pd.DataFrame.info() to show line numbers GH17304

ba8f01c

pratapvardhan force-pushed the info branch from 33872e4 to 1b26f7a Compare July 9, 2018 17:38

pratapvardhan force-pushed the info branch from 1b26f7a to ff70d33 Compare July 9, 2018 18:12

df.info column summary line numbers with header separator

ff70d33

jreback requested changes Jul 9, 2018

View reviewed changes

pratapvardhan force-pushed the info branch from 1ef63c8 to 7c13220 Compare July 10, 2018 04:54

TST: df.info(null_counts=False) header as Dtype

7c13220

WillAyd closed this Feb 27, 2019

WillAyd mentioned this pull request Oct 1, 2019

feature wanted: pd.DataFrame.info() should show line numbers #17304 #28696

Closed

5 tasks

WillAyd mentioned this pull request Oct 10, 2019

ENH: show numbers on .info() with verbose flag #28876

Merged

5 tasks

ENH: pd.DataFrame.info() to show line numbers GH17304 #17332

ENH: pd.DataFrame.info() to show line numbers GH17304 #17332

Conversation

pratapvardhan commented Aug 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pratapvardhan Aug 25, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Aug 25, 2017 • edited Loading

Choose a reason for hiding this comment

gfyoung commented Aug 25, 2017

bashtage commented Aug 25, 2017

codecov bot commented Aug 25, 2017

Codecov Report

codecov bot commented Aug 25, 2017 • edited Loading

Codecov Report

pratapvardhan commented Sep 6, 2017

jreback left a comment

Choose a reason for hiding this comment

pratapvardhan commented Oct 29, 2017

bashtage commented Oct 29, 2017

pratapvardhan commented Oct 29, 2017 • edited Loading

jreback commented Nov 25, 2017

pratapvardhan commented Jan 6, 2018

jreback commented Jan 6, 2018

pratapvardhan commented Jan 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 24, 2018

jreback commented Jul 7, 2018

jorisvandenbossche left a comment

Choose a reason for hiding this comment

pratapvardhan commented Jul 9, 2018 • edited Loading

bashtage commented Jul 9, 2018 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pratapvardhan Jul 10, 2018 • edited Loading

Choose a reason for hiding this comment

WillAyd commented Feb 27, 2019

rotuna commented Sep 8, 2019

WillAyd commented Sep 9, 2019

pratapvardhan commented Aug 25, 2017 •

edited

Loading

pratapvardhan Aug 25, 2017 •

edited

Loading

gfyoung Aug 25, 2017 •

edited

Loading

codecov bot commented Aug 25, 2017 •

edited

Loading

pratapvardhan commented Oct 29, 2017 •

edited

Loading

pratapvardhan commented Jul 9, 2018 •

edited

Loading

pratapvardhan Jul 10, 2018 •

edited

Loading