-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: pd.DataFrame.info() to show line numbers GH17304 #17332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/frame.py
Outdated
counts = None | ||
|
||
tmpl = "%s%s" | ||
if show_counts: | ||
counts = self.count() | ||
if len(cols) != len(counts): # pragma: no cover | ||
raise AssertionError('Columns must equal counts (%d != %d)' | ||
% (len(cols), len(counts))) | ||
% (cols_count, len(counts))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use .format
string-formatting instead.
pandas/core/frame.py
Outdated
dtype = dtypes.iloc[i] | ||
col = pprint_thing(col) | ||
|
||
line_no = ("%d. " % (i + 1)).rjust(space_num) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
pandas/core/frame.py
Outdated
count = "" | ||
if show_counts: | ||
count = counts.iloc[i] | ||
|
||
lines.append(_put_str(col, space) + tmpl % (count, dtype)) | ||
lines.append(line_no + _put_str(col, space) + | ||
tmpl % (count, dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
pandas/tests/frame/test_repr_info.py
Outdated
assert 'a 1 non-null int64\n' == lines[3] | ||
assert 'a 1 non-null float64\n' == lines[4] | ||
assert '1. a 1 non-null int64\n' == lines[3] | ||
assert '2. a 1 non-null float64\n' == lines[4] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bashtage : Do you still prefer 0-indexing? I'm actually okay with 1-indexing because 0-indexing may not be as intuitive for non-purist programmers like ourselves 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I'm in favor of 1-indexing usage for .info()
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of purist, I think it is more of a loss of information since 0 is meaningful while 1 is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks guys for working on this feature :). Is it ready in new version of pandas?
pandas/core/frame.py
Outdated
lines.append('Data columns (total %d columns):' % | ||
len(self.columns)) | ||
space = max([len(pprint_thing(k)) for k in self.columns]) + 4 | ||
lines.append('Data columns (total %d columns):' % cols_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
Thanks for the PR! Just address the test failures and add a |
I find 1 based indexing confusing since Maybe a header:
|
Codecov Report
@@ Coverage Diff @@
## master #17332 +/- ##
==========================================
- Coverage 91.01% 90.99% -0.02%
==========================================
Files 162 162
Lines 49567 49570 +3
==========================================
- Hits 45113 45107 -6
- Misses 4454 4463 +9
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #17332 +/- ##
==========================================
+ Coverage 91.92% 91.92% +<.01%
==========================================
Files 160 160
Lines 49913 49926 +13
==========================================
+ Hits 45882 45895 +13
Misses 4031 4031
Continue to review full report at Codecov.
|
@gfyoung -- Changes have been pushed, any thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pratapvardhan if you want to rebase and revise to a style more like @bashtage proposed #17332 (comment), we can proceed.
also move the whatsnew to 0.22
Have you checked with very long column names and all possible dtypes? Just wondering if it always is readable on say 80 columns. |
@bashtage -- some samples
space constructor takes care of column character length and verbosity defaults to provided settings. |
can you rebase / update |
I've rebased, not sure what is causing the lint issue. The diff seems flake8 fine. |
pushed a commit to fix |
@jreback -- Thanks for that. |
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -462,3 +462,4 @@ Other | |||
|
|||
- Improved error message when attempting to use a Python keyword as an identifier in a ``numexpr`` backed query (:issue:`18221`) | |||
- :func:`Timestamp.replace` will now handle Daylight Savings transitions gracefully (:issue:`18319`) | |||
- :func:`DataFrame.info()` now shows line numbers for column summary (:issue:`17304`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will need a sub-section for this to show as this is a major change to output formatting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put this in new features
@@ -239,8 +239,8 @@ def test_info_duplicate_columns_shows_correct_dtypes(self): | |||
frame.info(buf=io) | |||
io.seek(0) | |||
lines = io.readlines() | |||
assert 'a 1 non-null int64\n' == lines[3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a an explict test for this, one that specifically checks the formatting (a bit duplicative of this one), but like it separate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you show the output of a sample
pandas/core/frame.py
Outdated
counts = None | ||
|
||
tmpl = "%s%s" | ||
header = _put_str('Index', space_num) + _put_str('Column', space) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call this 'N' rather than 'Index'. Also can you put a line of '-' after the header, formatted the same
can you rebase and update |
this looked ok if we can comeback and rebase it @pandas-dev/pandas-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't use "Index" to indicate the number, that's a bit confusing with the other meaning of that.
@jreback @jorisvandenbossche -- using
|
What about iloc instead of #.? If staying with # the . seems unnecessary.
…On Mon, Jul 9, 2018, 19:03 Pratap Vardhan ***@***.***> wrote:
@jreback <https://github.com/jreback> @jorisvandenbossche
<https://github.com/jorisvandenbossche> -- using #. instead of Index.
Below are the sample outputs. Moved this to 0.24.0 whatnew section
In [20]: df = pd.DataFrame(np.random.rand(4, 10), columns=[
...: '%s%s' % (x, np.random.randint(2, 50)*'a') for x in range(10)])
In [21]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 10 columns):
#. Column Non-Null Count
--- ------ --------------
0 0aaaaaaaaaaaaaa 4 non-null float64
1 1aaaaa 4 non-null float64
2 2aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 4 non-null float64
3 3aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 4 non-null float64
4 4aaaaaaaaaaaaaaaaaaaa 4 non-null float64
5 5aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 4 non-null float64
6 6aaaaaaaaaaaaa 4 non-null float64
7 7aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 4 non-null float64
8 8aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 4 non-null float64
9 9aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 4 non-null float64
dtypes: float64(10)
memory usage: 392.0 bytes
In [22]: df = pd.DataFrame(np.random.rand(4,4), columns=['%s' % x for x in range(4)])
In [23]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
#. Column Non-Null Count
--- ------ --------------
0 0 4 non-null float64
1 1 4 non-null float64
2 2 4 non-null float64
3 3 4 non-null float64
dtypes: float64(4)
memory usage: 200.0 bytes
In [24]: pd.DataFrame({'A': np.random.randn(5),
...: 'B': pd.date_range('1/1/2000', periods=5),
...: 'C': ['c']*5,
...: 'D': [1]*5}).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
#. Column Non-Null Count
--- ------ --------------
0 A 5 non-null float64
1 B 5 non-null datetime64[ns]
2 C 5 non-null object
3 D 5 non-null int64
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 232.0+ bytes
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17332 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFU5RWA_nNf26qQTauJe2aCr--2jaXxCks5uE5rVgaJpZM4PCYU5>
.
|
doc/source/whatsnew/v0.24.0.txt
Outdated
Output Formatting Enhancements | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
- `df.info()` now shows line numbers for the columns summary (:issue:`17304`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use a :func:`DataFrame.info`
here
pandas/core/frame.py
Outdated
int_col 5 non-null int64 | ||
text_col 5 non-null object | ||
float_col 5 non-null float64 | ||
#. Column Non-Null Count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe want to call this
Non-Null Count & Dtype
?
will need to adjust this header if null_counts=False
is passed to .info()
(add a tests as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for df.info(null_counts=False)
header will have dtype
, or would you prefer Dtype
-- is that fine? I'll push a test case once you confirm.
In [38]: df.info(null_counts=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
#. Column dtype
--- ------ -----
0 foo1 object
1 foo2 float64
dtypes: float64(1), object(1)
memory usage: 168.0+ bytes
Closing as stale. Ping if you'd like to continue |
@WillAyd What was left to be done? I can try finishing it up |
@rotuna if you'd like to pick up you can make your own branch off of this, merge in master, address comments above and post your own PR Note we've had a lot of stylistic changes to our code since this was started (namely introducing black) so might be a large diff |
Refactored to
self.columns
andlen(self.columns)
New output