-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Add HTML repr for groupby dataframe and series #34926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@JulianWgs Are you still working on this? |
Is this something of interest?
|
I would suggest that if you were expecting feedback on the code, that you mark the PR as ready for review. Alternatively, if you want feedback on the idea, that you first open a feature request issue (and fill out the template) |
For 20,000 groups the function takes 260 ms ± 12.2 ms - Alternative approach, how to get in the dots in the middle elegantly? def _repr_html_(self) -> str: group_names = list(self.groups.keys()) max_groups = get_option("display.max_groups") if max_groups < self.ngroups: n_start = (max_groups + 1) // 2 n_end = max_groups - n_start group_names = group_names[:n_start] + group_names[-n_end:] repr_html = "" for group_name in group_names: group = self.groups[group_name] if not hasattr(group, "to_html"): group = group.to_frame() repr_html += f"<H3>Group Key: {group_name}<H3/>" repr_html += group.to_html( max_rows=get_option("display.max_rows") // self.ngroups ) return repr_html
Hello @JulianWgs! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-07-04 09:10:36 UTC |
- takes 8.97 ms ± 352 µs for 20,000 groups and is not that dependent on numer of groups
- max rows per group was calculated with the number of groups in the groupby and did not consider max_groups setting
- load dataframes from html output and compare them
@simonjayhawkins How do I fix the "ImportError: lxml not found, please install it" error in Travis CI (Link)? |
1 similar comment
sorry. will look soon. in the meantime merging upstream/master to resolve conflicts may help |
can you show what this does for a truncated repr? e.g. > 10 groups. |
I've updated the original pull request text accordingly |
@@ -548,6 +549,29 @@ def __repr__(self) -> str: | |||
# TODO: Better repr for GroupBy object | |||
return object.__repr__(self) | |||
|
|||
def _repr_html_(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to reorg this to use a GroupbyFormatter located in pandas/io/formats/groupby.py (it can do pretty much this but just locate the code there) as this is where we keep all of the formatting code.
could also add a .to_string()
method but not sure that's actually worth it (maybe open an issue for that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for the review! Do you mean pandas/io/formats/html.py? Should I add a new function and then just call that function from the above location?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, i mean pandas/io/formats/format.py (ok to just shove in there is fine, we should split that file up but that's for later).
@jreback: Any thoughts on whether tests for |
- add link to Github Pull Request
- why only the first/last rows/groups are tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think checking only the first/last row/group is sufficient; we should be checking that exactly the right ones are included.
It seems feasible to combine these into one test by parametrizing the test with n_groups, n_rows, check_n_rows, check_n_groups (so the number of rows/groups to check is hard coded as a parameter), or names of that sort. If necessary, can use something like "all" if separate logic is needed when no rows/groups are hidden.
pandas/tests/groupby/test_groupby.py
Outdated
for k, (group_name, df_group) in enumerate(df_groupby): | ||
dtype = df_group.iloc[0].dtype | ||
tm.assert_series_equal( | ||
dfs_from_html[k].iloc[0].astype(dtype), df_group.iloc[0], check_names=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the astype and check_names necessary here? (May be worth a comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New code should be much clearer :) HTML is parsed as string type and needs to converted to integer manually
@@ -2020,3 +2020,39 @@ def buffer_put_lines(buf: IO[str], lines: List[str]) -> None: | |||
if any(isinstance(x, str) for x in lines): | |||
lines = [str(x) for x in lines] | |||
buf.write("\n".join(lines)) | |||
|
|||
|
|||
def repr_html_groupby(group_obj) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should use the same machines as DataFrameFormatter/DataFrameRenderer (subclass as appropriate), which was recently changed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, for the long inactivity. I don't get how I would use the DataFrameFormatter? Is there documentation on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can find examples in pandas/io/formats/latex.py
and in other IO methods (grepping for DataFrameFormatter will get you the lot)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, for coming back to this again, but I really dont get what code I should change or how? Could you tell me which line in my code I have to rewrite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still relevant? I still would need some guidance :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JulianWgs if you can address comments. There's also a minor pre-commit failure
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
- combine all previous test cases into one parametrized test case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up!
In addition to addressing comments you'll want to merge master to make sure that this is up to date, and add a whatsnew (1.3, under enhancements)
@@ -2020,3 +2020,39 @@ def buffer_put_lines(buf: IO[str], lines: List[str]) -> None: | |||
if any(isinstance(x, str) for x in lines): | |||
lines = [str(x) for x in lines] | |||
buf.write("\n".join(lines)) | |||
|
|||
|
|||
def repr_html_groupby(group_obj) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can find examples in pandas/io/formats/latex.py
and in other IO methods (grepping for DataFrameFormatter will get you the lot)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more comment - we might want an example for this in the groupby section of the user guide (but could be left for a followon too)
@JulianWgs can you resolve conflicts (and move release note to 1.4) |
@simonjayhawkins Done :) |
this is quite old, happen to reopen if actively worked on. |
After I saw the representation of a group by in Airtable, I wanted something similar in pandas. It always bothered me that there was no representation for a group by.
Update: 25.07.2020:
These are old screenshots, but still resemble the look. A new addition is the limitation of groups displayed. I don't know how to make the displaying width across all groups the same. If anyone has an idea, I'd appreciate it.
Update: 26.07.2020:
I've added three tests:
I unsure if it wise to put all cases into one test with an if statement which checks which of the above cases it is. Feedback welcome.
DataFrame:

Series:

This is a WIP implementation. Open issues are:
improve variable names
produce more clean HTML code
limit number of maximum group by DataFrames displayed (similar to the row limit)
improve look (Did not improve it, but I'm with it)
Same width across all groups
closes #xxxx
tests added / passed
passes
black pandas
passes
git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry
Truncated groups with a maximum of 3 groups:
