Skip to content

Add HTML repr for groupby dataframe and series #34926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 35 commits into from

Conversation

JulianWgs
Copy link

@JulianWgs JulianWgs commented Jun 22, 2020

After I saw the representation of a group by in Airtable, I wanted something similar in pandas. It always bothered me that there was no representation for a group by.

Update: 25.07.2020:
These are old screenshots, but still resemble the look. A new addition is the limitation of groups displayed. I don't know how to make the displaying width across all groups the same. If anyone has an idea, I'd appreciate it.

Update: 26.07.2020:
I've added three tests:

  1. All rows in groups and all groups are shown
  2. All groups are shown, but not all rows in groups.
  3. All rows in groups are shown, but no all groups.

I unsure if it wise to put all cases into one test with an if statement which checks which of the above cases it is. Feedback welcome.

DataFrame:
Screenshot from 2020-06-22 03-13-44

Series:
Screenshot from 2020-06-22 03-14-46

This is a WIP implementation. Open issues are:

  • improve variable names

  • produce more clean HTML code

  • limit number of maximum group by DataFrames displayed (similar to the row limit)

  • improve look (Did not improve it, but I'm with it)

  • Same width across all groups

  • closes #xxxx

  • tests added / passed

  • passes black pandas

  • passes git diff upstream/master -u -- "*.py" | flake8 --diff

  • whatsnew entry

Truncated groups with a maximum of 3 groups:
Screenshot from 2020-09-05 16-17-33

@simonjayhawkins
Copy link
Member

@JulianWgs Are you still working on this?

@simonjayhawkins simonjayhawkins added the Output-Formatting __repr__ of pandas objects, to_string label Jul 24, 2020
@JulianWgs
Copy link
Author

Is this something of interest?

  1. I'm not sure if the added complexity of the code is worth the cleaner html.
  2. Also I would need some help on how to improve the visual quality (design hints not necessarly technical help). The only thing I can think of is to set the column width equally across all tables.

@simonjayhawkins
Copy link
Member

I would suggest that if you were expecting feedback on the code, that you mark the PR as ready for review. Alternatively, if you want feedback on the idea, that you first open a feature request issue (and fill out the template)

For 20,000 groups the function takes 260 ms ± 12.2 ms

- Alternative approach, how to get in the dots in the middle elegantly?

def _repr_html_(self) -> str:
    group_names = list(self.groups.keys())
    max_groups = get_option("display.max_groups")
    if max_groups < self.ngroups:
        n_start = (max_groups + 1) // 2
        n_end = max_groups - n_start
        group_names = group_names[:n_start] + group_names[-n_end:]
    repr_html = ""
    for group_name in group_names:
        group = self.groups[group_name]
        if not hasattr(group, "to_html"):
            group = group.to_frame()
        repr_html += f"<H3>Group Key: {group_name}<H3/>"
        repr_html += group.to_html(
            max_rows=get_option("display.max_rows") // self.ngroups
        )
    return repr_html
@pep8speaks
Copy link

pep8speaks commented Jul 25, 2020

Hello @JulianWgs! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-07-04 09:10:36 UTC

- takes 8.97 ms ± 352 µs for 20,000 groups and is not that dependent on
numer of groups
- max rows per group was calculated with the number of groups in the
groupby and did not consider max_groups setting
@JulianWgs JulianWgs marked this pull request as ready for review July 25, 2020 13:27
@JulianWgs
Copy link
Author

@simonjayhawkins How do I fix the "ImportError: lxml not found, please install it" error in Travis CI (Link)?

@JulianWgs
Copy link
Author

@simonjayhawkins

1 similar comment
@JulianWgs
Copy link
Author

@simonjayhawkins

@simonjayhawkins
Copy link
Member

sorry. will look soon. in the meantime merging upstream/master to resolve conflicts may help

@JulianWgs
Copy link
Author

The problem was that the lxml dependency was missing. I've added @td.skip_if_no("lxml") to skip the test if it is missing. This is also done here.

Please review the PR :)

@jreback
Copy link
Contributor

jreback commented Sep 5, 2020

can you show what this does for a truncated repr? e.g. > 10 groups.

@jreback jreback added Groupby IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Sep 5, 2020
@JulianWgs
Copy link
Author

can you show what this does for a truncated repr? e.g. > 10 groups.

I've updated the original pull request text accordingly

@@ -548,6 +549,29 @@ def __repr__(self) -> str:
# TODO: Better repr for GroupBy object
return object.__repr__(self)

def _repr_html_(self) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to reorg this to use a GroupbyFormatter located in pandas/io/formats/groupby.py (it can do pretty much this but just locate the code there) as this is where we keep all of the formatting code.

could also add a .to_string() method but not sure that's actually worth it (maybe open an issue for that).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for the review! Do you mean pandas/io/formats/html.py? Should I add a new function and then just call that function from the above location?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, i mean pandas/io/formats/format.py (ok to just shove in there is fine, we should split that file up but that's for later).

@rhshadrach
Copy link
Member

@jreback: Any thoughts on whether tests for groupby._repr_html_ should be in groupby tests or io.formats.test_to_html (or somewhere else)?

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think checking only the first/last row/group is sufficient; we should be checking that exactly the right ones are included.

It seems feasible to combine these into one test by parametrizing the test with n_groups, n_rows, check_n_rows, check_n_groups (so the number of rows/groups to check is hard coded as a parameter), or names of that sort. If necessary, can use something like "all" if separate logic is needed when no rows/groups are hidden.

for k, (group_name, df_group) in enumerate(df_groupby):
dtype = df_group.iloc[0].dtype
tm.assert_series_equal(
dfs_from_html[k].iloc[0].astype(dtype), df_group.iloc[0], check_names=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the astype and check_names necessary here? (May be worth a comment)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New code should be much clearer :) HTML is parsed as string type and needs to converted to integer manually

@@ -2020,3 +2020,39 @@ def buffer_put_lines(buf: IO[str], lines: List[str]) -> None:
if any(isinstance(x, str) for x in lines):
lines = [str(x) for x in lines]
buf.write("\n".join(lines))


def repr_html_groupby(group_obj) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should use the same machines as DataFrameFormatter/DataFrameRenderer (subclass as appropriate), which was recently changed).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, for the long inactivity. I don't get how I would use the DataFrameFormatter? Is there documentation on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can find examples in pandas/io/formats/latex.py and in other IO methods (grepping for DataFrameFormatter will get you the lot)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, for coming back to this again, but I really dont get what code I should change or how? Could you tell me which line in my code I have to rewrite?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still relevant? I still would need some guidance :)

Copy link
Member

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JulianWgs if you can address comments. There's also a minor pre-commit failure

@github-actions
Copy link
Contributor

github-actions bot commented Jan 2, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 2, 2021
- combine all previous test cases into one parametrized test case
Copy link
Member

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up!

In addition to addressing comments you'll want to merge master to make sure that this is up to date, and add a whatsnew (1.3, under enhancements)

@@ -2020,3 +2020,39 @@ def buffer_put_lines(buf: IO[str], lines: List[str]) -> None:
if any(isinstance(x, str) for x in lines):
lines = [str(x) for x in lines]
buf.write("\n".join(lines))


def repr_html_groupby(group_obj) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can find examples in pandas/io/formats/latex.py and in other IO methods (grepping for DataFrameFormatter will get you the lot)

Copy link
Member

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment - we might want an example for this in the groupby section of the user guide (but could be left for a followon too)

@simonjayhawkins
Copy link
Member

@JulianWgs can you resolve conflicts (and move release note to 1.4)

@JulianWgs
Copy link
Author

@simonjayhawkins Done :)

@jreback
Copy link
Contributor

jreback commented Nov 28, 2021

this is quite old, happen to reopen if actively worked on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby IO HTML read_html, to_html, Styler.apply, Styler.applymap Output-Formatting __repr__ of pandas objects, to_string Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants