Add HTML repr for groupby dataframe and series #34926

JulianWgs · 2020-06-22T01:24:35Z

After I saw the representation of a group by in Airtable, I wanted something similar in pandas. It always bothered me that there was no representation for a group by.

Update: 25.07.2020:
These are old screenshots, but still resemble the look. A new addition is the limitation of groups displayed. I don't know how to make the displaying width across all groups the same. If anyone has an idea, I'd appreciate it.

Update: 26.07.2020:
I've added three tests:

All rows in groups and all groups are shown
All groups are shown, but not all rows in groups.
All rows in groups are shown, but no all groups.

I unsure if it wise to put all cases into one test with an if statement which checks which of the above cases it is. Feedback welcome.

DataFrame:

Series:

This is a WIP implementation. Open issues are:

Truncated groups with a maximum of 3 groups:

simonjayhawkins · 2020-07-24T11:48:48Z

@JulianWgs Are you still working on this?

JulianWgs · 2020-07-24T13:42:29Z

Is this something of interest?

I'm not sure if the added complexity of the code is worth the cleaner html.
Also I would need some help on how to improve the visual quality (design hints not necessarly technical help). The only thing I can think of is to set the column width equally across all tables.

simonjayhawkins · 2020-07-24T14:00:29Z

I would suggest that if you were expecting feedback on the code, that you mark the PR as ready for review. Alternatively, if you want feedback on the idea, that you first open a feature request issue (and fill out the template)

For 20,000 groups the function takes 260 ms ± 12.2 ms - Alternative approach, how to get in the dots in the middle elegantly? def _repr_html_(self) -> str: group_names = list(self.groups.keys()) max_groups = get_option("display.max_groups") if max_groups < self.ngroups: n_start = (max_groups + 1) // 2 n_end = max_groups - n_start group_names = group_names[:n_start] + group_names[-n_end:] repr_html = "" for group_name in group_names: group = self.groups[group_name] if not hasattr(group, "to_html"): group = group.to_frame() repr_html += f"<H3>Group Key: {group_name}<H3/>" repr_html += group.to_html( max_rows=get_option("display.max_rows") // self.ngroups ) return repr_html

pep8speaks · 2020-07-25T12:47:12Z

Hello @JulianWgs! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-07-04 09:10:36 UTC

- takes 8.97 ms ± 352 µs for 20,000 groups and is not that dependent on numer of groups

- max rows per group was calculated with the number of groups in the groupby and did not consider max_groups setting

- load dataframes from html output and compare them

- https://github.com/pandas-dev/pandas/pull/34926/checks?check_run_id=909776147#step:6:16 - black did not pick it up

JulianWgs · 2020-07-28T08:36:10Z

@simonjayhawkins How do I fix the "ImportError: lxml not found, please install it" error in Travis CI (Link)?

JulianWgs · 2020-08-11T07:34:15Z

@simonjayhawkins

JulianWgs · 2020-08-20T13:52:01Z

@simonjayhawkins

simonjayhawkins · 2020-08-20T14:05:13Z

sorry. will look soon. in the meantime merging upstream/master to resolve conflicts may help

JulianWgs · 2020-09-03T18:46:50Z

The problem was that the lxml dependency was missing. I've added @td.skip_if_no("lxml") to skip the test if it is missing. This is also done here.

Please review the PR :)

jreback · 2020-09-05T03:35:32Z

can you show what this does for a truncated repr? e.g. > 10 groups.

JulianWgs · 2020-09-05T14:19:36Z

can you show what this does for a truncated repr? e.g. > 10 groups.

I've updated the original pull request text accordingly

jreback · 2020-09-05T14:45:12Z

pandas/core/groupby/groupby.py

@@ -548,6 +549,29 @@ def __repr__(self) -> str:
        # TODO: Better repr for GroupBy object
        return object.__repr__(self)

+    def _repr_html_(self) -> str:


I'd like to reorg this to use a GroupbyFormatter located in pandas/io/formats/groupby.py (it can do pretty much this but just locate the code there) as this is where we keep all of the formatting code.

could also add a .to_string() method but not sure that's actually worth it (maybe open an issue for that).

Thank for the review! Do you mean pandas/io/formats/html.py? Should I add a new function and then just call that function from the above location?

no, i mean pandas/io/formats/format.py (ok to just shove in there is fine, we should split that file up but that's for later).

rhshadrach · 2020-10-29T21:24:20Z

@jreback: Any thoughts on whether tests for groupby._repr_html_ should be in groupby tests or io.formats.test_to_html (or somewhere else)?

- add link to Github Pull Request

- why only the first/last rows/groups are tested

rhshadrach

I don't think checking only the first/last row/group is sufficient; we should be checking that exactly the right ones are included.

It seems feasible to combine these into one test by parametrizing the test with n_groups, n_rows, check_n_rows, check_n_groups (so the number of rows/groups to check is hard coded as a parameter), or names of that sort. If necessary, can use something like "all" if separate logic is needed when no rows/groups are hidden.

rhshadrach · 2020-10-30T18:55:37Z

pandas/tests/groupby/test_groupby.py

+    for k, (group_name, df_group) in enumerate(df_groupby):
+        dtype = df_group.iloc[0].dtype
+        tm.assert_series_equal(
+            dfs_from_html[k].iloc[0].astype(dtype), df_group.iloc[0], check_names=False


Why is the astype and check_names necessary here? (May be worth a comment)

New code should be much clearer :) HTML is parsed as string type and needs to converted to integer manually

jreback · 2020-10-31T17:57:51Z

pandas/io/formats/format.py

@@ -2020,3 +2020,39 @@ def buffer_put_lines(buf: IO[str], lines: List[str]) -> None:
    if any(isinstance(x, str) for x in lines):
        lines = [str(x) for x in lines]
    buf.write("\n".join(lines))
+
+
+def repr_html_groupby(group_obj) -> str:


this should use the same machines as DataFrameFormatter/DataFrameRenderer (subclass as appropriate), which was recently changed).

Sorry, for the long inactivity. I don't get how I would use the DataFrameFormatter? Is there documentation on this?

You can find examples in pandas/io/formats/latex.py and in other IO methods (grepping for DataFrameFormatter will get you the lot)

Sorry, for coming back to this again, but I really dont get what code I should change or how? Could you tell me which line in my code I have to rewrite?

Is this still relevant? I still would need some guidance :)

arw2019

@JulianWgs if you can address comments. There's also a minor pre-commit failure

github-actions · 2021-01-02T00:24:34Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

- combine all previous test cases into one parametrized test case

arw2019

Thanks for picking this up!

In addition to addressing comments you'll want to merge master to make sure that this is up to date, and add a whatsnew (1.3, under enhancements)

arw2019 · 2021-01-03T14:58:28Z

pandas/io/formats/format.py

@@ -2020,3 +2020,39 @@ def buffer_put_lines(buf: IO[str], lines: List[str]) -> None:
    if any(isinstance(x, str) for x in lines):
        lines = [str(x) for x in lines]
    buf.write("\n".join(lines))
+
+
+def repr_html_groupby(group_obj) -> str:


You can find examples in pandas/io/formats/latex.py and in other IO methods (grepping for DataFrameFormatter will get you the lot)

pandas/tests/groupby/test_groupby.py

arw2019

One more comment - we might want an example for this in the groupby section of the user guide (but could be left for a followon too)

simonjayhawkins · 2021-07-03T10:54:13Z

@JulianWgs can you resolve conflicts (and move release note to 1.4)

JulianWgs · 2021-07-04T09:20:32Z

@simonjayhawkins Done :)

jreback · 2021-11-28T21:04:45Z

this is quite old, happen to reopen if actively worked on.

JulianWgs added 2 commits June 22, 2020 03:19

Add html repr for groupby dataframe and series

5e1cb8c

Sort imports with isort in groupby.py

9c3df8a

simonjayhawkins added the Output-Formatting __repr__ of pandas objects, to_string label Jul 24, 2020

JulianWgs added 3 commits July 25, 2020 13:48

Improve variable naming

4a3911a

Add display.max_groups to config

46f5353

JulianWgs force-pushed the master branch from 0068e5b to c840c29 Compare July 25, 2020 12:50

JulianWgs added 3 commits July 25, 2020 15:13

Implement faster and more scalable list variant

139bdc6

- takes 8.97 ms ± 352 µs for 20,000 groups and is not that dependent on numer of groups

Black config_init

1020be9

Fix bug which displayed too few rows

2e4a6ee

- max rows per group was calculated with the number of groups in the groupby and did not consider max_groups setting

JulianWgs marked this pull request as ready for review July 25, 2020 13:27

JulianWgs added 3 commits July 26, 2020 10:47

Add test for groupby representation

ea2f151

- load dataframes from html output and compare them

Delete trailing whitespace in comment

2443b80

- https://github.com/pandas-dev/pandas/pull/34926/checks?check_run_id=909776147#step:6:16 - black did not pick it up

Add test cases for truncated rows and groups

913afb0

JulianWgs added 2 commits August 24, 2020 09:08

Merge remote-tracking branch 'upstream/master'

7efc505

Skip test if lxml is not installed

d85fc63

jreback added Groupby IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Sep 5, 2020

jreback reviewed Sep 5, 2020

View reviewed changes

JulianWgs added 3 commits October 30, 2020 16:24

Fix typo and capitalize pandas objs correctly

669c047

Change docstring to comment in groupby repr test

8a75299

- add link to Github Pull Request

Add additional explanation in groupby_repr test

b36177d

- why only the first/last rows/groups are tested

rhshadrach requested changes Oct 30, 2020

View reviewed changes

jreback requested changes Oct 31, 2020

View reviewed changes

JulianWgs added 2 commits November 11, 2020 07:56

Test more rows in groupby repr when truncated

edff21d

Test more groups in groupby repr when truncated

580d09b

arw2019 reviewed Nov 30, 2020

View reviewed changes

github-actions bot added the Stale label Jan 2, 2021

Refactor groups repr html

0c948e1

- combine all previous test cases into one parametrized test case

arw2019 suggested changes Jan 3, 2021

View reviewed changes

JulianWgs added 5 commits January 3, 2021 17:47

Merge remote-tracking branch 'upstream/master'

e41ff00

Add whatsnew entry for group-by HTML representation

ae8721d

Fix test case name

1c92ed8

Rename groupby objects

b92d61f

Add case for single and tuple groupby key

579998a

arw2019 reviewed Jan 3, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master'

8d8b260

simonjayhawkins added the Needs Review label Jun 16, 2021

JulianWgs force-pushed the master branch from 49ebb30 to 8d8b260 Compare July 3, 2021 11:11

Merge branch 'master' into master

5865cfb

JulianWgs force-pushed the master branch from 5d19833 to 5865cfb Compare July 4, 2021 09:08

Move whats new to 1.4.0 release

7a11be8

jreback closed this Nov 28, 2021

JulianWgs mentioned this pull request Apr 18, 2023

ENH: HTML repr for groupby dataframe and series #52721

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HTML repr for groupby dataframe and series #34926

Add HTML repr for groupby dataframe and series #34926

JulianWgs commented Jun 22, 2020 •

edited

Loading

simonjayhawkins commented Jul 24, 2020

JulianWgs commented Jul 24, 2020

simonjayhawkins commented Jul 24, 2020

pep8speaks commented Jul 25, 2020 •

edited

Loading

JulianWgs commented Jul 28, 2020

JulianWgs commented Aug 11, 2020

JulianWgs commented Aug 20, 2020

simonjayhawkins commented Aug 20, 2020

JulianWgs commented Sep 3, 2020

jreback commented Sep 5, 2020

JulianWgs commented Sep 5, 2020

jreback Sep 5, 2020

JulianWgs Sep 5, 2020

jreback Sep 5, 2020

rhshadrach commented Oct 29, 2020

rhshadrach left a comment

rhshadrach Oct 30, 2020

JulianWgs Jul 4, 2021

jreback Oct 31, 2020

JulianWgs Jan 3, 2021

arw2019 Jan 3, 2021

JulianWgs Jan 20, 2021

JulianWgs Jul 4, 2021

arw2019 left a comment

github-actions bot commented Jan 2, 2021

arw2019 left a comment

arw2019 Jan 3, 2021

arw2019 left a comment

simonjayhawkins commented Jul 3, 2021

JulianWgs commented Jul 4, 2021

jreback commented Nov 28, 2021

Add HTML repr for groupby dataframe and series #34926

Add HTML repr for groupby dataframe and series #34926

Conversation

JulianWgs commented Jun 22, 2020 • edited Loading

simonjayhawkins commented Jul 24, 2020

JulianWgs commented Jul 24, 2020

simonjayhawkins commented Jul 24, 2020

pep8speaks commented Jul 25, 2020 • edited Loading

Comment last updated at 2021-07-04 09:10:36 UTC

JulianWgs commented Jul 28, 2020

JulianWgs commented Aug 11, 2020

JulianWgs commented Aug 20, 2020

simonjayhawkins commented Aug 20, 2020

JulianWgs commented Sep 3, 2020

jreback commented Sep 5, 2020

JulianWgs commented Sep 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach commented Oct 29, 2020

rhshadrach left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arw2019 left a comment

Choose a reason for hiding this comment

github-actions bot commented Jan 2, 2021

arw2019 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arw2019 left a comment

Choose a reason for hiding this comment

simonjayhawkins commented Jul 3, 2021

JulianWgs commented Jul 4, 2021

jreback commented Nov 28, 2021

JulianWgs commented Jun 22, 2020 •

edited

Loading

pep8speaks commented Jul 25, 2020 •

edited

Loading