Skip to content

Fix for DataFrame.hist() with by- and weights-keyword #11028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Fix for DataFrame.hist() with by- and weights-keyword #11028

wants to merge 3 commits into from

Conversation

Twizzledrizzle
Copy link

closes #9540

for example:

import pandas as pd
d = {'one' : ['A', 'A', 'B', 'C'],
     'two' : [4., 3., 2., 1.],
     'three' : [10., 8., 5., 7.]}     
df = pd.DataFrame(d)
df.hist('two', by='one', weights='three', bins=range(0, 10))

does not seem to break anything, but this is my first meddling in the pandas library, so a review would be nice

will make this work:

import pandas as pd
d = {'one' : ['A', 'A', 'B', 'C'],
     'two' : [4., 3., 2., 1.],
     'three' : [10., 8., 5., 7.]}
     
df = pd.DataFrame(d)
df.hist('two', by='one', weights='three', bins=range(0, 10))
ax.hist(group.dropna().values, bins=bins, **kwargs)
def plot_group(group, ax, weights=None):
if weights is not None:
weights=weights.dropna().values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this will fail if the missing values in the weights column don't align perfectly with the missing values in the group column. It might be cleaner to refactor this to drop rows missing either group or weight earlier on, so that plot_group only has to deal with valid observations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this because on the next line, group drops na values as well?
ax.hist(group.dropna().values

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, issues when na is different on weights and group arrays, did not think of this, will think a little bit more

@TomAugspurger TomAugspurger added this to the 0.18.0 milestone Sep 9, 2015
@TomAugspurger
Copy link
Contributor

Just gave a quick look through here. This is a good idea that we should support.

We'll need tests for this. Put them in pandas/tests/test_graphics.py. Make sure to cover cases where

  • by is None or a column
  • weights is an array of weights or a string column
  • values has missing data and / or weights has missing data

The plotting stuff is being refactored a bunch currently, so I've tagged this for the 0.18. You might want to hold off on making more changes, but in the meantime you can write tests for this.

@jreback jreback added the Visualization plotting label Sep 10, 2015
@jreback
Copy link
Contributor

jreback commented Sep 10, 2015

FYI the checking is quite similar to how weights are checked for DataFrame.sample, so would want to make this a common function (could be a private function on a dataframe)

@Twizzledrizzle
Copy link
Author

Thanks jreback, I will try my best to add some tests. I will also check what can be done with synching dropna with the weighs.

I was thinking in the lines of
drow rows in weighs, that are na in group
drop rows na in group
fillna in weighs with zero, so they do not count to anything

would this be ok?

@Twizzledrizzle
Copy link
Author

Or more logical to drop all rows that are NA in group or in weights?

@TomAugspurger
Copy link
Contributor

I'd say your lest method. Something like data.dropna(subset=['group', 'weight'])

@jreback
Copy link
Contributor

jreback commented Oct 25, 2015

@TomAugspurger can you review

@Twizzledrizzle
Copy link
Author

Sorry I have not had time to continue with the tests, my limited git knowledge made it tough interacting with the repo. I planned to look how my patch worked with the new release

@Twizzledrizzle
Copy link
Author

Added a new pull request here: #11441

for the new version, sorry for not using git correctly :(

@TomAugspurger
Copy link
Contributor

Superseded by #11441

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants