-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
WIP: Add value_counts() to DataFrame #5381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jtratner looks fine...(obv need some tests as you have indicated) |
yeah, I'm pretty sure there's a bug with the binning that I need to address |
let's put this off to 0.14.... |
Yep, that's what I was thinking too. |
def value_counts(self, axis=0, normalize=False, sort=True, | ||
ascending=False, bins=None, numeric_only=False): | ||
""" | ||
Returns DataFrame containing counts of unique values. The resulting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put the first sentence on a seperate line? (so putting the "The resulting" on the next line. When following the numpy docstring standard exactly, there should even be a blank line after the first sentence.) This will ensure that the summary in the api docs (http://pandas.pydata.org/pandas-docs/dev/api.html) are limited to that one sentence.
Added some documentation comments. |
Thanks for the comments! I'll incorporate those when I return to finish |
When would someone use this ? I can't think of when I would use this :s Should fill in the NaNs with 0, since count is 0 if no values appear. |
In general I think anything that you can do for a series you should be able to do for a column/columns of a DataFrame. I think there are many uses for this especially when binning data--essentially it's a histogram function that you can apply to bunch of columns at once rather than going through each series. |
A column of a DataFrame is a Series (so the first half I agree). However, I strongly disagree to extending all Series functionality to DataFrame for the sake of it, rather things should be extended only where they makes sense... Personally I can't see this particular one making sense, but am open to seeing an example where it does.
So, could you give a specific example? |
Exactly my point that it's a series of Series so it's a natural extension to be able to do a Series function to a series of Series. One example is if there are is a DataFrame of different price timeseries and the user is interested in counts in bins of some increments (e.g., $10 or $5 or some other measure). Or one could think of binning returns on a DataFrame or binning measurement data. Another one is if there are states of different variables in each column one would be interested in the counts of these states for the various columns. There are other ways to do the counts, but I think the binning is the real key to the usefulness of value_count. It is just waiting tests thanks to implementation work already done by @jtratner. |
I've ran into a couple situations in the past couple weeks where I could have used value counts on a DataFrame. Specifically needing to get value counts for flows from period to period: In [12]: df.head()
Out[12]:
1 2 3 4 5 6 7 8
HRHHID HRHHID2 PULINENO
12807008622 87001 1 NaN nn nn nn nn nn nn nn
2 NaN nn nn nn nn nn nn nn
3 NaN ee ee ee ee ee ee ee
17141220290 87001 1 NaN un un ne ee ee ee ee
2 NaN ee ee ee ee ee ee ee Granted, using |
This is a pretty trivial addition overall - I just haven't had time to get |
@jtratner up 2 u if you want this for 0.13.1 |
@jtratner still working on this? |
@jtratner status? |
@jtratner i can take this if you want |
+ abstract bin generation from cut to use elsewhere. Panel goes to ndarray on apply so that's a future TODO. Conflicts: pandas/core/frame.py pandas/core/series.py
ca9c6f9
to
fc8f0b9
Compare
I don't know if this bug is still at the "needing use cases" stage, but I have a dataframe where the columns are categorical variables. I want to prune values for these variables which appear below a certain frequency in the dataset. So doing value_counts() on a dataframe is the first step for that. |
closing as stale |
apply so that's a future TODO.
I'm relatively certain that this works, but I don't have explicit tests
for it. I'm hoping others who are interested (cough cough @rockg) will
be willing to create some of them. Key to test are that binning,
normalization and sorting all work correctly.
Will close #5377 when completed.