Skip to content

WIP: Add value_counts() to DataFrame #5381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jtratner
Copy link
Contributor

  • abstract bin generation from cut to use elsewhere. Panel goes to ndarray on
    apply so that's a future TODO.

I'm relatively certain that this works, but I don't have explicit tests
for it. I'm hoping others who are interested (cough cough @rockg) will
be willing to create some of them. Key to test are that binning,
normalization and sorting all work correctly.

Will close #5377 when completed.

@jreback
Copy link
Contributor

jreback commented Oct 30, 2013

@jtratner looks fine...(obv need some tests as you have indicated)

@jtratner
Copy link
Contributor Author

yeah, I'm pretty sure there's a bug with the binning that I need to address
and flesh out some of the edge cases (esp dup columns :-/). Should this
wait until 0.14 b/c it's a new feature?

@jreback
Copy link
Contributor

jreback commented Oct 30, 2013

let's put this off to 0.14....

@jtratner
Copy link
Contributor Author

Yep, that's what I was thinking too.

def value_counts(self, axis=0, normalize=False, sort=True,
ascending=False, bins=None, numeric_only=False):
"""
Returns DataFrame containing counts of unique values. The resulting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put the first sentence on a seperate line? (so putting the "The resulting" on the next line. When following the numpy docstring standard exactly, there should even be a blank line after the first sentence.) This will ensure that the summary in the api docs (http://pandas.pydata.org/pandas-docs/dev/api.html) are limited to that one sentence.

@jorisvandenbossche
Copy link
Member

Added some documentation comments.

@jtratner
Copy link
Contributor Author

Thanks for the comments! I'll incorporate those when I return to finish
this up.

@hayd
Copy link
Contributor

hayd commented Nov 22, 2013

When would someone use this ? I can't think of when I would use this :s

Should fill in the NaNs with 0, since count is 0 if no values appear.

@rockg
Copy link
Contributor

rockg commented Nov 22, 2013

In general I think anything that you can do for a series you should be able to do for a column/columns of a DataFrame. I think there are many uses for this especially when binning data--essentially it's a histogram function that you can apply to bunch of columns at once rather than going through each series.

@hayd
Copy link
Contributor

hayd commented Nov 22, 2013

In general I think anything that you can do for a series you should be able to do for a column/columns of a DataFrame

A column of a DataFrame is a Series (so the first half I agree). However, I strongly disagree to extending all Series functionality to DataFrame for the sake of it, rather things should be extended only where they makes sense... Personally I can't see this particular one making sense, but am open to seeing an example where it does.

I think there are many uses for this especially when binning data...

So, could you give a specific example?

@rockg
Copy link
Contributor

rockg commented Nov 22, 2013

Exactly my point that it's a series of Series so it's a natural extension to be able to do a Series function to a series of Series.

One example is if there are is a DataFrame of different price timeseries and the user is interested in counts in bins of some increments (e.g., $10 or $5 or some other measure). Or one could think of binning returns on a DataFrame or binning measurement data. Another one is if there are states of different variables in each column one would be interested in the counts of these states for the various columns. There are other ways to do the counts, but I think the binning is the real key to the usefulness of value_count.

It is just waiting tests thanks to implementation work already done by @jtratner.

@TomAugspurger
Copy link
Contributor

I've ran into a couple situations in the past couple weeks where I could have used value counts on a DataFrame. Specifically needing to get value counts for flows from period to period:

In [12]: df.head()
Out[12]: 
                                1   2   3   4   5   6   7   8
HRHHID      HRHHID2 PULINENO                                 
12807008622 87001   1         NaN  nn  nn  nn  nn  nn  nn  nn
                    2         NaN  nn  nn  nn  nn  nn  nn  nn
                    3         NaN  ee  ee  ee  ee  ee  ee  ee
17141220290 87001   1         NaN  un  un  ne  ee  ee  ee  ee
                    2         NaN  ee  ee  ee  ee  ee  ee  ee

Granted, using df.apply(lambda x: x.value_counts()) isn't onerous, but It wasn't the first thing I turned to (df.value_counts was).

@jtratner
Copy link
Contributor Author

This is a pretty trivial addition overall - I just haven't had time to get
to it (been super busy these past two weeks). Sorry guys.

@jreback
Copy link
Contributor

jreback commented Jan 3, 2014

@jtratner up 2 u if you want this for 0.13.1

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

@jtratner ?

@jreback
Copy link
Contributor

jreback commented Mar 9, 2014

@jtratner still working on this?

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Mar 28, 2014
@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

@jtratner status?

@cpcloud
Copy link
Member

cpcloud commented Jun 5, 2014

@jtratner i can take this if you want

+ abstract bin generation from cut to use elsewhere. Panel goes to
ndarray on apply so that's a future TODO.

Conflicts:
	pandas/core/frame.py
	pandas/core/series.py
@jtratner jtratner force-pushed the add-value-counts-to-frame branch from ca9c6f9 to fc8f0b9 Compare November 17, 2014 06:23
@chrish42
Copy link
Contributor

chrish42 commented Dec 1, 2014

I don't know if this bug is still at the "needing use cases" stage, but I have a dataframe where the columns are categorical variables. I want to prune values for these variables which appear below a certain frequency in the dataset. So doing value_counts() on a dataframe is the first step for that.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

closing as stale

@jreback jreback closed this Jan 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff API Design Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: DataFrame.value_counts()
8 participants