WIP: Add value_counts() to DataFrame #5381

jtratner · 2013-10-30T03:43:00Z

abstract bin generation from cut to use elsewhere. Panel goes to ndarray on
apply so that's a future TODO.

I'm relatively certain that this works, but I don't have explicit tests
for it. I'm hoping others who are interested (cough cough @rockg) will
be willing to create some of them. Key to test are that binning,
normalization and sorting all work correctly.

Will close #5377 when completed.

jreback · 2013-10-30T17:29:31Z

@jtratner looks fine...(obv need some tests as you have indicated)

jtratner · 2013-10-30T21:08:24Z

yeah, I'm pretty sure there's a bug with the binning that I need to address
and flesh out some of the edge cases (esp dup columns :-/). Should this
wait until 0.14 b/c it's a new feature?

jreback · 2013-10-30T21:11:43Z

let's put this off to 0.14....

jtratner · 2013-10-30T21:18:35Z

Yep, that's what I was thinking too.

jorisvandenbossche · 2013-11-14T17:49:42Z

pandas/core/frame.py

+    def value_counts(self, axis=0, normalize=False, sort=True,
+                     ascending=False, bins=None, numeric_only=False):
+        """
+        Returns DataFrame containing counts of unique values. The resulting


Can you put the first sentence on a seperate line? (so putting the "The resulting" on the next line. When following the numpy docstring standard exactly, there should even be a blank line after the first sentence.) This will ensure that the summary in the api docs (http://pandas.pydata.org/pandas-docs/dev/api.html) are limited to that one sentence.

jorisvandenbossche · 2013-11-14T17:54:27Z

Added some documentation comments.

jtratner · 2013-11-14T23:01:00Z

Thanks for the comments! I'll incorporate those when I return to finish
this up.

hayd · 2013-11-22T00:08:17Z

When would someone use this ? I can't think of when I would use this :s

Should fill in the NaNs with 0, since count is 0 if no values appear.

rockg · 2013-11-22T01:36:54Z

In general I think anything that you can do for a series you should be able to do for a column/columns of a DataFrame. I think there are many uses for this especially when binning data--essentially it's a histogram function that you can apply to bunch of columns at once rather than going through each series.

hayd · 2013-11-22T02:04:32Z

In general I think anything that you can do for a series you should be able to do for a column/columns of a DataFrame

A column of a DataFrame is a Series (so the first half I agree). However, I strongly disagree to extending all Series functionality to DataFrame for the sake of it, rather things should be extended only where they makes sense... Personally I can't see this particular one making sense, but am open to seeing an example where it does.

I think there are many uses for this especially when binning data...

So, could you give a specific example?

rockg · 2013-11-22T02:44:50Z

Exactly my point that it's a series of Series so it's a natural extension to be able to do a Series function to a series of Series.

One example is if there are is a DataFrame of different price timeseries and the user is interested in counts in bins of some increments (e.g., $10 or $5 or some other measure). Or one could think of binning returns on a DataFrame or binning measurement data. Another one is if there are states of different variables in each column one would be interested in the counts of these states for the various columns. There are other ways to do the counts, but I think the binning is the real key to the usefulness of value_count.

It is just waiting tests thanks to implementation work already done by @jtratner.

TomAugspurger · 2013-11-22T03:02:31Z

I've ran into a couple situations in the past couple weeks where I could have used value counts on a DataFrame. Specifically needing to get value counts for flows from period to period:

In [12]: df.head()
Out[12]: 
                                1   2   3   4   5   6   7   8
HRHHID      HRHHID2 PULINENO                                 
12807008622 87001   1         NaN  nn  nn  nn  nn  nn  nn  nn
                    2         NaN  nn  nn  nn  nn  nn  nn  nn
                    3         NaN  ee  ee  ee  ee  ee  ee  ee
17141220290 87001   1         NaN  un  un  ne  ee  ee  ee  ee
                    2         NaN  ee  ee  ee  ee  ee  ee  ee

Granted, using df.apply(lambda x: x.value_counts()) isn't onerous, but It wasn't the first thing I turned to (df.value_counts was).

jtratner · 2013-11-22T04:19:57Z

This is a pretty trivial addition overall - I just haven't had time to get
to it (been super busy these past two weeks). Sorry guys.

jreback · 2014-01-03T22:19:49Z

@jtratner up 2 u if you want this for 0.13.1

jreback · 2014-02-16T21:57:35Z

@jtratner ?

jreback · 2014-03-09T15:01:48Z

@jtratner still working on this?

jreback · 2014-04-22T15:35:11Z

@jtratner status?

cpcloud · 2014-06-05T17:03:06Z

@jtratner i can take this if you want

+ abstract bin generation from cut to use elsewhere. Panel goes to ndarray on apply so that's a future TODO. Conflicts: pandas/core/frame.py pandas/core/series.py

chrish42 · 2014-12-01T18:45:55Z

I don't know if this bug is still at the "needing use cases" stage, but I have a dataframe where the columns are categorical variables. I want to prune values for these variables which appear below a certain frequency in the dataset. So doing value_counts() on a dataframe is the first step for that.

jreback · 2015-01-18T21:39:35Z

closing as stale

jorisvandenbossche reviewed Nov 14, 2013
View reviewed changes

jreback added Algos labels Feb 16, 2014

jreback modified the milestones: 0.15.0, 0.14.0 Mar 28, 2014

jreback mentioned this pull request Jun 5, 2014

ENH/GBY: add nlargest/nsmallest to Series.groupby #7356

Merged

jtratner added 3 commits November 16, 2014 21:37

ENH: Add value_counts() to DataFrame

a3d490f

+ abstract bin generation from cut to use elsewhere. Panel goes to ndarray on apply so that's a future TODO. Conflicts: pandas/core/frame.py pandas/core/series.py

Value count tests

c452ed8

CLoser to working test cases

fc8f0b9

jtratner force-pushed the add-value-counts-to-frame branch from ca9c6f9 to fc8f0b9 Compare November 17, 2014 06:23

jreback closed this Jan 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add value_counts() to DataFrame #5381

WIP: Add value_counts() to DataFrame #5381

jtratner commented Oct 30, 2013

jreback commented Oct 30, 2013

jtratner commented Oct 30, 2013

jreback commented Oct 30, 2013

jtratner commented Oct 30, 2013

jorisvandenbossche Nov 14, 2013

jorisvandenbossche commented Nov 14, 2013

jtratner commented Nov 14, 2013

hayd commented Nov 22, 2013

rockg commented Nov 22, 2013

hayd commented Nov 22, 2013

rockg commented Nov 22, 2013

TomAugspurger commented Nov 22, 2013

jtratner commented Nov 22, 2013

jreback commented Jan 3, 2014

jreback commented Feb 16, 2014

jreback commented Mar 9, 2014

jreback commented Apr 22, 2014

cpcloud commented Jun 5, 2014

chrish42 commented Dec 1, 2014

jreback commented Jan 18, 2015

WIP: Add value_counts() to DataFrame #5381

WIP: Add value_counts() to DataFrame #5381

Conversation

jtratner commented Oct 30, 2013

jreback commented Oct 30, 2013

jtratner commented Oct 30, 2013

jreback commented Oct 30, 2013

jtratner commented Oct 30, 2013

jorisvandenbossche Nov 14, 2013

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 14, 2013

jtratner commented Nov 14, 2013

hayd commented Nov 22, 2013

rockg commented Nov 22, 2013

hayd commented Nov 22, 2013

rockg commented Nov 22, 2013

TomAugspurger commented Nov 22, 2013

jtratner commented Nov 22, 2013

jreback commented Jan 3, 2014

jreback commented Feb 16, 2014

jreback commented Mar 9, 2014

jreback commented Apr 22, 2014

cpcloud commented Jun 5, 2014

chrish42 commented Dec 1, 2014

jreback commented Jan 18, 2015