Skip to content

BUG: preserve dtypes in interpolate #6291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

TomAugspurger
Copy link
Contributor

Closes #6290

@jreback is there a good way to update part of a BlockManager inplace? I have a new BlockManager from interpolate and the original BlockManager (or DataFrame) which contains some "all good" columns that didn't get interpolate on, and some bad columns that did get interpolated on.

For the non inplace version I'm calling self._constructor on the BlockManager returned from interpolating and concating that to the all good columns.

@jreback
Copy link
Contributor

jreback commented Feb 6, 2014

@TomAugspurger no....this is where it gets tricky

you basically split the block's values then create and return 2 blocks from the Block routine (the apply at the BlockManager level willcombine them properly).

downcast, putmask, eval, convert all do this; they do it item by item (which you don't want to do)

give a shot and I can help

@jreback
Copy link
Contributor

jreback commented Feb 6, 2014

you want to do this check at a much lower level, in Block.interpolate; e.g. instead of passing every array to the interpolate routines, you just selectively pass, then reassemble

@jreback jreback added this to the 0.14.0 milestone Feb 7, 2014
@TomAugspurger
Copy link
Contributor Author

@jreback two things

  1. We raise a ValueError if you interpolate with method='time' on an object that doesn't have a time series index. With this PR if you don't have any NaNs in the Series/Frame you just get it back, no ValueError. If there are any NaNs it still raises. I think this is actually an improvement, but I could be persuaded otherwise.
  2. I'm just doing this on the self._interpolate method in core/internals (the part I added). Would you like to do a similar check for the self._interpolate_with_fill side also?

@@ -823,6 +823,12 @@ def interpolate(self, method='pad', axis=0, index=None,
m = None

if m is not None:
# Skip interpolating this block if no NaNs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine; but you of course have the problem if say part of block is interpolated and part is not. but better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I'm still looking into how to handle blocks with some good cols, and some that need interpolating.

@jreback
Copy link
Contributor

jreback commented Feb 7, 2014

@TomAugspurger you could do this on _interpolate_with_fill as well. keeping in mind things:

  • this may break some tests (which expect the float to come back always)
  • this may introduce a performance issue (isnull is not free; though reasonably cheap)...maybe just give an overall vbench to see if anything pops up

@jreback
Copy link
Contributor

jreback commented Feb 7, 2014

you could try to a-prior split blocks (or post-process the changed ones)....can do later PR

@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

this looks good...you punted the splitting blocks?

@TomAugspurger
Copy link
Contributor Author

The perf isn't looking good (assuming I'm running it correctly). There's an 80x slowdown on one of my new vbenches. I'll look at it closer tomorrow and ping you.

On Feb 13, 2014, at 9:35 PM, "jreback" <[email protected]mailto:[email protected]> wrote:

this looks good...you punted the splitting blocks?


Reply to this email directly or view it on GitHubhttps://github.com//pull/6291#issuecomment-35052784.

@TomAugspurger
Copy link
Contributor Author

Am I doing this correctly? I'm running python test_perf.py -b cab2a93 -H,

cab2a93 is just a merge commit from the hexbin plots earlier today.

The script runs and here are the interpolate related ones:

$cat ../vb_suite.log | grep interpolate
frame_interpolate_some_good                  |   1.8273 |
frame_interpolate_some_good_infer            |   3.2444 |
frame_interpolate                            |  79.5443 |

But when I checkout those commits manually, I get:

cab2a93

In [11]: %timeit df.interpolate()
10 loops, best of 3: 110 ms per loop

master  # should be the same as cab2a93, and it is.

In [5]: %timeit df.interpolate()
10 loops, best of 3: 110 ms per loop

my branch

In [5]: %timeit df.interpolate()
10 loops, best of 3: 78.4 ms per loop

So it's faster when I actually check it out, not 80x slower.

df is the same as the one from the vbench:

df = DataFrame(randn(10000, 100))
df.values[::2] = np.nan

@jreback
Copy link
Contributor

jreback commented Feb 14, 2014

no...those are the actual running times in ms, not the ratio...

./test_perf.sh -b <1 prior commit to yours> -t <last commit of yours> -r interpolate

I make sure its checked in and pushed because it pulls from the remote repo

@TomAugspurger
Copy link
Contributor Author

Thanks. This makes more sense.

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
frame_interpolate_some_good                  |   1.8317 |   3.9976 |   0.4582 |
frame_interpolate                            |  79.5573 | 109.9803 |   0.7234 |
frame_interpolate_some_good_infer            |   3.3453 |   3.9697 |   0.8427 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

@@ -823,6 +823,11 @@ def interpolate(self, method='pad', axis=0, index=None,
m = None

if m is not None:
if isinstance(self, (IntBlock, BoolBlock)):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want me to do this check in on the self._interpolate_with_fill part a few lines up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@TomAugspurger
Copy link
Contributor Author

Here's the perf for the fillna parts:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
frame_fillna_inplace                         |   9.4923 |   9.4840 |   1.0009 |
replace_fillna                               |   1.5057 |   1.4993 |   1.0042 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |

I fixed that typo on vbench/merge_join also.

@@ -802,6 +802,16 @@ def interpolate(self, method='pad', axis=0, index=None,
values=None, inplace=False, limit=None,
fill_value=None, coerce=False, downcast=None, **kwargs):

def check_int_bool(self, inplace):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also do this by self.is_integer, self.is_bool, self.is_timedelta.... these are defined for all blocks (but not a biggie)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback, are you able to view this PR on github? I’m getting an error every
time I try to view it. Not any other pages though.

Anyway I think it’s ready to go.

On Feb 14, 2014, at 11:22 AM, jreback <[email protected]mailto:[email protected]> wrote:

In pandas/core/internals.py:

@@ -802,6 +802,16 @@ def interpolate(self, method='pad', axis=0, index=None,
values=None, inplace=False, limit=None,
fill_value=None, coerce=False, downcast=None, **kwargs):

  •    def check_int_bool(self, inplace):
    

you can also do this by self.ishttp://self.is_integer, self.ishttp://self.is_bool, self.ishttp://self.is_timedelta.... these are defined for all blocks (but not a biggie)


Reply to this email directly or view it on GitHubhttps://github.com//pull/6291/files#r9756105.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

github seems kind of flaky now

will try later

On Feb 14, 2014, at 4:39 PM, Tom Augspurger [email protected] wrote:

In pandas/core/internals.py:

@@ -802,6 +802,16 @@ def interpolate(self, method='pad', axis=0, index=None,
values=None, inplace=False, limit=None,
fill_value=None, coerce=False, downcast=None, **kwargs):

  •    def check_int_bool(self, inplace):
    
    @jreback, are you able to view this PR on github? I’m getting an error every time I try to view it. Not any other pages though. Anyway I think it’s ready to go. On Feb 14, 2014, at 11:22 AM, jreback <[email protected]mailto:[email protected]> wrote: In pandas/core/internals.py:
    @@ -802,6 +802,16 @@ def interpolate(self, method='pad', axis=0, index=None, values=None, inplace=False, limit=None, fill_value=None, coerce=False, downcast=None, **kwargs): + def check_int_bool(self, inplace):
    you can also do this by self.ishttp://self.is_integer, self.ishttp://self.is_bool, self.ishttp://self.is_timedelta.... these are defined for all blocks (but not a biggie) — Reply to this email directly or view it on GitHubhttps://github.com/BUG: preserve dtypes in interpolate #6291/files#r9756105.

    Reply to this email directly or view it on GitHub.

@jreback
Copy link
Contributor

jreback commented Feb 15, 2014

@TomAugspurger looks like github finally works again...pls rebase and can get this in (and if you want to change to is_bool.....etc cool

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

github emailed me back - they had a bug that was preventing things from working today

fixed now

@TomAugspurger
Copy link
Contributor Author

Closing without merging in favor of #6378.

@TomAugspurger TomAugspurger deleted the interpolate-ignore-good branch November 3, 2016 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Interpolate should be more careful with dtypes
2 participants