Closed
Description
dupe of #3707
groupby often casts int64 data to floats before applying functions to the grouped data, and then casts the result back to an int64. This produces incorrect results.
Here is a simple example:
import pandas as pd
import numpy as np
label = np.array([1, 2, 2, 3],dtype=np.int)
data = np.array([1, 2, 3,4], dtype=np.int64) + 24650000000000000
z = pd.DataFrame({'a':label,'data':data})
f = z.groupby('a').first()
In [40]: print z
a data
0 1 24650000000000001
1 2 24650000000000002
2 2 24650000000000003
3 3 24650000000000004
In [41]: print f
data
a
1 24650000000000000
2 24650000000000000
3 24650000000000004
The result is clearly incorrect and should be
data
a
1 24650000000000001
2 24650000000000002
3 24650000000000004
z.data and f.data are both numpy.int64, which masks the fact that z.data was converted to a float with a loss of precision during the groupby operation.
Tests of software using groupby on int64 will only show a problem if the tests include integers large enough to cause this loss of precision. Otherwise the tests will pass and the software will just quietly produce incorrect results in production.