Closed
Description
Not that familiar (at all :) with pandas internals, but I don't think this is expected behaviour.
f3 = DataFrame(
[
[95820843523155097, 1, 'director', 1],
[95820843523155098, 1, 'director', 2],
[95820843523155099, 1, 'director', 3],
[95820843523155100, 2, 'director', 4],
[95820843523155101, 2, 'computer system management (director)', 5],
[95820843523155102, 3, 'company director', 6],
[95820843523155103, 3, 'office manager', 7]
],
columns=['uid', 'cid', 'role', 'idx']
)
f3.dtypes
uid int64
cid int64
role object
idx int64
dtype: object
Observed behaviour
f3.groupby('cid').first()
uid | role | idx | |
---|---|---|---|
cid | |||
1 | 95820843523155104 | director | 1 |
2 | 95820843523155104 | director | 4 |
3 | 95820843523155104 | company director | 6 |
The uid
column contains values that are all the same and aren't in the original data. (This isn't always true in larger sets; sometimes there's an overlap.)
Expected behaviour
f3.groupby('cid').apply(lambda g: g[:1])
uid | role | idx | ||
---|---|---|---|---|
cid | ||||
1 | 0 | 95820843523155097 | director | 1 |
2 | 3 | 95820843523155100 | director | 4 |
3 | 5 | 95820843523155102 | company director | 6 |
This is what I expected to happen (i.e. the uid
matches the rest of the row).
INSTALLED VERSIONS
------------------
commit: None
python: 3.4.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-24-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.15.2
nose: 1.3.4
Cython: 0.21.1
numpy: 1.8.2
scipy: 0.14.0
statsmodels: 0.6.1
IPython: 2.3.1
sphinx: None
patsy: 0.3.0
dateutil: 2.1
pytz: 2014.9
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.0
openpyxl: 2.0.2
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.6.4
lxml: 3.4.1
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.8
pymysql: None
psycopg2: 2.5.4 (dt dec pq3 ext)