Skip to content

groupby type coercion dependent on presence of datetime column in grouped data #14849

Closed
@wes-turner

Description

@wes-turner

Code Sample, a copy-pastable example if possible

import pandas as pd

foo = pd.DataFrame.from_records(
    [ 
      (pd.datetime(2016,1,1), 'red', 'dark', 1, '8'),
      (pd.datetime(2015,1,1), 'green', 'stormy', 2, '9'),
      (pd.datetime(2014,1,1), 'blue', 'bright', 3, '10'),
      (pd.datetime(2013,1,1), 'blue', 'calm', 4, 'potato')
    ],
    columns=['observation', 'color', 'mood', 'intensity', 'score'])

# The type of 'score' changes depending on the types passed through the groupby
print(pd.concat(
    [
        foo.dtypes,
        foo.loc[:,['observation', 'color', 'mood', 'intensity', 'score']].groupby('color').apply(lambda g: g.iloc[0]).dtypes,
        foo.loc[:,[               'color', 'mood', 'intensity', 'score']].groupby('color').apply(lambda g: g.iloc[0]).dtypes
    ],
    axis=1,
    keys=['original DF', 'w/ datetime', 'w/o datetime']))

Problem description

When the results of a groupby contain a Series with a datetime and are aggregated back into a DataFrame, columns of object type are cast numeric when possible. When that Series contains no datetime, they are not.

The presence of a datetime elsewhere in the Series should not have effects on unrelated columns. Doing no implicit type coercion seems (to me) like the safest option (especially in a language where "1" != 1). But regardless, whether or not type coercion is done for a column 'A' should not depend on the types of all the column 'B's.

Issue #14423 is a different problem over the same code.

Expected Output

Current:

                original DF     w/ datetime w/o datetime
color                object          object       object
intensity             int64           int64        int64
mood                 object          object       object
observation  datetime64[ns]  datetime64[ns]          NaN
score                object           int64       object

Expected:

                original DF     w/ datetime w/o datetime
color                object          object       object
intensity             int64           int64        int64
mood                 object          object       object
observation  datetime64[ns]  datetime64[ns]          NaN
score                object          object       object

-or-

                original DF     w/ datetime w/o datetime
color                object          object       object
intensity             int64           int64        int64
mood                 object          object       object
observation  datetime64[ns]  datetime64[ns]          NaN
score                object           int64        int64

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.4.3.final.0 python-bits: 64 OS: Linux OS-release: 3.13.0-101-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.1
nose: 1.3.1
pip: 1.5.4
setuptools: 3.3
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.3
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.3.3
bs4: 4.2.1
html5lib: 0.999
httplib2: 0.8
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions