Skip to content

using DataFrame.resample with 'agg' method on non-existant columns provides unexpected behavior #16766

Closed
@dwilson-icr

Description

@dwilson-icr

Code Sample, a copy-pastable example if possible

import pandas as pd
from datetime import datetime

data = [
    {
        't': datetime(2017,6,1,0),
        'x': 1.0,
        'y': 2.0
    },
    {
        't': datetime(2017,6,1,1),
        'x': 2.0,
        'y': 2.0
    },
    {
        't': datetime(2017,6,1,2),
        'x': 3.0,
        'y': 1.5
    }

]

df = pd.DataFrame(data)
df = df.set_index('t')

# Perform a resample is get a binned time series DataFrame... this works fine
ts = df.resample('30T').agg({'x':['mean'],'y':['median']})
print ts['x'].shape
# (5,1)

# What if I put a field in there that doesn't exist? 
ts = df.resample('30T').agg({'x':['mean'],'y':['median'],'z':['sum']})
print ts['x'].shape
# (5,2) ??? I don't understand why the shape isn't (5,1)
print ts['x'].values
#[[ 1.   2. ]
# [ nan  nan]
# [ 2.   2. ]
# [ nan  nan]
# [ 3.   1.5]]
# Looks like a copy of the full aggregation even though I only requested the 'x' column
# Furthermore, now ts['z'] exists

Output:

(5, 1)
[[ 1.]
[ nan]
[ 2.]
[ nan]
[ 3.]]
(5, 2)
[[ 1. 2. ]
[ nan nan]
[ 2. 2. ]
[ nan nan]
[ 3. 1.5]]

Problem description

I am using pandas on records from an Elasticsearch database. The queries are pulling from multiple indices with overlapping column/field names. Most records will have most of their data in common, but when available, I want to know the values of other fields. I'm essentially creating a time series for each column with a specific time-based binning and a per-column aggregation.

I think this should either ignore columns that don't exist or raise an exception. If it silently ignores these columns, ts['x'] should return the same result as the first example. I spent several hours on a workaround today that required that I check for which columns were available for each aggregation and remove those from my agg dictionary. I feel like the current behavior doesn't have a purpose, but perhaps I'm missing something.

Expected Output

(5, 1)
[[  1.]
 [ nan]
 [  2.]
 [ nan]
 [  3.]]
(5, 1)
[[  1.]
 [ nan]
 [  2.]
 [ nan]
 [  3.]]

OR

ValueError("Column 'z' does not exist!")

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.5.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-514.2.2.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 0.9.8
Cython: None
numpy: 1.11.3
scipy: 0.18.1
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 1.5
pytz: 2012d
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions