Skip to content

Unexpected behaviour in pandas.DataFrame.first_valid_index (and last...) #20499

Closed
@asdf8601

Description

@asdf8601

Code Sample

import pandas as pd
idx_w_freq = pd.date_range('20100101', periods=3, freq='B')

# Series
# ======

# this works fine
s = pd.Series([1,2,3], index=idx_w_freq)
s.first_valid_index()
Out[5]: Timestamp('2010-01-01 00:00:00', freq='B')

# this works fine
s_nan = pd.Series([None,None,3], index=idx_w_freq)
s_nan.first_valid_index()
Out[7]: Timestamp('2010-01-05 00:00:00', freq='B')

# this works fine
s_nan = pd.Series([1,None,3], index=idx_w_freq)
s_nan.first_valid_index()
Out[9]: Timestamp('2010-01-01 00:00:00', freq='B')

# DataFrame (here is the problem)
# =========

# this works fine
df = pd.DataFrame([1,2,3], index=idx_w_freq)
df.first_valid_index()
Out[11]: Timestamp('2010-01-01 00:00:00', freq='B')

# this works fine
df_nan = pd.DataFrame([None,2,3], index=idx_w_freq)
df_nan.first_valid_index()
Out[13]: Timestamp('2010-01-04 00:00:00', freq='B')

# this works fine
df_nan = pd.DataFrame([[None,None], [None, None], [None, 3]], index=idx_w_freq)
df_nan.first_valid_index()
Out[15]: Timestamp('2010-01-05 00:00:00', freq='B')

# UNEXPECTED OUTPUT WITHOUT FREQUENCY
df_w_holes = pd.DataFrame([[1,None], [None, None], [3, 3]], index=idx_w_freq)
df_w_holes.first_valid_index()
Out[17]: Timestamp('2010-01-01 00:00:00')

Problem description

The method implemented in pandas.Series it works fine for all cases. However, the method implemented in pandas.DataFrame returns an index without the frequency when there is holes in the values.

This is because the _get_valid_indices() returns a fancy selection of the indices with the naive mask as shown below:

# pandas.core.frame.DataFrame#_get_valid_indices
def _get_valid_indices(self):
    is_valid = self.count(1) > 0
    return self.index[is_valid]

The problem is present since 0.19.1.

Output of pd.show_versions()

pd.show_versions()
INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: es_ES.UTF-8
pandas: 0.22.0
pytest: 3.4.2
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.27.3
numpy: 1.14.2
scipy: 1.0.0
pyarrow: 0.9.0
xarray: 0.10.2
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.2.0
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: 0.1.3
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None

Possible solution

I wonder to know if this is actually a bug or not. In the bug case, in my opinion, I will add the next functions as methods to NDFrame class, because It works with Series and DataFrames in substitution of the current methods. I would like do the pull request if this is ok.

def first_valid_index(data):
    """Return label for first non-NA/null value.

    Parameters
    ----------
    data : pandas.Series or pandas.DataFrame
        Input data object.

    Returns
    -------
    first valid index: index
    """
    if len(data) == 0:
        return None
    mask = data.count(1) > 0
    i = mask.argmax()
    if not mask[i]:
        return None
    else:
        return i


def last_valid_index(data):
    """Return index label for last non-NA/null value.

    Parameters
    ----------
    data : pandas.Series or pandas.DataFrame
        Input data object.

    Returns
    -------
    index_label: type of input index
        Index label for the last non-NA/null value.
    """
    if len(data) == 0:
        return None

    mask = data.count(1) > 0  # count number of non-null values per row, if
    # result is greater than 0, then the row is valid
    i = mask._values[::-1].argmax()  # find the integer index of the first True
    # value starting from the end
    if not mask.iat[len(data) - i - 1]:  # no valid values in data
        return None
    else:
        return data.index[len(data) - i - 1]

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugFrequencyDateOffsetsIndexingRelated to indexing on series/frames, not to indexes themselvesReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions