Description
Code Sample
import pandas as pd
idx_w_freq = pd.date_range('20100101', periods=3, freq='B')
# Series
# ======
# this works fine
s = pd.Series([1,2,3], index=idx_w_freq)
s.first_valid_index()
Out[5]: Timestamp('2010-01-01 00:00:00', freq='B')
# this works fine
s_nan = pd.Series([None,None,3], index=idx_w_freq)
s_nan.first_valid_index()
Out[7]: Timestamp('2010-01-05 00:00:00', freq='B')
# this works fine
s_nan = pd.Series([1,None,3], index=idx_w_freq)
s_nan.first_valid_index()
Out[9]: Timestamp('2010-01-01 00:00:00', freq='B')
# DataFrame (here is the problem)
# =========
# this works fine
df = pd.DataFrame([1,2,3], index=idx_w_freq)
df.first_valid_index()
Out[11]: Timestamp('2010-01-01 00:00:00', freq='B')
# this works fine
df_nan = pd.DataFrame([None,2,3], index=idx_w_freq)
df_nan.first_valid_index()
Out[13]: Timestamp('2010-01-04 00:00:00', freq='B')
# this works fine
df_nan = pd.DataFrame([[None,None], [None, None], [None, 3]], index=idx_w_freq)
df_nan.first_valid_index()
Out[15]: Timestamp('2010-01-05 00:00:00', freq='B')
# UNEXPECTED OUTPUT WITHOUT FREQUENCY
df_w_holes = pd.DataFrame([[1,None], [None, None], [3, 3]], index=idx_w_freq)
df_w_holes.first_valid_index()
Out[17]: Timestamp('2010-01-01 00:00:00')
Problem description
The method implemented in pandas.Series
it works fine for all cases. However, the method implemented in pandas.DataFrame
returns an index without the frequency when there is holes in the values.
This is because the _get_valid_indices()
returns a fancy selection of the indices with the naive mask as shown below:
# pandas.core.frame.DataFrame#_get_valid_indices
def _get_valid_indices(self):
is_valid = self.count(1) > 0
return self.index[is_valid]
The problem is present since 0.19.1.
Output of pd.show_versions()
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: es_ES.UTF-8
pandas: 0.22.0
pytest: 3.4.2
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.27.3
numpy: 1.14.2
scipy: 1.0.0
pyarrow: 0.9.0
xarray: 0.10.2
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.2.0
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: 0.1.3
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None
Possible solution
I wonder to know if this is actually a bug or not. In the bug case, in my opinion, I will add the next functions as methods to NDFrame
class, because It works with Series
and DataFrames
in substitution of the current methods. I would like do the pull request if this is ok.
def first_valid_index(data):
"""Return label for first non-NA/null value.
Parameters
----------
data : pandas.Series or pandas.DataFrame
Input data object.
Returns
-------
first valid index: index
"""
if len(data) == 0:
return None
mask = data.count(1) > 0
i = mask.argmax()
if not mask[i]:
return None
else:
return i
def last_valid_index(data):
"""Return index label for last non-NA/null value.
Parameters
----------
data : pandas.Series or pandas.DataFrame
Input data object.
Returns
-------
index_label: type of input index
Index label for the last non-NA/null value.
"""
if len(data) == 0:
return None
mask = data.count(1) > 0 # count number of non-null values per row, if
# result is greater than 0, then the row is valid
i = mask._values[::-1].argmax() # find the integer index of the first True
# value starting from the end
if not mask.iat[len(data) - i - 1]: # no valid values in data
return None
else:
return data.index[len(data) - i - 1]