Description
Minimal reproducible example
import pandas as pd
import numpy as np
tdf = pd.DataFrame({"tree": [0, 0, 0, 0, 1, 1, 1, 1],
"into": ["0-2", "0-4", "0-10", np.nan, "1-2", "1-3", "1-7", np.nan],
"leaf_value": [0,0,0,3,0,0,0,4]},
index=["0-0", "0-2", "0-4", "0-10", "1-0", "1-1", "1-2", "1-7"])
def deduce_tree(df):
print("DEDUCING TREE WITH INDEX:\n",df.index)
next_id = df.index[0]
while isinstance(next_id, str):
print(next_id)
next_node = df.loc[next_id, :]
next_id = df.loc[next_id, "into"]
print("RETURNING:\n", next_node)
return next_node
tdf.groupby("tree").apply(deduce_tree)
DEDUCING GREE WITH INDEX:
Index(['0-0', '0-2', '0-4', '0-10'], dtype='object')
0-0
0-2
0-4
0-10
RETURNING:
tree 0
into NaN
leaf_value 3
Name: 0-10, dtype: object
DEDUCING GREE WITH INDEX:
Index(['1-0', '1-1', '1-2', '1-7'], dtype='object')
1-0
DEDUCING GREE WITH INDEX:
Index(['0-0', '0-2', '0-4', '0-10'], dtype='object')
0-0
0-2
0-4
0-10
RETURNING:
tree 0
into NaN
leaf_value 3
Name: 0-10, dtype: object
DEDUCING GREE WITH INDEX:
Index(['1-0', '1-1', '1-2', '1-7'], dtype='object')
1-0
1-2
1-7
RETURNING:
tree 1
into NaN
leaf_value 4
Name: 1-7, dtype: object
Problem description
This came up when trying to analyze a boosted tree internals. As the apply function gets called with a print statement, when it gets to the line next_node = df.loc[next_id, :]
it just calls the deduce_tree function again, with the group 0.
it prints out DEDUCING TREE WITH INDEX
3 times as opposed to 2, and for some reason interrupts the function.
the result of tdf.groupby("tree").apply(deduce_tree)
is correct, but it seems to do some unnecessary work and if I want to implement some side effects into deduce_tree
it gets messed up.
Can anyone explain why it works like this? Is this some bug? How can a .loc interrupt a function?
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 5.2.1-1.el7.elrepo.x86_64
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 0.25.3
numpy : 1.18.1
pytz : 2018.4
dateutil : 2.7.3
pip : 19.2.3
setuptools : 41.0.1
Cython : None
pytest : 4.1.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.1 (dt dec pq3 ext lo64)
jinja2 : 2.10
IPython : 7.8.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.2.8
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None