Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
xml = """
<data>
<row>
<circle>round</circle>
<square>sq</square>
</row>
<row>
<circle angle="360">round too</circle>
<square>blocky</square>
</row>
</data>
"""
df = pd.read_xml(xml, xpath="./row")
print(df)
Produces:
df
circle square
0 round sq
1 round too blocky
### Issue Description
pd.read_xml does not return attributes when they do not appear on the first line. It appears that the implementation expects the attribute to be present in every line, but XML attributes are frequently used intermittently, consequently this does not appear to me as a correct behavior.
Interestingly, if you try to extract only the attributes with the above example:
df = pd.read_xml(xml, xpath="./row", read_attrs=True)
an IndexError is thrown:
df = pd.read_xml(xml, xpath="./row", attrs_only=True)
Traceback (most recent call last):
File "", line 1, in
File "...Python39\lib\site-packages\pandas\util_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "...Python39\lib\site-packages\pandas\io\xml.py", line 938, in read_xml
return _parse(
File "...Python39\lib\site-packages\pandas\io\xml.py", line 733, in _parse
data_dicts = p.parse_data()
File "...Python39\lib\site-packages\pandas\io\xml.py", line 398, in parse_data
return self._parse_nodes()
File "...Python39\lib\site-packages\pandas\io\xml.py", line 469, in _parse_nodes
if self.namespaces or "}" in list(dicts[0].keys())[0]:
IndexError: list index out of range
which is probably not correct either.
### Expected Behavior
The dataframe was expected to have:
circle circle_angle square
0 round NA sq
1 round too 360 blocky
where circle_angle might have read angle, but this could get into problems with attributes having the same name for different elements. Instead, the attribute angle was ignored.
### Installed Versions
<details>
INSTALLED VERSIONS
------------------
commit : e8093ba372f9adfe79439d90fe74b0b5b6dea9d6
python : 3.9.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19043
machine : AMD64
processor : Intel64 Family 6 Model 167 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252
pandas : 1.4.3
numpy : 1.21.2
pytz : 2021.3
dateutil : 2.8.2
setuptools : 56.0.0
pip : 22.1.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.8.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : 1.0.9
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : 2.1.1
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
</details>