Skip to content

BUG: beautifulsoup4 breaks pandas.read_html #58086

Closed
@neutrinoceros

Description

@neutrinoceros

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from io import StringIO

table = """<table border="1" class="dataframe">
<thead>
    <tr style="text-align: right;">
    <th>col0</th>
    <th>col1</th>
    <th>col2</th>
    </tr>
</thead>
<tbody>
    <tr>
    <td>1</td>
    <td>1.0</td>
    <td>a</td>
    </tr>
    <tr>
    <td>2</td>
    <td>2.5</td>
    <td>b</td>
    </tr>
    <tr>
    <td>3</td>
    <td>5.0</td>
    <td>c</td>
    </tr>
</tbody>
</table>
"""
buf = StringIO()
buf.write(table)
buf.seek(0)
pd.read_html(buf, flavor="bs4")

Issue Description

beautifuloup4 version 4.13.0b2 breaks this example, with the following exception being raised:

Traceback (most recent call last):
  File "/Users/clm/dev/astropy-project/coordinated/astropy/bugs/16251/t.py", line 34, in <module>
    pd.read_html(buf, flavor="bs4")
  File "/Users/clm/.pyenv/versions/astropy.dev/lib/python3.12/site-packages/pandas/io/html.py", line 1213, in read_html
    return _parse(
           ^^^^^^^
  File "/Users/clm/.pyenv/versions/astropy.dev/lib/python3.12/site-packages/pandas/io/html.py", line 972, in _parse
    tables = p.parse_tables()
             ^^^^^^^^^^^^^^^^
  File "/Users/clm/.pyenv/versions/astropy.dev/lib/python3.12/site-packages/pandas/io/html.py", line 242, in parse_tables
    tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/clm/.pyenv/versions/astropy.dev/lib/python3.12/site-packages/pandas/io/html.py", line 594, in _parse_tables
    element_name = self._strainer.name
                   ^^^^^^^^^^^^^^^^^^^
AttributeError: 'SoupStrainer' object has no attribute 'name'

I'm not sure whether it should be addressed in pandas or in bs4.
For context, this was discovered while testing astropy.
xref: astropy/astropy#16251

Expected Behavior

Not exception.

Installed Versions

INSTALLED VERSIONS

commit : 4241ba5
python : 3.12.2.final.0
python-bits : 64
OS : Darwin
OS-release : 23.4.0
Version : Darwin Kernel Version 23.4.0: Fri Mar 15 00:12:49 PDT 2024; root:xnu-10063.101.17~1/RELEASE_ARM64_T6020
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 3.0.0.dev0+644.g4241ba5e1
numpy : 1.26.0
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.1.0
pip : 24.0
Cython : None
pytest : 8.1.1
hypothesis : 6.98.9
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.3
IPython : 8.22.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.13.0b2
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.3
numba : None
numexpr : None
odfpy : None
openpyxl : None
pyarrow : 15.0.2
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.12.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions