Skip to content

ENH: read_xml() does not allow to specify huge_tree=True for the 'lxml' parser. #61290

Open
@sergiykhan

Description

@sergiykhan

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

read_xml() fails with the error message XMLSyntaxError: xmlSAX2Characters: huge text node.

A similar problem can be overcome when manually parsing the tree like so:

from lxml import etree
with open(filename) as f:
    tree = etree.parse(f, etree.XMLParser(huge_tree=True))

Feature Description

I am not sure what the best way to supply options to the parser would be.

Alternative Solutions

Right now, I have to read the file using the 'etree' parser like so

df = pd.read_xml(
    filename,
    parser='etree',
)

Additional Context

Similarly, the following option could be passed to the parser recover=True.

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions