-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: iterparse of read_xml not parsing duplicate element and attribute names #47414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
4aa32cc
4a1a2a0
65c09d5
29211b5
3c241d5
d73b32f
35bc554
0379764
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -413,11 +413,21 @@ def _iterparse_nodes(self) -> list[dict[str, str | None]]: | |
row = {} | ||
|
||
if row is not None: | ||
for col in self.iterparse[row_node]: | ||
if curr_elem == col: | ||
row[col] = elem.text.strip() if elem.text else None | ||
if col in elem.attrib: | ||
row[col] = elem.attrib[col] | ||
if self.names: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you share this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. I refactored to use code once in base class of the parsers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a bit confusing now. Could you split the refactoring and the bug fix into 2 prs? Hard to review like this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will revert back to bug fix only and on subsequent PR refactor to remove repetitive code between parsers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks. I was not sure if you could share more initially, but if you can share most of the code than we can separate these |
||
for col, nm in zip(self.iterparse[row_node], self.names): | ||
if curr_elem == col: | ||
elem_val = elem.text.strip() if elem.text else None | ||
if elem_val not in row.values() and nm not in row: | ||
row[nm] = elem_val | ||
if col in elem.attrib: | ||
if elem.attrib[col] not in row.values() and nm not in row: | ||
row[nm] = elem.attrib[col] | ||
else: | ||
for col in self.iterparse[row_node]: | ||
if curr_elem == col: | ||
row[col] = elem.text.strip() if elem.text else None | ||
if col in elem.attrib: | ||
row[col] = elem.attrib[col] | ||
|
||
if event == "end": | ||
if curr_elem == row_node and row is not None: | ||
|
@@ -661,11 +671,21 @@ def _iterparse_nodes(self) -> list[dict[str, str | None]]: | |
row = {} | ||
|
||
if row is not None: | ||
for col in self.iterparse[row_node]: | ||
if curr_elem == col: | ||
row[col] = elem.text.strip() if elem.text else None | ||
if col in elem.attrib: | ||
row[col] = elem.attrib[col] | ||
if self.names: | ||
for col, nm in zip(self.iterparse[row_node], self.names): | ||
if curr_elem == col: | ||
elem_val = elem.text.strip() if elem.text else None | ||
if elem_val not in row.values() and nm not in row: | ||
row[nm] = elem_val | ||
if col in elem.attrib: | ||
if elem.attrib[col] not in row.values() and nm not in row: | ||
row[nm] = elem.attrib[col] | ||
else: | ||
for col in self.iterparse[row_node]: | ||
if curr_elem == col: | ||
row[col] = elem.text.strip() if elem.text else None | ||
if col in elem.attrib: | ||
row[col] = elem.attrib[col] | ||
|
||
if event == "end": | ||
if curr_elem == row_node and row is not None: | ||
|
@@ -1020,7 +1040,8 @@ def read_xml( | |
|
||
names : list-like, optional | ||
Column names for DataFrame of parsed XML data. Use this parameter to | ||
rename original element names and distinguish same named elements. | ||
rename original element names and distinguish same named elements and | ||
attributes. | ||
|
||
dtype : Type name or dict of column -> type, optional | ||
Data type for data or columns. E.g. {{'a': np.float64, 'b': np.int32, | ||
|
Uh oh!
There was an error while loading. Please reload this page.