ENH: show the raw unicode in the output formatting of Index/array?

I ran into a somewhat wrong CSV file. We automatically remove the BOM character from the data, but this file started with two such characters .. and then right now we keep the second. So essentially I had a dataframe like this:


```python
>>> df = pd.DataFrame({"\ufeffCol": [1, 2, 3]})
>>> df 
   ﻿Col
0     1
1     2
2     3
```

In the dataframe repr, I think it is expected we don't show the character (since it is unicode for a "zero width space" ..). In any case I was also using a notebook, and in the html repr we certainly would render the unicode.

But to diagnose the issue of `df["Col"]` failing with a KeyError, I looked at the columns:

```python
>>> df.columns
Index(['﻿Col'], dtype='str')
```

Here we do show the value as a string (i.e. it is quoted), but still don't show the unicode character, while the python repr of the string or the equivalent numpy array repr both show it:

```python
>>> df.columns[0]
'\ufeffCol'
>>> df.columns.to_numpy()
array(['\ufeffCol'], dtype=object)
```

(the above is showing with the new "str" dtype, but originally I ran into it with object dtype, so both have the same issue)

It would have been much easier to debug this issue if the Index repr showed the unicode character.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: show the raw unicode in the output formatting of Index/array? #60819

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ENH: show the raw unicode in the output formatting of Index/array? #60819

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions