Skip to content

BUG: read_csv skips leading space where it shouldn't #34085

Open
@plammens

Description

@plammens
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

I wasn't able to isolate a (more) minimal example, so I'll just share what I was working on. nltk here is version 3.4.5.

import csv
import string

import nltk
import pandas as pd


UNFORMATTED = set(string.ascii_lowercase)
PUNCTUATION = set(" !\"&'(),-.:;?[]_`")
ALLOWED = UNFORMATTED | set(string.ascii_uppercase) | PUNCTUATION

EMPTY = '<NONE>'
CAPITALIZE = '<CAP>'


def generate_sequences(text: str, k: int):
    """
    Yields tuples of subsequence of k characters, next character
    (if within a special set)
    """
    for i in range(len(text) - k):
        seq = text[i:i + k]
        next_char = text[i + k]
        punct_char = (next_char if next_char in PUNCTUATION else
                      CAPITALIZE if next_char.isupper() else EMPTY)
        yield seq, punct_char


gutenberg = nltk.corpus.gutenberg
gutenberg.ensure_loaded()

sample_file = gutenberg.fileids()[0]
sample = ' '.join(gutenberg.raw(sample_file).split())

with open('seq.txt', 'w') as file:
    file.writelines(f"{seq}|{punct}\n" for seq, punct in generate_sequences(sample, k=10))

df = pd.read_csv('seq.txt', sep='|', quoting=csv.QUOTE_NONE,
                 names=['sequence', 'next_char'], skipinitialspace=False,
                 dtype=str, na_filter=False)

seq_length = len(df.at[0, 'sequence'])
lengths = df['sequence'].apply(len)
assert (lengths == seq_length).all()

Problem description

AssertionError is raised because there is one element in the sequence column that isn't of length 10, even though the text file was manually crafted to contain sequences of exactly 10 characters, followed by the separator |, followed by another value.

Upon inspection:

>>> df.assign(length=lengths)[lengths != seq_length]
          sequence next_char  length
763047  it could     <NONE>       9

but

>>> with open('seq.txt') as file:
...    lines = file.readlines()
...
>>> lines[763047]
' it could |<NONE>\n'
>>> len(lines[763047].split('|')[0])
10

This is unexpected behaviour because skipinitialspace=False, quoting=csv.QUOTE_NONE, dtype=str and na_filter=False were all passed to pd.read_csv, meaning that the values should be interpreted as raw as they come (i.e. including any leading space).

Moreover, this behaviour is inconsistent since there are plenty other examples in seq.txt of values with leading spaces that do get parsed correctly.

What's even weirder (and probably near to the crux of the problem) is that setting EMPTY to '<EMPTY>' or something else instead of '<NONE>' in the script above makes the problem disappear. Furthermore, any value of EMPTY with exactly four characters enclosed in angle brackets starting with NA produces the error. That is, '<NAAA>', '<NAZZ>', do produce the error, but '<NA>', '<NAAA', '<NAA>' do not.

Probably this has to do with NA parsing? Though I thought passing na_filter=False should have fixed that.

Expected Output

All elements of df['sequence'] are strings of the same length (10 in this case), so no AssertionError.

Output of pd.show_versions()

For installed environment:

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.7.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
machine          : AMD64
processor        : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.None
pandas           : 1.0.3
numpy            : 1.18.1
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 46.1.3.post20200330
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.13.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.2.1
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : None
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
numba            : None

For test on master:

INSTALLED VERSIONS
------------------
commit           : 998a0deea39f11fa06071af77cc1afba65900330
python           : 3.8.2.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
machine          : AMD64
processor        : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : English_United Kingdom.1252
pandas           : 1.0.3
numpy            : 1.18.4
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.1
setuptools       : 46.1.3
Cython           : 0.29.17
pytest           : 5.4.2
hypothesis       : 5.11.0
sphinx           : 3.0.3
blosc            : 1.9.1
feather          : None
xlsxwriter       : 1.2.8
lxml.etree       : 4.5.0
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.14.0
pandas_datareader: None
bs4              : 4.9.0
bottleneck       : 1.3.2
fastparquet      : 0.3.3
gcsfs            : None
lxml.etree       : 4.5.0
matplotlib       : 3.2.1
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.3
pandas_gbq       : None
pyarrow          : 0.17.0
pytables         : None
pytest           : 5.4.2
pyxlsb           : None
s3fs             : 0.4.2
scipy            : 1.4.1
sqlalchemy       : 1.3.16
tables           : 3.6.1
tabulate         : None
xarray           : 0.15.1
xlrd             : 1.2.0
xlwt             : 1.3.0
xlsxwriter       : 1.2.8
numba            : 0.49.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions