Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
I wasn't able to isolate a (more) minimal example, so I'll just share what I was working on. nltk
here is version 3.4.5
.
import csv
import string
import nltk
import pandas as pd
UNFORMATTED = set(string.ascii_lowercase)
PUNCTUATION = set(" !\"&'(),-.:;?[]_`")
ALLOWED = UNFORMATTED | set(string.ascii_uppercase) | PUNCTUATION
EMPTY = '<NONE>'
CAPITALIZE = '<CAP>'
def generate_sequences(text: str, k: int):
"""
Yields tuples of subsequence of k characters, next character
(if within a special set)
"""
for i in range(len(text) - k):
seq = text[i:i + k]
next_char = text[i + k]
punct_char = (next_char if next_char in PUNCTUATION else
CAPITALIZE if next_char.isupper() else EMPTY)
yield seq, punct_char
gutenberg = nltk.corpus.gutenberg
gutenberg.ensure_loaded()
sample_file = gutenberg.fileids()[0]
sample = ' '.join(gutenberg.raw(sample_file).split())
with open('seq.txt', 'w') as file:
file.writelines(f"{seq}|{punct}\n" for seq, punct in generate_sequences(sample, k=10))
df = pd.read_csv('seq.txt', sep='|', quoting=csv.QUOTE_NONE,
names=['sequence', 'next_char'], skipinitialspace=False,
dtype=str, na_filter=False)
seq_length = len(df.at[0, 'sequence'])
lengths = df['sequence'].apply(len)
assert (lengths == seq_length).all()
Problem description
AssertionError
is raised because there is one element in the sequence
column that isn't of length 10, even though the text file was manually crafted to contain sequences of exactly 10 characters, followed by the separator |
, followed by another value.
Upon inspection:
>>> df.assign(length=lengths)[lengths != seq_length]
sequence next_char length
763047 it could <NONE> 9
but
>>> with open('seq.txt') as file:
... lines = file.readlines()
...
>>> lines[763047]
' it could |<NONE>\n'
>>> len(lines[763047].split('|')[0])
10
This is unexpected behaviour because skipinitialspace=False
, quoting=csv.QUOTE_NONE
, dtype=str
and na_filter=False
were all passed to pd.read_csv
, meaning that the values should be interpreted as raw as they come (i.e. including any leading space).
Moreover, this behaviour is inconsistent since there are plenty other examples in seq.txt
of values with leading spaces that do get parsed correctly.
What's even weirder (and probably near to the crux of the problem) is that setting EMPTY
to '<EMPTY>'
or something else instead of '<NONE>'
in the script above makes the problem disappear. Furthermore, any value of EMPTY
with exactly four characters enclosed in angle brackets starting with NA
produces the error. That is, '<NAAA>'
, '<NAZZ>'
, do produce the error, but '<NA>'
, '<NAAA'
, '<NAA>'
do not.
Probably this has to do with NA parsing? Though I thought passing na_filter=False
should have fixed that.
Expected Output
All elements of df['sequence']
are strings of the same length (10 in this case), so no AssertionError
.
Output of pd.show_versions()
For installed environment:
INSTALLED VERSIONS
------------------
commit : None
python : 3.7.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 1.0.3
numpy : 1.18.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.3.post20200330
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None
For test on master
:
INSTALLED VERSIONS
------------------
commit : 998a0deea39f11fa06071af77cc1afba65900330
python : 3.8.2.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United Kingdom.1252
pandas : 1.0.3
numpy : 1.18.4
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1
setuptools : 46.1.3
Cython : 0.29.17
pytest : 5.4.2
hypothesis : 5.11.0
sphinx : 3.0.3
blosc : 1.9.1
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.14.0
pandas_datareader: None
bs4 : 4.9.0
bottleneck : 1.3.2
fastparquet : 0.3.3
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.17.0
pytables : None
pytest : 5.4.2
pyxlsb : None
s3fs : 0.4.2
scipy : 1.4.1
sqlalchemy : 1.3.16
tables : 3.6.1
tabulate : None
xarray : 0.15.1
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.8
numba : 0.49.1