Skip to content

read_excel with dtype=str converts empty cells to np.nan #20429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
dd53df8
TST: Test for astype_nansafe. Modified test for astype
nikoskaragiannakis Mar 20, 2018
6f771fb
BUG: np.nan should stay as it is when we cast to str/basestring
nikoskaragiannakis Mar 20, 2018
37f00ad
BUG: revert change in lib.pyx. modify excel functionality directly
nikoskaragiannakis Mar 20, 2018
f194b70
TST: revert changes in dtypes/test_cast. test excel functionality
nikoskaragiannakis Mar 20, 2018
eb8f4c5
DOC: added description
nikoskaragiannakis Mar 20, 2018
ac6a409
TST: correction and pep8
nikoskaragiannakis Mar 20, 2018
6994bb0
BUG: pep8
nikoskaragiannakis Mar 20, 2018
40a563f
TST: remove unused import
nikoskaragiannakis Mar 20, 2018
9858259
DOC: resolved conflict
nikoskaragiannakis Mar 20, 2018
5f71a99
Update v0.23.0.txt
nikoskaragiannakis Mar 20, 2018
0a93b60
conflict again
nikoskaragiannakis Mar 20, 2018
f296f9a
arghh
nikoskaragiannakis Mar 20, 2018
7c0af1f
DOC: add disallowing of Series construction of len-1 list with index …
jorisvandenbossche Mar 19, 2018
f0fd0a7
Bug: Allow np.timedelta64 objects to index TimedeltaIndex (#20408)
mroeschke Mar 19, 2018
61e0519
DOC: Only use ~ in class links to hide prefixes. (#20402)
dukebody Mar 19, 2018
9fdac27
DOC: update the pandas.DataFrame.plot.hist docstring (#20155)
liopic Mar 19, 2018
ddb904f
DOC" update the Pandas core window rolling count docstring" (#20264)
tommy-stone Mar 19, 2018
694849d
BUG: astype_unicode astype_str turn a np.nan to empty string (#20377)
nikoskaragiannakis Mar 24, 2018
5ba95a1
TST: added unitest for read_excel and modified series/test_dtypes for…
nikoskaragiannakis Mar 24, 2018
d3ceec3
TST: added unitest for read_csv (#20377)
nikoskaragiannakis Mar 25, 2018
ea1d73a
BUG: patched TextReader to turn np.nan to empty string if dtype=str (…
nikoskaragiannakis Mar 25, 2018
c1376a5
DOC: updated IO section (#20377)
nikoskaragiannakis Mar 25, 2018
3103811
DOC: updated IO section (#20377)
nikoskaragiannakis Mar 25, 2018
7d5f6b2
pull from master
nikoskaragiannakis Mar 25, 2018
478d08d
DOC: updated IO section (#20377)
nikoskaragiannakis Apr 2, 2018
edb26d7
BUG: np.nan stays as np.nan (#20377)
nikoskaragiannakis Apr 2, 2018
c3ab9cb
TXT: Moved test from series.test_io to io.parser.na_values. Corrected…
nikoskaragiannakis Apr 2, 2018
69f6c95
DOC: updated IO section (#20377)
nikoskaragiannakis Apr 2, 2018
97a345a
TST: pep8 (#20377)
nikoskaragiannakis Apr 2, 2018
8b2fb0b
TXT: Moved test from series.test_io to io.parser.na_values. Corrected…
nikoskaragiannakis Apr 2, 2018
c9f5120
DOC: updated IO section (#20377)
nikoskaragiannakis Apr 2, 2018
fab0b27
resolve conflict
nikoskaragiannakis Apr 2, 2018
571d5c4
pep8 correction
nikoskaragiannakis Apr 2, 2018
0712392
Merge remote-tracking branch 'upstream/master' into nikoskaragiannaki…
TomAugspurger Apr 3, 2018
47bc105
DOC: Better explanation (#20377)
nikoskaragiannakis Apr 5, 2018
3740dfe
BUG: use checknull (#20377)
nikoskaragiannakis Apr 5, 2018
7d453bb
TST: update tests (#20377)
nikoskaragiannakis Apr 8, 2018
bcd739d
BUG: string nans to np.nan in Series for list data (#20377)
nikoskaragiannakis Apr 8, 2018
7341cd1
sync
nikoskaragiannakis Apr 8, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -984,6 +984,8 @@ I/O
- Bug in :meth:`pandas.io.stata.StataReader.value_labels` raising an ``AttributeError`` when called on very old files. Now returns an empty dict (:issue:`19417`)
- Bug in :func:`read_pickle` when unpickling objects with :class:`TimedeltaIndex` or :class:`Float64Index` created with pandas prior to version 0.20 (:issue:`19939`)
- Bug in :meth:`pandas.io.json.json_normalize` where subrecords are not properly normalized if any subrecords values are NoneType (:issue:`20030`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are including some other changes here, pls rebase on master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not mine. I deleted it by mistake and added it back.
You can check master here https://github.com/pandas-dev/pandas/blob/master/doc/source/whatsnew/v0.23.0.txt#L985
However, even after rebasing, I keep getting this conflict

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you rebased off master and resolved the conflicts in the rebase then it should be ok. Did you fetch the current master before rebasing?

Copy link
Contributor Author

@nikoskaragiannakis nikoskaragiannakis Mar 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did now

- Bug in :func:`read_excel` and :class:`TextReader` now turn np.nan to empty string when dtype=str. They used to turn np.nan to 'nan' (:issue `20377`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps phrase this as "Bug in read_excel, old behavior, new behavior". Right now it reads a bit like turning NaN to the empty string was the bug.

TextReader isn't part of the public API so this link won't work (and it'll cause a warning in the doc build.)

Double backticks around np.nan.

Double backticsk around dtype=str.

Double backticks around 'nan' (keep the quotes though).

Copy link
Contributor

@TomAugspurger TomAugspurger Mar 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also mention that this affected parsing CSVs with dtype=str, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also mention that this affected parsing CSVs with dtype=str, right?

@TomAugspurger : yes



Plotting
^^^^^^^^
Expand Down
6 changes: 4 additions & 2 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -465,7 +465,8 @@ cpdef ndarray[object] astype_unicode(ndarray arr):
for i in range(n):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
util.set_value_at_unsafe(result, i, unicode(arr[i]))
arr_i = arr[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is arr_i in the cdef?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d'oh!

util.set_value_at_unsafe(result, i, unicode(arr_i) if arr_i is not np.nan else '')

return result

Expand All @@ -478,7 +479,8 @@ cpdef ndarray[object] astype_str(ndarray arr):
for i in range(n):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
util.set_value_at_unsafe(result, i, str(arr[i]))
arr_i = arr[i]
util.set_value_at_unsafe(result, i, str(arr_i) if arr_i is not np.nan else '')

return result

Expand Down
19 changes: 15 additions & 4 deletions pandas/_libs/parsers.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1217,17 +1217,28 @@ cdef class TextReader:
return result, 0

# treat as a regular string parsing
return self._string_convert(i, start, end, na_filter,
na_hashset)
res, na_count = self._string_convert(i, start, end, na_filter,
na_hashset)

for i in range(len(res)):
if res[i] is np.nan:
res[i] = ''
return res, na_count

elif dtype.kind == 'U':
width = dtype.itemsize
if width > 0:
raise TypeError("the dtype %s is not "
"supported for parsing" % dtype)

# unicode variable width
return self._string_convert(i, start, end, na_filter,
na_hashset)
res, na_count = self._string_convert(i, start, end, na_filter,
na_hashset)
for i in range(len(res)):
if res[i] is np.nan:
res[i] = ''
return res, na_count

elif is_categorical_dtype(dtype):
# TODO: I suspect that _categorical_convert could be
# optimized when dtype is an instance of CategoricalDtype
Expand Down
29 changes: 29 additions & 0 deletions pandas/tests/io/test_excel.py
Original file line number Diff line number Diff line change
Expand Up @@ -361,6 +361,35 @@ def test_reader_dtype(self, ext):
with pytest.raises(ValueError):
actual = self.get_exceldf(basename, ext, dtype={'d': 'int64'})

def test_reader_dtype_str(self, ext):
# GH 20377
basename = 'testdtype'
actual = self.get_exceldf(basename, ext)

expected = DataFrame({
'a': [1, 2, 3, 4],
'b': [2.5, 3.5, 4.5, 5.5],
'c': [1, 2, 3, 4],
'd': [1.0, 2.0, np.nan, 4.0]}).reindex(
columns=['a', 'b', 'c', 'd'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just specify columns=list('abcd') rather than reindex


tm.assert_frame_equal(actual, expected)

actual = self.get_exceldf(basename, ext,
dtype={'a': 'float64',
'b': 'float32',
'c': str,
'd': str})

expected['a'] = expected['a'].astype('float64')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this higher (by the expected), you can simply construct things directly by using e.g. Series(...., dtype='float32') rather than a list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this higher (by the expected)

I'm not sure what you mean here.

you can simply construct things directly by using e.g. Series(...., dtype='float32') rather than a list

First of all, this is copy-paste from the previous test, which was added for #8212

Do you mean to do

expected = DataFrame({'a': Series([1,2,3,4], dtype='float64'),
                      'b': Series([2.5,3.5,4.5,5.5], dtype='float32'),
                      ...})

?
If i use, for example, for the 'c' column: Series([1, 2, 3, 4], dtype=str), then it will give me ['1', '2', '3', '4'] instead of the expected ['001', '002', '003', '004'].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly. If you need things like '001', then just do it that way, e.g. Series(['001', '002'....])

expected['b'] = expected['b'].astype('float32')
expected['c'] = ['001', '002', '003', '004']
expected['d'] = ['1', '2', '', '4']
tm.assert_frame_equal(actual, expected)

with pytest.raises(ValueError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the error message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueError: Unable to convert column d to type int64

which is the same as in the test_reader_dtype, which I copied it from. No reason to have this here I guess.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason to have this here I guess.

What do you mean?

actual = self.get_exceldf(basename, ext, dtype={'d': 'int64'})

def test_reading_all_sheets(self, ext):
# Test reading all sheetnames by setting sheetname to None,
# Ensure a dict is returned.
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/series/test_dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ def test_astype_datetime64tz(self):
tm.rands(1000)]),
Series([string.digits * 10,
tm.rands(63),
tm.rands(64), nan, 1.0])])
tm.rands(64), '', 1.0])])
def test_astype_str_map(self, dtype, series):
# see gh-4405
result = series.astype(dtype)
Expand Down
10 changes: 10 additions & 0 deletions pandas/tests/series/test_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,16 @@ def test_to_csv_compression(self, compression):
index_col=0,
squeeze=True))

def test_from_csv_dtype_str(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This test should really belong in the tests/io/parser/na_values tests
  • Do we really need to round-trip to surface this bug? Can we try to just use StringIO?

# GH20377
s = Series([1, 2, np.nan, 4], index=['A', 'B', 'C', 'D'],
name='X')
with ensure_clean() as filename:
s.to_csv(filename, header=True)
rs = pd.read_csv(filename, dtype=str)
expected = Series(['1.0', '2.0', '', '4.0'], name=s.name)
assert_series_equal(rs.X, expected)


class TestSeriesIO(TestData):

Expand Down