Skip to content

strings not properly detected despite correct dtype in read_csv #16569

Closed
@randomgambit

Description

@randomgambit

Hello there!

I am working with text data, and I read my data in using

full_list =[]

for myfile in all_files:
    print("processing " + myfile)
    news = pd.read_csv(myfile, usecols = ['FULL_TIMESTAMP', 'HEADLINE'], dtype = {'HEADLINE' : str})
    full_list.append(news)
   
data_full = pd.concat(full_list)

As you see, I make sure that my headline variable is a str. However, when I type

collapsed = data_full.groupby('day').HEADLINE.agg(lambda x: '| '.join(x))

I get :

  File "<ipython-input-1-8ce0197f52ac>", line 34, in <module>
    collapsed =data_full.groupby('day').HEADLINE.agg(lambda x: '| '.join(x))

  File "C:\Users\me\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 2668, in aggregate
    result = self._aggregate_named(func_or_funcs, *args, **kwargs)

  File "C:\Users\me\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 2786, in _aggregate_named
    output = func(group, *args, **kwargs)

  File "<ipython-input-1-8ce0197f52ac>", line 34, in <lambda>
    collapsed = data_full.groupby('day').HEADLINE.agg(lambda x: '| '.join(x))

TypeError: sequence item 21: expected string, float found

To fix the problem, I need first to type

data_full['HEADLINE'] = data_full['HEADLINE'].astype(str)

Is that expected? I thought specifying the dtypes in read_csv was the most robust solution to have consistent types in the data? Still using Pandas 19.2.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Dtype ConversionsUnexpected or buggy dtype conversionsIO CSVread_csv, to_csvMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions