Skip to content

read_csv : using day first 23x to 35x slower than setting the format explicitly #25848

Closed
@garfieldthecat

Description

@garfieldthecat

Code Sample: create a small csv with dates, then imports it and times the import

import numpy as np
import pandas as pd
import time
import timeit
start_time=time.time()
n=int(100e3)
out =pd.DataFrame()
out['day']=np.random.randint(1,12,n)
out['month']=np.random.randint(1,12,n)
out['year']=np.random.randint(2000,2010,n)
out['date1']= out.day.astype(str) + '-' + out.month.astype(str) + '-' + out.year.astype(str)
out['date2']= out['day'].map('{:0>2d}'.format ) + '-' + out['month'].map('{:0>2d}'.format ) + '-' + out.year.astype(str)
out.to_csv('test dates.csv')
print('Exported in ')
print(time.time()-start_time)


def noparse():
    return pd.read_csv('test dates.csv', dtype={'date1':str,'date2':str} )

def parse1():
    return pd.read_csv('test dates.csv', parse_dates=['date1','date2'], dayfirst=True )

def parse2():
    df = pd.read_csv('test dates.csv', dtype={'date1':str,'date2':str} )
    for d in ['date1','date2']:
        df[d]= pd.to_datetime( df[d], dayfirst=True )
    return df

def parse3():
    df = pd.read_csv('test dates.csv', dtype={'date1':str,'date2':str} )
    # formatting described at: http://php.net/manual/en/function.strftime.php
    df['date1']=pd.to_datetime(df['date1'], format='%d-%m-%Y')
    df['date2']=pd.to_datetime(df['date2'], format='%d-%m-%Y')
    return df

results=pd.DataFrame()
results['no parse'] = timeit.Timer( "noparse()", globals=globals() ).repeat(repeat=3, number =1)
results['parse 1']=timeit.Timer( "parse1()", globals=globals() ).repeat(repeat=3, number =1)
results['parse 2']=timeit.Timer( "parse2()", globals=globals() ).repeat(repeat=3, number =1)
results['parse 3']=timeit.Timer( "parse3()", globals=globals() ).repeat(repeat=3, number =1)
print(results)

Problem description

I am trying to import a CSV with dates, where the days are always < 13 , i.e. there is no way to infer the proper format - it must be specified explicitly (how can you tell whether 1-2 is Jan 2nd or Feb 1st?).

What I noticed is that setting the date format with dayfirst is 25 to 35x (I tried it on two PCs) slower than setting the format explicitly with format='%d-%m-%Y'

How can this be? It's insane. I appreciate there may be a little bit of additional overhead, as dayfirst must guess the position of the year, but 25x to 35x slower leaves me speechless. With dayfirst it took 16 seconds to import a csv with 2 date columns and only 100k rows.

There must be something very wrong with how read_csv implements dayfirst - I'm hoping it shouldn't be too complicated to fix it?

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: AMD64 Family 23 Model 1 Stepping 1, AuthenticAMD
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.23.4
pytest: 4.0.2
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
None

Metadata

Metadata

Assignees

No one assigned

    Labels

    DatetimeDatetime data dtypeIO CSVread_csv, to_csvPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions