Description
Code Sample: create a small csv with dates, then imports it and times the import
import numpy as np
import pandas as pd
import time
import timeit
start_time=time.time()
n=int(100e3)
out =pd.DataFrame()
out['day']=np.random.randint(1,12,n)
out['month']=np.random.randint(1,12,n)
out['year']=np.random.randint(2000,2010,n)
out['date1']= out.day.astype(str) + '-' + out.month.astype(str) + '-' + out.year.astype(str)
out['date2']= out['day'].map('{:0>2d}'.format ) + '-' + out['month'].map('{:0>2d}'.format ) + '-' + out.year.astype(str)
out.to_csv('test dates.csv')
print('Exported in ')
print(time.time()-start_time)
def noparse():
return pd.read_csv('test dates.csv', dtype={'date1':str,'date2':str} )
def parse1():
return pd.read_csv('test dates.csv', parse_dates=['date1','date2'], dayfirst=True )
def parse2():
df = pd.read_csv('test dates.csv', dtype={'date1':str,'date2':str} )
for d in ['date1','date2']:
df[d]= pd.to_datetime( df[d], dayfirst=True )
return df
def parse3():
df = pd.read_csv('test dates.csv', dtype={'date1':str,'date2':str} )
# formatting described at: http://php.net/manual/en/function.strftime.php
df['date1']=pd.to_datetime(df['date1'], format='%d-%m-%Y')
df['date2']=pd.to_datetime(df['date2'], format='%d-%m-%Y')
return df
results=pd.DataFrame()
results['no parse'] = timeit.Timer( "noparse()", globals=globals() ).repeat(repeat=3, number =1)
results['parse 1']=timeit.Timer( "parse1()", globals=globals() ).repeat(repeat=3, number =1)
results['parse 2']=timeit.Timer( "parse2()", globals=globals() ).repeat(repeat=3, number =1)
results['parse 3']=timeit.Timer( "parse3()", globals=globals() ).repeat(repeat=3, number =1)
print(results)
Problem description
I am trying to import a CSV with dates, where the days are always < 13 , i.e. there is no way to infer the proper format - it must be specified explicitly (how can you tell whether 1-2 is Jan 2nd or Feb 1st?).
What I noticed is that setting the date format with dayfirst is 25 to 35x (I tried it on two PCs) slower than setting the format explicitly with format='%d-%m-%Y'
How can this be? It's insane. I appreciate there may be a little bit of additional overhead, as dayfirst must guess the position of the year, but 25x to 35x slower leaves me speechless. With dayfirst it took 16 seconds to import a csv with 2 date columns and only 100k rows.
There must be something very wrong with how read_csv implements dayfirst - I'm hoping it shouldn't be too complicated to fix it?
Expected Output
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: AMD64 Family 23 Model 1 Stepping 1, AuthenticAMD
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None
pandas: 0.23.4
pytest: 4.0.2
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
None