Description
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([[1, 0, 0],[0,0,0]])
# non sparse
In [19]: %time df.loc[0]
CPU times: user 367 µs, sys: 14 µs, total: 381 µs
Wall time: 374 µs
Out[19]:
0 1
1 0
2 0
Name: 0, dtype: int64
In [20]: %time df.loc[[0]]
CPU times: user 808 µs, sys: 57 µs, total: 865 µs
Wall time: 827 µs
Out[20]:
0 1 2
0 1 0 0
In [3]: df = df.to_sparse()
# sparse
In [5]: %time df.loc[0]
CPU times: user 429 µs, sys: 14 µs, total: 443 µs
Wall time: 436 µs
Out[5]:
0 1.0
1 0.0
2 0.0
Name: 0, dtype: float64
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([3], dtype=int32)
In [6]: %time df.loc[[0]]
CPU times: user 4.07 ms, sys: 406 µs, total: 4.48 ms
Wall time: 6.16 ms
Out[6]:
0 1 2
0 1.0 0.0 0.0
It seems in SparseDataFrame, selecting as dataframe, the time increased much more than it does in DataFrame, I would expect it runs faster(on my 1m x 500 SparseDataFrame it takes seconds for the loc operation)
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_GB.UTF-8
LANG: en_GB.UTF-8
pandas: 0.18.1
nose: None
pip: 8.1.2
setuptools: 24.0.2
Cython: None
numpy: 1.11.0
scipy: 0.17.1
statsmodels: None
xarray: None
IPython: 5.0.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: None
boto: None
pandas_datareader: None