Skip to content

BUG: ValueError when doing HDFStore.Select of contiguous mixed-data table ft. VLArray #17021

Closed
@johanhoog

Description

@johanhoog

Code Sample, a copy-pastable example if possible

import pandas as pd
myDf = pd.DataFrame({'a' : pd.Series([1443525810,1443540836,1443609470]),
                     'b' : pd.Series(['ab','cd','ab'])})
myDf.to_hdf('test.h5', 'test')

with pd.HDFStore('test.h5') as myFile:
    df = myFile.select('/test', start=0, stop=2) # omit "start=0, stop=2" to prevent error
    display (df)

Problem description

ValueError: Shape of passed values is (2, 3), indices imply (2, 2)

Expected Output

             a   b
0   1443525810  ab
1   1443540836  cd

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Windows
OS-release: 2012ServerR2
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.13.1
scipy: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: 1.1.11
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

Other remarks:

  • please be gentle, this is my first Github interaction :)

  • notebook attached that contains problem and solution output

  • My guess is that pytables.py's read_array takes the one-dimensional behavior of VLArray into account too late; after slicing "data = node[start:stop]", resulting in the slice returning the whole column, my following implementation of the method seems to fix it.

      def read_array(self, key, start=None, stop=None):
      """ read an array for the specified node (off of group """
      import tables
      node = getattr(self.group, key)
      attrs = node._v_attrs
    
      transposed = getattr(attrs, 'transposed', False)
    
      if isinstance(node, tables.VLArray):
          ret = node[0][start:stop]
      else:
          dtype = getattr(attrs, 'value_type', None)
          shape = getattr(attrs, 'shape', None)
    
          if shape is not None:
              # length 0 axis
              ret = np.empty(shape, dtype=dtype)
          else:
              ret = node[start:stop]
    
          if dtype == u('datetime64'):
    
              # reconstruct a timezone if indicated
              ret = _set_tz(ret, getattr(attrs, 'tz', None), coerce=True)
    
          elif dtype == u('timedelta64'):
              ret = np.asarray(ret, dtype='m8[ns]')
    
      if transposed:
          return ret.T
      else:
          return ret
    

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions