Skip to content

CLN: Refactor *SON RFC #9166

Closed
Closed
@lJoublanc

Description

@lJoublanc

n.b. I'm looking at how to store a DataFrame in BSON format (see #4329) in MongoDB. This is my attempt to gather all the relevant information before adding yet another *SON serializer. Please don't treat this as a change request - I'm just trying to get my head around all the relevant bits. Maybe skip to the summary and start from there ...

Refactoring *SON to use common schema

There are a number of *SON formats which are similar: e.g. JSON, Msgpack (in pandas), BSON (not in pandas) recently announced JBSON. They have some common components:

  1. Conversion of python object graph into Document(s) (think dict) made up only of JSON/Msgpack/BSON primitives. You can think of this in terms of a schema, as discussed in ENH: support for msgpack serialization/deserialization #3525, WIP: Support metadata at de/serialization time #3297, Allow custom metadata to be attached to panel/df/series? #2485. In an ideal world, this schema would be shared by all formats.
  2. Serialization into a binary format (Msgpack/BSON only) or text (JSON). This is generally done by an external library separately for each format.

Currently, I don't think any of the code between JSON and Msgpack is shared. The schema used by Msgpack is entirely different from JSON. I'm guessing this is by design; JSON has been kept as human-readable as possible, whereas Msgpack is more focused on a serialization format. To illustrate this, two examples. First a simple Dataframe:

                          bid     offer
2014-12-15 10:07:00  0.562804  0.398798
2014-12-15 10:07:01  0.399399  0.896809
2014-12-15 10:07:02  0.747191  0.098048

produces the following document before serialization in msgpack:

{'axes': [{'data': ['bid', 'offer'],
   'dtype': 17,
   'klass': 'Index',
   'name': None,
   'typ': 'index'},
  {'data': '...',
   'dtype': 21,
   'freq': 'S',
   'klass': 'DatetimeIndex',
   'name': None,
   'typ': 'datetime_index',
   'tz': None}],
 'blocks': [{'compress': 'blosc',
   'dtype': 12,
   'items': {'data': ['bid', 'offer'],
    'dtype': 17,
    'klass': 'Index',
    'name': None,
    'typ': 'index'},
   'klass': 'FloatBlock',
   'shape': (2, 3),
   'values': '...'}],
 'klass': 'DataFrame',
 'typ': 'block_manager'}

I produced this by calling encode recursively in packing.py, replicating what pack does. There have been plenty of discussions regarding storage of metadata in #3525, #3297. Now the result of calling to_json():

'{"bid":{"1418638020000":0.5628044127,"1418638021000":0.3993987818,"1418638022000":0.7471914537},"offer":{"1418638020000":0.398797779,"1418638021000":0.8968090851,"1418638022000":0.0980482752}}'

Both appear to use blocking (see #9130 ). Also, as described in #3525, dates are stored as integers (in strings) for performance reasons.

The second example, which has a multi-index:

                       bid                                ask                \
                      95.0   95.5   96.0   96.5   97.0   97.5   98.0   98.5   
2014-12-15 10:07:00  25030  16800  42580  58560  75110  10400  89240   3990   
2014-12-15 10:07:01  42620  57860  80010  81500  98880  99610  65770  83930   
2014-12-15 10:07:02  94170  98040  74840  41690  48960  76510  88530  48770   
2014-12-15 10:07:03  65090  23700  16390  45700    500  29290  32370  68350  

This kind of works for to_msgpack() but not for to_json() (throws an exception):

{'axes': [{'data': [('bid', 95.0),
    ('bid', 95.5),
    ('bid', 96.0),
    ...
    ('ask', 99.5)],
   'dtype': 17,
   'klass': 'MultiIndex',
   'names': FrozenList([None, None]),
   'typ': 'multi_index'},
  {'data': '...',
   'dtype': 21,
   'freq': 'S',
   'klass': 'DatetimeIndex',
   'name': None,
   'typ': 'datetime_index',
   'tz': None}],
 'blocks': [{'compress': 'blosc',
   'dtype': 12,
   'items': {'data': [('bid', 95.0),
     ('bid', 95.5),
     ('bid', 96.0), 
     ...
     ('ask', 99.5)],
    'dtype': 17,
    'klass': 'MultiIndex',
    'names': FrozenList([None, None]),
    'typ': 'multi_index'},
   'klass': 'FloatBlock',
   'shape': (10, 4),
   'values': '...'}],
 'klass': 'DataFrame',
 'typ': 'block_manager'}

It would be nice to expose the API of the 'intermediate' representation (i.e. at the end of step 1) for the advanced user, in order to store the dataframe in multiple 'Documents'. If storing to a file, this doesn't make any sense - you want to have a single document with many embedded documents; This is the JSON paradigm. But if you're storing into a database, you would want flexibility in order to store each nested doc/dict in a different collection/table. There are reasons for this e.g. a technical reason is the MongoDB maximum record/document size; however, it would generally be driven by a use-case.

What would this intermediate document look like? A dictionary, with only python (and numpy - see below) primitives? What are the primitives that are different for *SON formats (e.g. dates)?

Serialization of vectorized data

A separate issue, that is somewhat at odds with the above, is the storage of vectorized data. i.e. the underlying numpy array in to_msgpack() is encoded into a string and stored in binary format. While this makes sense for performance reasons (see #3525 (comment)), it's giving up one of the advantages of using msgpack (or BSON going forward), in my opinion: portability. If you go to http://msgpack.org, you'll see they have APIs for every conceivable language (likewise for MongoDB). Wouldn't it be nice if you could use one of these APIs and load up your pandas data in e.g. Java, Haskell or C? At present, this is possible with the exception of the 'values' fields, which you would have to decode by hand. Sure you could argue to use JSON for portability, but then you're trading off performance.

This is at odds with adding compression - which we still want. A compromise could be to use the native *SON list type when compress=None, when we assume that speed isn't important, and the current solution (encode as string) when compression is active, and speed is important (and compressed data is not handled by the abovementioned APIs)?
Note also, as described in links in #4329 (comment), BSON/MongoDB appears to have some support for vectorised lists that avoids having to convert to a list and then serialize each element separately.

Summary

In adding yet another *SON serializer, it would not make sense to have a third codebase that handles the issues discussed above in a new way:

  • Would it make sense to combine existing msgpack/JSON 'packing' code, to have a standard intermediate 'schema', shared by all *SON? cons: breaks backward compatibility, makes JSON document much more complex, pros: support more complex pandas structures (DF, panel, sparse df etc.).
  • Could code for storing vector data be shared? Could we allow native storage of vector data (e.g. supported by BSON), for example, when not using the compress argument in to_msgpack, to allow portability? Ideally also using compress? could we somehow replicate what monary is doing for BSON in the msgpack serializer, i.e. altogether avoiding conversion to lists and memory copies?

Again, I want to stress I'm just trying to create some discussion here rather than opening a specific change req... Sorry if raising an issue wasn't the correct procedure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignCompatpandas objects compatability with Numpy or Python functionsIO DataIO issues that don't fit into a more specific labelIO JSONread_json, to_json, json_normalizePerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions