Description
n.b. I'm looking at how to store a DataFrame
in BSON format (see #4329) in MongoDB. This is my attempt to gather all the relevant information before adding yet another *SON serializer. Please don't treat this as a change request - I'm just trying to get my head around all the relevant bits. Maybe skip to the summary and start from there ...
Refactoring *SON to use common schema
There are a number of *SON formats which are similar: e.g. JSON, Msgpack (in pandas), BSON (not in pandas) recently announced JBSON. They have some common components:
- Conversion of python object graph into Document(s) (think dict) made up only of JSON/Msgpack/BSON primitives. You can think of this in terms of a schema, as discussed in ENH: support for msgpack serialization/deserialization #3525, WIP: Support metadata at de/serialization time #3297, Allow custom metadata to be attached to panel/df/series? #2485. In an ideal world, this schema would be shared by all formats.
- Serialization into a binary format (Msgpack/BSON only) or text (JSON). This is generally done by an external library separately for each format.
Currently, I don't think any of the code between JSON and Msgpack is shared. The schema used by Msgpack is entirely different from JSON. I'm guessing this is by design; JSON has been kept as human-readable as possible, whereas Msgpack is more focused on a serialization format. To illustrate this, two examples. First a simple Dataframe:
bid offer
2014-12-15 10:07:00 0.562804 0.398798
2014-12-15 10:07:01 0.399399 0.896809
2014-12-15 10:07:02 0.747191 0.098048
produces the following document before serialization in msgpack:
{'axes': [{'data': ['bid', 'offer'],
'dtype': 17,
'klass': 'Index',
'name': None,
'typ': 'index'},
{'data': '...',
'dtype': 21,
'freq': 'S',
'klass': 'DatetimeIndex',
'name': None,
'typ': 'datetime_index',
'tz': None}],
'blocks': [{'compress': 'blosc',
'dtype': 12,
'items': {'data': ['bid', 'offer'],
'dtype': 17,
'klass': 'Index',
'name': None,
'typ': 'index'},
'klass': 'FloatBlock',
'shape': (2, 3),
'values': '...'}],
'klass': 'DataFrame',
'typ': 'block_manager'}
I produced this by calling encode
recursively in packing.py
, replicating what pack
does. There have been plenty of discussions regarding storage of metadata in #3525, #3297. Now the result of calling to_json()
:
'{"bid":{"1418638020000":0.5628044127,"1418638021000":0.3993987818,"1418638022000":0.7471914537},"offer":{"1418638020000":0.398797779,"1418638021000":0.8968090851,"1418638022000":0.0980482752}}'
Both appear to use blocking (see #9130 ). Also, as described in #3525, dates are stored as integers (in strings) for performance reasons.
The second example, which has a multi-index:
bid ask \
95.0 95.5 96.0 96.5 97.0 97.5 98.0 98.5
2014-12-15 10:07:00 25030 16800 42580 58560 75110 10400 89240 3990
2014-12-15 10:07:01 42620 57860 80010 81500 98880 99610 65770 83930
2014-12-15 10:07:02 94170 98040 74840 41690 48960 76510 88530 48770
2014-12-15 10:07:03 65090 23700 16390 45700 500 29290 32370 68350
This kind of works for to_msgpack()
but not for to_json()
(throws an exception):
{'axes': [{'data': [('bid', 95.0),
('bid', 95.5),
('bid', 96.0),
...
('ask', 99.5)],
'dtype': 17,
'klass': 'MultiIndex',
'names': FrozenList([None, None]),
'typ': 'multi_index'},
{'data': '...',
'dtype': 21,
'freq': 'S',
'klass': 'DatetimeIndex',
'name': None,
'typ': 'datetime_index',
'tz': None}],
'blocks': [{'compress': 'blosc',
'dtype': 12,
'items': {'data': [('bid', 95.0),
('bid', 95.5),
('bid', 96.0),
...
('ask', 99.5)],
'dtype': 17,
'klass': 'MultiIndex',
'names': FrozenList([None, None]),
'typ': 'multi_index'},
'klass': 'FloatBlock',
'shape': (10, 4),
'values': '...'}],
'klass': 'DataFrame',
'typ': 'block_manager'}
It would be nice to expose the API of the 'intermediate' representation (i.e. at the end of step 1) for the advanced user, in order to store the dataframe in multiple 'Documents'. If storing to a file, this doesn't make any sense - you want to have a single document with many embedded documents; This is the JSON paradigm. But if you're storing into a database, you would want flexibility in order to store each nested doc/dict in a different collection/table. There are reasons for this e.g. a technical reason is the MongoDB maximum record/document size; however, it would generally be driven by a use-case.
What would this intermediate document look like? A dictionary, with only python (and numpy - see below) primitives? What are the primitives that are different for *SON formats (e.g. dates)?
Serialization of vectorized data
A separate issue, that is somewhat at odds with the above, is the storage of vectorized data. i.e. the underlying numpy array in to_msgpack()
is encoded into a string and stored in binary format. While this makes sense for performance reasons (see #3525 (comment)), it's giving up one of the advantages of using msgpack (or BSON going forward), in my opinion: portability. If you go to http://msgpack.org, you'll see they have APIs for every conceivable language (likewise for MongoDB). Wouldn't it be nice if you could use one of these APIs and load up your pandas data in e.g. Java, Haskell or C? At present, this is possible with the exception of the 'values' fields, which you would have to decode by hand. Sure you could argue to use JSON for portability, but then you're trading off performance.
This is at odds with adding compression - which we still want. A compromise could be to use the native *SON list type when compress=None
, when we assume that speed isn't important, and the current solution (encode as string) when compression is active, and speed is important (and compressed data is not handled by the abovementioned APIs)?
Note also, as described in links in #4329 (comment), BSON/MongoDB appears to have some support for vectorised lists that avoids having to convert to a list and then serialize each element separately.
Summary
In adding yet another *SON serializer, it would not make sense to have a third codebase that handles the issues discussed above in a new way:
- Would it make sense to combine existing msgpack/JSON 'packing' code, to have a standard intermediate 'schema', shared by all *SON? cons: breaks backward compatibility, makes JSON document much more complex, pros: support more complex pandas structures (DF, panel, sparse df etc.).
- Could code for storing vector data be shared? Could we allow native storage of vector data (e.g. supported by BSON), for example, when not using the
compress
argument into_msgpack
, to allow portability? Ideally also using compress? could we somehow replicate what monary is doing for BSON in the msgpack serializer, i.e. altogether avoiding conversion to lists and memory copies?
Again, I want to stress I'm just trying to create some discussion here rather than opening a specific change req... Sorry if raising an issue wasn't the correct procedure.