CLN: Refactor *SON RFC

n.b. I'm looking at how to store a `DataFrame` in BSON format (see #4329) in MongoDB. This is my attempt to gather all the relevant information before adding yet another *SON serializer. Please don't treat this as a change request - I'm just trying to get my head around all the relevant bits. Maybe skip to the summary and start from there ...
## Refactoring *SON to use common schema

There are a number of *SON formats which are similar: e.g. JSON, Msgpack (in pandas), BSON (not in pandas) recently announced [JBSON](http://www.phoronix.com/scan.php?page=news_item&px=MTY0MTU). They have some common components:
1. Conversion of python object graph into Document(s) (think dict) made up only of JSON/Msgpack/BSON primitives. You can think of this in terms of a _schema_, as discussed in #3525, #3297, #2485. In an ideal world, this schema would be shared by all formats.
2. Serialization into a binary format (Msgpack/BSON only) or text (JSON). This is generally done by an external library separately for each format.

Currently, I don't think any of the code between JSON and Msgpack is shared. The schema used by Msgpack is entirely different from JSON. I'm guessing this is by design; JSON has been kept as human-readable as possible, whereas Msgpack is more focused on a serialization format. To illustrate this, two examples. First a simple Dataframe:

```
                          bid     offer
2014-12-15 10:07:00  0.562804  0.398798
2014-12-15 10:07:01  0.399399  0.896809
2014-12-15 10:07:02  0.747191  0.098048
```

produces the following document before serialization in msgpack:

``` python
{'axes': [{'data': ['bid', 'offer'],
   'dtype': 17,
   'klass': 'Index',
   'name': None,
   'typ': 'index'},
  {'data': '...',
   'dtype': 21,
   'freq': 'S',
   'klass': 'DatetimeIndex',
   'name': None,
   'typ': 'datetime_index',
   'tz': None}],
 'blocks': [{'compress': 'blosc',
   'dtype': 12,
   'items': {'data': ['bid', 'offer'],
    'dtype': 17,
    'klass': 'Index',
    'name': None,
    'typ': 'index'},
   'klass': 'FloatBlock',
   'shape': (2, 3),
   'values': '...'}],
 'klass': 'DataFrame',
 'typ': 'block_manager'}
```

I produced this by calling `encode` recursively in `packing.py`, replicating what `pack` does. There have been plenty of discussions regarding storage of metadata in #3525, #3297. Now the result of calling `to_json()`:

``` python
'{"bid":{"1418638020000":0.5628044127,"1418638021000":0.3993987818,"1418638022000":0.7471914537},"offer":{"1418638020000":0.398797779,"1418638021000":0.8968090851,"1418638022000":0.0980482752}}'
```

Both appear to use blocking (see #9130 ). Also, as described in #3525, dates are stored as integers (in strings) for performance reasons.

The second example, which has a multi-index:

```
                       bid                                ask                \
                      95.0   95.5   96.0   96.5   97.0   97.5   98.0   98.5   
2014-12-15 10:07:00  25030  16800  42580  58560  75110  10400  89240   3990   
2014-12-15 10:07:01  42620  57860  80010  81500  98880  99610  65770  83930   
2014-12-15 10:07:02  94170  98040  74840  41690  48960  76510  88530  48770   
2014-12-15 10:07:03  65090  23700  16390  45700    500  29290  32370  68350  
```

This kind of works for `to_msgpack()` but not for `to_json()` (throws an exception):

```
{'axes': [{'data': [('bid', 95.0),
    ('bid', 95.5),
    ('bid', 96.0),
    ...
    ('ask', 99.5)],
   'dtype': 17,
   'klass': 'MultiIndex',
   'names': FrozenList([None, None]),
   'typ': 'multi_index'},
  {'data': '...',
   'dtype': 21,
   'freq': 'S',
   'klass': 'DatetimeIndex',
   'name': None,
   'typ': 'datetime_index',
   'tz': None}],
 'blocks': [{'compress': 'blosc',
   'dtype': 12,
   'items': {'data': [('bid', 95.0),
     ('bid', 95.5),
     ('bid', 96.0), 
     ...
     ('ask', 99.5)],
    'dtype': 17,
    'klass': 'MultiIndex',
    'names': FrozenList([None, None]),
    'typ': 'multi_index'},
   'klass': 'FloatBlock',
   'shape': (10, 4),
   'values': '...'}],
 'klass': 'DataFrame',
 'typ': 'block_manager'}
```

It would be nice to expose the API of the 'intermediate' representation (i.e. at the end of step 1) for the advanced user, in order to store the dataframe in multiple 'Documents'. If storing to a file, this doesn't make any sense - you want to have a single document with many embedded documents; This is the JSON paradigm. But if you're storing into a database, you would want flexibility in order to store each nested doc/dict in a different collection/table. There are reasons for this e.g. a technical reason is the MongoDB maximum record/document size; however, it would generally be driven by a use-case. 

What would this intermediate document look like? A dictionary, with only python (and numpy - see below) primitives? What are the primitives that are different for *SON formats (e.g. dates)?
## Serialization of vectorized data

A separate issue, that is somewhat at odds with the above, is the storage of vectorized data. i.e. the underlying numpy array in `to_msgpack()` is encoded into a string and stored in binary format.  While this makes sense for performance reasons (see https://github.com/pydata/pandas/pull/3525#issuecomment-17442698), it's giving up one of the advantages of using msgpack (or BSON going forward), in my opinion: portability. If you go to http://msgpack.org, you'll see they have APIs for every conceivable language (likewise for MongoDB). Wouldn't it be nice if you could use one of these APIs and load up your pandas data in e.g. Java, Haskell or C? At present, this is possible with the exception of the 'values' fields, which you would have to decode by hand. Sure you could argue to use JSON for portability, but then you're trading off performance.

This is at odds with adding compression - which we still want. A compromise could be to use the native *SON list type when `compress=None`, when we assume that speed isn't important, and the current solution (encode as string) when compression is active, and speed is important (and compressed data is not handled by the abovementioned APIs)?
Note also, as described in links in https://github.com/pydata/pandas/issues/4329#issuecomment-27095723, BSON/MongoDB appears to have some support for vectorised lists that avoids having to convert to a list and then serialize each element separately.
## Summary 

In adding yet another *SON serializer, it would not make sense to have a third codebase that handles the issues discussed above in a new way:
- Would it make sense to combine existing msgpack/JSON 'packing' code, to have a standard intermediate 'schema', shared by all *SON? cons: breaks backward compatibility, makes JSON document much more complex, pros: support more complex pandas structures (DF, panel, sparse df etc.).
- Could code for storing vector data be shared? Could we allow native storage of vector data (e.g. supported by BSON), for example, when not using the `compress` argument in `to_msgpack`, to allow portability? Ideally also using compress? could we somehow replicate what _monary_ is doing for BSON in the msgpack serializer, i.e. altogether avoiding conversion to lists and memory copies?

Again, I want to stress I'm just trying to create some discussion here rather than opening a specific change req... Sorry if raising an issue wasn't the correct procedure.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

CLN: Refactor *SON RFC #9166

Refactoring *SON to use common schema

Serialization of vectorized data

Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

CLN: Refactor *SON RFC #9166

Description

Refactoring *SON to use common schema

Serialization of vectorized data

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions