Skip to content

json_normalize skips an entry in a pymongo cursor #30323

Closed
@languitar

Description

@languitar

I am sorry for not being able to provide a reproducible example, but any attempt to reduce the problem to something limited makes the problem disappear.

In [54]: res = client.events.api.longterm.find({'foo': 'bar'})

In [55]: res.count()
Out[55]: 76845

In [56]: len(pd.io.json.json_normalize(res))
Out[56]: 76844

In [57]: res = client.events.api.longterm.find({'foo': 'bar'})

In [58]: len(pd.io.json.json_normalize(list(res)))
Out[58]: 76845

Problem description

I have a pretty large collection of documents in MongoDB, which I am querying using pymongo. The resulting cursor is passed to pd.io.json.json_normalize to convert the resulting data into a data frame. In one example, which I am unfortunately unable to reduce to something reproducible, a single element of the 76845 entries in the cursor is not present in the resulting data frame. If I convert the cursor to a list before using json_normalize, all entries are present. The affected document itself looks completely sane and without anything suspicious. Moreover, the default for json_normalize should be to raise errors and not to swallow rows.

Expected Output

All 76845 rows are present in the result when passing the cursor directly to json_normalize.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit : None
python : 3.8.0.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.3-arch1-1
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.3
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.10.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.2
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugCompatpandas objects compatability with Numpy or Python functionsIO JSONread_json, to_json, json_normalize

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions