Skip to content

BUG: Inconsistent results using pd.json_normalize() on a generator object versus list (off by one) #35923

Closed
@ldacey

Description

@ldacey
  • [ x] I have checked that this issue has not already been reported.

  • [ x] I have confirmed this bug exists on the latest version of pandas.


Code Sample, a copy-pastable example

Only one value is returned with this:

def gen():
    test = [{'created_at': '2020-08-24T09:30:05Z',
             '_id': '5f43889de6a98fd57afce7be'},
            {'created_at': '2020-08-23T11:16:09Z',
             '_id': '5f44b03799944352493d9317'},
           ]
    for val in test:
        yield val
        
results = gen()
pd.json_normalize(results)

image

This returns all values though:

results = gen()
list_ = [x for x in results]
pd.json_normalize(list_)

And so does this:

def list_():
    final = []
    test = [{'created_at': '2020-08-24T09:30:05Z',
             '_id': '5f43889de6a98fd57afce7be'},
            {'created_at': '2020-08-23T11:16:09Z',
             '_id': '5f44b03799944352493d9317'},
           ]
    for val in test:
        final.append(val)
    return final

results = list_()
pd.json_normalize(results)

image

Problem description

Using pd.json_normalize() on a generator always seems to reduce the expected results by 1. I first noticed this on a REST API where a column informed me that I should expect 901 results but I kept getting 900 results each time. When I tried to append the results to a list and normalize that, I got the expected 901 results.

Expected Output

Perhaps this is an expected output. It just caused me some headaches earlier and it was not immediately obvious that I was missing one record. I would expect that my example above would result in the same 2 row DataFrame.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d9fff27
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-1028-azure
Version : #29~18.04.1-Ubuntu SMP Fri Jun 5 14:32:34 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0.post20200814
Cython : 0.29.21
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.17.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.7.4
fastparquet : None
gcsfs : None
matplotlib : 3.3.0
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 1.0.0
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.19
tables : 3.6.1
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.48.0

Metadata

Metadata

Labels

BugIO JSONread_json, to_json, json_normalize

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions