Skip to content

Jsonlines export error #2615

Closed
Closed
@TevenLeScao

Description

@TevenLeScao

Describe the bug

When exporting large datasets in jsonlines (c4 in my case) the created file has an error every 9999 lines: the 9999th and 10000th are concatenated, thus breaking the jsonlines format. This sounds like it is related to batching, which is by 10000 by default

Steps to reproduce the bug

This what I'm running:

in python:

from datasets import load_dataset
ptb = load_dataset("ptb_text_only")
ptb["train"].to_json("ptb.jsonl")

then out of python:

head -10000 ptb.jsonl

Expected results

Properly separated lines

Actual results

The last line is a concatenation of two lines

Environment info

  • datasets version: 1.9.1.dev0
  • Platform: Linux-5.4.0-1046-gcp-x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.6.9
  • PyArrow version: 4.0.1

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions