Closed
Description
Describe the bug
When exporting large datasets in jsonlines (c4 in my case) the created file has an error every 9999 lines: the 9999th and 10000th are concatenated, thus breaking the jsonlines format. This sounds like it is related to batching, which is by 10000 by default
Steps to reproduce the bug
This what I'm running:
in python:
from datasets import load_dataset
ptb = load_dataset("ptb_text_only")
ptb["train"].to_json("ptb.jsonl")
then out of python:
head -10000 ptb.jsonl
Expected results
Properly separated lines
Actual results
The last line is a concatenation of two lines
Environment info
datasets
version: 1.9.1.dev0- Platform: Linux-5.4.0-1046-gcp-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.6.9
- PyArrow version: 4.0.1