Skip to content

Latest commit

 

History

History
42 lines (29 loc) · 1.38 KB

README.md

File metadata and controls

42 lines (29 loc) · 1.38 KB

GitHub Pull Request Review Comments size 1.5GB

Download link.

25.3 million pull request review comments on GitHub since January 2015 till December 2018.

Format

xz-compressed CSV, with columns:

  • COMMENT_ID - identifier of the comment in mother dataset - GH Archive
  • COMMIT_ID - commit hash to which the review comment is attached
  • URL - path to the GitHub pull request the comment comes from
  • AUTHOR - GitHub user of the author of the comment
  • CREATED_AT - creation date of the comment
  • BODY - raw content of the comment

Sample code

Python:

# too big for pandas.read_csv
import codecs
import csv
import lzma

with lzma.open("review_comments.csv.xz") as archf:
    reader = csv.DictReader(codecs.getreader("utf-8")(archf))
    for record in reader:
        print(record)

Origin

The dataset was generated from GH Archive in the following notebook. The comments which exceeded Python's csv.field_size_limit equal to 128KB were discarded (~10 comments).

We gathered some statistics about the dataset.

License

Open Data Commons Open Database License (ODbL)