Guidance on defining collections to group datasets released as a series or in fragments #258
Description
Data.gov has had the notion of a "collection" that can be used to group multiple datasets that would logically be considered a single dataset, but have been released in separate parts. The most common scenario for this is a series of release over time. In some cases a dataset may by published in monthly or yearly releases, but if the only thing that distinguishes these is date, then they should really be packaged as a single dataset. This also makes browsing simpler - it prevents many similar datasets from crowding out more unique ones. Some datasets might also be published by location, such as data relating to each state being released as a separate file. These should also be grouped together to appear as a single dataset.
Ideally agencies should package these all together as a single file/release before publishing, eg one file that is continuously updated is preferable to separate releases over time, but at the very least there should be a way to define this kind of packaged grouping at the metadata level as is currently the case on data.gov.
The way data.gov handles this is that the collection is essentially treated just as a normal dataset entry but it refers to many child entries. Something similar could be done with the data.json schema, but we would need to establish a convention for defining that parent/child relationship between entries.
Here's a current example of a collection on data.gov
View of the collection "parent" metadata:
http://catalog.data.gov/dataset/tiger-line-shapefile-2010-series-information-file-for-the-2010-census-block-state-ba
View of all its "child" datasets:
http://catalog.data.gov/dataset?collection_package_id=2a8b7f0b-1ae5-453c-ba56-996547266a63