Single-file /data catalog not good--optional alternative suggested #105
Description
Current guidance is that each agency's "/data" inventory must be a single list in a file containing multiple lines of Javascript Object Notation (JSON) summary metadata per dataset, even if our agency has tens of thousands of datasets distributed across multiple facilities and servers. I believe the single list will pose problems of inventory creation, maintenance, and usability. I enumerate my concerns below, but first I propose a specific solution.
PROPOSAL:
I recommend the single-list approach be made optional. Specifically, I suggest that the top-level JSON file be permitted to include either a list of datasets or a list of child nodes. Each node would at minimum have 'title' and 'accessURL' elements from your JSON schema (http://project-open-data.github.io/schema/), an agreed-upon value of 'format' such as "inventory_node" to indicate the destination is not a data file, and optionally some useful elements (e.g., person, mbox, modified, accessLevel, etc) describing that node. Each node could likewise include either a list of datasets or a list of children.
CONCERNS REGARDING THE SINGLE-LIST APPROACH:
(1) We should not build these inventories only to support data.gov. We want to leverage this for other efforts internal to our agencies, for PARR, and to support other external portals such as the Global Earth Observing System of Systems (GEOSS) or the Global Change Master Directory (GCMD). A distributed organization will be more useful for them (even if data.gov itself could handle a single long unsorted list.)
(2) The inventory will need to be compiled from many different sources, including multiple web-accessible folders (WAFs) of geospatial metadata, existing catalog servers, or other databases or lists. Each type of input will need some code to produce the required subset of the list, and then some software will need to merge everything into a giant list. Failure at any step in the process may cause the inventory to be out of date, incomplete, or broken entirely. A distributed organization will more easily allow most of the inventory to be correct or complete even if one sub-part is not, and will require much less code for fault-handling.
(3) Some of our data changes very frequently, on timescales of minutes or hours, while other data are only modified yearly or less frequently. A distributed organization will more easily allow partial updates and the addition (or removal) of new collections of data without having to regenerate the entire list.
(4) The inventory is supposed to include both our scientific observations and "business" data, and both public and non-public data. That alone suggests a top-level division into (for example) /data/science, /data/business, and /data/internal. The latter may need to be on a separate machine with different access control.
(5) It would be easier to create usable parallel versions of the inventory in formats other than JSON (e.g., HTML with schema.org tags) if the organization were distributed.
(6) I understand that the data.gov harvester has successfully parsed very long JSON files. However, recursive traversing of a web-based directory tree-like structure would be trivial to implement by data.gov and would be more scalable and solve many problems for the agencies and the users. data.gov's own harvesting could even be helped if the last-modified date on each node is checked to determine whether you can skip it.