Description
NOTE: The goal of this is to find something resembling community consensus on a direction, or at least a notable lean in one direction or another from a large swath of the community.
We are not trying to discredit either idea, although we all tend to lurch in that direction from time to time, myself included. What we need is something that more people than the usual tiny number of participants would be willing to try out.
The discussion here can get very fast-paced. I am trying to periodically pause it to allow new folks, or people who don't have quite as much time, to catch up. Please feel free to comment requesting such a pause if you would like to contribute but are having trouble following it all.
This proposal attempts to create one or more general mechanisms, consistent with our overall approach, that will address the "additionalPropeties": false
use cases that do not work well with our existing modularity and re-use features.
TL;DR: We should look to the multi-level approach of URI Templates to solve complex problems that only a subset of users require. Implementations can choose what level of functionality to provide, and vocabularies can declare what level of support they require.
Existing implementations are generally Level 3 by the following list. Draft-07 introduces annotation collections rules which are optional to implement. Implementations that do support annotation collection will be Level 4. This PR proposes Level 5 and Level 6, and also examines how competing proposals (schema transforms) impact Level 1.
EDIT: Deferred keywords are intended to make use of subschema results, and not results from parent or sibling schemas as the original write-up accidentally stated.
- Level 1: Basic media type functionality. Identify and link schemas, allow for basic modularity and re-use
- Level 2: Full structural access. Apply subschemas to the current location and combine the results, and/or apply subschemas to child locations
- Level 3: Assertions. Evaluate the assertions within a schema object without regard to the contents of any other schema object
- Level 4: Annotations. Collect all annotations that apply to a given location and combine the values as defined by each keyword
- Level 5: Deferred Assertions. Evaluate these assertions across all subschemas that apply to a given location
- Level 6: Deferred Annotations. Collect annotations and combine them with existing level 4 results as specified by the keyword. Deferred annotations may specify override rules for level 4 annotations when it comes to level 4 annotations collected from subschemas
A general JSON Schema processing model
With the keyword classifications developed during draft-07 (and a bit further in #512), we can lay out a conceptual processing model for a generic JSON Schema implementation.
NOTE 1: This does not mean that implementations need to actually organize their code in this manner. In particular, an implementation focusing on a specific vocabulary, e.g. validation, may want to optimize performance by taking a different approach and/or skipping steps that are not relevant to that vocabulary. A validator does not necessarily need to collect annotations. However, Hyper-Schema relies on the annotation collection step to build hyperlinks.
NOTE 2: Even if this approach is used, the steps are not executed linearly. $ref
must be evaluated lazily, and it makes sense to alternate evaluation of assertions and applicability keywords to avoid evaluating subschemas that are irrelevant because of failed assertions.
- Process schema linking and URI base keywords (
$schema
,$id
,$ref
,definitions
as discussed in Move "definitions" to core (as "$defs"?) #512) - Process applicability keywords to determine the set of subschema objects relevant to the current instance location, and the logic rules for combining their assertion results
- Process each subschema object's assertions, and remove any subschema objects with failed assertion from the set
- Collect annotations from the remaining relevant subschemas
There is a basic example in one of the comments.
Note that (assuming #512 is accepted), step 1 is entirely determiend by the Core spec, and (if #513 is accepted) step 2 is entirely determined by either the Core spec or its own separate spec.
Every JSON Schema implementation MUST handle step 1, and all known vocabularies also require step 2.
Steps 3 and 4 are where things get more interesting.
Step 3 is required to implement validation, and AFAIK most validators stop with step 3. Step 4 was formalized in draft-07, but previously there was no guidance on what to do with the annotation keywords (if anything).
Implementations that want to implement draft-07's guidance on annotations with the annotation keywords in the validation spec would need to add step 4 (however, this is optional in draft-07).
Strictly speaking, Hyper-Schema could implement steps 1, 2, and 4, as it does not define any schema assertions to evaluate in step 3. But as a practical matter, Hyper-Schema will almost always be implemented alongside validation, so a Hyper-Schema implementation will generally include all four steps.
So far, none of this involves changing anything. It's just laying out a way to think about the things that the spec already requires (or optionally recommends).
To solve the re-use problem, there are basically two approaches, both of which can be viewed as extensions to this processing model:
Deferred processing
To solve the re-use problems I propose defining a step 5:
- Process additional assertions (a.k.a. deferred assertions) that may make use of all subschemas that are relevant at the end of step 4. Note that we must already process all existing subschema keywords before we can provide the overall result for a schema object.
EDIT: The proposal was originally called unknownProperties
, which produced confusion over the definition of "known" as can be seen in many later comments. This write-up has been updated to call the intended proposed behavior unevaluatedProperties
instead. But that name does not otherwise appear until much later in this issue.
This easily allows a keyword to implement "ban unknown properties", among other things. We can define unevaluatedProperties
to be a deferred assertion analogous to additionalProperties
. Its value is a schema that is applied to all properties that are not addressed by the union over all relevant schemas of properties
and patternProperties
.
There is an example of how unevaluatedProperties
, called unknownProperties
in the example, would work in the comments. You should read the basic processing example in the previous comment first if you have not already.
We could then easily define other similar keywords if we have use cases for them. One I can think of offhand would be unevaluatedItems
, which would be analogous to additionalItems
except that it would apply to elements after the maximum length items
array across all relevant schemas. (I don't think anyone's ever asked for this, though).
Deferred annotations would also be possible (which I suppose would be a step 6). Maybe something like deferredDefault
, which would override any/all default
values. And perhaps it would trigger an error if it appears in multiple relevant schemas for the same location. (I am totally making this behavior up as I write it, do not take this as a serious proposal).
Deferred keywords require collecting annotation information from subschemas, and are therefore somewhat more costly to implement in terms of memory and processing time. Therefore, it would make sense to allow implementations to opt-in to this as an additional level of functionality.
Implementations could also provide both a performance mode (that goes only to level 3) and a full-feature mode (that implements all levels).
Schema transforms
In the interest of thoroughly covering all major re-use proposals, I'll note that solutions such as $merge
or $patch
would be added as a step 1.5, as they are processed after $ref
but before all other keywords.
These keywords introduce schema transformations, which are not present in the above processing model. All of the other remaining proposals ($spread
, $use
, single-level overrides) can be described as limited versions of $merge
and/or $patch
, so they would fit in the same place. They all still introduce schema transformations, just with a smaller set of possible transformations.
It's not clear to me how schema transform keywords work with the idea that $ref
is delegation rather than inclusion (see #514 for a detailed discussion of these options and why it matters).
[EDIT: @epoberezkin has proposed a slightly different $merge
syntax that avoids some of these problems, but I'm leaving this part as I originally wrote it to show the progress of the discussion]
If $ref
is lazily replaced with its target (with $id
and $schema
adjusted accordingly), then transforms are straightforward. However, we currently forbid changing $schema
while processing a schema document, and merging schema objects that use different $schema
values seems impossible to do correctly in the general case.
Imposing a restriction of identical $schema
s seems undesirable, given that a target schema maintainer could change their draft version indepedent of the source schema maintainer.
On the other hand, if $ref
is delegation, it is handled by processing its target and "returning" the resulting assertion outcome (and optionally the collected annotation). This works fine with different $schema
values but it is not at all clear to me how schema transforms would apply.
@epoberezkin, I see that you have some notes on ajv-merge-patch about this but I'm having a bit of trouble following. Could you add how you think this should work here?
Conclusions
Based on my understanding so far, I prefer deferred keywords as a solution. It does not break any aspect of the existing model, it just extends it by applying the same concepts (assertions and annotations) at a different stage of processing (after collecting the relevant subschemas, instead of processing each relevant schema on its own). It also places a lot of flexibility in the hands of vocabulary designers, which is how JSON Schema is designed to work.
Schema transforms introduces an entirely new behavior to the processing model. It does not seem to work with how we are now conceptualizing $ref
, although I may well be missing something there. However, if I'm right, that would be the most compelling argument against it.
I still also dislike that arbitrary editing/transform functionality as a part of JSON Schema at all, but that's more of a philosophical thing and I still haven't figured out how to articulate it in a convincing way.
I do think that this summarizes the two possible general approaches and defines them in a generic way. Once we choose which to include in our processing model, then picking the exact keywords and behaviors will be much less controversial. Hopefully :-)