Flexible re-use: deferred keywords vs schema transforms

**NOTE: The goal of this is to find something resembling community consensus on a direction, or at least a notable lean in one direction or another from a large swath of the community.**

We are **not** trying to discredit either idea, although we all tend to lurch in that direction from time to time, myself included.  What we need is something that more people than the usual tiny number of participants would be willing to try out.

_The discussion here can get very fast-paced.  I am **trying** to periodically pause it to allow new folks, or people who don't have quite as much time, to catch up.  Please feel free to comment requesting such a pause if you would like to contribute but are having trouble following it all._

-----

This proposal attempts to create one or more **general mechanisms**, consistent with our overall approach, that will address the `"additionalPropeties": false` use cases that do not work well with our existing modularity and re-use features.

-----

**TL;DR:** We should look to the [multi-level approach of URI Templates](https://tools.ietf.org/html/rfc6570#section-1.2) to solve complex problems that only a subset of users require.  Implementations can choose what level of functionality to provide, and vocabularies can declare what level of support they require.

Existing implementations are generally Level 3 by the following list.  Draft-07 introduces annotation collections rules which are optional to implement.  Implementations that do support annotation collection will be Level 4.  This PR proposes Level 5 and Level 6, and also examines how competing proposals (schema transforms) impact Level 1.

***EDIT:*** _Deferred keywords [are intended to make use of subschema results](https://github.com/json-schema-org/json-schema-spec/issues/515#issuecomment-353668311), and not results from parent or sibling schemas as the original write-up accidentally stated._

* **Level 1**: Basic media type functionality.  Identify and link schemas, allow for basic modularity and re-use
* **Level 2**: Full structural access.  Apply subschemas to the current location and combine the results, and/or apply subschemas to child locations
* **Level 3**: Assertions.  Evaluate the assertions within a schema object without regard to the contents of any other schema object
* **Level 4**: Annotations.  Collect all annotations that apply to a given location and combine the values as defined by each keyword
* _**Level 5**: Deferred Assertions.  Evaluate these assertions across all subschemas that apply to a given location_
* _**Level 6**: Deferred Annotations.  Collect annotations and combine them with existing level 4 results as specified by the keyword.  Deferred annotations may specify override rules for level 4 annotations when it comes to level 4 annotations collected from subschemas_

## A general JSON Schema processing model

With the keyword classifications developed during draft-07 (and a bit further in #512), we can lay out a conceptual processing model for a generic JSON Schema implementation.

***NOTE 1:*** _This does not mean that implementations need to actually organize their code in this manner.  In particular, an implementation focusing on a specific vocabulary, e.g. validation, may want to optimize performance by taking a different approach and/or skipping steps that are not relevant to that vocabulary.  A validator does not necessarily need to collect annotations.  However, Hyper-Schema relies on the annotation collection step to build hyperlinks._

***NOTE 2:*** _Even if this approach is used, the steps are not executed linearly.  `$ref` must be evaluated lazily, and it makes sense to alternate evaluation of assertions and applicability keywords to avoid evaluating subschemas that are irrelevant because of failed assertions._

1. Process schema linking and URI base keywords (`$schema`, `$id`, `$ref`, `definitions` as discussed in #512)
2. Process applicability keywords to determine the set of subschema objects relevant to the current instance location, and the logic rules for combining their assertion results
3. Process each subschema object's assertions, and remove any subschema objects with failed assertion from the set
4. Collect annotations from the remaining relevant subschemas

There is a [basic example](https://github.com/json-schema-org/json-schema-spec/issues/515#issuecomment-347728057) in one of the comments.

Note that (assuming #512 is accepted), step 1 is entirely determiend by the Core spec, and (if #513 is accepted) step 2 is entirely determined by either the Core spec or its own separate spec.

Every JSON Schema implementation MUST handle step 1, and all known vocabularies also require step 2.

Steps 3 and 4 are where things get more interesting.

Step 3 is required to implement validation, and AFAIK most validators stop with step 3.  Step 4 was formalized in draft-07, but previously there was no guidance on what to do with the annotation keywords (if anything).

Implementations that want to implement draft-07's guidance on annotations with the annotation keywords in the validation spec would need to add step 4 (however, this is optional in draft-07).

Strictly speaking, Hyper-Schema could implement steps 1, 2, and 4, as it does not define any schema assertions to evaluate in step 3.  But as  a practical matter, Hyper-Schema will almost always be implemented alongside validation, so a Hyper-Schema implementation will generally include all four steps.

So far, none of this involves changing anything.  It's just laying out a way to think about the things that the spec already requires (or optionally recommends).

To solve the re-use problem, there are basically two approaches, both of which can be viewed as extensions to this processing model:

## Deferred processing


To solve the re-use problems I propose defining a step 5:

* Process additional assertions (a.k.a. _deferred assertions_) that may make use of all subschemas that are relevant at the end of step 4.  Note that we must already process all existing subschema keywords before we can provide the overall result for a schema object.

***EDIT:*** _The proposal was originally called `unknownProperties`, which produced confusion over the definition of "known" as can be seen in many later comments.  This write-up has been updated to call the intended proposed behavior `unevaluatedProperties` instead.  But that name does not otherwise appear until [much later in this issue](https://github.com/json-schema-org/json-schema-spec/issues/515#issuecomment-353668311)._

This easily allows a keyword to implement "ban unknown properties", among other things.  We can define `unevaluatedProperties` to be a deferred assertion analogous to `additionalProperties`.  Its value is a schema that is applied to all properties that are not addressed by the _union over all relevant schemas_ of `properties` and `patternProperties`.

There is [an example of how `unevaluatedProperties`, called `unknownProperties` in the example, would work](https://github.com/json-schema-org/json-schema-spec/issues/515#issuecomment-347731205) in the comments.  You should read the basic processing example in the previous comment first if you have not already.

We could then easily define other similar keywords if we have use cases for them.  One I can think of offhand would be `unevaluatedItems`, which would be analogous to `additionalItems` except that it would apply to elements after the maximum length `items` array across all relevant schemas.  (I don't think anyone's ever asked for this, though).

Deferred annotations would also be possible (which I suppose would be a step 6).  Maybe something like `deferredDefault`, which would override any/all `default` values.  And perhaps it would trigger an error if it appears in multiple relevant schemas for the same location.  (_I am totally making this behavior up as I write it, do not take this as a serious proposal_).

-----

Deferred keywords require collecting annotation information from subschemas, and are therefore somewhat more costly to implement in terms of memory and processing time.  Therefore, it would make sense to allow implementations to opt-in to this as an additional level of functionality.

Implementations could also provide both a performance mode (that goes only to level 3) and a full-feature mode (that implements all levels).

## Schema transforms

In the interest of thoroughly covering all major re-use proposals, I'll note that solutions such as `$merge` or `$patch` would be added as a step 1.5, as they are processed after `$ref` but before all other keywords.

These keywords introduce schema transformations, which are not present in the above processing model.  All of the other remaining proposals (`$spread`, `$use`, single-level overrides) can be described as limited versions of `$merge` and/or `$patch`, so they would fit in the same place.  They all still introduce schema transformations, just with a smaller set of possible transformations.

-----

It's not clear to me how schema transform keywords work with the idea that `$ref` is delegation rather than inclusion (see #514 for a detailed discussion of these options and why it matters).

_[**EDIT:** @epoberezkin has [proposed a slightly different `$merge` syntax](https://github.com/json-schema-org/json-schema-spec/issues/515#issuecomment-348808736) that avoids some of these problems, but I'm leaving this part as I originally wrote it to show the progress of the discussion]_

If `$ref` is lazily replaced with its target (with `$id` and `$schema` adjusted accordingly), then transforms are straightforward.  However, we currently forbid changing `$schema` while processing a schema document, and merging schema objects that use different `$schema` values seems impossible to do correctly in the general case.

Imposing a restriction of identical `$schema`s seems undesirable, given that a target schema maintainer could change their draft version indepedent of the source schema maintainer.

On the other hand, if `$ref` is delegation, it is handled by processing its target and "returning" the resulting assertion outcome (and optionally the collected annotation).  This works fine with different `$schema` values but it is not at all clear to me how schema transforms would apply.

@epoberezkin, I see that you have some notes on ajv-merge-patch about this but I'm having a bit of trouble following.  Could you add how you think this should work here?

## Conclusions

Based on my understanding so far, I prefer deferred keywords as a solution.  It does not break any aspect of the existing model, it just extends it by applying the same concepts (assertions and annotations) at a different stage of processing (after collecting the relevant subschemas, instead of processing each relevant schema on its own).  It also places a lot of flexibility in the hands of vocabulary designers, which is how JSON Schema is designed to work.

Schema transforms introduces an entirely new behavior to the processing model.  It does not seem to work with how we are now conceptualizing `$ref`, although I may well be missing something there.  However, if I'm right, that would be the most compelling argument against it.

I still also dislike that arbitrary editing/transform functionality as a part of JSON Schema at all, but that's more of a philosophical thing and I still haven't figured out how to articulate it in a convincing way.

I do think that this summarizes the two possible general approaches and defines them in a generic way.  Once we choose which to include in our processing model, then picking the exact keywords and behaviors will be much less controversial.  Hopefully :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Flexible re-use: deferred keywords vs schema transforms #515

A general JSON Schema processing model

Deferred processing

Schema transforms

Conclusions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Flexible re-use: deferred keywords vs schema transforms #515

Description

A general JSON Schema processing model

Deferred processing

Schema transforms

Conclusions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions