Expected behavior on schema validation
Status
-
Proposed -
Trialing -
Under review -
Approved -
Retired
Context & Scope
In OSDU R2, there is storage service with schema API. In R3, there will be schema service. Both of them serve similar purpose: to enable consumers to understand the data. However, OSDU platform still puts a high burden on the consumer to validate the data agains the schema. This ADR introduces schema validation that aims to remove this burden.
Decision
Going forward, data in OSDU will be validated against the schema. This will not be enforcement (rejecting data not meeting provided schema definition) because it would increase ingestion friction. It will be asynchronous validation following these steps:
- Each data must be marked with corresponding kind. (This is already the case.)
- Data record is stored via synchronous storage service API. (This is already the case.)
- Data is asynchronously verified against kind schema. (New feature.)
- Record in storage service will be appended with boolean root property schemaConsistent set to true when record matches the schema and false if record does not match the schema. (New feature.)
- Indexer service will index schemaConsistent property. (New feature.)
Records that are schema consistent can contain more data (properties) then what is defined in the schema. In that sense, the associated schema is the "minimal" schema.
Rationale
It is hard to write a robust application if one cannot rely on the schema of the data. The consumer must perform the check and code defensively to compensate for this uncertainty. It is a part of the OSDU platform responsibility to provide such functionality.
Consequences
One does not need to do the schema verification anymore and can rely on schema consistency of the data where the schemaConsistent property is true.
Tradeoff Analysis - Input to decision
Schema can be:
- Not enforced or verified by the platform
- Current behavior that does not provide value to consumers.
- Enforced by the platform
- This means rejecting the data that does not meet schema definition.
- Challenges
- Increases ingestion friction (sync and async enforcement)
- Decreases ingestion performance (sync enforcement)
- Complicates storage and DDMS design (sync enforcement)
- Complicates usage (async enforcement)
- Recommendation is not to do this because it results in loss of data for the platform due to increases ingestion friction.
- Verified by the platform
- This means flagging the data as consistent/non-consistent to the provided schema
- Challenges
- Less rigorous to upstream errors (sync and async enforcement)
- Decreases ingestion performance (sync enforcement)
- Eventual consistency (async enforcement)
- Recommendation is to do this with main benefit being data retention.
Decision criteria and tradeoffs
- Performance
- Reliability
- Cost of implementation
- Usability