ADR: General purpose batch write DAG operator
Status
-
Draft -
Proposed -
Trialing -
Under Review -
Approved -
Retired
Context
There are a wide variety of volume-based use cases that drive how ingestion with the OSDU(TM) data platform will occur. The use cases span from a single record to millions of records. There are also multiple sources of data in multiple formats. Additionally, the Storage Service createOrUpdate
API endpoint is by default programmed to receive at most 500 records at a time. As such, any ingestion workflow must determine how many records it needs to save and if that number exceeds 500, it must batch writes accordingly.
However, the lowest-common-denominator is a record that will be stored in OSDU via the Storage API. Therefore, we have the ability to design and build a DAG operator that is capable of receiving a list of records that will then batch them according to the Storage Service's createOrUpdate
configuration, perform the writes, capture the results, and make them available via logging. This approach will prevent other ingestion workflows from implementing custom batching, which reduces code duplication and enables a move toward standardization.
Scope
- A single DAG Operator that has an expected set of inputs, outputs, and errors
- The DAG Operator will have the ability to receive a list of records, which it will batch and send to the Storage Service's
createOrUpdate
API endpoint - The DAG Operator will write the records in the order provided by the list (starting with position 0 - assuming a zero-based list)
- The DAG Operator will log the ID of each record and its outcome (success, error) using the XCom logging style used by Manifest Ingestion
- The DAG Operator will not handle Surrogate Keys (or should it?)
Decision
- Create a common DAG Operator that can batch and write records to the Storage Service's
createOrUpdate
API endpoint.
Rationale
- This approach will standardize the writing step of ingestion, provide batching for the Storage Service's limit on
createOrUpdate
, and reduce code duplication by creating a reusable DAG Operator.
Consequences
- No consequences as the DAG Operator is optional. This ADR does not suggest making the use of the generic batch operator a requirement for DAG implementations.
When to revisit
- N/A
Tradeoff Analysis - Input to decision
- No tradeoffs as leveraging the DAG Operator is optional. Other ingestion workflows may opt to exclude it from their DAG.
Decision timeline
Decision ready to be made.