ADR Process large volume of data
Introduction
Schema upgrade service assists in updating the records of the OSDU platform to align with newer versions of the schema. All records within the OSDU platform are designed to conform to a particular schema type. Whenever this schema undergoes updates to incorporate additional features or attributes, the schema version is incremented. In order to accommodate this modification, the records on the OSDU platform need to be upgraded to comply with the updated schema version.
The ability to updating schema for a large number of records efficiently is necessary for the service to prove itself useful.
Current Problem
- Works well with limited set of records
- Assume if we have more than 10K records to migrate then this approach will timeout or may take hours to upgrade records
- If something failed during upgrade, it will not resume from where it left
- There is no way to track on how many records upgraded and how many of them failed
- This approach will overload Search and Storage service
Proposed Design
The short-term solution will be to use search-with-cursor endpoint of the search service. This will avoid the 10k limit, although it can also have issues (see osdu/platform/system/search-service#157)
The better solution assumes dividing Schema Upgrade service into two pods, first for the API and second for batch workers. It should help increase scalability.
Schema Upgrade API Pod:
- There will be two additional endpoints, one for triggering batch schema upgrade (1) and another one for checking batch job status (2).
Batch Workers Pod:
- Batch Workers will be asynchronous jobs.
- The first one (4) will fetch ids of records to update, divide them into batch items, and add them to the Queue (5). Batch size should be configurable and provided in the infrastructure level (e.g. in Terraform or Helm).
- The second one (6) will do all the transformations and record updates. In case of 5xx errors when communicating with Storage service, worker should trigger re-processing with exponential back-off. In case of 4xx errors from Storage service, Schema Upgrade internal errors, or re-processing failure, batch of failed record ids will be pushed to dead letter queue (8).
- Batch id and a list of record ids along with the statuses will be stored in Helper Database (7), which will allow users to check job status at any time. The Helper Database should be implemented utilizing CSP document store (e.g. DynamoDB for AWS or PostgreSQL as in the Community Implementation). See discussion below in "Open Issues".
Nice to have features (3):
- TBD if we need a sepearate endpoint for batch rollback or similar functionality can be achieved by reusing batch schema upgrade.
- There is also an option to create cron job for checking batch job statuses, e.g. for email notifications
Open issues
-
We need to determine who will be responsible for investigating why certain records end up in DLQ and design tools accordingly e.g. expose DLQ through API or send notifications.
-
Many of the data structures in use in the schema upgrade service are stored as platform resources. This means:
- they have schema (reviewed, approved)
- are ingested or loaded via Storage service
- are indexed by the indexer
- are searchable by platform users
The current thinking is that the Helper Database (7) should exist to support the upgrade service and NOT load job status records into the Storage Service. This will guarantee that transactional support is available and custom SQL queries can be run which would not be available if the records were Platform Resources. It also enforces encapsulation of the information which should not be exposed beyond the boundary of the schema upgrade service. This follows the pattern used by the Entitlements Service and its database as shown in the bare metal implementation diagram: https://community.opengroup.org/osdu/platform/deployment-and-operations/infra-gcp-provisioning/-/raw/master/img/OSDU%20Bare%20Metal%20Architecture.png.
The other service using this pattern is the legal service where the tags are stored outside the platform. However, it is currently in a non-SQL database (MongoDB and/or DynamoDB). See:
https://community.opengroup.org/osdu/platform/ci-cd-pipelines/-/blob/master/cloud-providers/aws-mongodb-global.yml?ref_type=heads
https://community.opengroup.org/osdu/platform/security-and-compliance/legal/-/blob/maste[…]ess/mongodb/repository/LegalTagRepositoryMongoDBImpl.java