Refactor queryRecordsInBatch to broadly support varying batch sizes
Type of change
-
Bug Fix -
Feature
Please provide link to gitlab issue or ADR(Architecture Decision Record)
#577 (closed)
Does this introduce a change in the core logic?
- [YES/NO]
Does this introduce a breaking change?
- [YES/NO]
What is the current behavior?
- Dependency on batchSize being less than 1000 or a multiple of 1000.
What is the new/expected behavior?
- batchSize may be any value greater than 0 and the batches will adjust accordingly
Any other useful information
Batch Size - What is it?
-
batchSize
dictates the number of records read from OSDU before they are turned over for processing by the Transformer- Lifecycle of a batch:
- Ingestion from OSDU
- Ingestion through OSDU Search, which is capped at a limit of 1000. So if
batchSize
is greater than 1000, we must sub-batch the ingestion queries until total ingested records has met thebatchSize
- The sub-batching is represented by Search#queryRecordsInBatch, whereas the larger batch life cycle is captured by FeatureCacheSynchronizerHelper#synchronizeInBatch
- Ingestion through OSDU Search, which is capped at a limit of 1000. So if
- Process all records (conversion to GeoJSON, etc.)
- Load records into Ignite Cache
- Ingestion from OSDU
- Lifecycle of a batch:
-
batchSize
must be set on the Transformer level, but can optionally be set on a per-kind level- If
batchSize
is set on a kind, it overrides thebatchSize
set by the Transformer
- If
Example:
- Configuration:
batchSize
is 1005 - Batch Lifecycle: FeatureCacheSynchronizerHelper#synchronizeInBatch will call
getData()
on a kind with the specified batchSize of 1005- Per-batch sub-batching: Search#queryRecordsInBatch will attempt to ingest 1005 records from OSDU, but must do so with a max limit of 1000.
- First will make a query to retrieve 1000 records
- Use resulting cursor to retrieve the next 5 records
- If cursor has expired, will query with an offset of 1000 and a limit of 5 to retrieve the next 5 records.
- Per-batch sub-batching: Search#queryRecordsInBatch will attempt to ingest 1005 records from OSDU, but must do so with a max limit of 1000.
- Batch has been collected, and is now processed in bulk
- The next batch lifecycle of 1005 is started
- These batches go on until there are no more records to ingest
Cursor Expiration
- During my testing, GLAB OSDU Search was very slow, taking over 40 seconds to query 1000 Wellbore records.
- With a batchSize over 1000, I found the cursor would frequently expire and our code did not have a fallback.
Edited by Levi Remington