ADR: Enhance Index Augmenter to support functions for data process/calculation
Index Augmenter was first proposed by @gehrmann in ADR ADR: Configurable Index Extensions and De-Normalizations and was first delivered in M18. It went through several iterations and became stable since M20. It has become a common solution via data de-normalization in OSDU index to enrich search capability as well as data preview.
In the original solution of the Index Augmenter, it focuses on data de-normalizations. In this ADR, we would like to extend the current Index Augmenter with function framework. It will allow developers to add functions that can process data from one or more source properties to generate aggregated results to further support search and data preview.
Furthermore, under this new function framework, we will implement 3 functions that is needed in GIS workflows to be able to filter based on object sizes.:
- Extent: Compute the extent of a shape in geodetic coordinate WGS84.
- Len: Compute the (total) length of a polyline or multi-polyline shape in geodetic coordinate WGS84.
- Area: Compute the (total) area of a polygon or multi-polygon shape in geodetic coordinate WGS84.
The driving use case for this ADRE is a desire to add these attributes to the following kinds:
- osdu:wks:work-product-component--SeismicTraceData
- osdu:wks:work-product-component--SeismicBinGrid
- osdu:wks:work-product-component--SeismicLineGeometry
- osdu:wks:master-data--Wellbore
- osdu:wks:work-product-component--WellboreTrajectory:
For clarity:
The proposed feature is opt-in and is only enabled when a client adds augmentation configuration rules to define attributes for a given kind. This configuration is applied per partition. M27 itself does not ship with any preconfigured augmentation attributes.
Once an augmented attribute is configured, it is populated only for new data or when existing data is updated or reindexed. This follows the same mechanism already used by the rest of the augmentation system.
As a best practice, if reindexing is required, it is recommended to use force-clean=false and to throttle the process over time for large datasets by using the reindex-per-record-ID API.