Fix the search and indexing performance issues when the geometry of the document is large
Background:
Today the geometry or called shapes in the indexed records are not decimated. The size of geometry data could be large and reach tens of MB if hundreds of MB. As we know, the geometry in the search index can be used to support spatial query, data preview or data discovery.
However, the large size of geometry in the indexed records could significantly affect the performance on retrieving the search results and prevent search results to be used efficiently in some utilities, such as GIS map. In O&G application, GIS map is a critical component that users may use to render the shapes in the given region as a tool for the data discovery. It may require to retrieve and render thousands or even millions of shapes from the OSDU index. If there are tens of thousand of shapes to be retrieved and rendered, the performance won't be good enough even the shapes are decimated. At another end, it is unnecessary to show the detail of the shapes when tens of thousands indexed records are returned from the search.
Proposal:
We propose decimate the geometry of the following GeoJSON geometry types by implementing Ramer–Douglas–Peucker algorithm for the original shape attribute and shape attribute "data.VirtualProperties.DefaultLocation.Wgs84Coordinates" if exists.
- LineString
- MultiLineString
- Polygon
- MultiPolygon
Regarding shape attribute "data.VirtualProperties.DefaultLocation", please refer to ADR Common discovery within and across kinds
Performance Evaluation:
We did some performance evaluation with the prototype to decimate the original shape attribute and shape attribute "data.VirtualProperties.DefaultLocation.Wgs84Coordinates" using some seismic 2D surveys. The tolerance or epsilon is about 10 meters which is about 0.0001 degree around the equator.
The information of the test dataset and summary of the test report are attached below:
Summary:
- The decimation of the shape attributes significantly improve the end to end search performance (search and data retrieval from elastic search to the test client)
- The extra overhead of the decimation during indexing is offset by the gain of saving time on elastic search indexing of the geo-shapes. The test result indicates that it reduced the overall indexing time by 58%.