[Deployment and Operations] Define Service Level Indicators
Define a set of SLIs to inform where in the architecture/technology we what to capture metrics or add addition logging. This starts by
- identifying the critical user journeys for OSDU,
- evaluating the reliability and risks associated with each using both systems analysis and historical data
- defining the metrics (indicators) that would allow us to measure reliability
- Identify (or engineer in) points in system that would allow us to calculate the SLIs either through metrics (directly measured) or logging
- ...
Examples:
- Ingestion latency (time from loading data to when it is available)
XX.X% of the Wellbores loaded should be available for search within XX minutes as measured from the time when the "Ingestion Service" is triggered to the time the metadata is indexed - Search performance (time from query to first results availability)
XX.X% of searches should return first values within X seconds as measure from the point the search service is invoked to the first records returned - Service availability (how available/reliability should key services be)
XX.X of the service requests should be returned with an HTTP 200, as measured at the API Gateway