[Deployment and Operations] Define Service Level Indicators

Define a set of SLIs to inform where in the architecture/technology we what to capture metrics or add addition logging. This starts by

identifying the critical user journeys for OSDU,
evaluating the reliability and risks associated with each using both systems analysis and historical data
defining the metrics (indicators) that would allow us to measure reliability
Identify (or engineer in) points in system that would allow us to calculate the SLIs either through metrics (directly measured) or logging
...

Examples:

Ingestion latency (time from loading data to when it is available)
XX.X% of the Wellbores loaded should be available for search within XX minutes as measured from the time when the "Ingestion Service" is triggered to the time the metadata is indexed
Search performance (time from query to first results availability)
XX.X% of searches should return first values within X seconds as measure from the point the search service is invoked to the first records returned
Service availability (how available/reliability should key services be)
XX.X of the service requests should be returned with an HTTP 200, as measured at the API Gateway

Edited Apr 09, 2020 by Raj Kannan