To deliver the requirement & techniques for managing OSDU data platform in an enterprise setting, in particular handling audit logs, operational metrics/ SLIs and providing tracing for performance and auditability.
- Original list of metrics collected by Action Pack pre-Mercury - [here](https://gitlab.opengroup.org/osdu/community-import/program-activities/-/blob/master/Reporting_Dashboarding/Design%20documentation/OperationsDashFullList.xlsx)
- Mercury preship team reported issues - [here](https://gitlab.opengroup.org/osdu/subcommittees/ea/projects/pre-shipping/home/-/issues)
- read for more details - please review the [**Readme here**](https://community.opengroup.org/osdu/platform/deployment-and-operations/audit-and-metrics/-/blob/master/README.md)
- top-level issue for requirements - [issue detail](https://community.opengroup.org/osdu/platform/deployment-and-operations/audit-and-metrics/-/issues/1)
- operator feedback on requirements - captured [here](OperatorFeedback)
| latency & performance | the % of requests that complete faster than a target |
| freshness | the % of data that is up to date |
| correctness | the % of request that return the correct result |
| durability | the % of data that is recorded that can be read successfully |
Some of these examples were already provided in the Ops Procedures under SLIs.
**Common metrics include:**
-**Latency**: Every Service
- Service call latency (Request/Response) for every service. Given the variability of the cloud environment, these are typically evaluated in percentiles. Examples: 95% of calls to the search service should return an initial result in less than 5 seconds
-**Throughput**: Every service involved in a multi-record, batch oriented activity such as Indexing, Storage, Ingestion etc.
- Example metric: Re-Indexing should be able to sustain 1 million records per hour
-**Freshness**: This typically involves multiple services in a workflow. This can include Ingestion, Storage, Indexing, Consumption zones etc.
- Example: 80% of records ingested into the system should be indexed and searchable within [small minutes]
**Common scalability metrics include:**
To ensure scalability of the system (performance does not degrade with size or activity).
-**Content**
- Number of kinds and number of records per kind. This is to ensure that Searching and Storage performance do not degrade as more content is added to the system.
- Example: search should have similar latency whether there is 50,000 records or 50,000,000 records.
-**Concurrency**
- How many activities are happening at the same time
- Example: Is search and delivery performance impacted during ingestion (while storing and indexing new records)
**Reliability / availability metrics**
- Error Rates for both infrastructure and platform services
- These should be capture even if the service recovers through retry. They could be indicators of future problems or provide variability in performance.
#### Initial Recommended KPIs
- General
- Uptime for each of the services as measured by non-500 responses from the services.
- Data flow in
- Ingestion failures – number of ingestion service requests that end up in a failure over time
- Ingestion workflow throughput – number of data items processed thru the ingestion, workflow, DAG (parsers), storage, entitlement flow of data. Index/Search is out of band from sequence due to loose coupling thru events.
- Storage service performance for creation of new records/versions measured by %requests that are less than given time
- Data correctness for FoR conversions measured by number of successful requests over time (regular and probe)
- Elastic cluster uptime and number of failures from indexer service to refresh indexes over time
- DDMS Service specific – latency for requests to get or create data items in DDMS API (GetLog vs CreateLog for example)
- Data flow out
- Search service performance for queries as measured by response latency over time (normalized to number of entities returned/total).
- Maximum size of notification and indexer queues over time – larger queues can result in failed indexing
- Storage retrieval performance for providing records/versions measured by %requests that are less than given time
- Number of enrichment jobs failing over time (workflow DAGs for WKS/OSDU canonical model creation)
- Security
- Number of 40x errors reported by services over a given period of time. This can be an indication of system being hit without authorization or valid requests failing on authorization.
- Though technically not a metric we need a way to alert operations/SRE around certificate expiry, key rotation and related secrets management.