... | ... | @@ -6,13 +6,20 @@ |
|
|
To deliver the requirement & techniques for managing OSDU data platform in an enterprise setting, in particular handling audit logs, operational metrics/ SLIs and providing tracing for performance and auditability.
|
|
|
|
|
|
#### Strongly Recommended Read
|
|
|
- Operator requirements [here](https://community.opengroup.org/osdu/platform/deployment-and-operations/audit-and-metrics/-/wikis/OperatorFeedback)
|
|
|
- Observability workshop operator feedback: [here](https://docs.google.com/document/d/1_mUi3PYi7goUxRCc0zJAETBBwxd3lb8Jgo1egAk2gwQ/edit?usp=sharing)
|
|
|
- Original list of metrics collected by Action Pack pre-Mercury - [here](https://gitlab.opengroup.org/osdu/community-import/program-activities/-/blob/master/Reporting_Dashboarding/Design%20documentation/OperationsDashFullList.xlsx)
|
|
|
- Mercury preship team reported issues - [here](https://gitlab.opengroup.org/osdu/subcommittees/ea/projects/pre-shipping/home/-/issues)
|
|
|
- read for more details - please review the [**Readme here**](https://community.opengroup.org/osdu/platform/deployment-and-operations/audit-and-metrics/-/blob/master/README.md)
|
|
|
- top-level issue for requirements - [issue detail](https://community.opengroup.org/osdu/platform/deployment-and-operations/audit-and-metrics/-/issues/1)
|
|
|
- operator feedback on requirements - captured [here](OperatorFeedback)
|
|
|
|
|
|
### Organization
|
|
|
|
|
|
Project Lead: @nidhifotedar
|
|
|
Project Lead: Srini R (LTI)
|
|
|
Project Devs: TBD
|
|
|
Project SME Expert: @stephenwhitley @rbouter
|
|
|
Prior Lead: @nidhifotedar @stephenwhitley @rbouter @rveraart
|
|
|
|
|
|
**Voting Committers:**
|
|
|
|
... | ... | @@ -29,14 +36,7 @@ The Maintainer Committers are individuals within the Audit and Metrics services |
|
|
- Ghania [Microsoft]
|
|
|
- TBD
|
|
|
|
|
|
**Subcommittee Contacts**
|
|
|
|
|
|
- Nidhi Fotedar, BP
|
|
|
- Robbert Veraart, CGI
|
|
|
- Paco Hope, AWS
|
|
|
- Stephen Whitley, SLB
|
|
|
|
|
|
**Contributers**
|
|
|
**Contributors**
|
|
|
|
|
|
As many as we can get, but primarily the systems engineers / developers that have been assigned from each of the cloud providers and Schlumberger.
|
|
|
|
... | ... | @@ -54,6 +54,60 @@ As many as we can get, but primarily the systems engineers / developers that hav |
|
|
|
|
|
....
|
|
|
|
|
|
### Project Context and Details
|
|
|
Typically metrics come in the following form (from Ops Procedure doc)
|
|
|
| Metric | Description |
|
|
|
|-----------------------|--------------------------------------------------------------|
|
|
|
| availability | the % of successful responses |
|
|
|
| latency & performance | the % of requests that complete faster than a target |
|
|
|
| freshness | the % of data that is up to date |
|
|
|
| correctness | the % of request that return the correct result |
|
|
|
| durability | the % of data that is recorded that can be read successfully |
|
|
|
|
|
|
Some of these examples were already provided in the Ops Procedures under SLIs.
|
|
|
|
|
|
**Common metrics include:**
|
|
|
|
|
|
- **Latency**: Every Service
|
|
|
- Service call latency (Request/Response) for every service. Given the variability of the cloud environment, these are typically evaluated in percentiles. Examples: 95% of calls to the search service should return an initial result in less than 5 seconds
|
|
|
- **Throughput**: Every service involved in a multi-record, batch oriented activity such as Indexing, Storage, Ingestion etc.
|
|
|
- Example metric: Re-Indexing should be able to sustain 1 million records per hour
|
|
|
- **Freshness**: This typically involves multiple services in a workflow. This can include Ingestion, Storage, Indexing, Consumption zones etc.
|
|
|
- Example: 80% of records ingested into the system should be indexed and searchable within [small minutes]
|
|
|
|
|
|
**Common scalability metrics include:**
|
|
|
To ensure scalability of the system (performance does not degrade with size or activity).
|
|
|
|
|
|
- **Content**
|
|
|
- Number of kinds and number of records per kind. This is to ensure that Searching and Storage performance do not degrade as more content is added to the system.
|
|
|
- Example: search should have similar latency whether there is 50,000 records or 50,000,000 records.
|
|
|
- **Concurrency**
|
|
|
- How many activities are happening at the same time
|
|
|
- Example: Is search and delivery performance impacted during ingestion (while storing and indexing new records)
|
|
|
|
|
|
**Reliability / availability metrics**
|
|
|
- Error Rates for both infrastructure and platform services
|
|
|
- These should be capture even if the service recovers through retry. They could be indicators of future problems or provide variability in performance.
|
|
|
|
|
|
#### Initial Recommended KPIs
|
|
|
- General
|
|
|
- Uptime for each of the services as measured by non-500 responses from the services.
|
|
|
- Data flow in
|
|
|
- Ingestion failures – number of ingestion service requests that end up in a failure over time
|
|
|
- Ingestion workflow throughput – number of data items processed thru the ingestion, workflow, DAG (parsers), storage, entitlement flow of data. Index/Search is out of band from sequence due to loose coupling thru events.
|
|
|
- Storage service performance for creation of new records/versions measured by %requests that are less than given time
|
|
|
- Data correctness for FoR conversions measured by number of successful requests over time (regular and probe)
|
|
|
- Elastic cluster uptime and number of failures from indexer service to refresh indexes over time
|
|
|
- DDMS Service specific – latency for requests to get or create data items in DDMS API (GetLog vs CreateLog for example)
|
|
|
- Data flow out
|
|
|
- Search service performance for queries as measured by response latency over time (normalized to number of entities returned/total).
|
|
|
- Maximum size of notification and indexer queues over time – larger queues can result in failed indexing
|
|
|
- Storage retrieval performance for providing records/versions measured by %requests that are less than given time
|
|
|
- Number of enrichment jobs failing over time (workflow DAGs for WKS/OSDU canonical model creation)
|
|
|
- Security
|
|
|
- Number of 40x errors reported by services over a given period of time. This can be an indication of system being hit without authorization or valid requests failing on authorization.
|
|
|
- Though technically not a metric we need a way to alert operations/SRE around certificate expiry, key rotation and related secrets management.
|
|
|
|
|
|
|
|
|
**Ways of Working**
|
|
|
|
... | ... | |