Data Platform Metrics and Tracing Services
OSDU PMC Project Information
To deliver the requirement & techniques for managing OSDU data platform in an enterprise setting, in particular handling audit logs, operational metrics/ SLIs and providing tracing for performance and auditability.
Strongly Recommended Read
- Operator requirements here
- Observability workshop operator feedback: here
- Original list of metrics collected by Action Pack pre-Mercury - here
- Mercury preship team reported issues - here
- read for more details - please review the Readme here
- top-level issue for requirements - issue detail
- operator feedback on requirements - captured here
The following are the voting committers for the Audit and Metrics services project representing the resource commitment of Schlumberger and the Cloud Service Providers.
The Maintainer Committers are individuals within the Audit and Metrics services project with the authority to approve Pull Requests and commit to Master.
- Logesh [LTI]
- Ghania [Microsoft]
As many as we can get, but primarily the systems engineers / developers that have been assigned from each of the cloud providers and Schlumberger.
- Logesh [LTI]
- Ghania [Microsoft]
- Core services
Project Context and Details
Typically metrics come in the following form (from Ops Procedure doc)
|availability||the % of successful responses|
|latency & performance||the % of requests that complete faster than a target|
|freshness||the % of data that is up to date|
|correctness||the % of request that return the correct result|
|durability||the % of data that is recorded that can be read successfully|
Some of these examples were already provided in the Ops Procedures under SLIs.
Common metrics include:
Latency: Every Service
- Service call latency (Request/Response) for every service. Given the variability of the cloud environment, these are typically evaluated in percentiles. Examples: 95% of calls to the search service should return an initial result in less than 5 seconds
Throughput: Every service involved in a multi-record, batch oriented activity such as Indexing, Storage, Ingestion etc.
- Example metric: Re-Indexing should be able to sustain 1 million records per hour
Freshness: This typically involves multiple services in a workflow. This can include Ingestion, Storage, Indexing, Consumption zones etc.
- Example: 80% of records ingested into the system should be indexed and searchable within [small minutes]
Common scalability metrics include: To ensure scalability of the system (performance does not degrade with size or activity).
- Number of kinds and number of records per kind. This is to ensure that Searching and Storage performance do not degrade as more content is added to the system.
- Example: search should have similar latency whether there is 50,000 records or 50,000,000 records.
- How many activities are happening at the same time
- Example: Is search and delivery performance impacted during ingestion (while storing and indexing new records)
Reliability / availability metrics
- Error Rates for both infrastructure and platform services
- These should be capture even if the service recovers through retry. They could be indicators of future problems or provide variability in performance.
Initial Recommended KPIs
- Uptime for each of the services as measured by non-500 responses from the services.
- Data flow in
- Ingestion failures – number of ingestion service requests that end up in a failure over time
- Ingestion workflow throughput – number of data items processed thru the ingestion, workflow, DAG (parsers), storage, entitlement flow of data. Index/Search is out of band from sequence due to loose coupling thru events.
- Storage service performance for creation of new records/versions measured by %requests that are less than given time
- Data correctness for FoR conversions measured by number of successful requests over time (regular and probe)
- Elastic cluster uptime and number of failures from indexer service to refresh indexes over time
- DDMS Service specific – latency for requests to get or create data items in DDMS API (GetLog vs CreateLog for example)
- Data flow out
- Search service performance for queries as measured by response latency over time (normalized to number of entities returned/total).
- Maximum size of notification and indexer queues over time – larger queues can result in failed indexing
- Storage retrieval performance for providing records/versions measured by %requests that are less than given time
- Number of enrichment jobs failing over time (workflow DAGs for WKS/OSDU canonical model creation)
- Number of 40x errors reported by services over a given period of time. This can be an indication of system being hit without authorization or valid requests failing on authorization.
- Though technically not a metric we need a way to alert operations/SRE around certificate expiry, key rotation and related secrets management.
Useful Pre-read Material
|Document Description||Location Link|
|Introduction to LTI CloudEnsure Platform and metrics supported||LTI_CloudEnsure_Platform.pdf|
|Summary slides from the Action Pack on Operator Ops procedures conducted in 2019||LTI_OSDU_Reporting_and_Dashboard_KPI_Kick-off Meeting_18th Jun 2021.pdf|
|Operator feedback summary document on observability and DR (disaster recovery) circa 2019||OperationsDashFullList - KPI Classification - SW.xlsx|
|Kick-off call with LTI team on the reporting and dashboarding project for metrics – 18th Jun 2021||Operator Feedback - DR. Observability.docx|
|Stephen’s curated list that contains the categorization of the original metrics (KPI full list) spreadsheet||Operator Ops Procedures Workshop|
Ways of Working
---- see the Wiki.
Key Project Activities
- Please read the Readme.txt file