Raj Kannan · a0b30fb4
--- a/home.md
+++ b/home.md
@@ -6,13 +6,20 @@
 To deliver the requirement & techniques for managing OSDU data platform in an enterprise setting, in particular handling audit logs, operational metrics/ SLIs and providing tracing for performance and auditability.

 #### Strongly Recommended Read
+-  Operator requirements [here](https://community.opengroup.org/osdu/platform/deployment-and-operations/audit-and-metrics/-/wikis/OperatorFeedback)
+-  Observability workshop operator feedback: [here](https://docs.google.com/document/d/1_mUi3PYi7goUxRCc0zJAETBBwxd3lb8Jgo1egAk2gwQ/edit?usp=sharing)
+- Original list of metrics collected by Action Pack pre-Mercury - [here](https://gitlab.opengroup.org/osdu/community-import/program-activities/-/blob/master/Reporting_Dashboarding/Design%20documentation/OperationsDashFullList.xlsx)
+-  Mercury preship team reported issues - [here](https://gitlab.opengroup.org/osdu/subcommittees/ea/projects/pre-shipping/home/-/issues) 
 - read for more details - please review the [**Readme here**](https://community.opengroup.org/osdu/platform/deployment-and-operations/audit-and-metrics/-/blob/master/README.md)
 - top-level issue for requirements - [issue detail](https://community.opengroup.org/osdu/platform/deployment-and-operations/audit-and-metrics/-/issues/1)
 - operator feedback on requirements - captured [here](OperatorFeedback)

 ### Organization

-Project Lead: @nidhifotedar
+Project Lead: Srini R (LTI)
+Project Devs: TBD
+Project SME Expert: @stephenwhitley @rbouter
+Prior Lead: @nidhifotedar @stephenwhitley @rbouter @rveraart

 **Voting Committers:**

@@ -29,14 +36,7 @@ The Maintainer Committers are individuals within the Audit and Metrics services
 - Ghania [Microsoft]
 - TBD

-**Subcommittee Contacts**
-
- Nidhi Fotedar, BP
- Robbert Veraart, CGI
- Paco Hope, AWS
- Stephen Whitley, SLB
-
-**Contributers**
+**Contributors**

 As many as we can get, but primarily the systems engineers / developers that have been assigned from each of the cloud providers and Schlumberger.

@@ -54,6 +54,60 @@ As many as we can get, but primarily the systems engineers / developers that hav

 ....

+### Project Context and Details
+Typically metrics come in the following form (from Ops Procedure doc)
+| Metric                | Description                                                  |
+|-----------------------|--------------------------------------------------------------|
+| availability          | the % of successful responses                                |
+| latency & performance | the % of requests that complete faster than a target         |
+| freshness             | the % of data that is up to date                             |
+| correctness           | the % of request that return the correct result              |
+| durability            | the % of data that is recorded that can be read successfully |
+
+Some of these examples were already provided in the Ops Procedures under SLIs.  
+
+**Common metrics include:**
+
+- **Latency**: Every Service
+  - Service call latency (Request/Response) for every service.  Given the variability of the cloud environment, these are typically evaluated in percentiles. Examples: 95% of calls to the search service should return an initial result in less than 5 seconds
+- **Throughput**: Every service involved in a multi-record, batch oriented activity such as Indexing, Storage, Ingestion etc.
+  - Example metric:  Re-Indexing should be able to sustain 1 million records per hour
+- **Freshness**: This typically involves multiple services in a workflow. This can include Ingestion, Storage, Indexing, Consumption zones etc.
+  - Example:  80% of records ingested into the system should be indexed and searchable within [small minutes]
+
+**Common scalability metrics include:**
+To ensure scalability of the system (performance does not degrade with size or activity).
+
+- **Content**
+  - Number of kinds and number of records per kind. This is to ensure that Searching and Storage performance do not degrade as more content is added to the system.  
+  - Example: search should have similar latency whether there is 50,000 records or 50,000,000 records.
+- **Concurrency**
+  - How many activities are happening at the same time
+  - Example: Is search and delivery performance impacted during ingestion (while storing and indexing new records)
+
+**Reliability / availability metrics**
+  - Error Rates for both infrastructure and platform services
+  - These should be capture even if the service recovers through retry.  They could be indicators of future problems or provide variability in performance.
+
+#### Initial Recommended KPIs
+- General
+  - Uptime for each of the services as measured by non-500 responses from the services.
+- Data flow in
+  - Ingestion failures – number of ingestion service requests that end up in a failure over time
+  - Ingestion workflow throughput – number of data items processed thru the ingestion, workflow, DAG (parsers), storage, entitlement flow of data. Index/Search is out of band from sequence due to loose coupling thru events.
+  - Storage service performance for creation of new records/versions measured by %requests that are less than given time
+  - Data correctness for FoR conversions measured by number of successful requests over time (regular and probe)
+  - Elastic cluster uptime and number of failures from indexer service to refresh indexes over time
+  - DDMS Service specific – latency for requests to get or create data items in DDMS API (GetLog vs CreateLog for example)
+- Data flow out
+  - Search service performance for queries as measured by response latency over time (normalized to number of entities returned/total). 
+  - Maximum size of notification and indexer queues over time – larger queues can result in failed indexing
+  - Storage retrieval performance for providing records/versions measured by %requests that are less than given time
+  - Number of enrichment jobs failing over time (workflow DAGs for WKS/OSDU canonical model creation)
+- Security
+  - Number of 40x errors reported by services over a given period of time. This can be an indication of system being hit without authorization or valid requests failing on authorization. 
+  - Though technically not a metric we need a way to alert operations/SRE around certificate expiry, key rotation and related secrets management.
+

 **Ways of Working**