|
|
# Inputs from Operators
|
|
|
The following inputs were collected by the Operations Workstream team leads (@nidhifotedar, @robbert verrall) to collate the requirements related to logging, auditing, metrics and traceability needs for the system.
|
|
|
|
|
|
## Context and Classification
|
|
|
The inputs collected span across OSDU platform, service provider, Cloud provider and Operator/Customers, so it is important to categorize and distill the ones that the platform is responsible for.
|
|
|
|
|
|
As a recap, the platform is responsible for SLI and telemetry information per the shared responsibility model below.
|
|
|
|
|
|
#### Shared Responsibility Model
|
|
|
|
|
|
| RESPONSIBILITY | PLATFORM | PROVIDER | CLOUD | CUSTOMER |
|
|
|
| :------------- | :----------: | -----------: | :----------: | -----------: |
|
|
|
| SERVICE LEVEL INDICATORS | Responsible | Accountable | Consulted | Consulted |
|
|
|
| SYSTEM TELEMETRY | Responsible | Accountable | Responsible | Consulted |
|
|
|
| CENTRALIZED LOGGING | Consulted | Accountable | Consulted | Responsible |
|
|
|
| ALERTS & NOTIFICATIONS | Consulted | Accountable | Consulted | Responsible |
|
|
|
| DASHBOARDS | Consulted | Accountable | Consulted | Responsible |
|
|
|
|
|
|
#### Categorization
|
|
|
1. Disaster recovery and associated backup
|
|
|
- This is about data protection, RTO/RPO realistic targets and measures for backup/recovery and integrity protection. The persona that will consume this is a **IT/systems engineer**.
|
|
|
- Recommendation: initiate this as another sub-project for operations as the focus here is around data preservation than metrics collection.
|
|
|
2. System availability and health metrics, auditability/tracing
|
|
|
- This is about ensuring that we can look at how services perform – their availability over time (or down-times) and how impacts SLOs, the performance/latency of the services, so we can see if there is an erosion over time or during concurrent load etc., and identify how we can improve the software to meet these SLOs. The persona that will consume this is a **IT/systems engineer**.
|
|
|
- The project right now is framed towards this category and what cross-cutting aspects that the platform should instrument for operational readiness.
|
|
|
3. Data management statistics or platform utilization reporting
|
|
|
- This is about gleaning insights on jobs, data types, sources, formats being ingested, stored, accessed over time. The intent here is to understand how best the platform is being used and its utilization rate. As Sun Maria (@sunl) rightly points out the persona that will consume this is a **user/data manager**. There is some potential synergy here with the data reporting/BI workstream as well so we can query such meta information from OSDU and provide these reports.
|
|
|
- Recommendation: keep this distinct from the operations metrics intended for the IT/systems engineer at this stage.
|
|
|
4. Cloud Provider or infrastructure related metrics
|
|
|
- This categorization captures infrastructure level concerns on compute/storage utilization, cost monitoring, CSP provided dashboards/log administration tools etc. The persona that will consume this is an **IT/systems engineer**.
|
|
|
- Recommendation: delegate these requirements to cloud service provider for consideration and documenting how these aspects can be covered by the OSDU service provider and/or operator/customer.
|
|
|
|
|
|
|
|
|
## Categorized Requirements
|
|
|
|
|
|
| Category | Operator Feedback | Comments |
|
|
|
|----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|
|
| 2 | Central logs viewable from anywhere. Will customer X be able to see logs for the parts of the environment that only impact customer Y? | No, each partition of OSDU should keep the logs distinct to protect privacy and confidentiality |
|
|
|
| 2 | System to log for the different shared environments centrally and also provide APIs to for external distributed apps like Elk for e.g. to consume | Kibana or cloud native monitoring tools is consumption. Metrics project should support ability to enable any of these per provider/operator choice |
|
|
|
| 2 | A very important factor I think is to clearly delineate what the OSDU common code/architecture/data definitions/components will provide from what CSPs and marketplace players will/can provide | Infrastructure monitoring, tools for visualization/administrative access will be from CSPs. Platform provides logs, metrics, traces in a standard form to enable CSP provided tooling or industry standard tooling (if operator chooses that route). |
|
|
|
| 2 | This will probably vary based on what we define as system health. Is it performance, cost, security breaches or that there is inconsistency in some of the data stored in the data platform? | Availability, Reliability, Usability/Performance, Cyber-security are the items in scope. Cost is another interesting system metric - but perhaps best delegated to CSP, since the platform itself is free, it is the infrastructure utilization that needs to provide cost monitoring. |
|
|
|
| 2 | System health information must be captured and logged but these don’t need to be centralized as long as views can be created for system operations team to act based on. Central logs might be useful but not a requirement. Having the option to export logs is probably more useful than having a system configured to force all logs to one repository. | I assume central logs refer here to centralize logs from infrastructure and OSDU services. If this means each service can log to its own logging backend and the aggregation of these to collect metrics and tracing thru services is an external concern - that will pose burden on deployment/operations team |
|
|
|
| 2 | It is easier for the data correlation of the data can be centralized. Service health information that can be easily seen and communicated out is also important to early identification of troubles spots | Agreed |
|
|
|
| X | We can injust data from any source as long as there is an API, but having the data centralized definitely makes it easier. | Not sure this ties to audits, metrics, tracing objectives |
|
|
|
| 2 | How do we work with logs from different/multiple CSPs? | Goal of this project is come up with a standardized way to transmit logs, metrics and tracing information - for example thru OpenCensus/OpenTelemetry. This ensures compatibility with CSP without being tied to a single CSP. |
|
|
|
| 2 | (Auto-)scaling of resources based on load/use (based on performance & cost monitoring/reporting). | This is a corrective action based on monitoring latency/performance metrics of the system. Will focus on the metrics first, actions can be subsequent steps |
|
|
|
| 3 | Data Quality QC and lineage. (Auto-)-tiering of data storage used (based on usage frequency monitoring/reporting). | Auto-tiering could be the corrective action, usage metrics from data manager perspective on data items can be considered in category 3. |
|
|
|
| 4 | Entitlements/RBAC reviews etc. (based on internal security requirements), system updates/patching etc. | Infrastructure level monitoring required. From cyber-security perspective, logs on entitlement changes and traceability of calls for a particular user/identity over time can help address this. |
|
|
|
| 4 | Hardware warning/error messages, network errors, filesystem usage levels (file counts and total size). Node outages, overloaded nodes. | Good point. Should consider infrastructure level monitoring/metrics including usage levels, cost etc. to be blended. Adding as category 4 for CSP provided metrics. |
|
|
|
| 2 | Data usage and access logs | Traceability of a flow thru the system can help with access auditing. |
|
|
|
| 2 | Attempts to move or copy data | System is stateless, but a trigger of a larger workflow like ingestion or delivery and the associated flow thru other services can be captured as part of tracing. |
|
|
|
| X | Entitlement usage or non-usage (defunct) | Not sure this ties to audits, metrics, tracing objectives. Perhaps a data manager persona (3) concern? |
|
|
|
| 4 | Cost monitoring to prevent excessive logging by a particular service. | Infrastructure level monitoring required. It is best to change the time window for log retention rather than turn off capabilities. From cyber-security perspective, logs are required to assist with non-repudiation, so traceability of calls or auditability is key. |
|
|
|
| 2, 3, 4 | Huge topic. Heartbeat, API monitoring, performance monitoring, storage monitoring, Data Quality monitoring, event monitoring, security monitoring, etc. | Multiple bits conflated as the title says. API usage/tracing, performance metrics is included. Data maangement concern on quality monitoring or infrastructure monitoring on storage/events is category 4. |
|
|
|
| 4 | Visibility into the data so customer’s platform support groups that typically get complaints can answer questions quickly instead of needing to submit a ticket to the CSP. | The platform will provide logs, metrics and tracing information. The collection, reporting, access/administration will be thru CSP tools. |
|
|
|
| 2 | Central for quickest action and distributed for focused deep dives/resolution efforts | Ok |
|
|
|
| 2 | Centralized log system would be convenient for pattern detection (AI/DL/Analytics). Easier with one-stop-shop for logs. Filtering options and performance would be key if centralized. | Agreed |
|
|
|
| 2 | Central system should be visible to all users and not just admins | There is a balance to be had with privacy, confidentiality concerns to allow logs to be public |
|
|
|
| 2 | Would be great to have a central system showing predictive trends, outage areas, etc. | Agreed |
|
|
|
| 4 | Search capability across the platform | Unclear if this is search of data or logs. For the former there is a service in OSDU, for the latter we rely on infastructure/CSP tools | |
|
|
\ No newline at end of file |