OSDU Software issueshttps://community.opengroup.org/groups/osdu/-/issues2024-02-16T19:41:18Zhttps://community.opengroup.org/osdu/platform/security-and-compliance/legal/-/issues/36Search api missing in legal service2024-02-16T19:41:18ZDadong ZhouSearch api missing in legal serviceIs it possible to add the search capability for legal tags based on the legaltag attributes including the ones in extensionProperties? Thanks.Is it possible to add the search capability for legal tags based on the legaltag attributes including the ones in extensionProperties? Thanks.https://community.opengroup.org/osdu/platform/data-flow/ingestion/external-data-sources/core-external-data-workflow/-/issues/16Fetch-and-Ingest to refresh token from scheduled job2023-02-06T10:36:58ZDebasis ChatterjeeFetch-and-Ingest to refresh token from scheduled jobWith input from Farid (Katalyst) -
Tester creates secrets for various items and sets up CSRE and CSDJ records. At 10:00 am local time.
He/she sets the time for fetch-and-ingest to run every day at 11:00 pm.
Since the gap is 13 hours, it...With input from Farid (Katalyst) -
Tester creates secrets for various items and sets up CSRE and CSDJ records. At 10:00 am local time.
He/she sets the time for fetch-and-ingest to run every day at 11:00 pm.
Since the gap is 13 hours, it would expire by the time fetch-and-ingest starts.
Assuming Data Provider (Ex: Katalyst) has set the duration of token for, say, 10 hours.
What are the alternatives?
Such as there is no need to create secret for refresh token at the setup time?
Instead this will be generated every time fetch-and-ingest kicks in (per cron job schedule).
@AshishSaxenaAccenture - Please review, discuss with DEV team. And later we can deliberate together for a solution, if needed.
Thank youhttps://community.opengroup.org/osdu/platform/data-flow/ingestion/home/-/issues/53ADR - Checkpointing mechanism for Airflow Tasks2023-07-05T10:08:03Zharshit aggarwalADR - Checkpointing mechanism for Airflow Tasks
## Status
- [X] Proposed
- [ ] Trialing
- [ ] Under review
- [ ] Approved
- [ ] Retired
## Introduction
### Resiliency in DAGs
As Airflow is a distributed environment and tasks can be executing on different host machines, there is a...
## Status
- [X] Proposed
- [ ] Trialing
- [ ] Under review
- [ ] Approved
- [ ] Retired
## Introduction
### Resiliency in DAGs
As Airflow is a distributed environment and tasks can be executing on different host machines, there is a possibility of tasks getting killed or some unexpected crashes/failures. To mitigate and handle such failure scenarios there are general guidelines on Airflow and DAGs in Airflow like
- Making task logic idempotent so that they can be retried safely without getting into inconsistent state
- Making tasks atomic that is each task should be responsible for one operation that can be re-run independently
- Enabling retries at DAG task level in case of failure
All above resiliency guidelines are mostly getting followed in the community ingestion DAGs like manifest ingestion, csv-ingestion, but even though the DAG tasks involving ingestion are completely safe for retry it can result in data duplication or additional versions of same records because a blanket retry might re ingest the same data records into the system. Also, in future we might onboard to very compute intensive workloads orchestrated by Airflow on big data systems like spark, a generic framework to checkpoint and state preservation will really improve system reliability and performance.
### Storage API Behavior
The ingestion tasks in Airflow invokes Storage APIs to ingest data records into the system, Storage service PUT record API is idempotent in nature, if a record with same id is ingested it will result in 200 and a new version of data record will be created. In case a record is not provided by storage service random ids are generated and assigned to the data records.
Storage service doesn't perform any checksum operations on record data fields as well to prevent data duplicity because of performance issues like
- Computing checksums for each record can be very costly operation
- Even if checksums are present, on the fly checksum retrieval and comparison is very trick to handle
## Problem Statement
A Dag retry will result in re-ingestion and end users will see multiple versions of same records which they ingested, with use cases where even single Dag Run can contain 100s of thousands of records will be ingested and a plain re-ingestion of these records costs
- Additional resource consumption in terms of storage like blob storage, database, Elastic etc.
- Additional resource consumption in terms of computation as well as the records will get re-indexed via during search and indexing pipeline.
Also, in future we might onboard to very compute intensive workloads orchestrated by Airflow via big data systems like spark, a generic framework to checkpoint and state preservation will really improve system reliability and performance, but this ADR will mainly focus on ingestion tasks
From the above commentary there is a requirement to preserve the state of ingestion task using some markers, these markers can be record ids, batch ids etc. depending on the DAG logic so that in case of failures and subsequent retries, the tasks can resume from last checkpointed state and avoid re-ingestion scenarios
Below are some mechanisms/options to store and retrieve this state, we will discuss some potential markers [below](https://community.opengroup.org/osdu/platform/data-flow/ingestion/home/-/issues/53#checkpointing-markers) but these markers will be very dag specific and implementation can be left to DAG authors.
## Proposals
### Approach 1
**Leverage Airflow Xcoms for storing state**
XCom is a built-in Airflow feature and allows tasks to exchange task metadata or store small amounts of data. Xcom uses airflow metadata database (PostgreSQL/Mysql) to persist the information. They are defined by a key, value, and timestamp.
XComs can be "pushed", meaning sent by a task, or "pulled", meaning received by a task. When an XCom is pushed, it is stored in the Airflow metadata database and made available to all other tasks
### Solution Details
We can use out of the box Xcom support provided by airflow to save some state markers being ingested in each DAG run, airflow exposes APIs to perform read/write operations with Xcom to persist the data in metadata database
**1. DAGs with Python Operators**
![image](/uploads/8b50b97501203cc1ad6c1c8f35ca278c/image.png)
### Flow Details
- A workflow like manifest ingestion is triggered by the user by invoking workflow service, workflow service will call Airflow to initiate the Dag run
- The tasks (Airflow Operators) in the DAGs will begin ingestion, Xcom will be queried to check if any state was persisted for the run, in case state is found the task will resume from last ingested batch of records otherwise all records will be ingested, records will be ingested by invoking storage service PUT Records API. The task will also keep saving the records ids or relevant state markers like batch-ids with xcom in form of key-value pair, xcom values will be stored in PostgreSQL. The state object won’t have a fixed schema for generalization, these xcom entries are grouped at run_id and task_id level
```
{
"key": "RunId+TaskId",
"value": {
"state": {
"fieldA": "",
"fieldB": ""
}
}
}
```
- The xcom values can be queried based on run_id and task_id by the task run instance and payloads will be skipped/ingested appropriately to prevent duplication of data
**2. DAGs with K8s Pod Operators**
Python Dags can directly leverage airflow modules to perform Xcom operations but non python and k8s pod operators need an HTTP endpoint to interact with Xcom
![image](/uploads/9d4a60e8a2c08dcc4f2ed547f3669fb9/image.png)
#### Details on the flow
- For non-python and DAGs involving k8s pod operator, REST endpoints will be required to access Xcom
- New APIs will be onboarded in Workflow service to facilitate Xcom interaction
- The executing tasks will call APIs in Workflow service for Airflow Xcom interactions
- Rest flow remains same as in previous
#### Pros
- Easy to implement as the solution utilizes native capabilities of airflow to persist data
- Minimal changes required in workflow service, only two new APIs need to be exposed to perform read/write with xcom using airflow APIs
#### Cons
- **No support for out of the box REST API support for write operations with Airflow Xcom**
> Airflow only exposes APIs to list and get Xcom entries but not the write and push APIs, there is feasibility in Airflow to write own custom APIs using plugins feasibility but will additional effort and POC validations, plus some managed airflow offerings might not support the same
- **Xcom size constraints and usability**
> Xcom comes with a big limitation of size constraints, XCom should not be used for passing large data sets between tasks. The limit for the size of the XCom is determined by which metadata database you are using, which makes the solution platform dependent, supported sizes
> 1. Postgres: 1 Gb
> 1. SQLite: 2 Gb
> 1. MySQL: 64 Kb
> The above limits good enough for postgres backend but the recommendation from Airflow community is to not consume XComs even if data might meet the maximum allowable limit as larger Xcom can significantly degrade the performance task execution times and UI responsiveness
- **Xcoms are not recommended in implementation of other ADRs as well**
> Due to inherent constraints with Xcom, we introduced notion of manifest by reference to avoid passing of large manifests between airflow tasks and same issue can surface in case of large markers.
### Approach 2
#### Using External storage for persisting records
We can leverage a external storage like Azure Blob storage or AWS S3 to persist the records id information in the system
**Details**
- To facilitate access to external storage by the DAGs, new endpoints can be added in Workflow to abstract direct infrastructure access.
- The new APIs are supposed to be internal, new entitlements can be introduced for restricted access
- DAGs are expected to save small amounts of data as state markers, there are no storage constraints as such. Even if records ids are persisted as state markers given maximum batch size supported at storage service is 500, hence typical payload size will be around 25-50KB
![image](/uploads/d157d7a446d902dafbbd6116293974f8/image.png)
#### Pros
- No inherent constraints which were posed in Xcom
- Generic REST interface which can be used by all Dags
- Each CSP can extend and use their own backend infrastructure
#### Cons
- Minor - Additional CSP implementation required, along with onboarding new APIs in Workflow
### Checkpointing markers
These markers are essentially some metadata information which can be saved and retrieved by the DAG to resume processing in case of retries. These exact markers will depend on the DAG logic and implementation hence the framework is generic and schema free to save any objects
For typical OSDU DAGs like manifest ingestion or csv parser typical tasks performed during ingestion are schema validations, file metadata validations, CRS and Unit validations, referential checks by invoking search query API etc., apart from actual data record ingestion all other steps are read only operations and do not change state of system of system in any way. Record ingestion is the only write operation performed which changes the state of system, hence some markers to save state of records ingested is a good candidate. For instance, these state markers can be
- Records ids of the data records ingested
- Batch number of the records ingested (Batching logic should be predictable to resume safely)
- Unique monotonically increasing integer values assigned to each data record
All above can be decent marker choices to implement and depict the utility of this framework for OSDU DAGs like csv ingestion, goin
### References
- https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/67
- https://airflow.apache.org/docs/apache-airflow/stable/concepts/xcoms.html
- https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html#tag/XCom
- https://docs.astronomer.io/learn/airflow-passing-data-between-tasks
- https://docs.astronomer.io/learn/custom-xcom-backends
### FAQshttps://community.opengroup.org/osdu/platform/domain-data-mgmt-services/well-delivery/well-delivery/-/issues/16New "consumption" API to show combined information from multiple Drilling Rep...2023-02-02T23:02:15ZDebasis ChatterjeeNew "consumption" API to show combined information from multiple Drilling ReportsThis refers to OperationsReports entity and its sub-types such as GasReading, PumpOp, OperationsActivity.
https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/E-R/master-data/OperationsReport.1.2.0.md
I had prelimina...This refers to OperationsReports entity and its sub-types such as GasReading, PumpOp, OperationsActivity.
https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/E-R/master-data/OperationsReport.1.2.0.md
I had preliminary discussion with @openai and Stuart about this requirement.
Also see this deck for an understanding.
[Andrei-expected-flow.pptx](/uploads/3010f7d5b476931b55e23c2357377959/Andrei-expected-flow.pptx)
There are some API services available to deal with this data type.
GET /operationsReports/v1/byTimeRange/{start-time}/{end-time}
But these return only IDs and then end user (or vendor's application) will have to run extra mils to extract required information from multiple records.
What is needed is user friendly (new) service returning "packaged" information about requested sub-type from multiple Drilling reports (hence multiple records of OperationsReport) for the requested time range.https://community.opengroup.org/osdu/platform/system/storage/-/issues/156ADR: Recover a soft deleted record in storage2023-09-11T08:27:45ZAbhishek NandaADR: Recover a soft deleted record in storageAbility to recover a soft deleted record in storage service
# Decision Title
## Status
- [X] Proposed
- [ ] Trialing
- [ ] Under review
- [ ] Approved
- [ ] Retired
## Context & Scope
The storage service provides 2 ways to delete a r...Ability to recover a soft deleted record in storage service
# Decision Title
## Status
- [X] Proposed
- [ ] Trialing
- [ ] Under review
- [ ] Approved
- [ ] Retired
## Context & Scope
The storage service provides 2 ways to delete a record. One way is to logically delete the record in which the record with same id can be revived later because its version history is maintained and the other one is to purge the record in which case, the record's version history is deleted too. In both types of deletions, the record cannot be accessed using storage or search service.
Today there is no easy way to query or recover the soft-deleted records. Providing admin-only APIs will help admins to search, view and recover the soft-deleted data if required.
# Tradeoff Analysis - Input to decision
Today users have to maintain the soft deleted record IDs on their own. Below is the workaround available today to attempt recovery of such records
1. Recreate the record with existing id and random/empty data and meta blocks. This will mark the record as active.
2. Fetch all versions of the record.
3. Fetch the latest version prior to the one just created to get back the actual record data and meta blocks.
4. Recreate the record using the response to create a new version of the record with the appropriate data.
## Decision
Create 3 new APIs as below
1. Fetch deleted records (accessible to _users.datalake.admins_) -> This will fetch a list of records. Since the list can be very long it should return a maximum of 100 records and support a from and to deletion dates filter along with pagination.
![image](/uploads/ca34cf94f3184fba05d2ade6bb502a90/image.png)
2. Recover deleted records by id (accessible to _users.datalake.admins_) -> This will take a list of record ids (max 500) that are to be recovered and return the list of record ids that succeeded as well as failed.
![image](/uploads/ae448c5fb9ed5803101aeba51a4fd7b4/image.png)
3. Recover deleted records by metadata filters (Currently support for only fromDeletedDate and toDeletedDate) (accessible to _users.datalake.admins_) -> This will take filter criteria of records that are to be recovered and return the list of record ids that succeeded as well as failed.
![image](/uploads/2b1d373eed8513e166fba784be4b3250/image.png)
## Consequences
1. This will help users to bulk recover deleted records in a single go.
2. The APIs will help prevent having garbage record versions that had to be created just to make the record active.
3. This will help users to fetch a list of soft deleted records which was not possible earlier.
Open API spec for the service
[storage-recover-swagger.yaml](/uploads/396cc62881dfe5f075f0e987f0313472/storage-recover-swagger.yaml)https://community.opengroup.org/osdu/platform/system/storage/-/issues/154Storage service stale in-memory cache leads to inconsistency.2023-02-15T18:37:33ZNikhil Singh[MicroSoft]Storage service stale in-memory cache leads to inconsistency.We recently uncovered a bug in storage service due to local cache getting stale. The flow can be understood by the following steps.
1. Deletion of a legal tag via legal service delete API --> response 204 No content after successful del...We recently uncovered a bug in storage service due to local cache getting stale. The flow can be understood by the following steps.
1. Deletion of a legal tag via legal service delete API --> response 204 No content after successful deletion
2. Storage service API call made at https://**********/api/storage/v2/push-handlers/legaltag-changed?token=*** --> Goes to a pod P1 of storage service --> Updates the records compliance for all the record associated with the deleted tag in step 1---> Removes the deleted tag from local cache of pod P1.
3. Storage PUT call to create a record with the deleted legal tag--> goes to a pod P2 of storage--> the cache still has that legal tag-->returns 201 created.
At step 3, all calls going to pod p1 returns "Invalid legal tag" but API calls landing on other pods successfully create these records.
The service ITs are failing in transient manner due to this issue.M17 - Release 0.20Nikhil Singh[MicroSoft]Nikhil Singh[MicroSoft]https://community.opengroup.org/osdu/platform/data-flow/ingestion/csv-parser/csv-parser/-/issues/79Error diagnostics - need to improve significantly2022-12-13T00:31:21ZDebasis ChatterjeeError diagnostics - need to improve significantlyYou may start of by checking here.
https://community.opengroup.org/osdu/platform/pre-shipping/-/tree/main/R3-M14/AWS-M14/Ingestion%20DAG%20CSV
For each and every problem, I did not get suitable clue from error log.
1. problem in data. ...You may start of by checking here.
https://community.opengroup.org/osdu/platform/pre-shipping/-/tree/main/R3-M14/AWS-M14/Ingestion%20DAG%20CSV
For each and every problem, I did not get suitable clue from error log.
1. problem in data. ELEVATION has non numeric value.
2. problem in schema - TVD, Latitude, Longitude - missed "type=string".
3. At times when the file is missed (incorrect sequence in collection), it gives fatal error instead of saying clearly that "Unable to get the CSV file".
Caused situation where record gets created, we can see all properties from Storage service, but none from Search service.
Nearly impossible to figure out, for average Data Loader (user).
Next, imagine we are ingesting 1000 rows from source CSV and problem occurs in row-253 and row-455.
User's expectation is that CSV Ingestion program should pinpoint and clearly indicate row number and type of problem which caused the failure.
cc @chad , @tdixonhttps://community.opengroup.org/osdu/platform/home/-/issues/52ADR - Release management change for Core Libraries2024-02-13T10:27:00ZRene von Borstel [EPAM]ADR - Release management change for Core Libraries## Decision Title
Release management change for Core Libraries
## Status
- [x] Proposed
- [ ] Approved
- [ ] Implementing (incl. documenting)
- [ ] Testing
- [ ] Released
## Purpose
Change in release process for Core Libraries to have...## Decision Title
Release management change for Core Libraries
## Status
- [x] Proposed
- [ ] Approved
- [ ] Implementing (incl. documenting)
- [ ] Testing
- [ ] Released
## Purpose
Change in release process for Core Libraries to have reduced impact on Code tagging process for milestone releases.
## Problem statement
Right before the code freeze, we have core library and all services upgrades during that milestone. If we have a major upgrade e.g. spring boot update or Jackson Library gets upgraded from some older to new version, because the services have been working on some older version of core libraries, we will see that there will be a compile time errors or runtime errors on all the services in most of the time. That will actually impact the stability of the system, because now you stop all your development work in order to sanitize the release branch so that the service is up and running and all the items are passing being an additional overhead that we are taking.
## Proposed solution
Core libraries are not something that are shipped to the customers and are used internally within the OSDU community internally. Hence, they do not need to follow the milestone versioning.
We can avoid the above mentioned of upgrading library versions in services at every release by maintaining the following versioning strategy for Core Libraries
- Create independent versioning of Core Libraries.
- Do not cut a release branch at every release.
- Follow the following versioning strategy while rolling out new versions for Core Libraries.
- Major Version
- Create a new major version when the release contains Backward incompatible changes in Interfaces or Model classes.
- For eg: `id` in `Record` class is changed to `recordId`.
- Minor Version
- Use minor version when additional methods are added to Interfaces, new fields are added to Model classes
- Changes in versions of dependencies - Springboot, Jackson etc.
- Patch Version
- Increment patch version when Bug fixes or Security patches are applied to the Library.
With this approach we avoid patching core libraries right before the release and thereby, reduce the amount of time spent on Stabilizing the service during code tagging process.
## Consequences
- We retire the -rc* versioning strategy. We no longer create release candidates in Core Libraries.
- Every commit on the Core Library will end up creating a new version depending on the type of the change.
## Target Release
M14
## Owner
Please contact @krveduruDavid Diederichd.diederich@opengroup.orgChad LeongDavid Diederichd.diederich@opengroup.orghttps://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-dags/-/issues/107Process manifest task of Osdu_ingest DAG calls the Storage Service for the Da...2022-12-12T14:43:20ZMeghnath SahaProcess manifest task of Osdu_ingest DAG calls the Storage Service for the Dataset Id in WorkProductComponent irrespective of the outcome of validate referential integrity, resulting in creation of new record versionIt has been observed that the process manifest task of Osdu_ingest DAG calls the Storage Service for the DatasetId in WorkProductComponent irrespective of the outcome of validate referential integrity, resulting in creation of new record...It has been observed that the process manifest task of Osdu_ingest DAG calls the Storage Service for the DatasetId in WorkProductComponent irrespective of the outcome of validate referential integrity, resulting in creation of new record version. Also, in dataload_r3.py the FileId updated in WorkProductComponent is **file_id:file_version**. As result following the challenges are encountered.
1. If WorkProuct and WorkProuctComponent are not processed by Airflow Manifest Ingestion DAG due to failure in referential integrity validation then the file source information used in first attempt cannot be used for reprocessing the manifest because the file version in the file source json is not the latest anymore and validation of referential integrity fails when reused. As a result ingestion of WPC is tightly coupled with upload of Datasets which generates File Source information used to replace the surrogate key in the WPC manifest by the dataload_r3.py script.
2. open-test-data/rc--3.0.0/4-instances/TNO/work-products/markers/*.json and open-test-data/rc--3.0.0/4-instances/TNO/work-products/markers_1_1_0/*.json are using same dataset. Similarly, open-test-data/rc--3.0.0/4-instances/TNO/work-products/'well logs'/*.json and open-test-data/rc--3.0.0/4-instances/TNO/work-products/'well logs_1_1_0'/*.json. Same is the case with the manifests for Volve. Because of the current behavior of DAG and dataload_r3.py described above, same File Source information generated from the upload of Datasets cannot be reused. As a workaround, the dataset files s3://osdu-seismic-test-data/r1/data/provided/markers/ are copied into two different directories markers and markers_1_1_0 so that the files are uploaded separately generating unique FileId. Similarly for well logs.
@anujgupta @shrikgar @sukanta.bhattacharjee @aparial FYIhttps://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-cpp-lib/-/issues/16Using VCPKG manifest file2023-06-28T12:29:17ZPavel KisliakUsing VCPKG manifest fileCan we think about start using [VCPKG manifests](https://vcpkg.readthedocs.io/en/latest/specifications/manifests/)?
I see that VCPKG already used for Windows build, I hope that it also can be unified, because VCPKG available for Win/Linu...Can we think about start using [VCPKG manifests](https://vcpkg.readthedocs.io/en/latest/specifications/manifests/)?
I see that VCPKG already used for Windows build, I hope that it also can be unified, because VCPKG available for Win/Linux/Mac.
In addition to help with faster getting first build there are also other benefits:
- Will help to avoid interferences with globally installed libraries for different projects.
- Manifest allows to stick with specified versions of third-party libraries.
- Will reduce complexity of Cmake by removing constructions like "if (WIN32) else" for linking dependencies.
- Allow to move in direction to publish 'seismic-store-cpp-lib' to VCPKG.
As start point, I've prepared branch with [VCPKG manifest](https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-cpp-lib/-/commit/9ee956e09612cb4f69d174147db2d040040656ef).
Currently I am not added libraries **aws-sdk-cpp** and **google-cloud-cpp** because need to fix linking in the cmake.
How to build:
```
make -B build -DCMAKE_TOOLCHAIN_FILE=~/vcpkg/scripts/buildsystems/vcpkg.cmake -DVCPKG_FEATURE_FLAGS=versions
cmake --build build --config Release
```
[Please correct path to installed VCPKG]
The VCPKG way should work on all platforms, but currently there are exists few issues which should be fixed.
For have better life with Visual Studio, I've also prepared "CMakeSettings.json" file,
that allows to just use "Open folder" command from the VS and make build without any other configure actions.
(Just VCPKG should be installed and path should be added to environment variable VCPKG_ROOT).
One thing that need to keep in mind, VCPKG does not officially support dynamic linkage on Linux, which
is related to system-provided libraries ([more info](https://github.com/microsoft/vcpkg/issues/15006)).
But exists community supported triplet which can be used at our own risk.
As I am new on OSDU and on this particular library, please point out which other impediments do you see.
Edited 1/26/2023: Btw, the same work was done for [Reservoir/Open-ETP-server](https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/reservoir/open-etp-server/-/issues/30)https://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/73Indexer fails to correctly parse properties with special characters2022-08-23T15:08:44ZAn NgoIndexer fails to correctly parse properties with special charactersFor example:
```
"SpatialArea": {
"Wgs84Coordinates": {
"features": [
{
"geometry": {
"type": "Point",
"coordinates": [
2.2863,
61.198685
...For example:
```
"SpatialArea": {
"Wgs84Coordinates": {
"features": [
{
"geometry": {
"type": "Point",
"coordinates": [
2.2863,
61.198685
]
},
"properties": {
"id": "a:b"
},
"type": "Feature"
}
],
"type": "FeatureCollection"
}
}
```
Indexer fails to parse the properties id whose value contains a colon.https://community.opengroup.org/osdu/platform/system/storage/-/issues/128Data store location is not appended to legal tag ORDC of record2022-10-26T14:11:23ZAn NgoData store location is not appended to legal tag ORDC of recordUpon creating a record, the data store location/country is expected to be appended to the ORDC (Other relevant data countries) list.
This is not the current behavior.
```
"otherRelevantDataCountries": [
"VN"
]
```
Here, "VN"...Upon creating a record, the data store location/country is expected to be appended to the ORDC (Other relevant data countries) list.
This is not the current behavior.
```
"otherRelevantDataCountries": [
"VN"
]
```
Here, "VN" was provided when creating the record. Upon record creation, the system is supposed to append "US" (US environment partition), "BE" (EU) or "NL" (WEU), etc..https://community.opengroup.org/osdu/platform/deployment-and-operations/infra-gcp-provisioning/-/issues/12On-Prem OSDU Reference Implementation with MongoDB support2022-06-09T12:49:52ZRobert OberhoferOn-Prem OSDU Reference Implementation with MongoDB supportWe would like to propose the support of MongoDB as the Meta-Data Storage implementation for the OSDU On-Prem reference deployment. As a contributing member of OSDU (since late 2019), MongoDB would be happy to support this effort.
The b...We would like to propose the support of MongoDB as the Meta-Data Storage implementation for the OSDU On-Prem reference deployment. As a contributing member of OSDU (since late 2019), MongoDB would be happy to support this effort.
The benefits of using MongoDB in this context are:
- Native storage and query of JSON objects
- Ability to directly manipulate JSON objects at the field level
- Document model supports flexible representation of data-types, e.g. Key-value, Graph (useful for Entitlements), Tabular, Geo Spatial, Time-Series
- Document sizes support up to 16MB’s for consistent performance
- Ability to shield operational workloads from analytics queries ([workload isolation](https://www.mongodb.com/docs/manual/core/workload-isolation/))
- Flexible deployment options (on-prem, hybrid, cloud)
- Scaling from lightweight single cluster to multi-region clusters with replication
- Available as open-source (community), supported (Enterprise Advanced) and SaaS (e.g. MongoDB Atlas, available across all major cloud providers, e.g. Google, Azure, AWS)
@Nieten , @Kateryna_Kurach , @jgschmitz1965
[Robert Oberhofer, MongoDB]JoeRobert OberhoferJoehttps://community.opengroup.org/osdu/platform/data-flow/ingestion/energistics/witsml-parser/-/issues/59Support of V1.4 data files2022-05-07T15:09:47ZDebasis ChatterjeeSupport of V1.4 data filesTime and again we are hearing that most data files in real world still use V1.4. So, there is a need to support older version 1.4 over and above current support of V2.0.
See note from TotalEnergy - 6-May-2022.
TotalEnergies – access to...Time and again we are hearing that most data files in real world still use V1.4. So, there is a need to support older version 1.4 over and above current support of V2.0.
See note from TotalEnergy - 6-May-2022.
TotalEnergies – access to representative test data in WITSML V2.0 format (that is what is supported by the Parser today).
- (most of) our suppliers still use 1.4.1 so we do not have WITSML v2.0 data available
cc - @epeysson , @chad , @Keith_Wall , @jean_francois.rainaudhttps://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/57Utilizing Standard Pipelines2023-03-24T19:24:00ZDavid Diederichd.diederich@opengroup.orgUtilizing Standard PipelinesI'd like this project to consider merging your CI pipeline work with the osdu/platform/ci-cd-pipelines> project, and utilize more jobs by includes than using local CI config.
### Some Reasons to Consider
**Copy/paste code is hard to ke...I'd like this project to consider merging your CI pipeline work with the osdu/platform/ci-cd-pipelines> project, and utilize more jobs by includes than using local CI config.
### Some Reasons to Consider
**Copy/paste code is hard to keep maintained**
Most of your CI logic appears to have started as a copy/paste from the main repository, anyway.
But keeping it local means that developers need to update changes in multiple places, and when they're working on the improvements they don't have your use case in mind.
This included some recent developments to get the dev2 environment going, but it also includes the changes to the FOSSA scanning -- you're still using an older, unmaintained image for the scanning.
And, when I did the changes, I worked test examples for maven and pip, the two supported build systems.
If npm had been there, I would have had it in mind.
**You miss new pipeline developments**
I'm moving pieces of the release management scripts into the pipeline to make more aspects of the tagging process happen automatically from branch creation.
For now, it's only dependency scanning data, but upgrades are planned to do more stages from there.
The GitLab Ultimate scanners check for security vulnerabilities, and the InfoSec team utilizes these results to plan their work.
These scanners aren't running on your project, but would be if included the appropriate CI configuration -- or at least, we'd see what needs to be improved on those scanners to function if they don't work out of the box.
**Your improvements aren't available to others**
Any improvements you make to the CI process after you've copied it remains in your local repository.
Others could benefit from having this available in a common location.
Supporting another language gives future OSDU projects more capabilities right at the start.
You'd even get to define the basic processes for these.
### Open to Discussion
I'd like to hear more about how the custom pipelines came to be, and if they are serving a need that can't be generalized.
For steps that are truly custom and unique to your project, it makes sense to have them as local CI config files.
If we do decide to start using more of the standard pipeline logic, I think we'll need to implement it slowly, a piece at a time.
Of course, if you think a big bang MR is better, I'd consider that, too.
Thank you in advance for your thoughts.https://community.opengroup.org/osdu/platform/system/storage/-/issues/120Inconsistent behavior of storage PUT when skipdupes is passed as true2022-08-26T10:06:09ZMandar KulkarniInconsistent behavior of storage PUT when skipdupes is passed as trueStorage PUT API has an optional query parameter called [skipdupes](https://community.opengroup.org/osdu/platform/system/storage/-/blob/master/docs/tutorial/StorageService.md#using-skipdupes)
Current behavior of storage PUT API to update...Storage PUT API has an optional query parameter called [skipdupes](https://community.opengroup.org/osdu/platform/system/storage/-/blob/master/docs/tutorial/StorageService.md#using-skipdupes)
Current behavior of storage PUT API to update existing record is:
If skipdupes is passed as true, if the data, meta blocks in the input request are same as the existing record content, then the record update is skipped.
When skipdupes is passed as true, the record update is skipped in a scenario when the user has passed different legal, acl, tags blocks content in the input request, but data and meta block content is same as that of the existing record.
(This happens because when skipdupes is passed as true, the storage service compares only data and meta blocks of the incoming and existing records and not all the blocks in the record.)
Expected behavior is :
If skipdupes is passed as true, both data and meta blocks should be compared. If data block is same but legal, acl, tags blocks are different, then the same record should be updated. To keep the behavior in-sync with PATCH API, the record version should not be updated in case only tags, legal or acl blocks are being changed.https://community.opengroup.org/osdu/platform/system/schema-service/-/issues/95x-osdu-indexing changes are breaking2022-10-13T11:08:05ZThomas Gehrmann [slb]x-osdu-indexing changes are breaking# Context:
Indexing hints in the OSDU schemas are considered decorations and not taken into account when schemas versions are
validated for 'breaking changes'.
Downstream indexing changes from any state to any other state are considere...# Context:
Indexing hints in the OSDU schemas are considered decorations and not taken into account when schemas versions are
validated for 'breaking changes'.
Downstream indexing changes from any state to any other state are considered breaking changes:
* Breaking changes for the indexer: changes from `flattened` to `nested` require the re-indexing of the kind in
question.
* Consuming applications must use a different query syntax.
# How it's done today:
The process depends on human interaction (assuming OSDU well-known schemas here, but this is no different for custom
schemas):
* Stakeholders ask for an indexing behavior change, OSDU Data Definition reacts by changing the `x-osdu-indexing`
extension tag values in the schema.
* OSDU Data Definition Release notes identify the kinds, which are to be re-indexed.
* In M10 virtually all kinds had to be re-indexed
* In M11 type `reference-data--QualityDataRuleSet` requires re-indexing
* During deployment the records for the affected kinds must be re-indexed.
# Issue with current design:
Upon deployment of a new milestone (or custom schemas),
1. for all involved data-partitions, delete the index for the changed kind and trigger re-indexing. This can take -
depending on the number of records per kind - a very log time and cause serious down-time.
2. Applications have no good way of understanding that the query syntax has changed. Applications may no longer find
data if they depended on queries into data structures affected by the change.
# Proposal:
## `PUBLISHED` Schema Status
1. For schemas with state `PUBLISHED` treat changes to the `x-osdu-indexing` extension tag values in the schema as **_
breaking changes_**.
2. Breaking changes require an incremented major schema version number.
3. Schema Validation Changes during schema creation:
* Changes to the `x-osdu-indexing` extension tag values in `PUBLISHED` schemas with same major schema version
numbers will be **_rejected_**. I.e., the attempted registration of such schema will fail with error.
## `DEVELOPMENT` Schema Status
1. The validation for `DEVELOPMENT` status schemas for incremental versions on top of or between existing minor or patch
versions follows the same rules as for `PUBLISHED` schemas. Attempts to change the `x-osdu-indexing` extension tag
values will be **_rejected_** by the Schema service.
2. For 'single' version schemas in `DEVELOPMENT`, the updates of the `x-osdu-indexing` extension tag values are
permitted.
* It is the responsibility of the schema authors to communicate the impact to deployment and consumers. This is
expected to be acceptable during the development phase.
CC @nthakur @ChrisZhang @chad @pbehedeM12 - Release 0.15https://community.opengroup.org/osdu/platform/security-and-compliance/entitlements/-/issues/104Unclear Data Model for Groups, Users, Partitions and Roles2022-03-31T15:25:12ZMichael van der HavenUnclear Data Model for Groups, Users, Partitions and RolesThe group model, the members in groups and the partitions and roles seem to be a bit unclear. For example, it is possible to add groups as sub-groups to other groups.<br>
It is not clear though what the effect is supposed to be on member...The group model, the members in groups and the partitions and roles seem to be a bit unclear. For example, it is possible to add groups as sub-groups to other groups.<br>
It is not clear though what the effect is supposed to be on members in the parent group: are they automatically a member of the sub groups as well?
The use of the partitions is also a bit confusing: partition membership is added on a per user basis, but what if you'd like to add a group to a partition? It is also not really clear when you want to add the same user to multiple partitions in the same group with different roles.
It almost feels like the group membership, which seems to be of more importance in the context of ACLs in the data is being mixed with Role Membership, which is important for the API's.
A clear document of the current situation would really help, because explaining the model to current customers is very hard.https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-sdutil/-/issues/17sdutil to handle source SegY file that is already in cloud location2023-03-30T16:53:30ZDebasis Chatterjeesdutil to handle source SegY file that is already in cloud locationImagine the Segy file is already in cloud location. How to get sdutil to use that as input instead of user’s own desktop and local disk?
If we can achieve that, we can have a successful end to end workflow for Seismic.
Step-1: Data Loa...Imagine the Segy file is already in cloud location. How to get sdutil to use that as input instead of user’s own desktop and local disk?
If we can achieve that, we can have a successful end to end workflow for Seismic.
Step-1: Data Loader uploads SegY to cloud location and then create WP, WPC, Dataset etc. by using Manifest based Ingestion. This is how people work today without Seismic DDMS.
Step-2: Runs sdutil directly off Segy file (which is already in cloud). Next runs converter to zgy or vds as needed.https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/139osdu_ingest - Make dataset optional inside "Data" block of JSON payload2022-02-03T17:25:56ZDebasis Chatterjeeosdu_ingest - Make dataset optional inside "Data" block of JSON payloadIn the current structure, "Data block" expects all 3 sections - work-product, work-product component and then Dataset.
During recent discussion with @todaiks and @Kateryna_Kurach , it transpired that Work-product component may simply r...In the current structure, "Data block" expects all 3 sections - work-product, work-product component and then Dataset.
During recent discussion with @todaiks and @Kateryna_Kurach , it transpired that Work-product component may simply refer to an existing Dataset record (created in previous step). so, we do not want to spend double effort to deal with Dataset record.
See collection 29 from Platform Validation. Steps 4c and 5a.
https://community.opengroup.org/osdu/platform/testing/-/blob/master/Postman%20Collection/29_CICD_Setup_Ingestion/R3%20Full%20manifest-based%20ingestion.postman_collection.json
![osdu_ingest-Postman](/uploads/b5b956bf197bd8ce06542fda59697a0b/osdu_ingest-Postman.PNG)