ADR - Project & Workflow Services - Application Integration

Decision Title

This ADR focuses on how applications would integrate with Project & Workflow Services

Status

Context & Scope

The completion of an workflow requires the sharing of data across an number of applications. At present, data sharing between applications can lead to the following issues:

Latest data sets for an project in-progress is scattered in across various unmanaged storage space.
Data exported by users to an unmanaged storage space to work on is often left outside the managed data store, causing unmanaged growth of storage usage.
Data saved in personal unmanaged storage space is not available to other users.
The ability to add notes and annotations on interpreted data so that it can be reference in the future, is missing the current solutions.
Owner, lineage, audit trail and status of the data in an unmanaged storage space are often unknown.

As applications drive the creation of data, PWS must provide the methods for the application to interact directly with PWS functionality.

This ADR addresses how applications will integrate with PWS.

Decision

Collaboration Service, Collaboration Context, Core Services and DDMSs

A Namespace (Collaboration Context) will have a 1:1 relationship with project.
A collaboration context is composed of a Project ID, Namespace and a minimum set of ACLs and Legal Tags.
A collaboration context will be generated by the Collaboration Service.
Data Platform Core APIs services including Search, Storage, Index and Notifications will be extended to take into account the Collaboration Service and Collaboration Context (for example, the Storage API will be extended to leverage the Shadow Record pattern to write to the Collaboration Project Data Collection).
DDMSs will need to be updated to take into account the Collaboration Service and Collaboration Context.
Notifications will be triggered by the Collaboration Service and the logic for when and how the notifications will be triggered is TBC.

Application interactions

Applications will use the Collaboration Context to read/write data to a Collaboration Project Data Collection Project (CPDC).
Applications are expected to write data back to a Collaboration Project Data Collection when WIP data is ready to share with project team members or when the data is ready to be published. The expectation is not for an application to write every change or update to the Collaboration Project Data Collection.
If an application uses a data store outside of OSDU (i.e., offline), the data only needs to send to a Collaboration Project Data Collection when ready to be published or shared.
When an application reads data from a Project Collaboration Data Collection, it must be able to read and maintain the metadata associated with the file.
When an application writes data back to a Collaboration Project Data Collection, it must be able to maintain the 1/ existing metadata, 2/ add new metadata to the file, 3/ maintain the lineage and 4/ the legal tags

Business Process Overview - Block Diagram

Before we get into the details of each block of the diagram, it is important to define Work in Progress (WIP) data as it the core of the design.

Definition of Work in Progress data

The OSDU P&WS service uses a different approach by clearly defining System of Record data and Work in Progress data and keeping it in the same storage system. Differentiation between SOR data and WIP data is through the addition and completeness of metadata for new generated data in a technical workflow. How is this supposed to work?

Referring to the numbering in figure 1 the following steps illustrate the proposed approach:

Every Collaboration Project (CP) starts with an Initiation phase (yellow box 1) that includes the generation of an empty Collaboration Project Data Collection (CPDC - 1.2)
A techncial workflow starts with a selection of input data from the System of Record (2.1) that are added to the CPDC.
Newly generated data in a workflow is by default initially Work in Progress (WIP) data. (In a technical workflow it should be very easy and frictionless to generate new data. Not all data generated however will be useful to keep and store as a record. Therefore,we should not automatically define new data as a record)
Applications that generate this WIP data interact via the CP Data Management Service. Multiple versions can be created not inhibiting the creativity of the users. The CP Data Management Service offers the full CRUD (Create, Review, Update, Delete) functionality for WIP data (2.3). It is expected that WIP data gets lineage and other relevant metadata added by the applications automatically, but it will offer a functionality for this as well as part of the CP Publication services (3).
The publication services are an essential part of the P&WS concept. When WIP data is ready to be passed on to other workflows or applications the requirement will be that this can only be done after this selected WIP data is first declared a record. (3.1.1) This to ensure that every workflow uses the authoritative SOR.
Generated WIP data has to become authoritative by adding assurance metadata as the mechanism for this differentiation. Assurance meta data defines the trust level. OSDU allows this to be generic as well as being specific with additional assurance labels for what purpose the data can be used and for what purpose it cannot be used. (3.1.3)

This means that the P&WS must impose minimum metadata requirements before WIP data can be published to the SOR. With services provided to connect with the assurance metadata framework to label WIP data as SOR data the option exists for applications to use the P&WS as a mechanism to enhance data labeling before it is ingested into OSDU, provided they connect to a defined Collaboration Project.

After selected WIP data has been assured and published to the SOR there will still be other WIP data in the CPDC. At any time it will be possible to delete WIP data, but if that is not done by a user a non-record disposal (NRD)mechanism is possible to develop as well. The P&WS will not allow for any fucntionality to delete SOR data. That data is in line with OSDU data principles by default immutable.

Based on the initial definition of the duration of the project it will be possible to set-up a time bound notification if WIP data can be deleted or purged. When the duration is extended this notification system is adjusted as well. Or company policies can set this NRD time window.

1 Initialise Collaboration Project

A Project Admin triggers this process. There are mainly two sub-processes here, and OSDU needs to support this process by providing APIs that enable these sub-processes

1.1 Create Collaboration Project (CP)

The CP is a data type that comprises the top level config information about the Collaboration Project. The key config information is below:

Default ACLs and Legal Tags (LTs). During the lifetime of the CP, 1000s of temporary datasets may be created. Assigning an ACL and LT individually to each of them would be tedious. Also, as most of this data is temporary, it may also be overkill to do so. Therefore, a Default ACL and LT is specified at the CP level. Then, when the temp data is created, these ACLs/LTs will be auto-assigned to that data unless an ACL/LT is explicitly supplied during data creation
Scope, Objective, Timeline: these are standard characteristics of projects in general, and are also relevant to a CP, as they help define why the CP exists
Status: Can be Open and Closed
CP ACLs. In contrast to the above, this ACL contains the users that are allowed to access to the CP. For instance, there would be a certain set of users that are allowed to create data within the CP, another set of users that are allowed to manage the CP itself, such as triggering the Publish process or Delete data. Currently, we need at least a “Project Admin” ACL and a “Project Contributor” ACL.

In this process step (1.1), we need APIs that support the creation of this CP.

1.2 Create Collaboration Project Data Collection (CPDC)

This process creates the CPDC which is the key container for data within a CP. It consists of two sections:

References to SoR: This is a list of record references to data from the SoR. The CP is not allowed to modify these records, though it can generate new versions of these records and store those within the WIP data section
WIP Data: also known as temp data, this is the set of references to data that is created within the CP. Normally these data references should only be visible within the same CP and not outside

In this process step (1.2), APIs should support the creation of the CPDC along with the two sections mentioned. These sections will continuously be modified (i.e. records added into it) as the project goes on.

It is expected that a single CP will have only a single CPDC – however, that assumption should not be baked into the design/implementation as a limitation.

2 Execution

Most of the CP’s lifecycle will be spent in this process. There are three subprocesses, which may occur in any order through the course of the CP. To support this, the existing APIs in OSDU (Storage, Search, Notification) need to be modified.

2.1 Add SoR References

During the CP’s lifetime, Project Contributors may add new SoR References (including references to records in DDMS) at any time. To avoid additional complexity, removal of SoR References is not considered right now. It is assumed that a SoR reference, once added, remains in the CPDC until the end of the CP.

If another Open project has already added the same SoR, then the ID of those CP(s) should be added to the API response. Alternatively, a Notification can be raised to flag this fact. Project teams may then choose to coordinate as needed, though this part will (currently) lie outside the knowledge/control of OSDU.

This API is only callable while the CP is in the Open state. Calling it on a Closed CP throws an error

2.2 Search CPDC

It is important that functionality exists to catalog and search all the WIP data in a CP. The current approach is that the CP Data Management Service includes a cataloguing function for all this WIP data.

The standard Search API needs to be modified, to take the ID of the CPDC as an optional parameter. If this parameter is supplied, the Search API needs to narrow its search scope to include only the contents of the CPDC (both sections of it)

If the parameter is not supplied, then the search will include only the contents of the SoR. Any records that are in the WIP section of any CPDC, will be excluded from the search scope.

2.3 Manage WIP Data

The standard Storage API needs to be modified, to take the ID of the CPDC as an optional parameter. If this parameter is supplied, the Storage API needs to do the following:

For write API requests (only callable by CP Data Contributors), write record references into the CPDC, and populate ACLs and LTs. If ACL/LT was supplied in the API request, apply that, if not then pick the default ACL/LT from the CP config and apply it to the record. This API is only callable while the CP is in Open state. Calling it on a Closed CP throws an error
For read API requests, the current status quo behaviour should still work, as it is simply a read of the record based on the ACL/LT mentioned on the record
There should also be a Hard Deletion API that hard deletes a WIP record and associated data files from blob storage – as long as no descendant data records exist. If descendants exist, then the hard delete should throw an error. This API is only callable while the CP is in Open state. Calling it on a Closed CP throws an error
- Some operators may prefer to have an “archival” to a colder storage tier instead of hard-delete, this choice should be possible via the API.

If the CPDC parameter is not supplied, then the current status quo behaviour will apply.

3 Publish

Once all the needed work is accomplished in a workflow, we are left with artifacts in WIP section of CPDC which either needs to be published immediately or at a later date or discarded. The “Ref to SoR” part of CPDC is already in SoR and nothing needs to be done for them.

The Publish process can occur several times during the life of a project.

3.1 Prepare to publish

3.1.1 Select WIP datasets which needs to be published.

3.1.2 The Publish Svc needs to recursively identify the predecessor records based on the selected WIP datasets.

3.1.3 Each of those datasets needs to go through assurance and QC process. Assurance will be done using the Assurance Model by a separate app. From a PWS perspective, we need to know if the record(s) are assured or not. Outcome of this process is a “Ready-To-Publish List”.

3.1.4 Now that these datasets will end up in SoR, so we need to assign right ACLs and legal tags to them. This is needed because, it may be the case that while working on them when they were part of WIP, a default or simple ACL or legal tags were attached.

3.2 Detect conflicts

We do have two kinds of datasets in WIP – 1. Newly created datasets with no linkage to SoR 2. Modification of SoR datasets which created a new version of it, or WIP datasets derived from an SoR dataset

For the first one, as they are new datasets, there won’t be any conflicts and are ready to be published to SoR

For the second one there can be few implications:

We started with a SoR dataset and made some modifications to it. We did this while working on it in our workflow which is part of our collaboration project. This created a new WIP item in CPDC. But there may be a case that the parent SoR was also picked in some collaboration project and is also modified in that project thereby creating new WIP item in their CPDC. This will result in a conflict between these two WIP items as they are derived from same parent and needs to be resolved.
There can also be a scenario where we started with a SoR and made some modifications to it in our collaboration project thereby creating new WIP item in CPDC. But, in the meanwhile, the parent SoR is modified out collaboration project and a new version is available in SoR. Again, there is a conflict for our WIP item that needs to be resolved before publishing it as a newer version of parent SoR.

So, as an outcome of this step, there will be a “Conflicts list” which contains some WIP items which needs to resolve before they can be published to SoR.

3.3 Resolve Conflicts

There can be different way of dealing with conflict list:

Ignore the conflict and publish. We need to see if this should be allowed, or guardrail should be in place in block publishing datasets which have conflicts.
Use notifications to inform about conflicts among involved parties and resolve them manually.
Use tools/automations scripts to resolve conflicts automatically when possible. We do not foresee this for initial MVP releases. If this is really needed in future, we can work on it. Eventually, the conflict needs to be resolved so that these items are also ready to be published in SoR

3.4 Update SoR

Once all the items are ready to be published, they are published to SoR by removing the namespace tags from those items.

4 Project Closure

This is triggered by the Project Admin when the CP is to be closed. Consists of 3 sub-processes, which need APIs:

4.1 Publish data to SoR

This is already covered in Section 3

4.2 Delete remaining WIP Data

After all required data has been published via sub-process 4.1, the next step is to hard delete all remaining WIP data. This should reuse the same hard deletion API mentioned in sub process 2.3

4.3 Update Project Config

Clean up the CP by updating the status of the Project to “Closed”. Also generate a CP Closed notification. Also empty the CPDC by either removing all WIP and SoR References or set them to a status such as “Closed” so that other CPs no longer consider them for conflicts. Or, set the CPDC to be read-only

Appendix A: When to update PWS

Applications do not necessarily have to save every single record/data update into PWS. For reasons of efficiency, cost, and performance, applications or users may choose to ingest data to PWS only when the data has attained a certain level of maturity and/or is ready to be shared with other users/applications.

Appendix B: Offline Applications

For the purposes of this section, an “Offline” App is one that participates in a business workflow and handles temporary data, but does not interact with the PWS or with OSDU as a SoR/System of Engagement. As a result, PWS has no knowledge of or control over the data or processing done by such applications.

This situation can exist for various reasons, such as the app not supporting integration with OSDU (due to technical or other limitations) or because OSDU does not have the necessary data types or APIs for this.

Such applications can and will function outside the purview of OSDU and PWS. It will be a decision of the operators whether and how such data can be pushed back to OSDU. If the operator chooses to do so, they would use the same APIs described above to manage that integration, and OSDU/PWS would only be aware of the workflows as long as the data is managed within PWS. To facilitate this process, operators can consider an Anti-Corruption Layer or Façade Layer which mediates between the application and OSDU; these in turn may be custom built or marketplace solution. This layer would need to adhere to OSDU standards (eg: REST API integration) to integrate with PWS.

Below is a diagram which depicts how an anti-corruption layer can facilitate the communication between an offline app and P&WS:

Anti-Corruption Layer components

Facade

Offline applications can be of different architecture types and may have different way of storing and sharing their application data. So, we need a facade layer which wraps the functionality of offline apps. This helps the offline apps to connect to anti-corruption layer.

Adapter

The adapter component is responsible for converting the offline app domain model to OSDU domain model. This will include scema mapping. It can also have functionalities to add missing metadata to the application datasets before they can be added to CPDC. So, the adapter is both application and OSDU aware and helps converting application data to the form acceptable by P&WS. The adapter may need a translater component for scema mapping or other translations

Translator

The translator component can be used to translate from offline application domain model concepts to OSDU model concepts. This can include scema mapping and other needed translations.