ADR: Worker Service for Wellbore Bulk Data Access

changed the description

added IssueADR label

@kin.ng This is Yunhua from AWS. I am working on this implementation . When I run the test functional/tests/test_chunking.py::test_send_one_chunk_without_session, I noticed that POST is not using worker service, however, GET is calling worker service. I thought both POST and GET should use work service.

Yunhua, that is accurate.

The refactoring of the remaining bulk data operations into the worker service is in progress.

However, once the checklist from ~~this tutorial~~ this tutorial (updated link) is complete by the CSP, the upcoming changes in the service implementation to cover the remaining bulk data operations should not have any impact on the underlying CSP specific infrastructure provisioning, build, release, deployment processes, or service configuration.

At the moment, the migration away from Dask based implementation is complete for

Read bulk data, and
Write bulk data without session (more recently MR!751)

Moving next to the worker service,

Bulk data statistics
Write bulk data within a session, using data chunks

cc @ydzeng

Thanks. I will pull the MR and check.

About the performance improvement using service worker, how could we measure that? Since dask is still used in wellbore ddms service, I guess memory consumption and leak still exist in the service? Besides, worker service will use additional resource.

So maybe we can measure the improvement after you disable dask at wellbore ddms service when you finish all the moving to worker service, because now I still see dask in wellbore ddms even service_host_wdms_worker is set. Do you have a timeline when you will finish all the moving?

@kogliny For the first question, I've replied Yongdong in the comment below. Let me know if more info need.

If running isolatedly read bulk data operations (or write bulk data without session), Dask is not involved, and it's already possible to monitor/measure the improvements in request duration, memory, cpu, failure rate, max supported requests per second, ...

Also, how could we measure the performance improvement, what improvement should we see with this added worker service vs. without this worker service? @kogliny

@ydzeng The performance and stability improvement would be better evidenced with,

Querying a bigger number of data points from a large dataset
Sending multiple parallel read requests

However, there are no test sets in GitLab today that comprehensively exercises those scenarios.

Nonetheless, the existing integration tests, and performance tests on bulk data read operations in GitLab could still be run, while monitoring the service. Alternatively, after setting up synthetic data in the data partition, jMeter could also be used to send parallel requests.

For the expected improvements, in the cloud platform monitoring side, it should be observed,

Reduced bulk data read time
Less 5XX errors in active environments
More parallel requests supported
Reduced memory and CPU usage

@kin.ng based on your status on this. Do i get it right that it requires some more time on SLB side to make it ready for CSP to jump on it?

@ydzeng No, that is not accurate.

The information required for the CSPs to jump on it is ~~documented in this tutorial~~ documented in this tutorial (new link).

Once complete, all existing integration and performance tests must be green.

@kin.ng AWS implementation is here wellbore-domain-services-worker!4 (merged). We will merge it to gitlab after approval.

Please keep us updated when the dask is disabled completely with worker service at wellbore ddms service. We will tune the resource needed for worker service and wellbore ddms service, and get benchmark on performance tests to see if the new design is better than before. Thank you!

@kin.ng just to follow up on this. Is there any gap on aws side on implementation. Also, when will the dask disablement be completed before we can go for our performance test benchmark on aws side?

Looks like the recent change from SLB break aws e2e.

@kogliny Could you file a ticket with more details?

Be aware delays are expected, as it's holiday season for us. The team is back after August 16th.

@ydzeng The latest worker service version contains,

Bulk data statistics
Write bulk data within a session, using data chunks

Dask retirement will come next.

Performance benchmark executions do not need to wait on that to start.

@chad

changed the description

@Srinivasan_Narayanan @Yauhen_Shaliou @vikasrana as discussed today

CC @debasisc

added KBIn Progress label

cc: @omprakash_epam

cc @anujgupta

@kin.ng as i understand this deployment tested on Azure already? we see that in the tutorial doc there are some items completed on Azure. I was just thinking if yes then do we have also the benchmark test completed on Azure?

@omprakash_epam The Azure libraries are updated, and do already include the required implementation to enable the worker service in Azure.

For the benchmark results, see my reply to Nur below

What is still pending on Azure team side is to update the GitLab pipelines to enable the deployment of the worker service in the Azure environment used by the OSDU Forum.

Could you coordinate that with MSFT team?

CC @Vernet

cc: @nursheikh ; @lucynliu

Thanks @Srinivasan_Narayanan

@kin.ng could we include benchmark results with the ADR comparing existing and proposed design?

@nursheikh In the absence of a comprehensive performance benchmark that could be shared. The recommendation is to use the existing e2e test set to validate the expected reduction in CPU and memory usage with the adoption of the worker service for bulk data access, in replacement of the Dask-based implementation.

The AWS team has shared here the observed changes in CPU and memory usage when running the existing e2e tests in their environment against the Dask-based implementation vs using the new worker service for bulk data access.

Those values are consistent with the latest measurement collected in our internal Azure test environment as well.

assigned to @chad

Hi @kin.ng ,

we have started the work on implementing the worker service on Azure. we have observed the AWS deployment. However in the above, there is a link for CSP specific deployment steps which is pointing out to this link https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/wellbore/wellbore-domain-services/-/blob/ADR-bulk-worker/docs/bulk-worker-CSP-contribution.md

however we are not able to access the link. could you please provide the necessary details once

cc: @Srinivasan_Narayanan , @omprakash_epam @lucynliu

changed the description

mentioned in merge request wellbore-domain-services-worker!10 (merged)

mentioned in merge request osdu/platform/deployment-and-operations/helm-charts-azure!732 (merged)

mentioned in merge request osdu/platform/deployment-and-operations/helm-charts-azure!733 (closed)

@andrei_dalhikh , please look at the ADR

The proposal looks good for me. @Yan_Sushchynski, can you add your opinion here, since the ADR is discussing Python-related technologies and approaches?

This has been implemented in Azure and AWS as of M22.

GC, IBM please take note. @Yauhen_Shaliou @vikasrana

added GC IBM labels

marked the checklist item Approved as completed

ADR: Worker Service for Wellbore Bulk Data Access

Status

Context & Scope

Trade-off Analysis

Decision

Security Implications

Current

Target

Designs

Child items ...

Activity