Nick I've been offline last week and 1/2 I'll get you my new email.
Yes, it's linked in the text now.
@Kateryna_Kurach @Siarhei_Khaletski @shrikgar this is a working doc to develop ideas around performance benchmarking.
The purpose of this WIP issue is plan around how we are going to benchmark ingestion performance. Need to address the performance of ingestion mechanisms. Expected performance in production env is in excess of 33k records per minute - Wells, well logs, trajectories etc...
*Issues
performance changes since M6 load testing Issue
Testing the ingestion of 500, 1000, 50,000 record manifests.
Uses synthetic manifests to perform basic testing of the ingestion. Load testing is run by pre shipping for each release, one release in arreas, which means MX is testing during the development cycle of M(X+1). A spreadsheet showing the pass/fail and timing per CSP is provided by pre shipping on conclusion of the test.
Using the Airflow console, additional information regarding run time and latency of the Airflow scheduler can be found using the Airflow console Gantt Chart. This data provides a view of where the performance bottlenecks might be.
Assets:
The the other items (beyond the basic) load testing give insights into the sensitivity of the performance on the Airflow configuration. This maybe carried out if resources are available.
Today teams are loading 8-10 Million records, including validations, outside of the Manifest or CSV ingestion mechanisms in ~5 hours for a rate of about 33k Records per minute.
There is a use case for applications developers to run the Airflow part of the ingestion locally (with core services provided accessible through a REST API). This local mode can be used to better profile the performance of just the Airflow component, agnostic.
Standalone installation instructions are here.
@abhijeet_sawant is this an issue with the code or is it a deployment issue? @debasisc does this issue need to go to preshipping?
@kibattul @Siarhei_Khaletski For Option 3, what are the requirements for security? Is this "XCOM" data expected to be persistent etc. These are the questions manifest by reference has to solve too so perhaps talk with them. There might be a common solution here we can use in both use cases.
Can you please help me understand how this ADR got approved?
@kibattul I've marked it unapproved given your comments. The concept of passing references to manifests through XCOM instead of manifests themselves has been agreed upon in our weekly meetings. The concept of using the dataset service is being tested to see if it's viable. We haven't seen evidence of that yet.
Dataeset service is used to save persistent data.
You're right, unlike other platform data, the manifests themselves are not necessarily intended to be persistent. There maybe some issues with using the DS service for the (e.g. cleanup), but we haven't established its viability from a performance standpoint yet, that's the work that's progressing.
@Siarhei_Khaletski A reasonable requirement for the ingestion is that we have "programmatic access" to the status of an ingestion job. I suggest we do the simplest thing first, and so I favor starting with option 1, XCOM through the workflow service.
Question, does the Option 1. give us access to real time updates, can we poll Airflow to tell us what records are being processed.
The manifest "by reference" project being developed is intended to get away from having "huge" XCOMs. XCOM isn't intended to be an efficient store of large data.
Fields that can be identified as being the source of a failure should be returned. Publish what went wrong, not went right because the latter doesn't require user intervention.
The user will need to be able to take some action to fix an issue, enough information to identify the file that has encountered an error. The reason for the error, and if possible, the fields which are generating the error.
@chad what's your take on this, you're working on the AdminUI. My POV, for a first pass, keep it simple, it's more important that we have the programmatic access.
Meta information such as units (m, ft, g/cm^3, coordinate reference frame) is required for the ingestion of data. This information is crucial for normalizing the frame of reference.
For date time information, ideally this data is in UTC or has meta information available to transform it to UTC. Currently the schema service is throwing out records that don't conform to this requirement.
However, there is a large body of data already the environment where this information is not available, and so we need to make date time the exception, and waive the requirement to provide UTC information. A counterpoint to waving this requirement is where we have activities like active drilling, where correct date times in UTC are required.
e.g. should the ingestion provide warnings that date time meta information isn't available. Or should we have a flag or field in the record to allow the user to waive the requirement etc.
This issue is inline with this approved [ADR]
Following MR [Link] is introducing changes in DAG repo structure to support Packaged Dags
This is the new proposed structure for csv parser
├── airflowdags
├── osdu_csv_parser
| ├── __init__.py
| └── xyz.py
|
|___csv_ingestion_all_steps.py
As of now only Azure has made changes in their pipeline to honor this new structure, other CSPs should make changes in their pipelines to support this one
Once all CSP pipelines are moved to new structure, the existing folders airflowdags/dags
and airflowdags/plugins
folder will be cleaned up
@hmarkovic for your attention.
@harshit283 you've approved the unapproved the MR, was that intentional?
@todaiks I think it's fine to leave the test results in the opengroup repos. Thanks for providing the link.
@Wibben It's a misnomer, it's "load testing" of the performance work that was done by EPAM for M7.
i.e. the lru caching etc.
@ChrisZhang The MR is still open.
@Kateryna_Kurach @shrikgar @kibattul @Wibben Hi all, to help keep track of the Airflow 2+ adoption by CSP's can you please update your Airflow 2+ status here.
This issue is a place to track adoption of Airflow 2+ by the various CSP's.
Running Airflow 2+ with the experimental API is backward compatibility with the current workflow services. However it does provide potential performance improvements, particularly around the scheduler. Please update you current status here.
AWS - M9 timeline
Azure - M10 timeline
IBM - M9 timeline
GCP - M8 timeline
@shrikgar @harshit283 @Siarhei_Khaletski will this get merged from the M8 code freeze this friday?
@shrikgar IBM is taking over this issue, is it correct to tag it for M9 release.