M7 Manifest based ingestion - Load Testing
Definitions
Load Testing - The number of records or manifests that can be processes at a time.
Background
Throughout M7 there have been a number of performance improvements delivered by EPAM, as well as work on improving issues with configuration etc. We expect this has made a significant improvement to the capacity of the manifest based ingestion, but we don't have a specific figure.
The process of load testing should be repeatable, with the expectation it will be applied to the upcoming Airflow 2+ changes.
Requirements
We need test the "5000 manifest test" @debasisc @todaiks to be re-run on the M7 release. The result should a binary pass/fail and the wall-time for executing the job. For completeness (Table 1) we show a set of recommended test cases that we believe should ultimately be automated and runnable through the QA group.
Test | Issue | AWS | Azure | GCP | IBM |
---|---|---|---|---|---|
the "5000 manifest" | Our current baseline | ||||
1 Manifest with 5,000 records | |||||
1 Manifest with 20,000 records | |||||
1 Manifest with 50,000 records | Limit on the size of the request body | ||||
50K manifests in multiple requests, not simultaneously | Airflow 1.X doesn’t allow sending multiple requests (Fixed in Airflow 2.0) | ||||
chunks of 50, 1000 DAG runs | 1. max_active_runs (50) limitation 2. limitation of workflow service: java heap error Issue 64 3. Storage Service has a limitation of storing no more than 500 records/s | ||||
chunks of 1000 | see above | ||||
50 DAG runs | |||||
Launch several different DAGS simultaneously | |||||
Ingest the Volve data | to promote adoption | ||||
Ingest the TNO data | to promote adoption |