Performance testing in R3 M11 - Need to determine the maximum size of the payload allowed during ingestion using Osdu_ingest DAG
All,
For R3M11, I performed the performance load testing using the Osdu_ingest DAG running in Airflow v2.0 The environment I used was IBM Pre-ship R3 M11. Here is the summary:
As expected, we can see that when batch_upload is used the time required to ingest the data goes down ( performance gain)
Some observations of the process used. There is a difference in the python scripts that are used to generate the payload for Ingestion and batch_upload. The python script that generates the payload for ingestion, generates records of kind: “opendes:wks:master-data--Organisation:1.0.0”. So when user specifies 5 records, it generates 5 records for kind Organisation The python script that generates the payload for batch_upload, generates records of "osdu:wks:master-data--Organisation:1.0.0 and osdu:wks:reference-data--ContractorType:1.0.0". So when a user specifies 5 records, it generates records of kind Organisation and ContractorType. So it is actually generating twice the amount of records that were specified.
At present to establish the benchmark for the performance, we are using the number of records. Probably because it is convenient to tell the users that to ingest a certain number of wells for example takes x amount of time. But the well record size may vary from one user environment to another and hence performance numbers derived using a number of records may not hold true in all the situations.
How much one can ingest in one job or one time is based on the size of the payload in KB. So I think we should use the payload size in KB to establish the benchmark. The number of records that can fit in the payload would depend on the size of the records
I have done the testing in the IBM environment, but the test for 50000 records in batch_upload seems to be failing in all the environments. I do not know, where the size limit is coming from? REST API, Network, Airflow, DAG implementation. Nor do I know whether that size is configurable?
It is important for us to understand where that limitation is coming from and whether it is a hard limit or a configurable limit. The python script file should honor that limit and generate data/payload files (multiple) containing a correct number of records to avoid failures.