Performance review of main ingestion functions' improvements from M6 to M10

GCP results - Click to expand

I was testing the main functions of Manifest Based Ingestion on my local machine from M6 to M10 releases.

Results are provided in the following table.

Function	Manifest	M10_optimized (sec)	M9 (sec)	M8 (sec)	M7 (sec)	M6 (sec)
schema_validator.ensure_manifest_validity	LogCurveType (42917 records)	113	1453	1453	1453	1453
	LogCurveType (800 records)	2.6	25.38	25.38	25.38	25.38
	WorkProduct	2.5	2.685	2.895	2.895	2.895
manifest_integrity_validator.ensure_integrity	LogCurveType (42917 records)	14.94	15.07	14.67	40.2	**
	LogCurveType (800 records)	5.494	4.677	5.82	5.141	3751
	WorkProduct	0.0013	0.001	0.001446	0.001852	0.001781
single_manifest_processor.process_manifest	LogCurveType (42917 records)		2056*	**	**	**
	LogCurveType (800 records)		43.18*	439.3	439.3	439.3
	WorkProduct		2.544	2.454	2.887	2.6

*Sent batches of 400 records to Storage Service

**Can't execute this test for reasonable time (it may last more than 24h)

Performance improvements throughout M6-M10 releases.

M10 (?)

After analyzing the previous releases, some bottlenecks were found. The slowest part of Manifest Ingestion, besides Process Manifest, was Schema Validation. After some researches, it was found that a common way of using jsonschema.validate has a lot of overhead with creating classes and instances of validators on each schema validation.

The solution was to create jsonschema.validators on each unique schema one time and reuse them against corresponding records. This approach is roughly 10 times faster than the usual one of jsonschema.validate.

E.g., M9 Schema Validation of 42917 LogCurveType records was 1453 seconds, and it is 113.1(!) seconds on M10 release.

M9

In the previous releases, each Manifest's record was saved in Storage Service one by one, this cased a lot of requests to Storage.

After adding Storing Manifest's records with using Storage Service's Batch Saving (up to 500 records), it is possible to avoid extra requests to Storage.

E.g., M8 manifest processing of 800 LogCurveType records took 439 seconds, meanwhile M9 manifest processing with batches of 400 records took 43 seconds.

M8

Improved Manifest Integrity Validation performance by sending batches of external OSDU Ids of all Manifest's records to Search Service. Before, these Ids were searched one by one; this caused extra calls of Search Service.

E.g., M7 manifest integrity check of 42917 LogCurveType records took 40.2 seconds, meanwhile M8 manifest integrity check of the same Manifest took 14.67 seconds.

M7

Improved Manifest Integrity Validation performance by extracting all external references in OSDU Search Service into a single set of unique Ids, and only then they are searched. This significantly reduced a number of requests to Search Service; earlier, each Manifest record's external references were searched separately, this caused calling Search Service with the same requests many time.

E.g., M6 manifest integrity check of 800 LogCurveType records took 3751 seconds, meanwhile M7 manifest integrity check of the same Manifest took 5.141 seconds.

AWS results - Click to expand

#M12

Manifest by Reference implementation in validation and integrity check stages performs only marginally faster than current implementation. If the ADR’s design of adding a POST request to the stage was accepted, these marginal improvements might actually be slower.
The reference implementation showed a 9x performance decrease over existing implementation for the process_manifest step.

Function	Manifest	Original Manifest Ingestion (avg sec)	Manifest By Reference (avg sec)
schema_validator.ensure_manifest_validity	4kb Manifest	2.85	3.12
	128kb Manifest	6	6
	4mb Manifest	6.12	5.9
manifest_integrity_validator.ensure_integrity	4kb Manifest	2.85	2.98
	128kb Manifest	4	3.5
	4mb Manifest	4.27	4
single_manifest_processor.process_manifest	4kb Manifest	2.73	24.7
	128kb Manifest	4	25
	4mb Manifest	2.9	N/A**
Total time	4kb Manifest	8.43	30.8
	128kb Manifest	14	34.5
	4mb Manifest	13.29	N/A**

Edited Aug 11, 2022 by Okoun-Ola Fabien Houeto