Skip to content

GitLab

  • Menu
Projects Groups Snippets
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in / Register
  • I Ingestion DAGs
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 35
    • Issues 35
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
    • Test Cases
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Code review
    • Insights
    • Issue
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar

Upcoming Change to GitLab Pages The GitLab Pages URL is planned to migrate from the current *.pages.community.opengroup.org to a simpler domain: *.pages.opengroup.org. At the same time, HTTPS will be enabled for the server. This change is planned for May 21, 2022.

If you have questions or concerns, please email forum-support@opengroup.org.

  • Open Subsurface Data Universe Software
  • Platform
  • Data Flow
  • Data Ingestion
  • Ingestion DAGs
  • Issues
  • #101

Closed
Open
Created Dec 21, 2021 by Yan Sushchynski (EPAM)@Yan_SushchynskiMaintainer

Performance review of main ingestion functions' improvements from M6 to M10

GCP results - Click to expand I was testing the main functions of Manifest Based Ingestion on my local machine from M6 to M10 releases.

Results are provided in the following table.

Function Manifest M10_optimized (sec) M9 (sec) M8 (sec) M7 (sec) M6 (sec)
schema_validator.ensure_manifest_validity LogCurveType (42917 records) 113 1453 1453 1453 1453
LogCurveType (800 records) 2.6 25.38 25.38 25.38 25.38
WorkProduct 2.5 2.685 2.895 2.895 2.895
manifest_integrity_validator.ensure_integrity LogCurveType (42917 records) 14.94 15.07 14.67 40.2 **
LogCurveType (800 records) 5.494 4.677 5.82 5.141 3751
WorkProduct 0.0013 0.001 0.001446 0.001852 0.001781
single_manifest_processor.process_manifest LogCurveType (42917 records) 2056* ** ** **
LogCurveType (800 records) 43.18* 439.3 439.3 439.3
WorkProduct 2.544 2.454 2.887 2.6

*Sent batches of 400 records to Storage Service

**Can't execute this test for reasonable time (it may last more than 24h)

Performance improvements throughout M6-M10 releases.

M10 (?)

After analyzing the previous releases, some bottlenecks were found. The slowest part of Manifest Ingestion, besides Process Manifest, was Schema Validation. After some researches, it was found that a common way of using jsonschema.validate has a lot of overhead with creating classes and instances of validators on each schema validation.

The solution was to create jsonschema.validators on each unique schema one time and reuse them against corresponding records. This approach is roughly 10 times faster than the usual one of jsonschema.validate.

E.g., M9 Schema Validation of 42917 LogCurveType records was 1453 seconds, and it is 113.1(!) seconds on M10 release.

M9

In the previous releases, each Manifest's record was saved in Storage Service one by one, this cased a lot of requests to Storage.

After adding Storing Manifest's records with using Storage Service's Batch Saving (up to 500 records), it is possible to avoid extra requests to Storage.

E.g., M8 manifest processing of 800 LogCurveType records took 439 seconds, meanwhile M9 manifest processing with batches of 400 records took 43 seconds.

M8

Improved Manifest Integrity Validation performance by sending batches of external OSDU Ids of all Manifest's records to Search Service. Before, these Ids were searched one by one; this caused extra calls of Search Service.

E.g., M7 manifest integrity check of 42917 LogCurveType records took 40.2 seconds, meanwhile M8 manifest integrity check of the same Manifest took 14.67 seconds.

M7

Improved Manifest Integrity Validation performance by extracting all external references in OSDU Search Service into a single set of unique Ids, and only then they are searched. This significantly reduced a number of requests to Search Service; earlier, each Manifest record's external references were searched separately, this caused calling Search Service with the same requests many time.

E.g., M6 manifest integrity check of 800 LogCurveType records took 3751 seconds, meanwhile M7 manifest integrity check of the same Manifest took 5.141 seconds.

AWS results - Click to expand
Edited May 10, 2022 by Okoun-Ola Fabien Houeto
Assignee
Assign to
Time tracking