Yes @vineethguna I remember that discussion and that time as well I tried addressing to majority of the traffic against the few with this solution (and we missed putting cap that time), if we remove checksum generation altogether from here it will impact almost 80% of the requests coming in but if we start skipping that for file sizes more than 5 gb it would only impact about 20%.
Having said that we must definitely plan (rethink and design as discussed earlier as well) checksum generation for larger files which can take its own time to get through community and come up with working solution (and something which will work for files being registered through dataset service as well). We can also brainstorm on the schema changes what @gehrmann is proposing above if we can do partial checksum generation and if that can benefit us or not.
Right now with this issue my aim is not to block anyone from creating metadata if they are registering for huge file size.
Paresh Behede (9e30dc50) at 12 Oct 09:30
Bypass checksum generation for file size more than 5 gbs
@chad No nothing has been planned on that part, we need to think on designing that piece. may be we can check if someone from community can help here?
@chad pipeline is failing with below error, can someone from aws look into it?
GitLab returned: HTTPError: Unprocessable Entity { "base": [ "aws_bootstrap job: undefined need: aws-update-ecs" ] }
Changes, additions as explained in the ChangeReport https://gitlab.opengroup.org/osdu/subcommittees/data-def/work-products/schema/-/blob/master/E-R/ChangeReport.md#snapshot-2022-07-22-towards-m14 Schema scope is PUBLISHED (=read-only) except for two experimental 'placeholders' Reservoir and ReservoirSegment.
Closes #113
Paresh Behede (99701187) at 26 Sep 10:11
M14 snapshot schema repo SHA 1076d1347c3c1903dc09c01c69db5ce8eafd5b...
@gehrmann at least I don't think we have pressing need for this change to be done as part of M14.
Ok nice to know about this new change.
Hey @gehrmann I don't think we need any changes in schemas as such, we might just change logic of populating that attribute to async instead.
merged
This issue stands for the integration of the schema bootstrapping resources delivered by OSDU Data Definitions for the Milestone 13, 0.16.
Closes #110
DD schema repo SHA 8b96be9fb00a80374db4ab9129db16f68e9c8e11, 2022-07-22
Paresh Behede (f6e6102f) at 10 Aug 10:03
Merge branch '110-osdu-dd-m13-delivery' into 'master'
... and 1 more commit
Yes I am good, I have approved the same. do you want me to merge?
@Yan_Sushchynski I think this would deviate us from this status producing ADR, if we need status per record basis from manifest dag we must start emitting/producing status from manifest ingestor the way we have implemented that for CSV ingestor. manifest emitting or saving summary in Dataset doesn't seems to be right and consistent approach. You can refer CSV Ingestor code for what all status messages it is producing
Closes #110
DD schema repo SHA 8b96be9fb00a80374db4ab9129db16f68e9c8e11, 2022-07-22
@Yan_Sushchynski Workflow Service is already emitting status with specific stage and status.
While user makes a call to POST /metadata api endpoint for registering file on data platform, before saving that as record, file service generates checksum of file provided in request to help duplicate detection for further downstream workflows.
As this is HTTP blocking call checksum calculation takes quite a long if file size is huge (like more than 3-5 gbs) and HTTP post call gets hang and never respond.
We have tested checksum generation and metadata registration takes about 2 mins for file size of 5 GB.
We have experienced this when one of the user tried uploading file size of 14 GBs.
Though percentage of such a huge file being uploaded is quite low we still need to allow them to register metadata and to enable that we must bypass checksum generation logic for such a huge file sizes.
By doing this we still enable duplicate detection ability (by calculating and saving file checksum in storage record) for majority of files uploaded like for 95% of the files and we ignore that for 5% of the request.
Also to enable this for rest of the 5% of file requests we can think of async way to calculate checksum and update the storage record later.
@shrikgar can you please check why IBM bootstrapping is failing here?