Storage fails to delete large number of records upon legal tag expiration

If there are large number of records associated with a legalTag that expires after running the cron job, we are seeing availability issues and inconsistent result in terms of record searchability.

Observations:

LegalTag cron job update issue:

Scenario: I have a large number of records (in the 6 digits) that are associated with a legalTag (i.e. the record metadata has a particular legalTag (let's call it lt1) in the legal.legaltags section). The legalTag lt1 is set to expire soon

Event: lt1 expires

Action 1 : Cron job updateLegalTagStatus is triggered on a periodic basis, which grabs the legalTags that have changed their state (valid to invalid and invalid to valid) and publishes this information onto SB topic 'legaltags' and EG topic 'legaltagschangedtopic'. The legalTag also changes its state in the CosmosDb

'legaltagschangedtopic' has an event subscription to SB topic 'legaltagschangedtopiceg', which has a subscription 'eg_sb_legaltagssubscription'

**Action 2 **: Storage service pulls messages from 'eg_sb_legaltagssubscription' for LegalTag update events and updates records associated with lt1. Storage updates the recordMetadata with active/inactive record status and publishes the change onto SB and EG for indexer-queue to consume.

Expected outcome: All records associated with lt1 are now inactive. They are unsearchable from Storage and Search APIs.

Actual outcome: Some records associated with lt1 are now inactive. They are unsearchable from Storage and Search APIs. I can still search other records.

Issue: Not all records are getting pulled from Storage service at Action 2 to be processed. Thus, many records simply don't change their state, although the legalTag is invalid now.

Observed behavior/possible improvements:

The context of legalTag change (active to inactive or inactive to active) is not considered by Storage when fetching records to update. Storage tries to fetch ALL records for that legalTag with the query SELECT * FROM c WHERE ARRAY_CONTAINS(c.metadata.legal.legaltags, lt1). In case of large number of records, this is a longer operation. We observed throttling on the cosmos-db during this process
No way to retry. Because Legal service updates the letalTag status in cosmosDb, running the updateLegalTagStatus job again will not pick up this legal tag. To do this, we are required to manually change the status of the legalTag and run the cron job again. Upon manual retries, we face the issue above where Storage is trying to process ALL records again.
What happens when Storage job is interrupted, possibly due to pod restart (high cpu utilization) or network error or cosmosDb error? Retrying the whole job doesn't help much

Edited Apr 05, 2022 by Alok Joshi