Implement scale architecture (Fetch Process)
The requirement here is to decouple the Fetch process from the Ingestion process within the EDS framework. This will be achieved using message queue implementations as provided by the varied CSP's. The proposed process will be to perform the following :
- Have a schedule defined that will query for records that have changed from the Supplier since the last time it ran.
- When schedule is triggered, the following should be done :
- Retrieve from the platform the information on the Connected Source registry's conforming to the ConnectedSourceRegistry schema. This will provide all the information necessary to identify the supplier end point as well as the service credentials needed to connect to it.
- Retrieve from the platform the information conforming to the Data jobs conforming to the ConnectedSourceDataJob schema. This will contain the last time the job was ran and the data criteria needed to retrieve the data. Within the criteria, ensure you have specified the limit on the number of records that can be retrieved at a single point in time.
- Query the supplier end point using the criteria. The preference will be for the client to support cursors such as with the Elastic Search implementation. If it does, you can use the cursor in making multiple search calls until all records have been retrieved. If not, you will have to retrieve the count of records and iteratively retrieve the information from the supplier's end point.
- For every iteration (either using the client side or server side cursors), assign the results into a JSON structure. Using a dictionary, divide up and store the individual records in the JSON object.
- For each CSP's implementation, call the Message queue client and publish each record to the message queue.
- These records will be picked up by the "Ingest" process when it is triggered.
As indicated above, Message queue's will need to be created based on each CSP's offering. Message Bus Interface will need to be built that generically defines the methods exposed by the different cloud specific classes. Methods included in the interface will be pretty simple : EnvironmentSetup, Publish Messages, Retrieve Messages
Below are the references to CSP's implementations :
Microsoft Azure (Service Bus)
Amazon Web Services ( Amazon SQS)
Google Cloud Platform (Pub/Sub)