Reindex API - performance, scalability and reliability issues
Recent issues on Schema/Search backend requires us to re-index significant number of kinds/indices. Here are specifics on these issue:
- M10 schema hints changes on Schema service.
- Geoshape queries are broken when Elasticsearch server upgraded from 7.8.1 --> 7.17.x (Confirmed by Elasticsearch Support team, public issue is not available)
Current implementation of Reindex API (per kind) has serious performance, scalability and reliability issues. It does not work at all for kind with few million records. This is blocking us from adopting M10 (now M11) schema updates. Following list summarizes issues with API:
- API throughput is pretty slow and it can only re-index 250K-300K records per hour. In case of partition with 100 million records, this can run over 2 weeks.
- It’s not resilient, if operation fails in the middle, we have to start over.
- There is no transparency for Reindex operation, we don’t know how much progress has been made.
In addition to above issues, we cannot recover Search service in Disaster recovery scenarios as well. In this case, we can use ReindexAll API which use Reindex API (per kind) behind the scene. We run into to all of above issues.