Benchmarking Elasticsearch cross-cluster replication

sorangutan

One of the highly anticipated features of Elasticsearch 6.7 and 7.0 was [cross-cluster replication](https://www.elastic.co/guide/en/elastic-stack-overview/current/xpack-ccr.html) (CCR), or the ability to replicate indices across clusters, e.g., for disaster recovery planning or geographically distributed deployments. Since this feature introduces a lot of changes to the Elasticsearch code base, we need our users to be confident about its performance, resilience and stability. While our standard Elasticsearch benchmarking tool, [Rally](https://esrally.readthedocs.io/en/stable/), provided a lot of capabilities that we needed, it was necessary to expand it in several areas: - Ability to communicate with [more than one cluster](https://esrally.readthedocs.io/en/stable/command_line_reference.html#id2) - Collecting metrics (this is performed using [telemetry devices](https://esrally.readthedocs.io/en/latest/telemetry.html#recovery-stats) in Rally) from the new Elasticsearch [ccr-stats](https://www.elastic.co/guide/en/elasticsearch/reference/current/ccr-get-stats.html) and [recovery](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-recovery.html) API endpoints. With these features added, we defined a topology involving a leader cluster in a geographical region (Europe) far from the following cluster (North America), ensuring a network latency of at least 100ms: ![Benchmarking Topology][1] The benchmarking scenarios used to evaluate different aspects of CCR were: ## Ensure follower is always able to catch up with different types of load We picked three completely different load scenarios representing [very small](https://github.com/elastic/rally-tracks/tree/master/geopoint#example-document), [medium](https://github.com/elastic/rally-tracks/tree/master/http_logs#example-document) and [very large](https://github.com/elastic/rally-tracks/tree/master/pmc#example-document) document sizes. With these were able to [tune default values](https://www.elastic.co/guide/en/elasticsearch/reference/current/ccr-put-follow.html#_default_values_2) for following indices — for example, to avoid long garbage collection (GC) pauses with large doc sizes we’ve changed the default value of the CCR parameter [max_read_request_size](https://github.com/elastic/elasticsearch/commit/ddda2d419cb49c8df0df70b8ce8d4c8b7322ef02) from unlimited to 32MB. ## Ensure remote recovery is performing optimally Similarly, we wanted to be sure that bootstrapping a follower index with [remote recovery](https://www.elastic.co/guide/en/elastic-stack-overview/current/remote-recovery.html) is performing well for most cases with the default configuration. With the same topology, we used the [medium doc size dataset](https://github.com/elastic/rally-tracks/tree/master/http_logs) in different scenarios, including indexing it entirely first and then enabling CCR on the following cluster, as well as indexing it up to a certain percentage followed by enabling CCR while indexing continued. Immediately from the first scenario we observed that the time taken to recover was too long, so we added the ability to [fetch chunks from different files in parallel](https://www.elastic.co/blog/this-week-in-elasticsearch-and-apache-lucene-2019-02-08) and optimized the remote recovery settings [default values](https://www.elastic.co/guide/en/elasticsearch/reference/current/ccr-settings.html). ### … but what about network compression? ![data compression picture][3] Remote clusters can be optionally configured to [compress requests](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-remote-clusters.html#configuring-remote-clusters) on the transport layer. We also evaluated whether transport compression can reduce the time taken for remote recovery and [realized that](https://gist.github.com/dliappis/2e61295744d7e2f95a63394d771a86d4#gistcomment-2848138) without tweaking a number of settings to saturate the network, enabling compression actually ends up making the recovery process slower due to increased CPU usage. ## Stability Another aspect we looked at is CCR stability over a period of 10 days in different scenarios. Keeping the same network topology we used [data](https://github.com/elastic/rally-eventdata-track#elasticlogs_bulk_source) modelled on the Filebeat nginx module format. One scenario concerned leader indices managed using [index lifecycle management (ILM)](https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index-lifecycle-management.html) with [CCR auto-follow](https://www.elastic.co/guide/en/elasticsearch/reference/current/ccr-put-auto-follow-pattern.html) configured: a total of 17bn docs got indexed, the index rolled over 341 times, each rolled over index containing ~52 million docs and 20GB worth of data (not including replicas). We also conducted the same ten-day scenario with random restarts of the following cluster — with a five-minute pause period before starting them again — to ensure all data gets recovered as expected. ## Try benchmarking CCR with Rally yourself! We’ve prepared [an easy to use recipe](https://esrally.readthedocs.io/en/latest/recipes.html#testing-rally-against-ccr-clusters-using-a-remote-metric-store) based on Docker to spin up a local CCR environment for tests pulling metrics and shipping them to an Elasticsearch metrics store. Enjoy! [1]: https://images.contentstack.io/v3https://www.elastic.co/assets/bltefdd0b53724fa2ce/blt880747bfe374506d/5cc054f943283e8d640eaa4e/CrossRegionLink.png [2]: https://images.contentstack.io/v3https://www.elastic.co/assets/bltefdd0b53724fa2ce/blt85676a8c9b87d6c9/5cbdb114f8d0b9d818246e7e/CCR_Regions_Transparent.png [3]: https://images.contentstack.io/v3https://www.elastic.co/assets/bltefdd0b53724fa2ce/blt1021763a1731757b/5cbeff5dfcad0dd81e21b835/press-1332506_1920.jpg

https://www.elastic.co/blog/benchmarking-elasticsearch-cross-cluster-replication