How Kenna Security Speeds Up Elasticsearch Indexing at Scale - Part 1

sorangutan

Everyone wants their Elasticsearch cluster to index and search faster, but optimizing both at scale can take planning. In 2015, Kenna's cluster held 500 million documents, a million of which were processed every day. At the time, our poorly configured Elasticsearch cluster was the least stable piece of our infrastructure and could barely keep up as our data size grew. Today, our cluster holds 4 billion documents and we process over 200 million of them a day, with ease. Building a cluster to meet all of our indexing and searching demands was not easy. But with a lot of persistence and a few "OH CRAP!" moments, we got it done and learned a hell of a lot along the way. In this two part blog, I will share the techniques we used while building our Elasticsearch cluster at Kenna to get us to where we are today. However, before I dive into all the Elasticsearch fun, I first want to tell you a little bit about Kenna.

What is Kenna Security?

Kenna helps Fortune 500 companies manage their cybersecurity risk. The average company has 60 thousand assets. An asset is basically anything with an IP address. The average company also has 24 million vulnerabilities. A vulnerability is how you can hack an asset. With all this data it can be extremely difficult for companies to know what they need to focus on and fix first. That's where Kenna comes in. At Kenna, we take all of that data and we run it through our proprietary algorithms. Those algorithms then tell our clients which vulnerabilities pose the biggest risk to their infrastructure so they know what they need to fix first.

We initially store all of this data in MySQL, which is our source of truth. From there, we index the asset and vulnerability data into Elasticsearch:

MySQL to Elasticsearch.png

Elasticsearch is what allows our clients to really organize their data in any way they need, making search speed a top priority. At the same time, data is constantly changing in MySQL, so indexing is also extremely important in order to keep Elasticsearch insync with it.

Part 1: Speeding up Indexing at Scale

Let's start with how we were able to scale our indexing capacity. For our use case, we set a great refresh interval from the beginning, one of the best things we did early on in running Elasticsearch.

Refresh Interval

By default, Elasticsearch uses a one-second refresh interval. Refreshing an index takes up considerable resources, which takes away from other resources you could use for indexing. One of the easiest ways to speed up indexing is to increase your refresh interval. By increasing it you will decrease the number of refreshes an index does and thus free up resources for indexing. We have ours set at 30 seconds.

PUT /my_index/_settings 
{
    "index" : {
       "refresh_interval": "30s"
              }
}

Client data updates are handled by background jobs, so waiting an extra 30 seconds for data to be searchable is not a big deal. If 30 seconds is too long for your use case, try 10 or 20. Anything greater than one will get you indexing gains.

Once we had our refresh interval set it was smooth sailing — until we got our first big client. This client had 200,000 assets and over 100 million vulnerabilities. After getting their data into MySQL, we started to index it into Elasticsearch but it was slow going. After a couple days of indexing we had to tell this client that it was going to take two weeks to index all of their data. Obviously, this wasn't going to work long term. The solution was bulk processing.

Bulk Processing

We found that indexing 1000 documents in each bulk request gave us the biggest performance boost. The size you want to make your batches will depend on your document size. For tips on finding your optimal batch size, check out Elastic's suggestions for bulk processing.

Indexing documents individually vs. Indexing in a single batch using a bulk request

Bulk processing alone got us through a solid year of growth. Then, this past year, MySQL got an overhaul and suddenly Elasticsearch couldn't keep up. During peak indexing periods, node CPU would max out and we started getting a bunch of 429 (TOO_MANY_REQUESTS) errors. This is when we became aware of threads and the role they play in indexing.

Route Your Documents

When you are indexing a set of documents the number of threads needed to complete the request depends on how many shards on which those documents belong. Let's look at two batches of documents:

It's going to take four threads to index each batch because each one needs to talk to four shards. An easy way to decrease the number of threads you need for each request is to group your documents by shard. Going back to our example, if you group the documents by shard you can cut the number of threads needed to execute the request in half.

https://www.elastic.co/blog/how-kenna-security-speeds-up-elasticsearch-indexing-at-scale-part-1