Skip to content

Delete by query causing fielddata cache spike leading to 429 #2550

Open
@blacar

Description

@blacar

This ticket is the result of two weeks of experiments.

I'll try to put all the information because It might be something wrong with RestHighLevelClient doing deleteByQuery.
I have been two weeks betting that it should be a problem on my side or a problem on Elastic (performance, configuration) but after several experiments I need to present this to you because I have no explanation.

First of all, I have prior knowledge of Elastic and I am aware that updates and deletes are expensive operations, this is not about that.

CONTEXT

  • This is a spring-boot microservice running on Java 11 using spring-data-elasticsearch 4.2.11 to run operations on elastic cluster.
  • We are on pre-launch experiments and I have an environment that mirrors our production traffic but allows me total control of it.
  • We have a lot of ingest operations, a lot of query operations, some significant rate of update operations, and few delete operations.

We are using RestHighLevelClient configured like this:

  public RestHighLevelClient elasticsearchClient() {
    final HttpHeaders compatibilityHeaders = new HttpHeaders();
    compatibilityHeaders.add("Accept", "application/vnd.elasticsearch+json;compatible-with=7");
    compatibilityHeaders.add("Content-Type", "application/vnd.elasticsearch+json;"
      + "compatible-with=7");
    final ClientConfiguration clientConfiguration = ClientConfiguration.builder()
      .connectedTo(eshostname + ":" + esport)
      .usingSsl()
      .withBasicAuth(username, password)
      .withDefaultHeaders(compatibilityHeaders)
      .build();
    return RestClients.create(clientConfiguration).rest();
  }

As said we do many ingest and query operations ... as example:

    final BoolQueryBuilder boolQuery = QueryBuilders
      .boolQuery()
      .filter(QueryBuilders.matchQuery(SEARCH_FIELD_1, s1))
      .filter(QueryBuilders.matchQuery(SEARCH_FIELD_2, s2))
      .filter(QueryBuilders.rangeQuery(SEARCH_FIELD_3).lte(s3));
    final NativeSearchQuery nsq = new NativeSearchQuery(boolQuery);
    nsq.addSort(Sort.by(Direction.DESC, CREATED_SEARCH_FIELD));
    nsq.setMaxResults(size);

We do also updateByQuery operations ... like this:

    final BoolQueryBuilder boolQuery = QueryBuilders.boolQuery()
      .filter(QueryBuilders.matchQuery(SEARCH_FIELD_1, s1))
      .filter(QueryBuilders.rangeQuery(SEARCH_FIELD_3).lt(s3))
      .filter(QueryBuilders.matchQuery(SEARCH_FIELD_2, s2));
    final NativeSearchQuery nsq = new NativeSearchQuery(boolQuery);
    return UpdateQuery.builder(nsq)
      .withScriptType(ScriptType.INLINE)
      .withScript(UPDATE_SCRIPT)
      .withParams(UPDATE_PARAMS)
      .build();

update script looks like this:

"ctx._source.FIELD_4 = params.FIELD_4; ctx._source.FIELD_5 = params.FIELD_5; ctx._source.FIELD_6 = params.FIELD_6; ctx._source.FIELD_3 = params.FIELD_3"

finally we do deleteByQuery operations with the same query as update operations.
Of course no script in that case.

ISSUE

All operations run like a charm except deleteByQuery. At the moment deleteByQuery is enabled (even these being just a fraction of the traffic, even when there are much more UPDATE operations) the cluster starts to get into problems. ALL delete operations timeout, although the records are removed from the cluster. The fielddata cache starts to grow significantly, eventually causing the GC usage and duration to spike, eventually causing the CPU to spike, and finally causing the circuit breaker [parent] to be triggered starting to respond 429 TOO MANY REQUEST to our operations.

This is no matter of the size of the result of the delete query, delete queries bringing just 1 o 2 documents cause the same effect.
Please remember that the amount of deleted queries is small.

This only happens on deletes. If I replace deletes with updates (using the same query and a script that updates four fields) the cluster is stable. This alone is very weird to me since updates are expected to be more expensive than updates.

NOTE If I bypass spring-data-elasticsearch and use feign client sending POST HTTP requests directly without the RestHighLevelClient for the delete operations, then the cluster is stable. This leads me to think that there might be something wrong with the deletes that RestHighLevelClient is sending. It feels like something is not closed (connection timeout).

Here are some screenshots:

Timeout exception on ALL delete operations

org.springframework.dao.DataAccessResourceFailureException: 5,000 milliseconds timeout on connection http-outgoing-222 [ACTIVE]; nested exception is java.lang.RuntimeException: 5,000 milliseconds timeout on connection http-outgoing-222 [ACTIVE]
	at org.springframework.data.elasticsearch.core.ElasticsearchExceptionTranslator.translateExceptionIfPossible(ElasticsearchExceptionTranslator.java:75)
	at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.translateException(ElasticsearchRestTemplate.java:402)
	at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.execute(ElasticsearchRestTemplate.java:385)
	at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.delete(ElasticsearchRestTemplate.java:224)
	at com.xxx.xxx.service.xxx.deleteByQuery(xxx.java:380)

Metrics when deletes are enabled
(we disable updates at the same time so 100% of the spikes are related to deletes)
image

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions