Startpagina > English, java, work > How sharding in elasticsearch makes scoring a little less accurate and what to do about it

How sharding in elasticsearch makes scoring a little less accurate and what to do about it

Currently I’m using a small dataset (about 3500 records) on ElasticSearch and saw some strange scoring. Hits that should have exactly the same score had _almost_ the same score. Almost the same is kind of a problem since we sort our data on score and name. After some researching it appeared sharding was the issue here. With only one shard the problem is solved. If you want to know how I found out and what alternatives you have please stay tuned.

updated @ Sep 13: Zachary Tong pointed me to an article with a much better solution (read: the right solution ;-) ). I added the second to last paragraph to explain it.

Introduction

First thing to know is that when elasticsearch finds a document it gives this hit a score. This is the same ‘magic’ score you might know from Lucene.

Suppose we have a data set with 3 ships : Abis Dover, Abis Antwerpen and Mineral Antwerpen.

When you search in Lucene for ‘antwerpen’ the ‘Abis Antwerpen’ and ‘Mineral Antwerpen’ will have exactly the same score. When you do this in elasticsearch you might get the same score or you might get almost the same score. Since we’re in computer science the word ‘might’ and ‘almost’ really bugs us. This is a problem in the application I’m working on since we order results on score and then name of the ship.

Let’s set up a new index in elasticsearch and add our three ships. Note that I create two shards, the default is five, but this makes it easier to explain later on :

#optional: delete index
#curl -XDELETE 'http://localhost:9200/ships'

curl -XPUT 'http://localhost:9200/ships/' -d '
index:
number_of_shards: 2
'

curl -XPOST 'http://localhost:9200/ships/ship' -d '{ "name" : "Abis Antwerpen" }'

curl -XPOST 'http://localhost:9200/ships/ship/' -d '{ "name" : "Abis Dover" }'

curl -XPOST 'http://localhost:9200/ships/ship/' -d '{ "name" : "Mineral Antwerpen" }'

Now query for antwerpen:

curl -XGET 'http://localhost:9200/ships/ship/_search' -d '{
"query": {
"query_string": { "query": "antwerpen" }
}
}
'

In the results look for the _score field. It might be the same, it might be different. Just try it a few times to see what happens (uncomment the delete command in the previous insert). Note that the id’s are generated by elasticsearch (later I will explain why this is important).

Now enable explain on our queries and see what happens:
curl -XGET 'http://localhost:9200/ships/ship/_search' -d '{
"explain" : true,
"query": {
"query_string": { "query": "antwerpen" }
}
}
'

I won’t go into details, but when you look closely the difference in score can be explained by idf(docFreq=1, maxDocs=1) where maxDocs can also be 2. Because we use two shards there are two datasets where the query is executed. It’s also possible that you get the exact same scores, then all three documents will be in the same shard. To control this seemingly random behaviour you can add routing to your documents (see next section).
The difference in score is quite big here since one shard contains twice as many documents. With larger sets you will see the results converge. When you use two shards and have an even number of documents equally distributed then the score will be exactly the same (this almost never happens!).

Document routing

Routing determines in which shard your document will be stored. The default behaviour is routing by _id. When you assign an id yourself you will see more consistent behaviour when you repeat the import (because the id’s will be the same and not some random generated string).
Another option is to use routing. When you implement this correctly you will have a better document distribution and better performance (you only need to query the shards where the right data can exist).
At the MongoDB documentation there is a nice document about choosing a shard key.
These considerations apply to routing in elasticsearch as well.

dfs_query_then_fetch

Very shortly after I published this article Zachary Tong from elasticsearch pointed me to another solution. For our use case an even better one:
curl -XGET 'http://localhost:9200/ships/ship/_search?search_type=dfs_query_then_fetch' -d '{
"query": {
"query_string": { "query": "antwerpen" }
}
}
'

When dfs_query_then_fetch is enabled the document count is precalculated and problem solved :-)
Really read the conclusion of the elasticsearch-article carefully, because it’s not totally free (so I didn’t write this article for nothing ;-) ).

Conclusion

The simple fix was using one shard because there is a small dataset and it wont’ really hurt performance (it might even increase it). With larger sets the score differences will converge. When you want to sort on score you have do some grouping on scores that are very close to each other.
Please note that as we speak it’s not possible to change the number of shards after indexing unless you re-index all your data. The elasticsearch guys are working on this, so when you read this it might work. More about this and sharding can be seen in a video with Shay Banon at Berlin Buzzwords, earlier this year.

It’s also a good idea to add custom routing. You can use custom routing when querying. With this method only the shards that might have your data will be queried. When elasticsearch knows for sure it isn’t in a certain shard it won’t be queried. This is comparable with how MongoDB works (http://docs.mongodb.org/manual/sharding/) and will boost performance with the right routing.

About these ads
Categorieën:English, java, work Tags:,
  1. 11 september 2013 om 15:53

    Hi Jeroen. For small datasets, you can fix this scoring peculiarity by changing to DFS_query_then_fetch. This pre-calculates a global Doc Frequency for your terms so that there is no longer a per-shard disparity in scoring.

    Check out this article for more details: http://www.elasticsearch.org/blog/understanding-query-then-fetch-vs-dfs-query-then-fetch/

  2. Jeroen van Wilgenburg
    11 september 2013 om 20:08

    Thanks for the tip! I’ll try it out and update the article.

  1. 16 december 2013 om 10:41

Geef een reactie

Vul je gegevens in of klik op een icoon om in te loggen.

WordPress.com logo

Je reageert onder je WordPress.com account. Log uit / Bijwerken )

Twitter-afbeelding

Je reageert onder je Twitter account. Log uit / Bijwerken )

Facebook foto

Je reageert onder je Facebook account. Log uit / Bijwerken )

Google+ photo

Je reageert onder je Google+ account. Log uit / Bijwerken )

Verbinden met %s

Volg

Ontvang elk nieuw bericht direct in je inbox.

%d bloggers op de volgende wijze: