Startpagina > English, java, work > Diacritics in ElasticSearch – Fix it in two places and also watch your encoding while your at it

Diacritics in ElasticSearch – Fix it in two places and also watch your encoding while your at it

A few weeks ago I solved an issue with ElasticSearch where some search results with diacritics didn’t show up in the results. At least, I thought I did. There are quite a few places where it can go wrong. In this article I’ll explain these places and how to fix the issues.

Introduction

In the Dutch language some diacritics are used. Not as many as German, so problems with it took a while to show up.

ElasticSearch basically does two things: indexing and searching. During indexing you already have to think about how you’re going to search your indexed documents. The right thing to do is strip of all diacritics and make all characters lower case. This will solve almost all your problems. The third step is searching with wildcards, somehow things act a bit different here. The final step is ensuring the right character encoding (this has nothing to do with ElasticSearch but I’ll mention it anyway since you might run into the same problems).

I’ll show the problem and solution with some examples. To try the examples in this article yourself I suggest using elasticsearch-head this is a web front end for ElasticSearch and makes development a lot easier.

The samples can be run from the command line if you have curl installed (or something similar).

Storing and searching tokens without diacritics

First we have to create an index. I picked famous tennis players since there aren’t that many triathletes with diacritics in their names ;-)
Let’s show what happens when we insert a bunch of José’s and André’s (with and without diacritics) :


curl -XPUT 'http://localhost:9200/tennis/player/1' -d '{ "name" : "José Acasuso", "country" : "Argentina", "birth" : "1982"}'
curl -XPUT 'http://localhost:9200/tennis/player/2' -d '{ "name" : "Jose-Luis Clerc", "country" : "Argentina", "birth" : "1958"}'
curl -XPUT 'http://localhost:9200/tennis/player/3' -d '{ "name" : "José Higueras", "country" : "Spain", "birth" : "1953"}'
curl -XPUT 'http://localhost:9200/tennis/player/4' -d '{ "name" : "Andre Agassi", "country" : "United States", "birth" : "1970"}'
curl -XPUT 'http://localhost:9200/tennis/player/5' -d '{ "name" : "André Sá", "country" : "Brazil", "birth" : "1978"}'
curl -XPUT 'http://localhost:9200/tennis/player/6' -d '{ "name" : "Andrés Gómez", "country" : "Ecuador", "birth" : "1960"}'

Now open elasticsearch-head and go the the ‘structured query’ tab and enter the following parameters:
Search tennis for documents where must player.name query string jose

search-01

Make sure you use the query_string when testing since the term isn’t routed through the search_analyzer.
As you can see only one Jose appears and there are three.
When you search for José (with a thingy on the e) the other two results appear. Same story with Andre.

To fix this you have to apply settings that change the index and search process. We have to delete the index and create a new one:


curl -XDELETE 'http://localhost:9200/tennis'

curl -XPUT ‘http://localhost:9200/tennis’ -d ‘{
“index”: {
“analysis”: {
“analyzer”: {
“index_analyzer”: {
“filter”: [
“standard”,
“lowercase”,
“asciifolding”
],
“tokenizer”: “standard”
},
“search_analyzer”: {
“filter”: [
“standard”,
“lowercase”,
“asciifolding”
],
“tokenizer”: “standard”
},
“sortable”: {
“filter”: “lowercaseFilter”,
“tokenizer”: “keyword”,
“type”: “custom”
}
},
“filter”: {
“lowercaseFilter”: {
“type”: “lowercase”
}
},
“tokenizer”: {
“keyword”: {
“type”: “keyword”
}
}
}
}
}’

Then tell the mapper which analyze and indexer to use:


curl -XPUT 'http://localhost:9200/tennis/player/_mapping' -d '{
"player": {
"_all": {
"enabled": true,
"index_analyzer": "index_analyzer",
"search_analyzer": "search_analyzer"
},
"properties": {
"birth": {
"type": "string"
},
"country": {
"type": "string"
},
"id": {
"include_in_all": false,
"type": "string"
},
"name": {
"index_analyzer": "index_analyzer",
"search_analyzer": "search_analyzer",
"type": "string"
}
}
}
}'

Make sure you also re-index the data since the data we use now is analyzed the wrong way.

The most important thing here is the asciifolding filter. It took me a while to figure out what to do, but asciifolding makes sure characters from the first 127 ascii characters are used and fortunately these are the ones without diacritics (http://www.elasticsearch.org/guide/reference/index-modules/analysis/asciifolding-tokenfilter/)
The standard filter and lowercase filter are default behaviour, but I’m afraid we have to duplicate it.

Now it’s possible to search for jose and they all show up in the results:

search-02

Note that both the index_analyzer and search_analyzer use the same configuration. When you only add the index_analyzer no results will show when you search for José (with diacritics).

Also make sure you include the _all field. _all is used when searching for all fields and should have the same behaviour as the individual field.

Searching with wildcards

I ran into some problems with wildcard search. I cannot reproduce this problem properly, but when you run into the problem it’s fixable by setting the analyze_wildcard property to true. In java it’s the analyzeWildcard method and a query with curl would look like this (credits go to Shay Banon)


curl localhost:9200/tennis/player/_search -d '{
"query" : {
"query_string" : {
"query" : "jose*",
"analyze_wildcard" : true
}
}
}'

And when you find out how to reproduce it please mention it in the comments, I’ll include that in the article, thanks :-)

Encoding issues

There were some issues where the search query was part of the url and some characters got messed up. Most solutions point to the configuration of your web server, but it was a Spring thing. You can solve it by adding a CharacterEncodingFilter in your web.xml :
    <filter>
<filter-name>CharacterEncodingFilter</filter-name>
<filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class>
<init-param>
<param-name>encoding</param-name>
<param-value>UTF-8</param-value>
</init-param>
<init-param>
<param-name>forceEncoding</param-name>
<param-value>true</param-value>
</init-param>
</filter>

Watch the forceEncoding parameter, some browsers don’t include encoding information. For more information about encoding read this Stackoverflow question

What I wrote is basically a summary of that.

Conclusion

Most important lesson I learned is to create proper test cases. Make sure you search with and without diacritics to make sure the proper results appear. Another lesson is to search for certain things that might help. In my case ascii folding. It sounded like something I could use and in the end it was the solution.

Sources

About these ads
Categorieën:English, java, work Tags:,
  1. Nataly
    7 oktober 2013 om 11:38

    Hi! I’m new to elasticsearch and I faced almost the same problem and your solution was very useful. Thanks :)
    But in my case there is another problem – I have string field (which can contain diacritics) and two modes of search: in the first mode we should neglect diacritics (like in the second part of your example) and in the second mode we shouldn’t (like in the beginning of your example).
    You wrote “Also make sure you include the _all field. _all is used when searching for all fields and should have the same behaviour as the individual field.”. So is it possible to solve my problem without making two different indexes for different search modes?

  2. Jeroen van Wilgenburg
    8 oktober 2013 om 18:33

    That’s an interesting problem. The first thing I can think of is trying to index your data to two different fields (ie. when you have a ‘name’ field index it to ‘name’ and ‘name_d’ (_d for diacritics) with two different index and search analyzers. This won’t work with the _all field though.

    I think I’d go for two indexes if you really need the _all field. Creating multiple indexes isn’t as bad as it sounds ;-)

    Please let me know if you found another solution and good luck with finding one!

  1. No trackbacks yet.

Geef een reactie

Vul je gegevens in of klik op een icoon om in te loggen.

WordPress.com logo

Je reageert onder je WordPress.com account. Log uit / Bijwerken )

Twitter-afbeelding

Je reageert onder je Twitter account. Log uit / Bijwerken )

Facebook foto

Je reageert onder je Facebook account. Log uit / Bijwerken )

Google+ photo

Je reageert onder je Google+ account. Log uit / Bijwerken )

Verbinden met %s

Volg

Ontvang elk nieuw bericht direct in je inbox.

%d bloggers op de volgende wijze: