Table of contents
What is Elasticsearch
The most important goal is to use it for a text search, although its capabilities are much more extensive. I hope this tutorial will show you the most important basics of this tool.
Setup
We will need a docker installed, let’s prepare a docker-compose.yml file with a basic Elasticsearch setup:
version: "3.5"
services:
elasticsearch:
image: elasticsearch:7.2.0
environment:
- discovery.type=single-node
ports:
- "9200:9200"
use this command to start the service:
docker-compose up -d
it might take around a minute for the service to start. Check if the service is running with a command:
curl -XGET localhost:9200
A response from endpoint should look something like this:
{
"name" : "89e77e9abe2d",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "MLOf806ySVeHCWl9I3z0BA",
"version" : {
"number" : "7.10.2",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "747e1cc71def077253878a59143c1f785afa92b9",
"build_date" : "2021-01-13T00:42:12.435326Z",
"build_snapshot" : false,
"lucene_version" : "8.7.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
In case of problems:
docker-compose stop
docker-compose up # this will show docker output directly
Creating an index
first of all we need an index in Elasticsearch:
curl -XPUT localhost:9200/my-index-1
Response message should contain: “acknowledged”:true
That means our first index my-index-1 was created.
Now let’s see list of all indices:
curl -XGET localhost:9200/_cat/indices
This list should include our newly created index my-index-1
In case of problems, this will delete the index so you can recreate it again:
# curl -XDELETE localhost:9200/my-index-1
Mapping the data
First we need to add some mapping to the index. Mapping will inform Elasticsearch how to shape the data we will send there. We need a terminal tool called jq- https://stedolan.github.io/jq/download/
jq -n '
{
"properties": {
"name": { "type": "text" }
}
}
' | curl -s -XPUT localhost:9200/my-index-1/_mapping -H "Content-Type: application/json" -d @- | jq
What happens here is: we send a JSON formatted parameters:
{
"properties": {
"name": { "type": "text" }
}
}
to endpoint:
/my-index-1/_mapping
And then we parse it with a jq tool again.
Our mapping will be:
"name": { "type": "text" }
That means our document structure will only contain 1 field called name
To check the mapping use:
curl -XGET localhost:9200/my-index-1 | jq
Adding sample data
Let’s put a first document in the index:
jq -n '
{
"name": "John Rambo"
}
' | curl -s -XPUT localhost:9200/my-index-1/_doc/1 -H "Content-Type: application/json" -d @-
Now let’s list all the documents we have there:
jq -n '
{
"query": {
"match_all": {}
}
}
' | curl -s -XPOST localhost:9200/my-index-1/_search -H "Content-Type: application/json" -d @- | jq
In the response we’ll see:
"hits": [
{
"_index": "my-index-1",
"_type": "_doc",
"_id": "1",
"_score": 1,
"_source": {
"name": "John Rambo"
}
}
]
That means our first document is there!
Let’s check how our document was indexed:
curl -s -XGET 'localhost:9200/my-index-1/_doc/1/_termvectors?fields=name' | jq
Response will include:
"terms": {
"john": {
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 4
}
]
},
"rambo": {
"term_freq": 1,
"tokens": [
{
"position": 1,
"start_offset": 5,
"end_offset": 10
}
]
}
}
As we can see our John Rambo was divided into 2 terms: john, rambo
Let’s try to search for 1 of the terms:
jq -n '
{
"query": {
"terms": {
"name": [ "john" ]
}
}
}
' | curl -s -XPOST localhost:9200/my-index-1/_search -H "Content-Type: application/json" -d @- | jq
We’ll see our document there!
But what if we want to search for something smaller? Jo or Joh
If we’ll try to do that now we’ll will get empty results. We need to change the mapping.
Let’s delete the index:
curl -XDELETE localhost:9200/my-index-1
Ngram Tokenizer
Let’s create a new index:
jq -n '
{
"settings": {
"number_of_shards": 1,
"analysis": {
"tokenizer": {
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 4,
"max_gram": 4
}
},
"analyzer": {
"ngram_tokenizer_analyzer": {
"type": "custom",
"tokenizer": "ngram_tokenizer"
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"term_vector": "yes",
"analyzer": "ngram_tokenizer_analyzer"
}
}
}
}
' | curl -s -XPUT localhost:9200/my-index-1 -H "Content-Type: application/json" -d @- | jq
And now let’s add our example again:
jq -n '
{
"name": "John Rambo"
}
' | curl -s -XPUT localhost:9200/my-index-1/_doc/1 -H "Content-Type: application/json" -d @-
Let’s see how our document was index:
curl -s -XGET 'localhost:9200/my-index-1/_doc/1/_termvectors?fields=name' | jq
"terms": {
" Ram": {
"term_freq": 1
},
"John": {
"term_freq": 1
},
"Ramb": {
"term_freq": 1
},
"ambo": {
"term_freq": 1
},
"hn R": {
"term_freq": 1
},
"n Ra": {
"term_freq": 1
},
"ohn ": {
"term_freq": 1
}
}
As we can see our John Rambo example was split into 4 letter terms. This happened because we’ve added a ngram tokenizer to the mapping:
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 4,
"max_gram": 4
}
Unfortunately those settings are not very useful.
Let’s tweak the parameters a bit, delete old index and then try ngrams with length between 2–20 and type to “edge_ngram”
q -n '
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"ngram_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"ngram_filter"
]
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"term_vector": "yes",
"analyzer": "ngram_analyzer",
"search_analyzer": "standard"
}
}
}
}
' | curl -s -XPUT localhost:9200/my-index-1 -H "Content-Type: application/json" -d @- | jq
Let’s put our document to the index again and check the vectors:
curl -s -XGET 'localhost:9200/my-index-1/_doc/1/_termvectors?fields=name' | jq"terms": {
"jo": {
"term_freq": 1
},
"joh": {
"term_freq": 1
},
"john": {
"term_freq": 1
},
"ra": {
"term_freq": 1
},
"ram": {
"term_freq": 1
},
"ramb": {
"term_freq": 1
},
"rambo": {
"term_freq": 1
}
}
Now it’s much easier to search, we can use terms like: ram or ramb
Searching for multiple terms
Let’s put a second document to the index:
jq -n '
{
"name": "John Bon Jovi"
}
' | curl -s -XPUT localhost:9200/my-index-1/_doc/2 -H "Content-Type: application/json" -d @-
And now try to search:
jq -n '
{
"query": {
"terms": {
"name": [ "jo", "ram" ]
}
}
}
' | curl -s -XPOST localhost:9200/my-index-1/_search -H "Content-Type: application/json" -d @- | jq
Response should be like:
"hits": [
{
"_index": "my-index-1",
"_type": "_doc",
"_id": "1",
"_score": 1,
"_source": {
"name": "John Rambo"
}
},
{
"_index": "my-index-1",
"_type": "_doc",
"_id": "2",
"_score": 1,
"_source": {
"name": "John Bon Jovi"
}
}
]
As one can see we have 2 results, because both documents contain the word Jo
Bool queries
We can use more sophisticated queries:
jq -n '
{
"query": {
"bool": {
"must": {
"term": { "name": "john" }
},
"should": {
"term": { "name": "jovi" }
},
"must_not": {
"term": { "name": "test" }
}
}
}
}
' | curl -s -XPOST localhost:9200/my-index-1/_search -H "Content-Type: application/json" -d @- | jq
Now we can see both documents, but in a different order. Try to guess why? ;)
Of course this is just a glimpse of things that Elasticsearch can do.
I hope my tutorial helped you with some basic steps :)
Table of contents
What is Elasticsearch
New posts