What is Elasticsearch

Zawartość

What is Elasticsearch

The most important goal is to use it for a text search, although its capabilities are much more extensive. I hope this tutorial will show you the most important basics of this tool.

Setup

We will need a docker installed, let’s prepare a docker-compose.yml file with a basic Elasticsearch setup:

version: "3.5"
services:
  elasticsearch:
    image: elasticsearch:7.2.0
    environment:
      - discovery.type=single-node
    ports:
      - "9200:9200"

use this command to start the service:

docker-compose up -d

it might take around a minute for the service to start. Check if the service is running with a command:

curl -XGET localhost:9200

A response from endpoint should look something like this:

{
  "name" : "89e77e9abe2d",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "MLOf806ySVeHCWl9I3z0BA",
  "version" : {
    "number" : "7.10.2",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "747e1cc71def077253878a59143c1f785afa92b9",
    "build_date" : "2021-01-13T00:42:12.435326Z",
    "build_snapshot" : false,
    "lucene_version" : "8.7.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

In case of problems:

docker-compose stop
docker-compose up # this will show docker output directly

Creating an index

first of all we need an index in Elasticsearch:

curl -XPUT localhost:9200/my-index-1

Response message should contain: “acknowledged”:true
That means our first index my-index-1 was created.

Now let’s see list of all indices:

curl -XGET localhost:9200/_cat/indices

This list should include our newly created index my-index-1

In case of problems, this will delete the index so you can recreate it again:

# curl -XDELETE localhost:9200/my-index-1

Mapping the data

First we need to add some mapping to the index. Mapping will inform Elasticsearch how to shape the data we will send there. We need a terminal tool called jq- https://stedolan.github.io/jq/download/

jq -n '
 {
  "properties": {
    "name": { "type": "text" }
  }
 }
' | curl -s -XPUT localhost:9200/my-index-1/_mapping -H "Content-Type: application/json" -d @- | jq

What happens here is: we send a JSON formatted parameters:

{
  "properties": {
    "name": { "type": "text" }
  }
 }

to endpoint:

/my-index-1/_mapping

And then we parse it with a jq tool again.

Our mapping will be:

"name": { "type": "text" }

That means our document structure will only contain 1 field called name

To check the mapping use:

curl -XGET localhost:9200/my-index-1 | jq

Adding sample data

Let’s put a first document in the index:

jq -n '
 {
   "name": "John Rambo"
 }
' | curl -s -XPUT localhost:9200/my-index-1/_doc/1 -H "Content-Type: application/json" -d @-

Now let’s list all the documents we have there:

jq -n '
 {
    "query": {
        "match_all": {}
    }
 }
' | curl -s -XPOST localhost:9200/my-index-1/_search -H "Content-Type: application/json" -d @- | jq

In the response we’ll see:

"hits": [
      {
        "_index": "my-index-1",
        "_type": "_doc",
        "_id": "1",
        "_score": 1,
        "_source": {
          "name": "John Rambo"
        }
      }
    ]

That means our first document is there!

Let’s check how our document was indexed:

curl -s -XGET 'localhost:9200/my-index-1/_doc/1/_termvectors?fields=name' | jq

Response will include:

"terms": {
        "john": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 4
            }
          ]
        },
        "rambo": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 5,
              "end_offset": 10
            }
          ]
        }
      }

As we can see our John Rambo was divided into 2 terms: john, rambo

Let’s try to search for 1 of the terms:

jq -n '
{
  "query": {
    "terms": {
      "name": [ "john" ]
    }
  }
}
' | curl -s -XPOST localhost:9200/my-index-1/_search -H "Content-Type: application/json" -d @- | jq

We’ll see our document there!

But what if we want to search for something smaller? Jo or Joh

If we’ll try to do that now we’ll will get empty results. We need to change the mapping.

Let’s delete the index:

curl -XDELETE localhost:9200/my-index-1

Ngram Tokenizer

Let’s create a new index:

jq -n '
 {
   "settings": {
      "number_of_shards": 1,
      "analysis": {
         "tokenizer": {
            "ngram_tokenizer": {
               "type": "ngram",
               "min_gram": 4,
               "max_gram": 4
            }
         },
         "analyzer": {
            "ngram_tokenizer_analyzer": {
               "type": "custom",
               "tokenizer": "ngram_tokenizer"
            }
         }
      }
   },
   "mappings": {
     "properties": {
       "name": {
         "type": "text",
         "term_vector": "yes",
         "analyzer": "ngram_tokenizer_analyzer"
       }
     }
   }
}
' | curl -s -XPUT localhost:9200/my-index-1 -H "Content-Type: application/json" -d @- | jq

And now let’s add our example again:

jq -n '
 {
   "name": "John Rambo"
 }
' | curl -s -XPUT localhost:9200/my-index-1/_doc/1 -H "Content-Type: application/json" -d @-

Let’s see how our document was index:

curl -s -XGET 'localhost:9200/my-index-1/_doc/1/_termvectors?fields=name' | jq
"terms": {
        " Ram": {
          "term_freq": 1
        },
        "John": {
          "term_freq": 1
        },
        "Ramb": {
          "term_freq": 1
        },
        "ambo": {
          "term_freq": 1
        },
        "hn R": {
          "term_freq": 1
        },
        "n Ra": {
          "term_freq": 1
        },
        "ohn ": {
          "term_freq": 1
        }
      }

As we can see our John Rambo example was split into 4 letter terms. This happened because we’ve added a ngram tokenizer to the mapping:

"ngram_tokenizer": {
               "type": "ngram",
               "min_gram": 4,
               "max_gram": 4
            }

Unfortunately those settings are not very useful.

Let’s tweak the parameters a bit, delete old index and then try ngrams with length between 2–20 and type to “edge_ngram”

q -n '
 {
   "settings": {
      "number_of_shards": 1,
      "analysis": {
         "filter": {
            "ngram_filter": {
               "type": "edge_ngram",
               "min_gram": 2,
               "max_gram": 20
            }
         },
         "analyzer": {
            "ngram_analyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "lowercase",
                  "ngram_filter"
               ]
            }
         }
      }
   },
   "mappings": {
     "properties": {
       "name": {
         "type": "text",
         "term_vector": "yes",
         "analyzer": "ngram_analyzer",
         "search_analyzer": "standard"
       }
     }
   }
}
' | curl -s -XPUT localhost:9200/my-index-1 -H "Content-Type: application/json" -d @- | jq

Let’s put our document to the index again and check the vectors:

curl -s -XGET 'localhost:9200/my-index-1/_doc/1/_termvectors?fields=name' | jq"terms": {
        "jo": {
          "term_freq": 1
        },
        "joh": {
          "term_freq": 1
        },
        "john": {
          "term_freq": 1
        },
        "ra": {
          "term_freq": 1
        },
        "ram": {
          "term_freq": 1
        },
        "ramb": {
          "term_freq": 1
        },
        "rambo": {
          "term_freq": 1
        }
      }

Now it’s much easier to search, we can use terms like: ram or ramb

Searching for multiple terms

Let’s put a second document to the index:

jq -n '
 {
   "name": "John Bon Jovi"
 }
' | curl -s -XPUT localhost:9200/my-index-1/_doc/2 -H "Content-Type: application/json" -d @-

And now try to search:

jq -n '
{
  "query": {
    "terms": {
      "name": [ "jo", "ram" ]
    }
  }
}
' | curl -s -XPOST localhost:9200/my-index-1/_search -H "Content-Type: application/json" -d @- | jq

Response should be like:

"hits": [
      {
        "_index": "my-index-1",
        "_type": "_doc",
        "_id": "1",
        "_score": 1,
        "_source": {
          "name": "John Rambo"
        }
      },
      {
        "_index": "my-index-1",
        "_type": "_doc",
        "_id": "2",
        "_score": 1,
        "_source": {
          "name": "John Bon Jovi"
        }
      }
    ]

As one can see we have 2 results, because both documents contain the word Jo

Bool queries

We can use more sophisticated queries:

jq -n '
{
  "query": {
    "bool": {
      "must": {
        "term": { "name": "john" }
      },
      "should": {
        "term": { "name": "jovi" }
      },
      "must_not": {
        "term": { "name": "test" }
      }
    }
  }
}
' | curl -s -XPOST localhost:9200/my-index-1/_search -H "Content-Type: application/json" -d @- | jq

Now we can see both documents, but in a different order. Try to guess why? ;)

Of course this is just a glimpse of things that Elasticsearch can do.

I hope my tutorial helped you with some basic steps :)