Elasticsearch Find Similar Documents Tutorial (with Examples)

Minh Vu

By Minh Vu

Updated Dec 28, 2023

Figure: Elasticsearch Find Similar Documents Tutorial (with Examples)

Disclaimer: All content on this website is derived directly from my own expertise and experiences. No AI-generated text or automated content creation tools are used.

In this tutorial, we will learn how to find similar documents in Elasticsearch using the more_like_this query.

The more_like_this query finds documents that are similar to a given set of documents.

This can be useful when you want to query similar blog posts, products, or any other documents.

For example, at this site (dminhvu.com), I use the more_like_this query to find similar blog posts to the one you are reading, which is shown in the "You Might Also Like" section at the right side of the page.

Contents

How to Use the more_like_this Query in Elasticsearch to Find Similar Documents

The more_like_this query can be used in 3 ways:

  1. Find similar documents to a given text.
  2. Find similar documents to a given document or set of documents.
  3. Find similar documents to a mixed set of documents and text.

Let's discover the syntax and usage of each of these ways.

I will use the following documents for the examples throughout this tutorial:

documents
{
  "id": 1,
  "title": "How to Find Similar Documents in Elasticsearch",
  "description": "In this tutorial, we will learn how to find similar documents in Elasticsearch using the more_like_this query."
}
{
  "id": 2,
  "title": "Partial Update in Elasticsearch Guide (with Examples)",
  "description": "Learn how to perform partial updates in Elasticsearch 8.x to update only specific fields in a document using the Update API."
}
{
  "id": 3,
  "title": "Logstash Input from JSON File",
  "description": "Learn how to parse logs from a JSON file in Logstash using the multiline codec plugin."
}
{
  "id": 4,
  "title": "Python Extract Year from Date",
  "description": "Learn how to extract year from date in Python, including string date, date object, timestamp, and current year."
}

1. Find Similar Documents to a Given Text

The syntax of the more_like_this query to find similar documents to a given text is as follows:

query
GET <index>/_search
{
  "query": {
    "more_like_this": {
      "fields": ["field1", "field2", ...],
      "like": "text to find similar documents to",
      "min_term_freq": 1,
      "max_query_terms": 10
    }
  }
}

where:

  • <index>: the name of the index to search.
  • fields: the fields to search for similar documents.
  • like: the text to find similar documents.
  • min_term_freq: the minimum term frequency below which the terms will be ignored from the input document.
  • max_query_terms: the maximum number of query terms that will be selected, the higher the number, the higher the accuracy but slower the query.

For example, to find similar documents to the text "python extract year from date", we can use the following query:

query
GET posts/_search
{
  "query": {
    "more_like_this": {
      "fields": ["title", "description"],
      "like": "python extract year from date",
      "min_term_freq": 1,
      "max_query_terms": 10
    }
  }
}

The above query will return the following results:

response
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "posts",
        "_type": "_doc",
        "_id": "4",
        "_score": 0.2876821,
        "_source": {
          "id": 4,
          "title": "Python Extract Year from Date",
          "description": "Learn how to extract year from date in Python, including string date, date object, timestamp, and current year."
        }
      }
    ]
  }
}

As you can see, the query returns the document with id = 4 because it contains the text "python extract year from date" in the title field.

2. Find Similar Documents to a Given Document or Set of Documents

You can also use the more_like_this query to find similar documents to a given document or set of documents.

The syntax of the more_like_this query to find similar documents to a given document or set of documents is as follows:

query
GET <index>/_search
{
  "query": {
    "more_like_this": {
      "fields": ["field1", "field2", ...],
      "like": [
        {
          "_index": "<index>",
          "_id": "<id>"
        },
        {
          "_index": "<index>",
          "_id": "<id>"
        },
        ...
      ],
      "min_term_freq": 1,
      "max_query_terms": 10
    }
  }
}

where:

  • <index>: the name of the index to search.
  • _index: the name of the index of the document to find similar documents to.
  • _id: the id of the document to find similar documents to, those documents will be ignored from the search results.

For example, to find similar documents to the document with id = 1 in the posts index, we can use the following query:

query
GET posts/_search
{
  "query": {
    "more_like_this": {
      "fields": ["title", "description"],
      "like": [
        {
          "_index": "posts",
          "_id": "1"
        }
      ],
      "min_term_freq": 1,
      "max_query_terms": 10
    }
  }
}

The above query will return the following results:

response
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3,
      "relation": "eq"
    },
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "posts",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.2876821,
        "_source": {
          "id": 2,
          "title": "Partial Update in Elasticsearch Guide (with Examples)",
          "description": "Learn how to perform partial updates in Elasticsearch 8.x to update only specific fields in a document using the Update API."
        }
      }
    ]
  }
}

As you can see, the query returns the document with id = 2 because it is related to "elasticsearch", which is also related to the document with id = 1.

3. Find Similar Documents to a Mixed Set of Documents and Text

To find similar documents to a mixed set of documents and text, you can use the more_like_this query as follows:

query
GET <index>/_search
{
  "query": {
    "more_like_this": {
      "fields": ["field1", "field2", ...],
      "like": [
        {
          "_index": "<index>",
          "doc": {
            "some_field": "some value",
            "_doc": "some text to find similar documents to",
            ...
          }
        },
        {
          "_index": "<index>",
          "_id": "<id>"
        },
        ...
      ],
      "min_term_freq": 1,
      "max_query_terms": 10
    }
  }
}

where:

  • <index>: the name of the index to search.
  • _index: the name of the index of the document to find similar documents to.
  • _id: the id of the document to find similar documents to, those documents will be ignored from the search results.
  • doc: the values of artificial fields to find similar documents.
  • _doc: the text to find similar documents.

For example, to find similar documents to the document with id = 1 in the posts index and the text "python extract year from date", we can use the following query:

query
GET posts/_search
{
  "query": {
    "more_like_this": {
      "fields": ["title", "description"],
      "like": [
        {
          "_index": "posts",
          "_id": "1"
        },
        {
          "_index": "posts",
          "doc": {
            "title": "python extract year from date",
            "description": "python extract year from date"
          }
        }
      ],
      "min_term_freq": 1,
      "max_query_terms": 10
    }
  }
}

The above query will return the following results:

response
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "posts",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.2876821,
        "_source": {
          "id": 2,
          "title": "Partial Update in Elasticsearch Guide (with Examples)",
          "description": "Learn how to perform partial updates in Elasticsearch 8.x to update only specific fields in a document using the Update API."
        }
      },
      {
        "_index": "posts",
        "_type": "_doc",
        "_id": "4",
        "_score": 0.2876821,
        "_source": {
          "id": 4,
          "title": "Python Extract Year from Date",
          "description": "Learn how to extract year from date in Python, including string date, date object, timestamp, and current year."
        }
      }
    ]
  }
}

As you can see, the query returns the documents with id = 2 and id = 4 because they are related to "elasticsearch" and "python extract year from date".

Conclusion

In this tutorial, we have learned how to find similar documents in Elasticsearch using the more_like_this query.

  • The more_like_this query finds documents that are similar to a given set of documents.
  • The more_like_this query can be used in 3 ways:
    • Find similar documents to a given text.
    • Find similar documents to a given document or set of documents.
    • Find similar documents to a mixed set of documents and text.
  • The more_like_this query is useful when you want to query similar blog posts, products, or any other documents.
Minh Vu

Minh Vu

Software Engineer

Hi guys 👋, I'm a developer specializing in Elastic Stack and Next.js. My blog shares practical tutorials and insights based on 3+ years of hands-on experience. Open to freelance opportunities — let's get in touch!

Comments

josh

Feb 22, 2024

cool tutorial, thank you

Leave a Comment

Receive Latest Updates 📬

Get every new post, special offers, and more via email. No fee required.