Skip to main content
Technology & EngineeringSearch Services356 lines

Elasticsearch

"Elasticsearch: full-text search, aggregations, mapping, bulk indexing, Node.js client, relevance tuning"

Quick Summary18 lines
Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. Its core tenets are:

## Key Points

- **Schema flexibility** — fields can be dynamically mapped or explicitly defined. Explicit mappings give you control over how text is analyzed and how fields are indexed.
- **Query DSL** — a powerful JSON-based query language supports full-text search, structured filters, aggregations, geo queries, and more in a single request.
- **Distributed by design** — indices are split into shards and replicated across nodes. Scaling is horizontal.
- **Near-real-time** — documents are searchable within one second of indexing by default (configurable refresh interval).
- **Aggregation engine** — beyond search, Elasticsearch performs analytics (bucketing, metrics, pipelines) directly on indexed data.
- **Always use explicit mappings.** Dynamic mapping is convenient for prototyping but causes type conflicts and wasted storage in production.
- **Use `bulk` for indexing.** Single-document indexing is orders of magnitude slower. Batch sizes of 1,000-5,000 documents work well.
- **Put filters in the `filter` context.** Filter clauses are cacheable and skip scoring, making them faster than `must` for non-text conditions.
- **Use `keyword` sub-fields for aggregations and sorting.** Text fields are analyzed and cannot be used for exact terms aggregations.
- **Set `number_of_replicas: 0` during initial bulk loads**, then increase replicas afterward. This speeds up indexing significantly.
- **Use index aliases** to decouple application code from physical index names. This enables zero-downtime reindexing and blue-green deployments.
- **Monitor shard sizes.** Keep shards between 10-50 GB. Over-sharding wastes resources; under-sharding limits parallelism.
skilldb get search-services-skills/ElasticsearchFull skill: 356 lines
Paste into your CLAUDE.md or agent config

Elasticsearch

Core Philosophy

Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. Its core tenets are:

  • Schema flexibility — fields can be dynamically mapped or explicitly defined. Explicit mappings give you control over how text is analyzed and how fields are indexed.
  • Query DSL — a powerful JSON-based query language supports full-text search, structured filters, aggregations, geo queries, and more in a single request.
  • Distributed by design — indices are split into shards and replicated across nodes. Scaling is horizontal.
  • Near-real-time — documents are searchable within one second of indexing by default (configurable refresh interval).
  • Aggregation engine — beyond search, Elasticsearch performs analytics (bucketing, metrics, pipelines) directly on indexed data.

Elasticsearch suits workloads ranging from site search to log analytics to security information and event management (SIEM).

Setup

Install the official Node.js client and connect:

// npm install @elastic/elasticsearch

import { Client } from "@elastic/elasticsearch";

const client = new Client({
  node: "http://localhost:9200",
  auth: { username: "elastic", password: "changeme" },
  // For Elastic Cloud:
  // cloud: { id: "deployment:abc123..." },
  // auth: { apiKey: "base64key" },
});

// Verify connection
const info = await client.info();
console.log(`Connected to Elasticsearch ${info.version.number}`);

Create an index with explicit mappings:

async function createProductIndex() {
  await client.indices.create({
    index: "products",
    body: {
      settings: {
        number_of_shards: 1,
        number_of_replicas: 1,
        analysis: {
          analyzer: {
            product_analyzer: {
              type: "custom",
              tokenizer: "standard",
              filter: ["lowercase", "asciifolding", "edge_ngram_filter"],
            },
          },
          filter: {
            edge_ngram_filter: {
              type: "edge_ngram",
              min_gram: 2,
              max_gram: 15,
            },
          },
        },
      },
      mappings: {
        properties: {
          name: {
            type: "text",
            analyzer: "product_analyzer",
            search_analyzer: "standard",
            fields: { keyword: { type: "keyword" } },
          },
          description: { type: "text" },
          price: { type: "float" },
          categories: { type: "keyword" },
          brand: { type: "keyword" },
          rating: { type: "float" },
          created_at: { type: "date" },
          location: { type: "geo_point" },
          in_stock: { type: "boolean" },
        },
      },
    },
  });
}

Key Techniques

Bulk Indexing

async function bulkIndex(products: Product[]) {
  const operations = products.flatMap((doc) => [
    { index: { _index: "products", _id: doc.id } },
    doc,
  ]);

  const { errors, items } = await client.bulk({
    refresh: true,
    operations,
  });

  if (errors) {
    const failedItems = items.filter((item) => item.index?.error);
    console.error(`${failedItems.length} documents failed`, failedItems);
  }
}

// Stream large datasets with a helper
async function bulkIndexStream(products: AsyncIterable<Product>) {
  const { total, failed, errors } = await client.helpers.bulk({
    datasource: products,
    onDocument(doc: Product) {
      return { index: { _index: "products", _id: doc.id } };
    },
    refreshOnCompletion: true,
  });

  console.log(`Indexed ${total} documents, ${failed} failed`);
  if (errors.length > 0) console.error("Errors:", errors);
}

Full-Text Search with Filters

async function searchProducts(query: string, filters: ProductFilters = {}) {
  const must: any[] = [];
  const filterClauses: any[] = [];

  if (query) {
    must.push({
      multi_match: {
        query,
        fields: ["name^3", "description"],
        type: "best_fields",
        fuzziness: "AUTO",
      },
    });
  }

  if (filters.categories?.length) {
    filterClauses.push({ terms: { categories: filters.categories } });
  }
  if (filters.minPrice !== undefined || filters.maxPrice !== undefined) {
    filterClauses.push({
      range: {
        price: {
          ...(filters.minPrice !== undefined && { gte: filters.minPrice }),
          ...(filters.maxPrice !== undefined && { lte: filters.maxPrice }),
        },
      },
    });
  }
  if (filters.inStock !== undefined) {
    filterClauses.push({ term: { in_stock: filters.inStock } });
  }

  const response = await client.search<Product>({
    index: "products",
    body: {
      query: {
        bool: {
          must: must.length > 0 ? must : [{ match_all: {} }],
          filter: filterClauses,
        },
      },
      highlight: {
        fields: {
          name: { number_of_fragments: 0 },
          description: { fragment_size: 150, number_of_fragments: 3 },
        },
        pre_tags: ["<mark>"],
        post_tags: ["</mark>"],
      },
      from: filters.offset ?? 0,
      size: filters.limit ?? 20,
    },
  });

  return {
    hits: response.hits.hits.map((h) => ({
      ...h._source!,
      _score: h._score,
      _highlight: h.highlight,
    })),
    total:
      typeof response.hits.total === "number"
        ? response.hits.total
        : response.hits.total?.value ?? 0,
  };
}

interface ProductFilters {
  categories?: string[];
  minPrice?: number;
  maxPrice?: number;
  inStock?: boolean;
  offset?: number;
  limit?: number;
}

Aggregations

async function getProductAggregations(query?: string) {
  const response = await client.search({
    index: "products",
    body: {
      size: 0, // no hits, only aggregations
      query: query ? { match: { name: query } } : { match_all: {} },
      aggs: {
        categories: {
          terms: { field: "categories", size: 20 },
        },
        brands: {
          terms: { field: "brand", size: 10 },
        },
        price_ranges: {
          range: {
            field: "price",
            ranges: [
              { key: "budget", to: 25 },
              { key: "mid", from: 25, to: 100 },
              { key: "premium", from: 100 },
            ],
          },
        },
        avg_rating: {
          avg: { field: "rating" },
        },
        price_stats: {
          stats: { field: "price" },
        },
      },
    },
  });

  return response.aggregations;
}

Relevance Tuning with Function Score

async function boostedSearch(query: string) {
  return client.search<Product>({
    index: "products",
    body: {
      query: {
        function_score: {
          query: {
            multi_match: {
              query,
              fields: ["name^3", "description"],
              fuzziness: "AUTO",
            },
          },
          functions: [
            {
              // Boost highly rated products
              field_value_factor: {
                field: "rating",
                factor: 1.2,
                modifier: "log1p",
                missing: 1,
              },
            },
            {
              // Boost in-stock items
              filter: { term: { in_stock: true } },
              weight: 2,
            },
            {
              // Decay older products
              gauss: {
                created_at: {
                  origin: "now",
                  scale: "30d",
                  decay: 0.5,
                },
              },
            },
          ],
          score_mode: "multiply",
          boost_mode: "multiply",
        },
      },
    },
  });
}

Index Aliases for Zero-Downtime Reindexing

async function reindex() {
  const newIndex = `products_${Date.now()}`;

  // Create the new index with the same mappings
  const { mappings, settings } = await client.indices.get({
    index: "products",
  }).then((r) => Object.values(r)[0]);

  await client.indices.create({
    index: newIndex,
    body: { mappings, settings: { index: { number_of_shards: settings?.index?.number_of_shards } } },
  });

  // Reindex data
  await client.reindex({
    body: {
      source: { index: "products" },
      dest: { index: newIndex },
    },
    wait_for_completion: true,
  });

  // Swap alias atomically
  await client.indices.updateAliases({
    body: {
      actions: [
        { remove: { index: "products_*", alias: "products" } },
        { add: { index: newIndex, alias: "products" } },
      ],
    },
  });
}

Best Practices

  • Always use explicit mappings. Dynamic mapping is convenient for prototyping but causes type conflicts and wasted storage in production.
  • Use bulk for indexing. Single-document indexing is orders of magnitude slower. Batch sizes of 1,000-5,000 documents work well.
  • Put filters in the filter context. Filter clauses are cacheable and skip scoring, making them faster than must for non-text conditions.
  • Use keyword sub-fields for aggregations and sorting. Text fields are analyzed and cannot be used for exact terms aggregations.
  • Set number_of_replicas: 0 during initial bulk loads, then increase replicas afterward. This speeds up indexing significantly.
  • Use index aliases to decouple application code from physical index names. This enables zero-downtime reindexing and blue-green deployments.
  • Monitor shard sizes. Keep shards between 10-50 GB. Over-sharding wastes resources; under-sharding limits parallelism.

Anti-Patterns

  • Using Elasticsearch as a primary database. It is not ACID-compliant. Always keep a source-of-truth store and treat Elasticsearch as a derived index.
  • Mapping everything as text. Numeric, date, keyword, and boolean fields should use their native types for correct filtering, sorting, and aggregation.
  • Deep pagination with from + size. Beyond 10,000 results this is rejected by default. Use search_after or the Scroll API for deep pagination.
  • Creating one index per user or per tenant. This leads to thousands of small indices and shard explosion. Use filtered aliases or a tenant ID field instead.
  • Not handling bulk errors. The bulk API returns a 200 even when individual documents fail. Always inspect the errors flag and items array.
  • Running unscoped match_all queries in production. They can return massive result sets and stress the cluster. Always set a reasonable size.
  • Ignoring the refresh_interval. Calling refresh=true on every index operation kills performance. Use the default 1-second interval or batch refreshes.

Install this skill directly: skilldb add search-services-skills

Get CLI access →