Skip to content

Stemming

Orama can analyze the input and perform a stemming operation, which allows the engine to perform more optimized queries, as well as save indexing space.

What is stemming?

In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base, or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.

Read more: Wikipedia

WARNING

Note that as of Orama 1.0.0 only the English stemmer is shipped with Orama. Other languages are published in the @orama/stemmers package, which must be installed manually.

When stemming is enabled, Orama uses the English language analyzer, but we can override this behavior by setting the property language at database initialization, and importing a custom stemmer.

javascript
import { create } from '@orama/orama'
import { stemmer, language } from '@orama/stemmers/italian'

const db = create({
  schema: {
    author: 'string',
    quote: 'string',
  },
  components: {
    tokenizer: {
      stemming: true,
      language,
      stemmer
    },
  },
})

Right now, Orama supports 29 languages and stemmers out of the box:

  • Arabic
  • Armenian
  • Bulgarian
  • Chinese (Mandarin - stemmer not supported)
  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Greek
  • Hindi
  • Hungarian
  • Indonesian
  • Irish
  • Italian
  • Mandarin (stemmer not supported)
  • Nepali
  • Norwegian
  • Portuguese
  • Romanian
  • Russian
  • Sanskrit
  • Serbian
  • Slovenian
  • Spanish
  • Swedish
  • Tamil
  • Turkish
  • Ukrainian