Stop-words
The words which are generally filtered out before processing a natural language are called stop words. These are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”.
Note that as of Orama 1.0.7, stop-words are shipped via a separate @orama/stopwords
package.
Orama provides support for stop-words removal via the @orama/stopwords
package.
npm install @orama/stopwords
Enabling stop-words removal
By default, Orama does not remove any stop-word, as this is intended to be an explicit action from the user. To enable stop-words removal, you can use the stopWords
property when creating a new Orama instance:
import { create } from '@orama/orama'
const db = await create({
schema: {
author: 'string',
quote: 'string',
},
components: {
tokenizer: {
stopWords: ['foo', 'bar'], // Enable custom stop-words
}
}
})
Using the default stop-words list
By installing the @orama/stopwords
package, you can use the default stop-words list for a given language:
import { create } from '@orama/orama'
import { stopwords as englishStopwords } from '@orama/stopwords/english'
const db = await create({
schema: {
author: 'string',
quote: 'string',
},
components: {
tokenizer: {
stopWords: englishStopwords,
}
}
})
Using the default stop-words list is the recommended way to enable stop-words removal, as it is the most efficient way to do so.
Extending the default stop-words list
You can always extend the default stop-words list by adding or removing words:
import { create } from '@orama/orama'
import { stopwords as italianStopwords } from '@orama/stopwords/italian'
const db = await create({
schema: {
author: 'string',
quote: 'string',
},
components: {
tokenizer: {
stopWords: [...italianStopwords, 'ciao', 'buongiorno']
}
}
})
Supported languages
As for now, Orama supports 28 languages when it comes to stop-words removal:
- Arabic
- Armenian
- Bulgarian
- Danish
- Dutch
- English
- Finnish
- French
- German
- Greek
- Hindi
- Hungarian
- Indonesian
- Irish
- Italian
- Nepali
- Norwegian
- Portuguese
- Romanian
- Russian
- Sanskrit
- Serbian
- Slovenian
- Spanish
- Swedish
- Tamil
- Turkish
- Ukrainian