Analyze
Analyze the text and generate tokens.
Request
POST /api/_analyze
Response
{
"tokens": [
{
"end_offset": 2,
"keyword": false,
"position": 1,
"start_offset": 0,
"token": "50",
"type": "Numeric"
},
{
"end_offset": 8,
"keyword": false,
"position": 1,
"start_offset": 3,
"token": "first",
"type": "AlphaNumeric"
},
{
"end_offset": 14,
"keyword": false,
"position": 1,
"start_offset": 9,
"token": "dates",
"type": "AlphaNumeric"
}
]
}
Use a special analyzer
Use a special tokenizer
Use a special tokenizer and filter
{
"tokenizer" : "standard",
"char_filter" : ["html"],
"token_filter" : ["camel_case"],
"text" : "50 first dates"
}
Support Analyzers
- standard
- simple
- keyword
- web
- regexp
- stop
- whitespace
Luanguages analyzers
| Country | Shortened form |
|---|---|
| arabic | ar |
| Asia Countries | cjk |
| sorani | ckb |
| danish | da |
| german | de |
| english | en |
| spanish | es |
| persian | fa |
| finnish | fi |
| french | fr |
| hindi | hi |
| hungarian | hu |
| italian | it |
| dutch | nl |
| norwegian | no |
| portuguese | pt |
| romanian | ro |
| russian | ru |
| swedish | sv |
| turkish | tr |
OR
Chinese analyzers
- gse_standard
- gse_search
OR
Support Tokenizers
- character
- char_group
- ngram
- edge_ngram
- exception
- letter
- simple
- lower_case
- path_hierarchy
- regexp
- single
- keyword
- standard
- web
- whitespace
- gse_standard
- gse_search
Support TokenFilters
- apostrophe
- camel_case
- lower_case
- upper_case
- dict
- ngram
- edge_ngram
- elision
- keyword
- length
- porter
- reverse
- regexp
- shingle
- trim
- stop
- truncate
- unicodenorm
- unique
- gse_stop
Luanguages tokenFilters
| Country | token_filter |
|---|---|
| arabic | arabic_normalization / ar_normalization / arabic_stemmer / ar_stemmer |
| cjk | cjk_bigram / cjk_width |
| sorani | sorani_normalization / ckb_normalization / sorani_stemmer / ckb_stemmer |
| danish | danish_stemmer / da_stemmer |
| german | german_normalization / de_normalization / german_stemmer / de_stemmer / german_light_stemmer / de_light_stemmer |
| english | english_possessive_stemmer / en_possessive_stemmer / english_stemmer / en_stemmer |
| spanish | spanish_stemmer / es_stemmer / spanish_light_stemmer / es_light_stemmer |
| persian | persian_normalization / fa_normalization |
| finnish | finnish_stemmer / fi_stemmer |
| french | french_elision / fr_elision / french_stemmer / fr_stemmer / french_light_stemmer / fr_light_stemmer / french_minimal_stemmer / fr_minimal_stemmer |
| irish | irish_elision / ga_elision |
| hindi | hindi_normalization / hi_normalization / hindi_stemmer / hi_stemmer |
| hungarian | hungarian_stemmer / hu_stemmer |
| indic | indic_normalization / in_normalization |
| italian | italian_elision / it_elision / italian_stemmer / it_stemmer / italian_light_stemmer / it_light_stemmer |
| dutch | dutch_stemmer / nl_stemmer |
| norwegian | norwegian_stemmer / no_stemmer |
| portuguese | portuguese_light_stemmer / portuguese_stemmer / pt_light_stemmer |
| romanian | romanian_stemmer / ro_stemmer |
| russian | russian_stemmer / ru_stemmer |
| swedish | swedish_stemmer / sv_stemmer |
| turkish | turkish_stemmer / tr_stemmer |
Support CharFilters
- ascii_folding
- html
- zero_width_non_joiner
- regexp
- mapping