理解分词（Understanding tokenization）

If a property is both indexed and tokenized, as defined on the property, only the tokens generated from its string representation will go in the index. For example, using the standard configuration and running with an ‘en’ locale, a d:text property will be tokenized using the AlfrescoStandardAnalyser.
如果属性同时被索引和分词（如属性上定义的），则只有从其字符串表示生成的分词才会进入索引。例如，使用标准配置并使用“en”语言环境运行时，将使用alfrescostandardanalyzer标记d:text属性。
‘The fox’s nose was wet and black, just like all the other foxes’ noses. The fox liked to run and jump. It was not as brown or quick as the other foxes but it was very good at jumping. It jumped over all the dogs it could find, as quickly as it could, including the lazy, blackish dog.’
“狐狸的鼻子又湿又黑，和其他狐狸的鼻子一样。狐狸喜欢跑和跳。它不像其他狐狸那样棕色或敏捷，但它非常擅长跳跃。它尽可能快地跳过它能找到的所有狗，包括那只懒惰的黑狗。”
AlfrescoStandardAnalyser
PorterSnowballAnalyser

EnglishSnowballAnalyser

ItalianSnowballAnalyser

The VerbatimAnalyser just produces one token of type VERBATIM starting at position 0 and finished at 283.
逐字分析程序只生成一个逐字类型的标记，从0开始，到283结束。
Some words, such as ‘the’, are excluded as tokens by some tokenizers. These are known as stop words. So if you try and search for the tokens ‘The’ or ‘the’, you will find nothing.
一些分词器将某些单词（如“the”）排除在分词外。这些被称为停止语。因此，如果您尝试搜索分词“The”或“the”，您将一无所获。
Tokenizers are language-specific, so are the stop words and the actual tokens generated, as they may be specific to the tokenizer. The tokens may not always be what you expect, particularly if the tokenizer uses stemming.
分词器是特定于语言的，停止字和生成的实际分词也是特定于标记器的。分词可能并不总是您所期望的，特别是如果分词器使用词干。
On the search side, all phrase queries will be tokenized.
在搜索端，所有短语查询都将被分词。
e.g. TEXT:’The quick Brown fox’
Term queries are also  tokenized (post version 2.1).
术语查询也被标记化（版本2.1之后）。
e.g. TEXT:Brown
Wildcard queries are lower case but not tokenized.
通配符查询是小写的，但没有被分词。
e.g. TEXT:BRO*
(Wild cards (* and ?) are supported in phrases post version 2.1.)
（通配符（*和？）在2.1版之后的短语中受支持。）
TEXT:’BRO‘ will probably tokenize correctly and integrate with most stemming tokenizers, so long as BRO generates the stems required. The * should be ignored and removed by the analyser but understood by the QueryParser. If a particular analyser does not ignore * for stemming then phrase wildcards will not work.
TEXT:’BRO‘ 可能会正确分词并与大多数词干分词器集成，只要BRO生成所需的词干。应该被分词器忽略并移除，但是QueryParser可以理解。如果特定的分词器不忽略进行词干分析，则短语通配符将不起作用。
文档更新时间: 2020-02-10 16:37   作者：凌云文档