If a property is both indexed and tokenized, as defined on the property, only the tokens generated from its string representation will go in the index. For example, using the standard configuration and running with an ‘en’ locale, a d:text property will be tokenized using the AlfrescoStandardAnalyser.
‘The fox’s nose was wet and black, just like all the other foxes’ noses. The fox liked to run and jump. It was not as brown or quick as the other foxes but it was very good at jumping. It jumped over all the dogs it could find, as quickly as it could, including the lazy, blackish dog.’
The VerbatimAnalyser just produces one token of type VERBATIM starting at position 0 and finished at 283.
Some words, such as ‘the’, are excluded as tokens by some tokenizers. These are known as stop words. So if you try and search for the tokens ‘The’ or ‘the’, you will find nothing.
Tokenizers are language-specific, so are the stop words and the actual tokens generated, as they may be specific to the tokenizer. The tokens may not always be what you expect, particularly if the tokenizer uses stemming.
On the search side, all phrase queries will be tokenized.
e.g. TEXT:’The quick Brown fox’
Term queries are also tokenized (post version 2.1).
e.g. TEXT:Brown
Wildcard queries are lower case but not tokenized.
e.g. TEXT:BRO*
(Wild cards (* and ?) are supported in phrases post version 2.1.)
TEXT:’BRO‘ will probably tokenize correctly and integrate with most stemming tokenizers, so long as BRO generates the stems required. The * should be ignored and removed by the analyser but understood by the QueryParser. If a particular analyser does not ignore * for stemming then phrase wildcards will not work.
