In the ref guide (and also the web-site which has copied this bit) it says:
The standard tokenizer splits words at punctuation characters and hyphens while keeping email addresses and internet hostnames intact.
That used to be the case traditionally, but the behavior has changed on the Lucene side and e-mail addresses will be tokenized actually. In the SO answer I recommended to use {ClassicTokenizer}
(which now has the traditional behavior), we either should recommend that or show a custom tokenizer with the required behavior.
|