Message Title

Gunnar Morling created an issue

HSEARCH-2028

Issue Type:	Improvement
Assignee:	Unassigned
Components:	documentation
Created:	03/Nov/2015 04:11 AM
Priority:	Major
Reporter:	Gunnar Morling

In the ref guide (and also the web-site which has copied this bit) it says:

The standard tokenizer splits words at punctuation characters and hyphens while keeping email addresses and internet hostnames intact.

That used to be the case traditionally, but the behavior has changed on the Lucene side and e-mail addresses will be tokenized actually. In the SO answer I recommended to use

{ClassicTokenizer}

(which now has the traditional behavior), we either should recommend that or show a custom tokenizer with the required behavior.

Add Comment

This message was sent by Atlassian JIRA