Message Title

HSEARCH-1792

Hi Yoseph Stephen, let me give you some insight on the internals, as you made some very valid comments but the proposed custom DirectoryProvider won't work.

There are two main problems:

all backend operations (like writing changes to the Directory) will happen in a different thread, so ThreadLocal isn't an option.
depending on backend configuration, the lifecycle of an IndexWriter might batch multiple writes, but will invoke directoryProvider.getDirectory() only once.

So while I agree we should work on a better solution, that is not the right path.

The ShardIdentifierProvider based solution however will work, and has some nice consequences. First off, Lucene's scoring formulas include factors like how frequent each term/keyword is across the whole index, so having a separate index per tenant seems a property you might want to have, to keep the scores fully independent across tenants.
I wrote that documentation section about sharding, and what I meant primarily is that having too many shards makes query less efficient - as people normally would then need to aggregate results from multiple shards (but you wouldn't do that in this case). Also, each shard would have its own Directory instance, which would keep some file handles open, so having too many Directory instances might exhaust the total number of file handles available to your process/system. On Linux servers you can reconfigure it to get your more file handles, so it's not a blocking issue but makes things a bit more complex. And obviously file handles take a bit of memory too, so it would be good to avoid them when not strongly needed.

But your proposal would also require a Directory instance per tenant, so you'd have the same drawbacks.

One more limitation of having a per-tenant sharding strategy, is that you wouldn't be able to do additional sharding in each tenant.. but considering we'd already have a large amount of {{Directory} instances I wouldn't recommend that so I think that's an acceptable limitation.

Regarding the `tenantId` token: you're right, and I wasn't suggesting to do that. I was more wondering how you were handling the separation today, and assuming that you're using a `tenantId` token.
Note that with a Filter the drawback wouldn't be too bad, as the filter instances are cacheable and you'd end up searching only in the subset of the index which relates to your tenant.. but these filters take memory and caching by LRU implies that sometimes you wouldn't have a cache hit, and at each write on the index these cached filters would be invalidated.

Note that if you were to experiment with the custom sharding strategy, you'd have the option to narrow down query execution to the specific shard only, using this special "filter" definition:
http://docs.jboss.org/hibernate/search/5.0/reference/en-US/html_single/#query-filter-shard

Add Comment

This message was sent by Atlassian JIRA