Message Title

Sanne Grinovero updated an issue

HSEARCH-1699

Change By:	Sanne Grinovero

An NRT based backend is orders of magnitude more efficient than a non-NRT one as it avoids a full flush cycle to the index for each {{LuceneBackendQueueTask}} which is applied to the index, and only triggers such flush to the Directory on specific needs (like when the in-memory buffers are running out of space).

But since changes are buffered in memory rather than being pushed to the store right away, an application requiring synchronous changes but using the Infinispan Directory to replicate changes to other application nodes can't use NRT as these changes would not be propagated synchronously, they would only be visible eventually when a flush event is triggered.

Consider that the Infinispan Directory can provide exceptional read efficiency but is not efficient at all at handling:
- read locks as they need to be emulated with multiple RPCs doing optimistic CAS operations
- index metadata replacements are also based on distributed CAS operations
- a single segment write might trigger up to 8 different index metadata writes
- one out of 10 (by default, or as configured the merging factor) writes on segment will trigger a rewrite of all segments
- load on any IndexReader refresh will increase contention on the above CAS operations
- and IndexReader refresh is triggered by visibility of a flush event

If you combine the notion of each of these bullet points, it's clear that each increase of {{IndexReader}} refresh operations, and each {{IndexWriter}} commit have an exponential effect on worsening the mount of network traffic and make the CAS operations unlikely to ever spin out.
The most important observation is that the {{IndexReader}} refresh operations are affected linearly by the frequency of commits: a refresh is directly caused by a previous commit. So simply controlling the frequency of {{IndexWriter}} commit events will directly control the frequency of {{IndexReader}} refresh operations, and so has the potential to dramatically increase the write throughput on the {{InfinispanDirectoryProvider}}.

So the core of the idea here is that for two incoming changesets A and B, incoming on behalf of two parallel user transactions TXA and TXB, normally we would apply this sequence of operations:

# TXA enqueues A, and blocks waiting for it.
# TXB enqueues B, and blocks waiting for it.
# IndexWriter thread takes A from the Queue
# IndexWriter thread applies A changes to the index
# IndexWriter thread commits and flushes the index
# TXA is notified and is unblocked
# IndexWriter thread takes B from the Queue
# IndexWriter thread applies B changes to the index
# IndexWriter thread commits and flushes the index
# TXB is notified and is unblocked

The above is the current strategy, and releases the waiting threads as soon as possible.
Let's assume that each write takes some nanoseconds, and each commit takes 200ms; actual figures will vary across systems and configurations but won't be too far from these orders of magnitude; so for the sake of this design we can approximate the cost of each write to nothing, while a commit time is significant.

The latency of the above described schema will then be approximately:
- TXA will return in 1 commit (plus a nanosecond)
- TXB will return in 2 commits (plus a couple nanoseconds)

If we now reorder the commit sequence to actually perform something like this:

# TXA enqueues A, and blocks waiting for it.
# TXB enqueues B, and blocks waiting for it.
# IndexWriter thread takes A&B from the Queue
# IndexWriter thread applies A&B changes to the index (maintaining order)
# IndexWriter thread commits and flushes the index
# TXA is notified and is unblocked
# TXB is notified and is unblocked

at this point the measured latency will be:
- TXA will return in 1 commit (plus a couple nanoseconds)
- TXB will return in 1 commit (plus a couple nanoseconds)

Compared to the previous results, TXA was delayed by some nanoseconds but the order of magnitude of its response time is not affected.
TXB however did cut its waiting time in half.. more importantly, it's speed is increased by a factor N where N in this case is 2 as we have only two transactions writing, but N could be 100, you'd have a 100X performance improvement.

h4. So what's the throughput that we get out of this pattern?

What matters most is to cap the _target frequency_ of commits to a specific upper bound, and we want to cap this frequency at exactly the maximum throughput that the Infinispan storage can actually take. So if the IndexWriter thread is designed as a busy loop which always takes all of N waiting changesets and applies them, if the commit is slow there will be more changes in the queue waiting for the next iteration of the write loop, making N larger and so scaling up the throughput: the more the storage is slow, the larger the batches.

The upper bound is of course determined by the speed of writes - the ones that we previously considered negligible as they operate in the nanoseconds range; they will stack up and eventually become the limiting factor. The implementation needs also to take into account at least two exceptional cases:
- what's the IndexWriter loop going to do when no changes are scheduled
- do we need to cap the maximum size of batches

Ultimately the latency of each user thread is still going to be close to the performance that we can get out of the store to perform a single commit operation, as at least one commit needs to happen before the synchronous user thread is unblocked. The benefit will be that different threads won't "pile up" changes creating larger wait times but the response time is expected to be quite predictable, whatever the load.

Systems with an high degree of parallelism will benefit from this , and the performance should converge to the performance you would have without every doing a commit; however if the frequency of commits is apparoching to zero, it also means that the average latency of each operation will get significantly higher . Still, in such situations assuming we are for example stacking up a million changesets between each commit, that implies this solution would be approximately a million times faster than the existing design (A million would not be realistic of course as it implies a million of parallel requests).

Add Comment

This message was sent by Atlassian JIRA