Lucene 5 is coming: pitfalls to consider
by Sanne Grinovero
Hi all,
the Hibernate Search branch upgrading to Apache Lucene 5.2.x is almost
ready, but there are some drawbacks on top of the many nice efficiency
improvements.
# API changes
The API changes are not too bad, and definitely an improvement. I'll
provide a detailed list as usual in the Hibernate Search migration
guide - for now let it suffice to know that it's an easy upgrade for
end users, as long as they were just creating Query instances and not
using the more powerful and complex stuff.
# Sorting
To sort on a field will require an UninvertingReader to wrap the
cached IndexReaders, and the uninverting process is very inefficient.
On top of that, the result of the uninverting process is not
cacheable, so that will need to be repeated on each index, for each
query which is executed.
In short, I expect performance of sorted queries to be quite degraded
in our first milestone using Lucene 5, and we'll have to discuss how
to fix this.
Needless to say, fixing this is a blocking requirement before we can
consider the migration complete.
Sorting will not need an UninvertingReader if the target field has
been indexed as DocValues, but that implies:
- we'll need an explicit, upfront (indexing time) flag to be set
- we'll need to detect if the matching indexing options are
compatible with the runtime query to skip the uninverting process
This is mostly a job for Hibernate Search, but in terms of user
experience it means you have to mark fields for "sortability"
explicitly; will we need to extend the protobuf schema?
Please make sure we'll just have to hook in existing metadata, we
can't fix this after API freeze.
# Filters
We did some clever bitset level optimisations to merge multiple Filter
instances and save memory to cache multiple filter instances, I had to
drop that code as we don't deal with in-heap structures more but the
design is about iterating off heap chunks of data, and resort on the
more traditional Lucene stack for filtering.
I couldn't measure the performance impact yet; it's a significantly
different approach and while it sounds promising on paper, we'll need
some help testing this. The Lucene team can generally be trusted to go
in the better direction, but we'll have to verify if we're using it in
the best way.
# Analyzers
It is no longer possible to override the field->analyzer mapping at
runtime. We did expose this feature as a public API and I found a way
to still do it, but it comes with a performance price tag.
We'll soon deprecate this feature; if you can, start making sure
there's no need for this in Infinispan as at some time in the near
future we'll have to drop this, with no replacement.
# Index encoding
As usual the index encoding evolves and the easy solution is to
rebuild it. Lucene 5 no longer ships with backwards compatible
de-coders, but these are available as separate dependencies. If you
feel the need to be able to read existing indexes, we should include
these.
(I'm including these as private dependencies in the Hibernate Search modules).
Thanks,
Sanne
9 years, 3 months
Redis infinispan cache store
by Simon Paulger
Hi,
I'm interested in developing inifinispan integration with Redis for use in
JBoss. Before working on JBoss, I first need to add the capability to
Infinispan itself.
Is this an enhancement that the infinispan community would be interested in?
Regards,
Simon
9 years, 3 months
Shared vs Non-Shared CacheStores
by Sanne Grinovero
I would like to propose a clear cut separation between our shared and
non-shared CacheStores,
in all terms such as:
- Configuration options
- Integration contracts (Split the CacheStore SPI)
- Implementations
- Terminology, to avoid any further confusion around valid
configurations and sensible architectures
We have loads of examples of users who get in trouble by configuring
one incorrectly, but also there are plenty of efficiency improvements
we could take advantage of by clearly splitting the integration points
and the implementations in two categories.
Not least, it's a very common and dangerous pitfall to assume that
Infinispan is able to restore a consistent state after having stopped
a DIST cluster which passivated into non-shared CacheStore instances,
or even REPL clusters when they don't shutdown all at the same exact
time (and "exact same time" is a strange concept at least..). We need
to clarify the different options, tradeoffs and their consequences..
to users and ourselves, as a clearly defined use case will avoid bugs
and simplify implementations.
# The purpose of each
I think that people should use a non-shared (local?) CacheStore for
the sole purpose of expanding to storage capacity of each single
node.. be it because you don't have enough memory at all, or be it
because you prefer some extra safety margin because either your
estimates are complex, or maybe because we live in a real world were
the hashing function might not be perfect in practice. I hope we all
agree that Infinispan should be able to take such situations with at
worst a graceful performance degradatation, rather than complain
sending OOMs to the admin and setting the service on strike.
A Shared CacheStore is useful for very different purposes; primarily
to implement a Cache on some other service - for example your (single,
shared) RDBMs, a slow (or expensive) webservice your organization has
to call frequently, etc.. Or it's useful even as a write-through cache
on a similar service, maybe internal but not able to handle the high
variation of load spikes which Infinsipan can handle better.
Finally, a great use case is to have a consistent backup of all your
data-grid content, possibly in some "reference" form such as JPA
mapped entities.
# Benefits of a Non-Shared
A non-shared CacheStore implementor should be able to take advantage
of *its purpose*, among the big ones I see:
- Exclusive usage -> locking of a specific entry can be handled at
datacontainer level, can simplify quite some internal code.
- Reliability -> since a clustered node needs to wipe its state at
reboot (after a crash), it's much simpler to code any such CacheStore
to avoid any form of disk synch or persistance guarantees.
- Encoding format -> this can be controlled entirely by Infinispan,
and no need to take factors like rolling upgrade compatible encodings
in mind. JBoss Marshalling would be good enough, or some
implementations might not need to serialize at all.
Our non-shared CacheStore implentation(s) could take advantage of
lower level more complex code optimisations and interfaces, as users
would rarely want to customize one of these, while the use case of
mapping data to a shared service needs a more user friendly SPI so to
keep it simple to plug in custom stores: custom data formats, custom
connectors, get some help in implementing concurrency correctly.
Proper Transaction integration for the CacheStore has been on our
wishlist for some time too, I suspect that accepting that we have been
mixing up two different things under a same name so far, would make it
simpler to implement further improvements such as transactions: the
way to do such a thing is very different in each of these use cases,
so it would help at least to implement it on a subset first, or maybe
only if it turns out there's no need for such things in the context of
the local-only-dedicated "swapfile".
# Mixed types should be killed
I'm aware that some of our current implementations _could_ work both as
shared or non-shared, for example the JDBC or JPACacheStore or the
Remote Cachestore.. but in most cases it doesn't make much sense. Why
would you ever want to use the JPACacheStore if not to share data with
a _shared_ database?
We should take such options away, and by doing so focus on the use
cases which actually matter and simplify the implementations and
improve the configuration validations.
If ever a compelling storage technology is identified which we'd like to
offer as an option for both shared or non-shared, I would still
recommend to make two different implementations, as there certainly are
different requirements and assumptions when coding such a thing.
Not least, I would very like to see a default local CacheStore:
picking one for local "emergency swapping" should be a no-brainer for
users; we could setup one by default and not bother newcomers with
complex choices.
If we simplify the requirement of such a thing, it should be easy to
write one on standard Java NIO2 APIs and get rid of the complexities of
maintaining the native integration with things like LevelDB, not least
the inefficiency of Java to make such native calls.
Then as a second step, we should attack the other use case: backups;
from a *purpose driven perspective* I'd then see us revive the Cassandra
integration; obviously as a shared-only option.
Cheers,
Sanne
9 years, 4 months
Special cache types and their configuration (or lack of)
by Tristan Tarrant
Hi all,
I wanted to bring attention to some discussion that has happened in the
context of Radim's work on simplified code for specific cache types [1].
In particular, Radim proposes adding explicit configuration options
(i.e. a new simple-cache cache type) to the programmatic/declarative API
to ensure that a user is aware of the limitations of the resulting cache
type (no interceptors, no persistence, no tx, etc).
My opinion is that we should aim for "less" configuration and not
"more", and that optimizations such as these should get enabled
implicitly when the parameters allow it: if the configuration code
detects it can use a "simple" cache.
Also, this choice should happen at cache construction time, and not
dynamically at cache usage time.
WDYT ?
Tristan
[1] https://github.com/infinispan/infinispan/pull/3577
--
Tristan Tarrant
Infinispan Lead
JBoss, a division of Red Hat
9 years, 4 months
Question about Hibernate ORM 5.0 + Infinispan 8.0...
by Scott Marlow
Hi,
I heard that Infinispan 8.0 may soon be integrated into WildFly 10.0.
If that happens, how does that impact Hibernate ORM 5.0 which currently
integrates with Infinispan 7.2.1.Final? Does Hibernate ORM 5.0 need any
changes to integrate with Infinispan 8.0?
Thanks,
Scott
9 years, 4 months
Strict Expiration
by William Burns
This is a necro of [1].
With Infinispan 8.0 we are adding in clustered expiration. That includes
an expiration event raised that is clustered as well. Unfortunately
expiration events currently occur multiple times (if numOwners > 1) at
different times across nodes in a cluster. This makes coordinating a
single cluster expiration event quite difficult.
To work around this I am proposing that the expiration of an event is done
solely by the owner of the given key that is now expired. This would fix
the issue of having multiple events and the event can be raised while
holding the lock for the given key so concurrent modifications would not be
an issue.
The problem arises when you have other nodes that have expiration set but
expire at different times. Max idle is the biggest offender with this as a
read on an owner only refreshes the owners timestamp, meaning other owners
would not be updated and expire preemptively. To have expiration work
properly in this case you would need coordination between the owners to see
if anyone has a higher value. This requires blocking and would have to be
done while accessing a key that is expired to be sure if expiration
happened or not.
The linked dev listing proposed instead to only expire an entry by the
reaper thread and not on access. In this case a read will return a non
null value until it is fully expired, increasing hit ratios possibly.
Their are quire a bit of real benefits for this:
1. Cluster cache reads would be much simpler and wouldn't have to block to
verify the object exists or not since this would only be done by the reaper
thread (note this would have only happened if the entry was expired
locally). An access would just return the value immediately.
2. Each node only expires entries it owns in the reaper thread reducing how
many entries they must check or remove. This also provides a single point
where events would be raised as we need.
3. A lot of code can now be removed and made simpler as it no longer has to
check for expiration. The expiration check would only be done in 1 place,
the expiration reaper thread.
The main issue with this proposal is as the other listing mentions is if
user code expects the value to be gone after expiration for correctness. I
would say this use case is not as compelling for maxIdle, especially since
we never supported it properly. And in the case of lifespan the user could
very easily store the expiration time in the object that they can check
after a get as pointed out in the other thread.
[1]
http://infinispan-developer-list.980875.n3.nabble.com/infinispan-dev-stri...
9 years, 5 months
Development process and handling of PRs
by Tristan Tarrant
Hi all,
there is something about our current development model which I feel is
holding us back a little. This is caused by a number of issues:
- Handling Pull Requests: we are really slow at doing this. When issuing
a PR, a developer expects at least one review to happen within the next
half-day at most. Instead, requests sit in the queue for days (weeks)
before they even get considered. I don't expect everybody to just drop
what they are doing and review immediately, but at least be a bit more
reactive.
- It seems like we're always aiming for the perfect PR. Obviously a PR
should have zero failures, but we should be a bit more iterative about
the way we make changes. This is probably also a consequence of the
above: why should I break up my PR into small chunks, if it takes so
long to review each one and the cumulative delay is detrimental to my
progress. I like what Pedro has done for his locking changes.
- We're afraid of changes, but that's what a development phase is for,
especially for a new major release. We should be a bit more aggressive
with trying things out. A PR can be merged even if there are some
concerns (obviously not from a fundamental design POV), and it can be
refined in later steps.
This is what I would like to see in Beta2:
- The functional API (I can take care of rebasing the PR)
- The management console
- The query grouping/aggregation stuff
- anything else we can merge soon
I would like to release Wednesday at the latest, so please do your best
to help in achieving this goal.
Tristan
--
Tristan Tarrant
Infinispan Lead
JBoss, a division of Red Hat
9 years, 5 months