[
https://issues.jboss.org/browse/ISPN-9690?page=com.atlassian.jira.plugin....
]
Galder Zamarreño updated ISPN-9690:
-----------------------------------
Comment: was deleted
(was: A chat discussion:
{code}
William Burns: @Galder responded to your email, however we can talk here maybe
William Burns: actually @Galder I can see not requiring transactions when the cache mode
is SIMPLE, LOCAL or INVALIDATION
William Burns: but that is not an option currently, would have to be added
Galder: Hmmm, we want this for DIST
William Burns: yeah then we would need two phase commit like I mentioned in my email
William Burns: since back up owner may reject a write, but primary may be fine
William Burns: unless we want to relax that requirement
Galder: Hmmmm
Galder: Assuming this is DIST, can't we start rejecting before the limit is reached?
William Burns: the way it currently works is each owner checks if their current size + new
entry size > max size and if any of the owners reject, we throw the
ContainerFullException
Galder: I mean: could an owner guess that adding something might end up topping non-owners
and reject it?
Dan: @Galder @William Burns we could split the limit per segment and throw an exception
once a segment is full
Galder: @William Burns Do you understand what I mean?
Galder: @Dan Don't think that'd solve the problem
Galder: You could still have issues where the primary owner is OK but the backup owner
ends up failing due to segment full?
William Burns: @Galder do you mean that each node has an idea of how full another owner
is, updated every so often in the background?
Galder: Or is the segment size fixed?
Galder: @William Burns Maybe... Or some other way that does not required communication
Galder: Segment based CH means we have more or less the same amount of data everywhere...
Galder: Or at least relatively similar number of entries in each node
Galder: In essence, could each primary owner probabilistically decide that doing a put
would topple others?
Dan: @Galder yeah, the number of entries is typically within 10% of the average
Galder: There you go, 10%
Galder: I mean with EXCEPTION strategy only
Dan: But that's for all the entries on a node, a segment always has the same entries
on all the owners that have it
Galder: Yeah but each node has different segments
Galder: That's where the variance comes in, right?
Dan: yes, the segments are different
Dan: sometimes by more than 10% :)
Galder: @William Burns I'm not necessarily suggesting each node keeps track of others,
i'm more suggesting that given how full a node is, it could decide that others are
likely full...
Dan: I was thinking of computing a fixed limit per segment, but you're right, that
depends on how many nodes there are in the cluster
William Burns: So we are thinking of some fixed percentage, that if a node is actually
current > (max *(1-%)) it would throw the exception and not consult other nodes. In
that case I would say why even have this %, just let the user configure the max directly
knowing this is the behavior
William Burns: and that some nodes can technically go over
William Burns: this is also very problematic if you have a lot of writes to same nodes
William Burns: ie. 3 nodes A, B, C if majority of writes are always to A and B and not C,
C could get OOM
Galder: CH should take care of too many writes in some nodes
Galder: But true, some keys could be written more
Dan: @William Burns do you mean only throw the exception on the primary owner, and accept
any write on the backup?
William Burns: well that is what Galder is after, so we don't have to use
transactions
William Burns: is my assumption
Galder:
So we are thinking of some fixed percentage, that if a node is actually current > (max
*(1-%)) it would throw the exception and not consult other nodes. In that case I would say
why even have this %, just let the user configure the max directly knowing this is the
behavior
But the user already configures the max, doesn't it?
William Burns: yes and we use transactions to ensure max is never reach on any node
Galder: Right, but using transactions for this seems too heavyweight
Dan: @Galder transactions are the only way to make it precise, so we only insert an entry
if all the owners have room for it
William Burns: unfortunately without some sort of two phase operation we can some nodes
updated and some others - and I figured don't reinvent the wheel
Dan: Any non-2PC solution will have to either allow some inserts to fail on some nodes and
succeed on others, or have some nodes go over the configured max
William Burns: now I am open to possibly allowing others to go over, but we have to make
sure the user knows this to possibly lower the max
Galder:
Any non-2PC solution will have to either allow some inserts to fail on some nodes and
succeed on others, or have some nodes go over the configured max
Would that be fine if partition handling was enabled?
Dan: @William Burns I've been dreaming about lite transactions that don't need a
xid and all that baggage ever since Mircea removed support for mixing tx and non-tx
operations in the same cache :)
Dan: @Galder
Would that be fine if partition handling was enabled?
Depends on what you mean by "fine" :)
William Burns: @Dan yeah tbh I was thinking that would be nice for this use case
Galder: So when a node fails due to memory issues, we switch the readiness probe to say
node not healthy... then the user gives it more memory to the pods... and one at the time
they get restarted... partition handling would kick in and address partial replications
Galder: ^ I might be day dreaming :see_no_evil:
William Burns: hrmm, I think just regular rebalance should repair it
William Burns: don't need partition handling
William Burns: assuming we haven't lost >= numOwners
William Burns: and then nothing can save it :D
Galder: Exactly
Galder: TBH, the above should be done by the autoscaler
Dan:
assuming we haven't lost >= numOwners
Yeah, that's the big assumption -- if one node is shut down because it wanted to
insert too many entries, those entries will be redirected somewhere else
Galder: Rather than manually by someone
Dan: So an exception in one node may trigger the restart of the entire cluster
William Burns: yeah actually the more I think about it losing that node could lose data
then
William Burns: since state transfer writes may be blocked
William Burns: if this fills up another node
Dan: @Galder does OS/K8S keep pods around if they're not healthy?
Galder: Hmmmm, actually, I think it tries to restart them until they're healthy
Dan: Yeah, that's what I thought
William Burns: this is the age old issue of eviction without a store can lose data on a
rehash, irrespective of numOwners
William Burns: @Galder will you have a store?
Galder: Actually
Galder: @Dan Depends on the config...
Galder: @Dan Depends how many times the readiness probe fails, at which point it'd
restart it
Galder:
https://kubernetes.io/docs/tasks/configure-pod-container/configure-livene...
Galder:
@Galder will you have a store?
Not by default
William Burns: so in that case I would say we should err on the side of making sure no
node has too much as much as possible
Galder: Ok
Galder: What if you had a store?
William Burns: if we had a store then partial updates should be fine
William Burns: ie. primary writes the value and backup ignores it
Galder: But the stores would also have partial udpates?
William Burns: but we would have to implement writing to a store but not memory
Galder: exactly
Galder: Ok
Galder: I'll create a JIRA and summarise the discussion for now
Galder: I have to head out in a few
William Burns: alright
William Burns: we could even optimize it for shared stores
William Burns: :D
William Burns: actually that is interesting...
William Burns: we could have a store mode with shared where we never write the entry to
memory on backups
Galder: Local stores are preferable
Galder: One less thing to worry about
William Burns: yeah agreed for this use case, I was just thinking more in general
Galder: @Dan @William Burns We want the default cache in data grid service to use
EXCEPTION strategy, so that data is preserved (as opposed to the cache service where it
evicts data, it's a cache), so would you make the default cache transactional? Or have
a local file based store instead?
Galder: I'd go for the latter, just wondering if you see any case where the former
would be better
Galder: Actually, with the current state you still need transactions because we don't
write to store if memory full
William Burns: yeah latter is only way to not lose data
William Burns: former may be faster? would have to confirm
Galder: Yeah, but local file based persistence alone is not enough today to make EXCEPTION
strategy work?
William Burns: no, tbh this seems like a different eviction mode at this point with a
store
William Burns: cause you would never get an EXCEPTION
William Burns: tbh, I don't know if this is worth it
Galder: true
William Burns: this is essentially regular eviction with a store
William Burns: just changes what elements are in memory
Galder: True...
William Burns: yeah now I remember before that I thought EXCEPTION based eviction and
stores don't make that much sense together
{code})
EXCEPTION eviction strategy should not require transactions
-----------------------------------------------------------
Key: ISPN-9690
URL:
https://issues.jboss.org/browse/ISPN-9690
Project: Infinispan
Issue Type: Enhancement
Reporter: Galder Zamarreño
Priority: Major
Attachments: ispn-9690-zulip-chat.log
Unfortunately we need 2 phase commit to guarantee consistency. Imagine the case where one
of the owners says the write is okay and another says no, there is no way without two
phase commit to guarantee that either all or none of the writes are completed.
One possibility would be node to deny a write if it expects it would result in other
nodes running out of memory. However, this could still fail if some keys store more data
than others. It would require Infinispan to calculate some probabilistic method of
deciding when a node would run out of memory.
Another way would be to have a local (or shared) persistent store attached. In that case,
if a backup owner will run out of memory if storing data, it would not store it in memory
but store it to the persistent layer. If the node is restarted with new memory settings,
the persistent stores would be consistent and the rebalance would put the data back in
memory.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)