[JBoss JIRA] (ISPN-9690) EXCEPTION eviction strategy should not require transactions

Wednesday, 7 November 2018

    [
https://issues.jboss.org/browse/ISPN-9690?page=com.atlassian.jira.plugin....
] 

Galder Zamarreño edited comment on ISPN-9690 at 11/7/18 11:06 AM:
------------------------------------------------------------------

A chat discussion:

{code}
William Burns: @Galder responded to your email, however we can talk here maybe

William Burns: actually @Galder I can see not requiring transactions when the cache mode
is SIMPLE, LOCAL or INVALIDATION

William Burns: but that is not an option currently, would have to be added

Galder: Hmmm, we want this for DIST

William Burns: yeah then we would need two phase commit like I mentioned in my email

William Burns: since back up owner may reject a write, but primary may be fine

William Burns: unless we want to relax that requirement

Galder: Hmmmm

Galder: Assuming this is DIST, can't we start rejecting before the limit is reached?

William Burns: the way it currently works is each owner checks if their current size + new
entry size > max size and if any of the owners reject, we throw the
ContainerFullException

Galder: I mean: could an owner guess that adding something might end up topping non-owners
and reject it?

Dan: @Galder @William Burns we could split the limit per segment and throw an exception
once a segment is full

Galder: @William Burns Do you understand what I mean?

Galder: @Dan Don't think that'd solve the problem

Galder: You could still have issues where the primary owner is OK but the backup owner
ends up failing due to segment full?

William Burns: @Galder do you mean that each node has an idea of how full another owner
is, updated every so often in the background?

Galder: Or is the segment size fixed?

Galder: @William Burns Maybe... Or some other way that does not required communication

Galder: Segment based CH means we have more or less the same amount of data everywhere...

Galder: Or at least relatively similar number of entries in each node

Galder: In essence, could each primary owner probabilistically decide that doing a put
would topple others?

Dan: @Galder yeah, the number of entries is typically within 10% of the average

Galder: There you go, 10%

Galder: I mean with EXCEPTION strategy only

Dan: But that's for all the entries on a node, a segment always has the same entries
on all the owners that have it

Galder: Yeah but each node has different segments

Galder: That's where the variance comes in, right?

Dan: yes, the segments are different

Dan: sometimes by more than 10% :)

Galder: @William Burns I'm not necessarily suggesting each node keeps track of others,
i'm more suggesting that given how full a node is, it could decide that others are
likely full...

Dan: I was thinking of computing a fixed limit per segment, but you're right, that
depends on how many nodes there are in the cluster

William Burns: So we are thinking of some fixed percentage, that if a node is actually
current > (max *(1-%)) it would throw the exception and not consult other nodes. In
that case I would say why even have this %, just let the user configure the max directly
knowing this is the behavior

William Burns: and that some nodes can technically go over

William Burns: this is also very problematic if you have a lot of writes to same nodes

William Burns: ie. 3 nodes A, B, C if majority of writes are always to A and B and not C,
C could get OOM

Galder: CH should take care of too many writes in some nodes

Galder: But true, some keys could be written more

Dan: @William Burns do you mean only throw the exception on the primary owner, and accept
any write on the backup?

William Burns: well that is what Galder is after, so we don't have to use
transactions

William Burns: is my assumption

Galder:
So we are thinking of some fixed percentage, that if a node is actually current > (max
*(1-%)) it would throw the exception and not consult other nodes. In that case I would say
why even have this %, just let the user configure the max directly knowing this is the
behavior
But the user already configures the max, doesn't it?

William Burns: yes and we use transactions to ensure max is never reach on any node

Galder: Right, but using transactions for this seems too heavyweight

Dan: @Galder transactions are the only way to make it precise, so we only insert an entry
if all the owners have room for it

William Burns: unfortunately without some sort of two phase operation we can some nodes
updated and some others - and I figured don't reinvent the wheel

Dan: Any non-2PC solution will have to either allow some inserts to fail on some nodes and
succeed on others, or have some nodes go over the configured max

William Burns: now I am open to possibly allowing others to go over, but we have to make
sure the user knows this to possibly lower the max

Galder:
Any non-2PC solution will have to either allow some inserts to fail on some nodes and
succeed on others, or have some nodes go over the configured max
Would that be fine if partition handling was enabled?

Dan: @William Burns I've been dreaming about lite transactions that don't need a
xid and all that baggage ever since Mircea removed support for mixing tx and non-tx
operations in the same cache :)

Dan: @Galder

Would that be fine if partition handling was enabled?
Depends on what you mean by "fine" :)

William Burns: @Dan yeah tbh I was thinking that would be nice for this use case

Galder: So when a node fails due to memory issues, we switch the readiness probe to say
node not healthy... then the user gives it more memory to the pods... and one at the time
they get restarted... partition handling would kick in and address partial replications

Galder: ^ I might be day dreaming :see_no_evil:

William Burns: hrmm, I think just regular rebalance should repair it

William Burns: don't need partition handling

William Burns: assuming we haven't lost >= numOwners

William Burns: and then nothing can save it :D

Galder: Exactly

Galder: TBH, the above should be done by the autoscaler

Dan:
assuming we haven't lost >= numOwners
Yeah, that's the big assumption -- if one node is shut down because it wanted to
insert too many entries, those entries will be redirected somewhere else

Galder: Rather than manually by someone

Dan: So an exception in one node may trigger the restart of the entire cluster

William Burns: yeah actually the more I think about it losing that node could lose data
then

William Burns: since state transfer writes may be blocked

William Burns: if this fills up another node

Dan: @Galder does OS/K8S keep pods around if they're not healthy?

Galder: Hmmmm, actually, I think it tries to restart them until they're healthy

Dan: Yeah, that's what I thought

William Burns: this is the age old issue of eviction without a store can lose data on a
rehash, irrespective of numOwners

William Burns: @Galder will you have a store?

Galder: Actually

Galder: @Dan Depends on the config...

Galder: @Dan Depends how many times the readiness probe fails, at which point it'd
restart it

Galder:
https://kubernetes.io/docs/tasks/configure-pod-container/configure-livene...

Galder:
@Galder will you have a store?
Not by default

William Burns: so in that case I would say we should err on the side of making sure no
node has too much as much as possible

Galder: Ok

Galder: What if you had a store?

William Burns: if we had a store then partial updates should be fine

William Burns: ie. primary writes the value and backup ignores it

Galder: But the stores would also have partial udpates?

William Burns: but we would have to implement writing to a store but not memory

Galder: exactly

Galder: Ok

Galder: I'll create a JIRA and summarise the discussion for now

Galder: I have to head out in a few

William Burns: alright

William Burns: we could even optimize it for shared stores

William Burns: :D

William Burns: actually that is interesting...

William Burns: we could have a store mode with shared where we never write the entry to
memory on backups

Galder: Local stores are preferable

Galder: One less thing to worry about

William Burns: yeah agreed for this use case, I was just thinking more in general

Galder: @Dan @William Burns We want the default cache in data grid service to use
EXCEPTION strategy, so that data is preserved (as opposed to the cache service where it
evicts data, it's a cache), so would you make the default cache transactional? Or have
a local file based store instead?

Galder: I'd go for the latter, just wondering if you see any case where the former
would be better

Galder: Actually, with the current state you still need transactions because we don't
write to store if memory full

William Burns: yeah latter is only way to not lose data

William Burns: former may be faster? would have to confirm

Galder: Yeah, but local file based persistence alone is not enough today to make EXCEPTION
strategy work?

William Burns: no, tbh this seems like a different eviction mode at this point with a
store

William Burns: cause you would never get an EXCEPTION

William Burns: tbh, I don't know if this is worth it

Galder: true

William Burns: this is essentially regular eviction with a store

William Burns: just changes what elements are in memory

Galder: True...

William Burns: yeah now I remember before that I thought EXCEPTION based eviction and
stores don't make that much sense together
{code}

was (Author: galder.zamarreno):
A chat discussion:

{code}
William Burns: @Galder responded to your email, however we can talk here maybe

William Burns: actually @Galder I can see not requiring transactions when the cache mode
is SIMPLE, LOCAL or INVALIDATION

William Burns: but that is not an option currently, would have to be added

Galder: Hmmm, we want this for DIST

William Burns: yeah then we would need two phase commit like I mentioned in my email

William Burns: since back up owner may reject a write, but primary may be fine

William Burns: unless we want to relax that requirement

Galder: Hmmmm

Galder: Assuming this is DIST, can't we start rejecting before the limit is reached?

William Burns: the way it currently works is each owner checks if their current size + new
entry size > max size and if any of the owners reject, we throw the
ContainerFullException

Galder: I mean: could an owner guess that adding something might end up topping non-owners
and reject it?

Dan: @Galder @William Burns we could split the limit per segment and throw an exception
once a segment is full

Galder: @William Burns Do you understand what I mean?

Galder: @Dan Don't think that'd solve the problem

Galder: You could still have issues where the primary owner is OK but the backup owner
ends up failing due to segment full?

William Burns: @Galder do you mean that each node has an idea of how full another owner
is, updated every so often in the background?

Galder: Or is the segment size fixed?

Galder: @William Burns Maybe... Or some other way that does not required communication

Galder: Segment based CH means we have more or less the same amount of data everywhere...

Galder: Or at least relatively similar number of entries in each node

Galder: In essence, could each primary owner probabilistically decide that doing a put
would topple others?

Dan: @Galder yeah, the number of entries is typically within 10% of the average

Galder: There you go, 10%

Galder: I mean with EXCEPTION strategy only

Dan: But that's for all the entries on a node, a segment always has the same entries
on all the owners that have it

Galder: Yeah but each node has different segments

Galder: That's where the variance comes in, right?

Dan: yes, the segments are different

Dan: sometimes by more than 10% :)

Galder: @William Burns I'm not necessarily suggesting each node keeps track of others,
i'm more suggesting that given how full a node is, it could decide that others are
likely full...

Dan: I was thinking of computing a fixed limit per segment, but you're right, that
depends on how many nodes there are in the cluster

William Burns: So we are thinking of some fixed percentage, that if a node is actually
current > (max *(1-%)) it would throw the exception and not consult other nodes. In
that case I would say why even have this %, just let the user configure the max directly
knowing this is the behavior

William Burns: and that some nodes can technically go over

William Burns: this is also very problematic if you have a lot of writes to same nodes

William Burns: ie. 3 nodes A, B, C if majority of writes are always to A and B and not C,
C could get OOM

Galder: CH should take care of too many writes in some nodes

Galder: But true, some keys could be written more

Dan: @William Burns do you mean only throw the exception on the primary owner, and accept
any write on the backup?

William Burns: well that is what Galder is after, so we don't have to use
transactions

William Burns: is my assumption

Galder:
So we are thinking of some fixed percentage, that if a node is actually current > (max
*(1-%)) it would throw the exception and not consult other nodes. In that case I would say
why even have this %, just let the user configure the max directly knowing this is the
behavior
But the user already configures the max, doesn't it?

William Burns: yes and we use transactions to ensure max is never reach on any node

Galder: Right, but using transactions for this seems too heavyweight

Dan: @Galder transactions are the only way to make it precise, so we only insert an entry
if all the owners have room for it

William Burns: unfortunately without some sort of two phase operation we can some nodes
updated and some others - and I figured don't reinvent the wheel

Dan: Any non-2PC solution will have to either allow some inserts to fail on some nodes and
succeed on others, or have some nodes go over the configured max

William Burns: now I am open to possibly allowing others to go over, but we have to make
sure the user knows this to possibly lower the max

Galder:
Any non-2PC solution will have to either allow some inserts to fail on some nodes and
succeed on others, or have some nodes go over the configured max
Would that be fine if partition handling was enabled?

Dan: @William Burns I've been dreaming about lite transactions that don't need a
xid and all that baggage ever since Mircea removed support for mixing tx and non-tx
operations in the same cache :)

Dan: @Galder

Would that be fine if partition handling was enabled?
Depends on what you mean by "fine" :)

William Burns: @Dan yeah tbh I was thinking that would be nice for this use case

Galder: So when a node fails due to memory issues, we switch the readiness probe to say
node not healthy... then the user gives it more memory to the pods... and one at the time
they get restarted... partition handling would kick in and address partial replications

Galder: ^ I might be day dreaming :see_no_evil:

William Burns: hrmm, I think just regular rebalance should repair it

William Burns: don't need partition handling

William Burns: assuming we haven't lost >= numOwners

William Burns: and then nothing can save it :D

Galder: Exactly

Galder: TBH, the above should be done by the autoscaler

Dan:
assuming we haven't lost >= numOwners
Yeah, that's the big assumption -- if one node is shut down because it wanted to
insert too many entries, those entries will be redirected somewhere else

Galder: Rather than manually by someone

Dan: So an exception in one node may trigger the restart of the entire cluster

William Burns: yeah actually the more I think about it losing that node could lose data
then

William Burns: since state transfer writes may be blocked

William Burns: if this fills up another node

Dan: @Galder does OS/K8S keep pods around if they're not healthy?

Galder: Hmmmm, actually, I think it tries to restart them until they're healthy

Dan: Yeah, that's what I thought

William Burns: this is the age old issue of eviction without a store can lose data on a
rehash, irrespective of numOwners

William Burns: @Galder will you have a store?

Galder: Actually

Galder: @Dan Depends on the config...

Galder: @Dan Depends how many times the readiness probe fails, at which point it'd
restart it

Galder:
https://kubernetes.io/docs/tasks/configure-pod-container/configure-livene...

Galder:
@Galder will you have a store?
Not by default

William Burns: so in that case I would say we should err on the side of making sure no
node has too much as much as possible

Galder: Ok

Galder: What if you had a store?

William Burns: if we had a store then partial updates should be fine

William Burns: ie. primary writes the value and backup ignores it

Galder: But the stores would also have partial udpates?

William Burns: but we would have to implement writing to a store but not memory

Galder: exactly

Galder: Ok

Galder: I'll create a JIRA and summarise the discussion for now

Galder: I have to head out in a few

William Burns: alright

William Burns: we could even optimize it for shared stores

William Burns: :D

William Burns: actually that is interesting...

William Burns: we could have a store mode with shared where we never write the entry to
memory on backups

Galder: Local stores are preferable

Galder: One less thing to worry about

William Burns: yeah agreed for this use case, I was just thinking more in general
{code}

...
 EXCEPTION eviction strategy should not require transactions
 -----------------------------------------------------------

                 Key: ISPN-9690
                 URL: https://issues.jboss.org/browse/ISPN-9690
             Project: Infinispan
          Issue Type: Enhancement
            Reporter: Galder Zamarreño
            Priority: Major

 Unfortunately we need 2 phase commit to guarantee consistency. Imagine the case where one
of the owners says the write is okay and another says no, there is no way without two
phase commit to guarantee that either all or none of the writes are completed.
 One possibility would be node to deny a write if it expects it would result in other
nodes running out of memory. However, this could still fail if some keys store more data
than others. It would require Infinispan to calculate some probabilistic method of
deciding when a node would run out of memory.
 Another way would be to have a local (or shared) persistent store attached. In that case,
if a backup owner will run out of memory if storing data, it would not store it in memory
but store it to the persistent layer. If the node is restarted with new memory settings,
the persistent stores would be consistent and the rebalance would put the data back in
memory. 

--
This message was sent by Atlassian Jira
(v7.12.1#712002)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009