On 4/19/13 11:51 AM, Sanne Grinovero wrote:
TBH I'm not understanding which problem this thread is about :)
Surely network partitions are a problem, but there are many forms of
"partition", and many different opinions of what an "acceptable"
behaviour is that the grid should implement, which largely depend on
assumptions the client application is making.
This thread is not about providing different application data merge
policies, which is also something that needs to be done, but more likely
in the context of eventual consistency in Infinispan. This thread is
about a *primary partition* approach which says that only members in a
primary partition are allowed to make progress, and others aren't.
'No progress' needs to be defined, but so far I think we've agreed on a
notion of read-only members, which do provide access to data, but don't
allow modifications. One could also think about not even allowing reads
as the data might be stale. Perhaps this is configurable.
The idea is that if only a primary partition can make progress, and only
*one* primary partitions exists in the system at any time, then we can
simply overwrite the data of minority partitions with data from the
primary partition on a merge.
So a primary partition approach is not about how to merge (possibly
conflicting) data after a cluster split heals, but about how to
*prevent* conflicting data in the first place.
If you think about this in CAP terminology, we sacrifice availability in
favor of consistency.
Since we seem to be discussing a case in which the minority group is
expected to flip into read-only mode, could we step back and describe:
- why this is an acceptable solution for some class of applications?
- what kind of potential network failure we want to take compensating
actions for?
There *are* no compensating actions, as we avoid the different branches
of the same key, contrary to eventual consistency which is all about
merging conflicting branches of a key.
I'm not an expert on how people physically wire up single nodes,
racks
and rooms to allow for our virtual connections, but let's assume that
all nodes are connected with a single "cable" between each other, or
if concrete multiple cables are actually used, could we rely on system
configuration to guarantee packets can find alternative routes if one
wire is eaten by mice?
In my experience, dropped packets due to this almost never happen. Most
partitions (in Infinispan/JGroups) occur because of the following things:
- Maxed out thread pool at one or more members, which leads to missed
heartbeat acks and false suspicions
- Misconfigured switch / firewall, especially if members are in
different subnets
- Buggy firmware in the switch, e.g. dropping multicasts every now and
then (IGMP snooping)
- Small packet queues, so packets are discarded
- GC inhibiting heartbeats for some time
It seems important to me to define what level of network failure we
want to address, for example are we assuming we don't deal with cases
in which nodes can talk to one group but not vice-versa?
We cannot guarantee this won't happen. As a matter of fact, I've seen
this in practice, and MERGE{2,3} contain code that deals with asymmetric
partitions.
If the effect of a nework failure is a completely isolated group,
can
we assume Hot Rod clients can't reach them either?
Partitions may include some clients and some cluster nodes, we cannot
assume they cleanly separate clients from server nodes. Unfortunately... :-)
--
Bela Ban, JGroups lead (
http://www.jgroups.org)