Hi - these question relate to establishing the safe operation of buddy replication under
AS 4.0.5.
Selection of Buddies
Say we have nodes a1, a2 and a3 and they are booted in that order. What we see is that
when a2 starts it forms a buddy pair with a1. Then when a3 starts a1 becomes the backup
for a3. So in this scenario a1 is backing up two nodes and a3 is backing up zero nodes.
So the memory utilization across the nodes is unbalanced. (we now have logging around the
session replication listener to analyse this behaviour)
This seems to be broken. Is this the way buddy replication should select buddies or do we
have a config problem somewhere? What we need is that each node has the same amount of
backup work (memory, cpu etc) overhead to even the load of providing replication across
the cluster.
Failover Operation
Having read the JBOSS doc I still need to understand more about the basic operation of
failover. Currently we have no replication so if an app server node goes down we loose
~25% of users but the other 75% stays pretty operational. In practice, we will have a
cluster of 6 app servers and during peak times we would see 2000-3000 users per node. 18K
concurrent users in all.
What I am concerned about using buddy replication is that if a node goes down we could
send other nodes down as well as they have to rapidly take over the work of the node that
failed (a sort of domino affect). After reading the doc I still dont have a solid
understanding of how this process works or the risks we might have.
Assume a2 backs up a1, a3 backs up a2 and a1 backs up a3. This is buddy replication with
one backup buddy. All nodes are fronted by an F5 load balancer that provides sticky
sessions and will redirect a user to a random node if the node with its original session
fails.
So what, in detail, happens if a1 goes down? After the failure of a1 the F5 will direct
Some of a1's users to a2 and some to a3.
1) How does the cluster determine who is the new primary owner of a1's session data?
Hopefully it will decide to use a2 since it already has a copy of a1's session cache.
2) For users directed to a3 by the F5 - how does a3 now populate its session cache to
service those newly arriving users.
3) I assume the cluster also now picks a new buddy for a3 since it lost its buddy a1. In
this case it will have to be a2 since there are no other nodes. So question is - what is
the impact (network, cpu etc) on a2 and a3 to establish a2 as the new buddy relationship
is established. What we are worried about is that both a2 and a3 now suddenly have a large
group of new users to support as well as taking the resource hit to replicate each others
session state.
Failover Best Practices
What are the buddy replicatoon 'best practices' that we should follow to provide
safe and reliable failover in a heavily loaded cluster?
View the original post :
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4148704#...
Reply to the post :
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&a...