We brought down one of the SOLARIS machine(P1 - Co-ordinator) to check the view in all
machines.
As expected, the co-ordinator changed to one of the RHEL machine by removing the P1 from
all views, but the dead RHEL members wasn't updated in the VIEW
Please find the DEBUG messages of jgroups.log
org.jgroups.protocols.pbcast.GMS --> new=[172.16.11.200:32790], suspected=[],
leaving=[], new view: [172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210,
172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513,
172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790]
| org.jgroups.protocols.pbcast.GMS --> mcasting view {[172.16.11.20:35858|259]
[172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918,
172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533,
172.16.11.200:32790]} (9 mbrs)
| org.jgroups.protocols.UDP --> sending msg to null (src=172.16.11.20:35858), headers
are {NAKACK=[MSG, seqno=3782], GMS= GmsHeader[VIEW]: view=[172.16.11.20:35858|259]
[172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204, 172.16.11.10:51918,
172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533,
172.16.11.200:32790], UDP =[channel_name=ProvCache-LABS]}
| org.jgroups.protocols.UDP --> message is [dst: 224.7.8.9:45567, src:
172.16.11.20:35858 (3 headers), size = 0 bytes], h eaders are {GMS=GmsHeader[VIEW]:
view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210,
172.16.11.191:37204, 172.16.11.10:51918, 17 2.16.11.12:40087, 172.16.11.13:38513,
172.16.11.13:38520, 172.16.11.13:38533, 172.16.11 .200:32790], NAKACK=[MSG, seqno=3782],
UDP=[channel_name=ProvCache-LABS]}
| org.jgroups.protocols.pbcast.GMS --> view=[172.16.11.20:35858|259]
[172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.1 91:37204, 172.16.11.10:51918,
172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533,
172.16.11.13:38538, 172.16.11.200:32790]
| org.jgroups.protocols.pbcast.GMS --> [local_addr=172.16.11.20:35858] view is
[172.16.11.20:35858|259] [172.16.11.20:358 58, 172.16.11.11:51210, 172.16.11.191:37204,
172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520,
172.16.11.13:38533, 172.16.11.200:32790]
| org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src:
172.16.11.12:40087 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]:
view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210,
172.16.11.191:37204, 172.16.11.10:51 918, 172.16.11.12:40087, 172.16.11.13:38513,
172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA,
seqno=1], UDP=[channel_name=ProvCache-LABS]}
| org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src:
172.16.11.10:51918 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]:
view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210,
172.16.11.191:37204, 172.16.11.10:51 918, 172.16.11.12:40087, 172.16.11.13:38513,
172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA,
seqno=1], UDP=[channel_name=ProvCache-LABS]}
| org.jgroups.protocols.UDP --> sending msg to 172.16.11.20:35858
(src=172.16.11.20:35858), headers are {GMS=GmsHeader[VIEW_ACK]:
view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210,
172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513,
172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UDP=[channe
l_name=ProvCache-LABS], UNICAST=[UNICAST: DATA, seqno=1]}
| org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src:
172.16.11.20:35858 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]:
view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210,
172.16.11.191:37204, 172.16.11.10:51 918, 172.16.11.12:40087, 172.16.11.13:38513,
172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA,
seqno=1], UDP=[channel_name=ProvCache-LABS]}
| org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src:
172.16.11.11:51210 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]:
view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210,
172.16.11.191:37204, 172.16.11.10:51 918, 172.16.11.12:40087, 172.16.11.13:38513,
172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA,
seqno=1], UDP=[channel_name=ProvCache-LABS]}
| org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src:
172.16.11.191:37204 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]:
view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210,
172.16.11.191:37204, 172.16.11.10:5 1918, 172.16.11.12:40087, 172.16.11.13:38513,
172.16.11.13:38520, 172.16.11.13:38533, 172.16.11.200:32790], UNICAST=[UNICAST: DATA,
seqno=1], UDP=[channel_name=ProvCache-LABS]}
| org.jgroups.protocols.pbcast.GMS --> failed to collect all ACKs (11) for view
[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204,
172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520,
172.16.11.13:38533, 172.16.11.200:32790] after 2000ms, missing ACKs from
[172.16.11.13:38513, 172.16.11.13:38515, 172.16.11.13:38520, 172.16.11.13:38533]
(received=[172.16.11.11:51210, 172.16.11.20:35858, 172.16.11.1 91:37204,
172.16.11.12:40087, 172.16.11.10:51918]), local_addr=172.16.11.20:35858
| org.jgroups.protocols.UDP --> sending msg to 172.16.11.200:32790
(src=172.16.11.20:35858), headers are {GMS=GmsHeader[JOIN_RSP]: join_rsp=view:
[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210, 172.16.11.191:37204,
172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513, 172.16.11.13:38520,
172.16.11.13:38533, 172.16.11.200:32790], digest: 172.16.11.11:51210: [0 : 0],
172.16.11.13:38513: [0 : 0], 172.16.11.10:51918: [4481 : 4482], 172.16.11.12:4008 7: [0 :
0], 172.16.11.13:38520: [0 : 0], 172.16.11.200:32790: [0 : 0], 172.16.11.20:35858: [3781 :
3782], 172.16.11.13 :38533: [0 : 0], 172.16.11.191:37204: [3685 : 3686],
UDP=[channel_name=ProvCache-LABS], UNICAST=[UNICAST: DATA , seqno=1]}
| org.jgroups.protocols.UDP --> message is [dst: 172.16.11.20:35858, src:
172.16.11.200:32790 (3 headers), size = 0 bytes], headers are {GMS=GmsHeader[VIEW_ACK]:
view=[172.16.11.20:35858|259] [172.16.11.20:35858, 172.16.11.11:51210,
172.16.11.191:37204, 172.16.11.10:51918, 172.16.11.12:40087, 172.16.11.13:38513,
172.16.11.13:38520, 172.16.11.13:38533,172.16.11.200:32790], UNICAST=[UNICAST: DATA,
seqno=2], UDP=[channel_name=ProvCache-LABS]}
org.jgroups.protocols.UDP --> message is [dst: 224.7.8.9:45567, src: 172.16.11.12:40087
(2 headers), size = 0 bytes], headers are {UDP=[channel_name=ProvCache-LABS], FD=[FD:
SUSPECT (suspected_mbrs=[172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533],
from=172.16.11.12:40087)]}
| org.jgroups.protocols.FD --> [SUSPECT] suspect hdr is [FD: SUSPECT
(suspected_mbrs=[172.16.11.13:38513, 172.16.11.13:38520, 172.16.11.13:38533],
from=172.16.11.12:40087)]
| org.jgroups.protocols.VERIFY_SUSPECT --> verifying that 172.16.11.13:38513 is dead
| org.jgroups.protocols.UDP --> sending msg to 172.16.11.13:38513
(src=172.16.11.10:51918), headers are {VERIFY_SUSPECT=[VERIFY_SUSPECT: ARE_YOU_DEAD],
UDP=[channel_name=ProvCache-LABS]}
| org.jgroups.protocols.VERIFY_SUSPECT --> diff=2034, mbr 172.16.11.13:38513 is dead
(passing up SUSPECT event)
| org.jgroups.protocols.VERIFY_SUSPECT --> diff=2034, mbr 172.16.11.13:38533 is dead
(passing up SUSPECT event)
| org.jgroups.protocols.VERIFY_SUSPECT --> diff=2034, mbr 172.16.11.13:38520 is dead
(passing up SUSPECT event)
| org.jgroups.protocols.pbcast.GMS --> processing [SUSPECT(172.16.11.13:38513),
SUSPECT(172.16.11.13:38533), SUSPECT(172.16.11.13:38520)]
| org.jgroups.blocks.RequestCorrelator --> suspect=172.16.11.13:38513
| org.jgroups.blocks.RequestCorrelator --> suspect=172.16.11.13:38533
| org.jgroups.blocks.RequestCorrelator --> suspect=172.16.11.13:38520
| org.jgroups.protocols.pbcast.GMS --> suspected members=[172.16.11.13:38513,
172.16.11.13:38533, 172.16.11.13:38520], suspected_mbrs=[172.16.11.13:38513,
172.16.11.13:38533, 172.16.11.13:38520]
|
As per these logs, the co-ordinator identifies the dead members correctly but don't
update the view properly, please advice on this
Please tell us how to overcome...
View the original post :
http://www.jboss.org/index.html?module=bb&op=viewtopic&p=4223083#...
Reply to the post :
http://www.jboss.org/index.html?module=bb&op=posting&mode=reply&a...