]
Bela Ban resolved JGRP-706.
---------------------------
Resolution: Done
applied patch
with dynamic keys in ENCRYPT, a node taking over coordinator can
block itself from taking over the view
-------------------------------------------------------------------------------------------------------
Key: JGRP-706
URL:
http://jira.jboss.com/jira/browse/JGRP-706
Project: JGroups
Issue Type: Bug
Affects Versions: 2.6.1
Environment: Linux test1a 2.6.18-8.el5 #1 SMP Thu Mar 15 19:46:53 EDT 2007 x86_64
x86_64 x86_64 GNU/Linux
CentOS.
Reporter: Gray Watson
Assigned To: Bela Ban
Fix For: 2.6.3, 2.7
Attachments: ENCRYPT.java.patch, ENCRYPT.patch
This is with JGroups 2.6.1 on Linux 2.6.18, and a cluster of 10-20 nodes with several
channels on each node.
Everything is running smoothly, and then we kill the coordinator (of one or more
channels), plus the next few nodes in the view, all at about the same time. SUSPECTs
happen, and then the highest-ranking remaining node decides to become coordinator. It
tries to mcast a new view, but never gets any view ACKs - so nobody else sees the updated
view and the channel remains basically unusable. This only happens sometimes, and usually
on just some of several channels - even though they share the same stack config.
Enabling TRACE logging revealed that ENCRYPT.down() is queuing the new view - we use
dynamically-generated keys and apparently the view changes have flipped queue_down to
'true'. We see this, over and over:
2008-03-02 10:54:47,970 [TRACE] GMS mcasting view {[10.193.48.119:33274|693]
[10.193.48.119:33274, 10.193.48.115:52962, 10.192.226.30:45130, 10.193.66.111:43645]} (4
mbrs)
2008-03-02 10:54:47,970 [TRACE] ENCRYPT queueing down message as no session key
established[dst: <null>, src: <null> (1 headers), size=0 bytes]
2008-03-02 10:54:49,972 [WARN] GMS failed to collect all ACKs (4) for view
[10.193.48.119:33274|693] [10.193.48.119:33274, 10.193.48.115:52962, 10.192.226.30:45130,
10.193.66.111:43645] after 2000ms, missing ACKs from [10.193.48.119:33274,
10.193.48.115:52962, 10.192.226.30:45130, 10.193.66.111:43645] (received=[]),
local_addr=10.193.48.119:33274
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: