]
Bela Ban commented on JGRP-1376:
--------------------------------
To reproduce:
- Use attached encrypt.xml
- Start first Draw: draw -props ./encrypt.xml -name A
- Set discard to true in the GUI
- Start second Draw: the 2 draw instances will not be able to find each other
- Now select "Stop discarding"
- The 2 instances find each other, but the merge handling is incorrect:
91053 [TRACE] ENCRYPT: header received ENCRYPT [type=0 version="15 bytes"]
91053 [WARN] ENCRYPT: attempting to use stored cipher as message does not use current
encryption version
91053 [WARN] ENCRYPT: Unable to find a matching cipher in previous key map
91053 [WARN] ENCRYPT: Unrecognised cipher discarding message
ENCRYPT race condition hangs cluster channel
--------------------------------------------
Key: JGRP-1376
URL:
https://issues.jboss.org/browse/JGRP-1376
Project: JGroups
Issue Type: Bug
Affects Versions: 3.0
Reporter: Manuel Dominguez Sarmiento
Assignee: Bela Ban
Fix For: 3.0
Attachments: encrypt.xml
We have two nodes on a channel. When starting both simultaneously (we spawn threads from
a third control server), sometimes the JGroups channel hangs on both servers and we get
the following logs forever:
[WARN ] 2011-10-15 17:19:17 [ENCRYPT-Incoming-2,null] - attempting to use stored cipher
as message does not uses current encryption version
[WARN ] 2011-10-15 17:19:17 [ENCRYPT-Incoming-2,null] - Unable to find a matching cipher
in previous key map
[WARN ] 2011-10-15 17:19:17 [ENCRYPT-Incoming-2,null] - Unrecognised cipher discarding
message
[WARN ] 2011-10-15 17:19:17 [ENCRYPT-Incoming-3,null] - attempting to use stored cipher
as message does not uses current encryption version
[WARN ] 2011-10-15 17:19:17 [ENCRYPT-Incoming-3,null] - Unable to find a matching cipher
in previous key map
[WARN ] 2011-10-15 17:19:17 [ENCRYPT-Incoming-3,null] - Unrecognised cipher discarding
message
We drilled down to the code in ENCRYPT, and we believe this is related to issue JGRP-706.
That issue introduced a fix for a different problem when using dynamic keys which seems to
be causing problems when starting nodes simultaneously. This did not happen to us before
we moved to a simulatenous multi-node start scenario, so it seems to indicate there is
some kind of race condition which is triggered only when the nodes start at almost the
same time (they run the same hardware, OS, JVM, etc. so startup times are almost
identical)
This is the ENCRYPT config:
<ENCRYPT
sym_init="128"
sym_algorithm="AES"
asym_init="512"
asym_algorithm="RSA"
/>
The tipoff that something is wrong here, and what led us to study the patch introduced by
JGRP-706, is that we have two servers / nodes: one is named gargamel, and the other
azrael. The odd thing is that we got these logs in gargamel:
[INFO ] 2011-10-15 17:19:13 [ENCRYPT-Incoming-3,null] - handling tmp-view up:
MergeView::[azrael-8126|2] [azrael-8126, gargamel-29063], subgroups=[[azrael-8126|0]
[azrael-8126], [gargamel-29063|1] [gargamel-29063]]
[INFO ] 2011-10-15 17:19:13 [ENCRYPT-Incoming-3,null] - I have become key server
azrael-8126
There some kind of identity-crisis going on here - how can gargamel-29063 have become
azrael-8126? The preceding log line indicates that the patch introduced by JGRP-706 is at
play here. A TMP_VIEW event triggers the node into becoming key server. handleViewChange()
is invoked with parameter makeServer=true which triggers this line:
becomeKeyServer(tmpKeyServer);
where tmpKeyServer is azrael-8126
We are not really knowledgeable of what should actually happen here, in order for
everything to work properly, including the situation described by JGRP-706. However, it
seems that in any case, if the local node is to become key server, then the method
argument should probably be invoked becomeKeyServer(local_addr) - although I'm not
sure whether this is correct, either.
After this happens, we immediately start to get the log errors described at the beginning
of this report.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: