]
Bela Ban updated JGRP-2239:
---------------------------
Fix Version/s: 4.0.11
(was: 4.0.10)
AUTH + ASYM_ENCRYPT causes problem with re-joining cluster (MERGE)
------------------------------------------------------------------
Key: JGRP-2239
URL:
https://issues.jboss.org/browse/JGRP-2239
Project: JGroups
Issue Type: Bug
Affects Versions: 4.0.6
Environment: Infinispan 9.1.1 + JGroups 4.0.6.Final + Vert.x 3.5.0
Reporter: Boris Sh
Assignee: Bela Ban
Fix For: 4.0.11
Hello,
I am using the following configuration:
{code:java}
<config
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="urn:org:jgroups" xsi:schemaLocation="urn:org:jgroups
http://www.jgroups.org/schema/jgroups.xsd">
<UDP />
<PING />
<MERGE3 />
<FD />
<VERIFY_SUSPECT />
<ASYM_ENCRYPT encrypt_entire_message="true" sym_keylength="128"
sym_algorithm="AES/ECB/PKCS5Padding" asym_keylength="2048"
asym_algorithm="RSA" />
<pbcast.NAKACK2 />
<UNICAST3 />
<pbcast.STABLE />
<FRAG2 />
<AUTH auth_class="org.jgroups.auth.X509Token" auth_value="auth"
keystore_path="keystore.jks" keystore_password="pwd"
cert_alias="alias"
cipher_type="RSA" />
<pbcast.GMS />
</config>
{code}
I have 7 services, but will try to show logs for 2 ones, coordinator and some random
node, and all the other nodes behave similarly.
Initially, when these nodes join the cluster, everything is fine.
The server is a shared machine with slow CPU and also slow HDD, so sometimes, when other
applications are busy with their tasks, whole my cluster can get frozen for 3-5 minutes.
During/in the end of this freeze, some service may tell me the following (in logs):
{code:java}
org.jgroups.protocols.FD up
WARNING: node-26978: I was suspected by node-27291; ignoring the SUSPECT message and
sending back a HEARTBEAT_ACK
WARNING: node-26978: unrecognized cipher; discarding message from node-27291
org.jgroups.protocols.Encrypt handleEncryptedMessage
WARNING: node-26978: unrecognized cipher; discarding message from node-27291
org.jgroups.protocols.Encrypt handleEncryptedMessage
WARNING: node-26978: unrecognized cipher; discarding message from node-36734
org.jgroups.protocols.Encrypt handleEncryptedMessage
{code}
so the node was kicked out from the cluster, as it became "suspect", but the
node doesn't agree with that fact. Cluster coordinator has already changed sym private
key, so in the further logs of this server I see "unrecognized cipher".
In cluster coordinator logs I see the following:
{code:java}
INFO: ISPN100000: Node node-26978 joined the cluster
****
WARN: node-27291: unrecognized cipher; discarding message from node-26978
org.jgroups.logging.Slf4jLogImpl error
ERROR: key requester node-26978 is not in current view [***]; ignoring key request
org.jgroups.logging.Slf4jLogImpl warn
WARN: node-27291: unrecognized cipher; discarding message from node-26978
INFO: ISPN000093: Received new, MERGED cluster view for channel ISPN:
MergeView::[node-26978|8] (7) [node-26978, node-12721, node-17625, node-45936, node-56674,
node-36734, node-27291], 2 subgroups: [node-27291|7] (6) [node-27291, node-12721,
node-17625, node-45936, node-56674, node-36734], [node-27291|6] (7) [node-27291,
node-26978, node-12721, node-17625, node-45936, node-56674, node-36734]
{code}
My understanding of what has happened:
For example I have 3 nodes {A, B, C} in the cluster. The cluster gets frozen for some
minutes, so node {C} becomes suspected, and kicked out from the cluster by coordinator.
For some reason {C} ignores that fact. Later, after cluster is up again, it becomes
ignoring messages from {C}, because it is using ASYM encryption and private key has been
re-generated by coordinator. Also, for some reason MERGE operation doesn't work, and
{C} can not join back to cluster, and now cluster has 2 subgroups, that don't
communicate to each other, and I don't fully understand why this happens.
How I temporary resolved this issue: changed ASYM_ENCRYPT to SYM_ENCRYPT, and now any
node can come back to the cluster successfully after freeze, as the key doesn't
change.
Also, I didn't test, but think change_key_on_leave="false" will help, but
this is not the way I want to use.
So looks like this a problem with AUTH + ASYM_ENCRYPT protocol combination, when node in
some cases can not rejoin the cluster.