]
Bela Ban updated JGRP-2297:
---------------------------
Fix Version/s: 4.0.18
(was: 4.0.17)
Coordinator with ASYM_ENCRYPT in the stack does not leave gracefully
--------------------------------------------------------------------
Key: JGRP-2297
URL:
https://issues.jboss.org/browse/JGRP-2297
Project: JGroups
Issue Type: Bug
Affects Versions: 4.0.14
Reporter: Radoslav Husar
Assignee: Bela Ban
Priority: Blocker
Fix For: 4.0.18
The {{ASYM_ENCRYPT_LeaveTest}} is designed to test graceful leaving coordinator(s) with
ASYM_ENCRYPT in the stack. However, the test currently passes due to presence of MERGE3 in
the stack. While the intention of the test seems to be testing graceful leaving of
coordinator(s), the cluster ends up with inconsistent views later resolved by MERGE3.
Here is a run of the test with a modification of the test with a *single* coordinator
leaving:
https://gist.github.com/rhusar/89172882fae60a1f29327c33f2d124db
The problem seems to be with coordinating of key exchange. In this run, roughly:
1. node 1 is leaving
2. node 2 becomes coordinator and key server
{noformat}
10:55:18.286 [jgroups-3,ASYM_ENCRYPT_LeaveTest,2] DEBUG org.jgroups.protocols.pbcast.GMS
- 2: installing view [2|10] (9) [2, 3, 4, 5, 6, 7, 8, 9, 10]
...
10:55:18.299 [jgroups-3,ASYM_ENCRYPT_LeaveTest,2] DEBUG
org.jgroups.protocols.ASYM_ENCRYPT - 2: I'm the new key server
10:55:18.300 [jgroups-3,ASYM_ENCRYPT_LeaveTest,2] DEBUG
org.jgroups.protocols.ASYM_ENCRYPT - 2: created new secret key (version:
AB1E6F44DE947D792A7D05D2E957AC85)
...
10:55:18.300 [jgroups-3,ASYM_ENCRYPT_LeaveTest,2] DEBUG
org.jgroups.protocols.ASYM_ENCRYPT - 2: created new secret key (version:
AB1E6F44DE947D792A7D05D2E957AC85)
{noformat}
3. node 9 receives {{FETCH_SECRET_KEY}} however receives stale key? looks like it still
contacts the leaving coordinator node 1?
{noformat}
10:55:18.319 [SSL_KEY_EXCHANGE-runner-12,ASYM_ENCRYPT_LeaveTest,1] DEBUG
org.jgroups.protocols.SSL_KEY_EXCHANGE - 1: accepted SSL connection from /127.0.0.1:51812;
protocol: TLSv1, cipher suite: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
...
10:55:18.319 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG
org.jgroups.protocols.SSL_KEY_EXCHANGE - 9: created SSL connection to 2 (/127.0.0.1:2157);
protocol: TLSv1, cipher suite: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
10:55:18.321 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG
org.jgroups.protocols.SSL_KEY_EXCHANGE - 9: sending up secret key (version:
AF7916A9394F49B085D4F35C4F5A0A3E)
10:55:18.321 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG
org.jgroups.protocols.ASYM_ENCRYPT - 9: ignoring secret key received from key exchange
protocol (version: AF7916A9394F49B085D4F35C4F5A0A3E), as it has already been installed
{noformat}
4. new coordinator fails to collect all acks (since it cannot decipher stale key?)
{noformat}
10:55:20.307 [jgroups-3,ASYM_ENCRYPT_LeaveTest,2] WARN org.jgroups.protocols.pbcast.GMS
- 2: failed to collect all ACKs (expected=8) for view [2|10] after 2000ms, missing 1 ACKs
from (1) 9
{noformat}
5. node 9 eventually obtains the key but since it has stale view and still thinks node 1
is coordinator? and fails to contact it
{noformat}
10:55:20.307 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG
org.jgroups.protocols.ASYM_ENCRYPT - 9: asking key exchange protocol to get secret key
from 2
10:55:20.322 [SSL_KEY_EXCHANGE-runner-26,ASYM_ENCRYPT_LeaveTest,2] DEBUG
org.jgroups.protocols.SSL_KEY_EXCHANGE - 2: accepted SSL connection from /127.0.0.1:51829;
protocol: TLSv1, cipher suite: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
10:55:20.322 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG
org.jgroups.protocols.SSL_KEY_EXCHANGE - 9: created SSL connection to 2 (/127.0.0.1:2158);
protocol: TLSv1, cipher suite: TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA
10:55:20.322 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG
org.jgroups.protocols.SSL_KEY_EXCHANGE - 9: sending up secret key (version:
AB1E6F44DE947D792A7D05D2E957AC85)
10:55:20.322 [jgroups-3,ASYM_ENCRYPT_LeaveTest,9] DEBUG
org.jgroups.protocols.ASYM_ENCRYPT - 9: installing secret key received from key exchange
protocol (version: AB1E6F44DE947D792A7D05D2E957AC85)
10:55:23.341 [TQ-Bundler-10,ASYM_ENCRYPT_LeaveTest,9] DEBUG org.jgroups.protocols.TCP -
JGRP000034: 9: failure sending message to 1: java.net.ConnectException: Connection refused
(Connection refused)
{noformat}
6. cluster is later healed with MERGE3
{noformat}
10:55:27.103 [jgroups-27,ASYM_ENCRYPT_LeaveTest,2] DEBUG org.jgroups.protocols.pbcast.GMS
- 2: I will be the merge leader. Starting the merge task. Views: {2=[2|10] (9) [2, 3, 4,
5, 6, 7, 8, 9, 10], 9=[1|9] (10) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
{noformat}
Another run with MERGE3 omitted from the stack is here:
https://gist.github.com/rhusar/b51aeee03485a607041f9669bbc6e707
Further investigation is ongoing, but this might be related to graceful leaving of
coordinator JGRP-2293 exacerbating the problem with key exchange in ASYM_ENCRYPT.
Scaling down is typical cloud workflow, especially with encryption since {{ASYM_ENCRYPT}}
is the recommended setup making this problem critical.