[
https://jira.jboss.org/jira/browse/JGRP-1190?page=com.atlassian.jira.plug...
]
Brian Stansberry updated JGRP-1190:
-----------------------------------
Attachment: JGRP-1190.patch
Attached is a patch with a possible simple solution. I modified
LazyRemovalCache.retainAll(Collection<K> keys, boolean) so it has the semantic of
indicating that whatever is in 'keys' is wanted and should no longer be marked
'removable'. Thus in my example above, when C2:VIEW comes in, the entry added by
C2:ADD and marked removable by C1:VIEW will no longer be marked 'removable'.
There is still a theoretical race where if before C2:VIEW comes in something causes the
cache to be purged, the C2:ADD entry would be lost. But with a reasonable
logical_cache_addr_expiration value I don't think that's a real world problem.
With this the AS testsuite cases that were failing now pass.
(Background: the scenario described in this JIRA occurs on AS startup, and then later the
testsuite connects and closes a bunch of channels on the same shared TP, creating more
logical addresses. Eventually the cache exceeds max size and the entry incorrectly marked
removable is discarded, failures in later tests when messages need to be sent to the
removed address. This situation doesn't seem to be recoverable for some reason, even
though when TP.sendToSingleMember() discovers a missing physical address, it sends up and
event meant to trigger Discovery to ask for physical addresses.)
Race conditions in logical address caching with shared transport
----------------------------------------------------------------
Key: JGRP-1190
URL:
https://jira.jboss.org/jira/browse/JGRP-1190
Project: JGroups
Issue Type: Bug
Reporter: Brian Stansberry
Assignee: Bela Ban
Fix For: 2.10
Attachments: JGRP-1190.patch
The logical address caching (i.e. TP.logical_addr_cache) is prone to races when the
shared transport is used. JBoss AS's startup simultaneously connects 2 channels on a
shared transport. With 2.10.0-Alpha3 we're seeing problems with addresses for still
healthy members being removed from logical_addr_cache.
An entry is added to the cache when:
1) an event comes down the stack, i.e. from Channel.setAddress().
2) A Discovery GET_MBRS_REQ comes in, either from a remote node or from receipt of the
nodes own message.
Primary mechanism for marking an entry for removal is a VIEW_CHANGE event comes down the
stack, which results in a retainAll() invocation on the cache; only addresses that are
part of the views of the channels associated with the shared TP are retained.
This can lead to the following kind of race where C1 and C2 are 2 channels sharing the
TP, ADD is one the events described above that adds to the cache, and VIEW is a
VIEW_CHANGE event coming down:
1) C1:ADD
2) C2:ADD
3) C1:VIEW --- oops -- whatever C2:ADD added is marked as removable
4) C2:VIEW
Besides this larger issue, there is also a minor race in TP.handleDownEvent's
handling of address caches when there is a view change:
case Event.VIEW_CHANGE:
synchronized(members) {
View view=(View)evt.getArg();
members.clear();
if(!isSingleton()) {
Vector<Address> tmpvec=view.getMembers();
members.addAll(tmpvec);
}
else {
// add all members from all clusters
for(Protocol prot: up_prots.values()) {
if(prot instanceof ProtocolAdapter) {
ProtocolAdapter ad=(ProtocolAdapter)prot;
Set<Address> tmp=ad.getMembers();
members.addAll(tmp);
}
}
}
}
// fix for
https://jira.jboss.org/jira/browse/JGRP-918
logical_addr_cache.retainAll(members);
UUID.retainAll(members);
break;
The two retainAll calls at the end need to be inside the synchronized block. Otherwise if
TP is a shared transport, two threads can simultaneously be carrying down view changes. T1
proceeds through the synchronized block. Then while it is updating the address caches, T2
enters the sync block and begins manipulating 'members' with the result that the
data passed by T1 to retainAll is incomplete.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira