]
Dipak Kothari commented on JGRP-594:
------------------------------------
I have been investigating this (and NPE in JGRP-594) further and have found the
following:
1) At start up, infrequently, i get members starting in different groups due to initial
coordinator not responding in time.
2) The MergeTask identifies the sub groups and initiates the merge.
3) New view and digest get installed. Also, the application gets a call back of View
Change.
4) The application, on identifing that its a mergeView request a getState from one of the
members (Say X) of the subgroup.
5) X returns its view of the View and digest before it had received the MergeView event
(can see in the log that this events arrives afterwards).
6) STATE_TRANSFER protocol handles the state Response
7) As flush is not in the protocol, it calls sends a SET_DIGEST event down the stack which
is picked up by NAKACK. It resets its digest (the digest it had was the correct merged
version) and updates with the one that arrived with state. However, this digest
doesn't have itself in it and so start getting the ERROR messages as mentioned above.
8) Subsequent put in ReplicatedHashMap fails with NPE in the NAKACK as it tries to add
message to the NakReceiverWindow associated with local address which now isn't there.
So I tried adding FLUSH protocol and this avoids resetting the digest and the NPE but the
same tests results in state being different between members and never seem to get to the
point where the state is the same across all members of the group.
I have looked at the bug list between 2.5 and 2.6.1 and have noticed some work that has
been done in the area of flush, gms and state transfer and so was thinking of working on
2.6.1 now rather than 2.5.0. One point though, without the flush I had the same issue with
2.6.1 as well though it only happended once in a 24 hour soak test.
Intermittently, a Null pointer exception is thrown when trying to
remove non-existent entry in ReplicatedHashMap
----------------------------------------------------------------------------------------------------------------
Key: JGRP-594
URL:
http://jira.jboss.com/jira/browse/JGRP-594
Project: JGroups
Issue Type: Bug
Affects Versions: 2.5
Environment: Linux
Reporter: Dipak Kothari
Assigned To: Bela Ban
Fix For: 2.7
Attachments: Server2.log
Intermittently, when an entry is removed from a ReplicatedHashMap (where the entry does
not exist) the following exception is thrown:
java.lang.RuntimeException: remove(APMExample.Services.examples.ServerA09) failed
at org.jgroups.blocks.ReplicatedHashMap.remove(ReplicatedHashMap.java:405)
at
com.ubs.apm.control.service.nameservice.jgroup.JGroupNameService.unRegisterService(JGroupNameService.java:132)
at
com.ubs.apm.control.sensors.ControlSensorManager.cleanup(ControlSensorManager.java:468)
at
com.ubs.apm.control.sensors.ControlSensorManager.<init>(ControlSensorManager.java:125)
at
com.ubs.apm.control.sensors.ControlSensorManager.<init>(ControlSensorManager.java:106)
at com.ubs.apm.control.example.ManagedServer.init(ManagedServer.java:20)
at com.ubs.apm.control.example.ManagedServer.main(ManagedServer.java:45)
Caused by: java.lang.RuntimeException: failed executing request [req_id=1189617222436
caller=14.64.61.201:6825
14.64.61.201:6838: sender=14.64.61.201:6838, retval=null, received=false,
suspected=false
.... many such lines ...
14.64.61.201:6860: sender=14.64.61.201:6860, retval=null, received=false,
suspected=false
14.64.61.201:6847: sender=14.64.61.201:6847, retval=null, received=false,
suspected=false
request_msg: [dst: <null>, src: 14.64.61.201:6825 (2 headers), size=143 bytes]
rsp_mode: GET_NONE
done: true
timeout: 5000
expected_mbrs: 0 ([14.64.61.201:6815, 14.64.61.201:6816, 14.64.61.201:6824,
14.64.61.201:6825, 14.64.61.201:6826, 14.64.61.201:6827, 14.64.61.201:6828,
14.64.61.201:6829, 14.64.61.201:6830, 14.64.61.201:6831, 14.64.61.201:6833,
14.64.61.201:6834, 14.64.61.201:6835, 14.64.61.201:6836, 14.64.61.201:6837,
14.64.61.201:6838, 14.64.61.201:6839, 14.64.61.201:6842, 14.64.61.201:6843,
14.64.61.201:6844, 14.64.61.201:6845, 14.64.61.201:6846, 14.64.61.201:6847,
14.64.61.201:6848, 14.64.61.201:6849, 14.64.61.201:6850, 14.64.61.201:6851,
14.64.61.201:6852, 14.64.61.201:6853, 14.64.61.201:6854, 14.64.61.201:6855,
14.64.61.201:6856, 14.64.61.201:6857, 14.64.61.201:6858, 14.64.61.201:6859,
14.64.61.201:6860, 14.64.61.201:6861, 14.64.61.201:6862, 14.64.61.201:6863,
14.64.61.201:6864, 14.64.61.201:6865, 14.64.61.201:6866])]
at org.jgroups.blocks.MessageDispatcher.castMessage(MessageDispatcher.java:433)
at org.jgroups.blocks.RpcDispatcher.callRemoteMethods(RpcDispatcher.java:199)
at org.jgroups.blocks.RpcDispatcher.callRemoteMethods(RpcDispatcher.java:167)
at org.jgroups.blocks.RpcDispatcher.callRemoteMethods(RpcDispatcher.java:163)
at org.jgroups.blocks.ReplicatedHashMap.remove(ReplicatedHashMap.java:402)
... 6 more
Caused by: java.lang.RuntimeException: failure adding msg [dst: <null>, src:
14.64.61.201:6825 (2 headers), size=143 bytes] to the retransmit table for
14.64.61.201:6825
at org.jgroups.protocols.pbcast.NAKACK.send(NAKACK.java:636)
at org.jgroups.protocols.pbcast.NAKACK.down(NAKACK.java:438)
at org.jgroups.protocols.pbcast.STABLE.down(STABLE.java:317)
at org.jgroups.protocols.pbcast.GMS.down(GMS.java:782)
at org.jgroups.protocols.pbcast.STATE_TRANSFER.down(STATE_TRANSFER.java:221)
at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:339)
at org.jgroups.JChannel.downcall(JChannel.java:1240)
at
org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.down(MessageDispatcher.java:752)
at org.jgroups.blocks.RequestCorrelator.sendRequest(RequestCorrelator.java:301)
at org.jgroups.blocks.GroupRequest.doExecute(GroupRequest.java:440)
at org.jgroups.blocks.GroupRequest.execute(GroupRequest.java:190)
at org.jgroups.blocks.MessageDispatcher.castMessage(MessageDispatcher.java:430)
... 10 more
Caused by: java.lang.NullPointerException
at org.jgroups.protocols.pbcast.NAKACK.send(NAKACK.java:632)
... 21 more
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: