]
Bela Ban resolved JGRP-1134.
----------------------------
Resolution: Done
This solution seems to work - haven't seen an issue in any of the manual, unit or
performance tests...
Plus, UNICAST2 might soon replace UNICAST, so this issue is not that important any longer
UNICAST.down(): move add to retransmitter out of the lock scope
---------------------------------------------------------------
Key: JGRP-1134
URL:
https://jira.jboss.org/jira/browse/JGRP-1134
Project: JGroups
Issue Type: Task
Reporter: Bela Ban
Assignee: Bela Ban
Fix For: 2.10
In UNICAST.down(), we acquire a lock per sender to which we send a message:
entry.lock(); // threads will only sync if they access the same entry
try {
seqno=entry.sent_msgs_seqno;
send_conn_id=entry.send_conn_id;
hdr=new UnicastHeader(UnicastHeader.DATA, seqno, send_conn_id, seqno
== DEFAULT_FIRST_SEQNO);
msg.putHeader(getName(), hdr);
entry.sent_msgs.add(seqno, msg); // add *including* UnicastHeader,
adds to retransmitter
entry.sent_msgs_seqno++;
}
finally {
entry.unlock();
}
the code
entry.sent_msgs.add()
is costly as it adds the message to the hashmap, but also to the retransmitter, which
schedules a timer task etc.
The temp solution is to split add(0 into 2 part, which add the message to the hashmap
(fast) and to the retransmitter (costly). The costly part is moved outside the lock scope,
for example:
entry.lock(); // threads will only sync if they access the same entry
try {
seqno=entry.sent_msgs_seqno;
send_conn_id=entry.send_conn_id;
hdr=new UnicastHeader(UnicastHeader.DATA, seqno, send_conn_id, seqno
== DEFAULT_FIRST_SEQNO);
msg.putHeader(getName(), hdr);
entry.sent_msgs.addToMessages(seqno, msg); // add *including*
UnicastHeader, adds to hashmap
entry.sent_msgs_seqno++;
}
finally {
entry.unlock();
}
entry.sent_msgs.addToRetransmitter(seqno, msg); // adds to
retransmitter
However, the issie is if the addition to the retransmitter fails (e.g. due to an OOME):
then we'd have a message gap on the receiver !
SOLUTION:
#1 Do the add to the retransmitter in a loop. If there's a failure, sleep a bit and
try again. Increase the sleep time and so on. Not very nice code, but works and
doesn't ever *lose* a message. OK, if we get OOMEs, then sth's wrong anyway, but
this covers temp OOMEs
#2 If there's an issue, set a flag. Next time around, we check the flag. If it is
set, we re-add all messages in the hashmap into the retransmitter. Involves locking of the
hashmaps and retransmitter, but that's OK since this case should almost never happen
anyway !
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: