[
https://issues.jboss.org/browse/ISPN-1602?page=com.atlassian.jira.plugin....
]
Bela Ban commented on ISPN-1602:
--------------------------------
Erik, I see address with include a siteId. Can I assume you're using RELAY then ? If
you do, the weird thing is that I see more than 2 siteIds, which current RELAY doesn't
support, how come I'm seeing siteIds ?
If you do a
grep "protocols.FD" *.log | grep suspect
, you'll see that FD did indeed suspect a few members:
- dht12 suspected dht11 multiple times
- dht12 suspected dht11 because it didn't receive a heartbeat for 24s
- dht11 also suspects dht12...
So dht11 and dht12 suspected each other, and those were the only members ever suspected by
FD. (Note that FD_SOCK did *not* suspect anyone.)
Again, possible causes are:
- GC. Are you sure you really only observed GCs under 500ms ? Kind of hard to believe, as
GC can take ~ 1s / GB, and you mentioned the other day that you had 6GB of heap...
- High traffic can cause a loss of heartbeat
- Busy CPU: the CPU is busy processing and this can slow the sending of heartbeat-acks
down, too
- Thread pools sized too small
- Thread pools with a rejection policy of "run": this is dangerous because, *if*
the thread which received a message has to process it (and blocks in app code), it will
not be able to process other messages !
This brings me back to what I said before:
#1 Please use one of the default configurations shipped with JGroups 3.0.1 (e.g.
udp-largecluster.xml), and make only minor modifications (e.g. thread pool sizes) !
#2 Increase the timeouts in FD
#3 If you use RELAY, then be aware that we haven't tested it yet in the context of
Infinispan 5.x and JGroups 3.x ! This task will be started soon, but for now the behavior
of rebalancing / state transfer in Infinispan is completely undefined/untested for 5.x/3.x
!
Single view change causes stale locks
-------------------------------------
Key: ISPN-1602
URL:
https://issues.jboss.org/browse/ISPN-1602
Project: Infinispan
Issue Type: Bug
Components: Core API
Affects Versions: 5.1.0.CR1
Reporter: Erik Salter
Assignee: Dan Berindei
Priority: Critical
Fix For: 5.1.0.CR2
During load testing of 5.1.0.CR1, we're encountering JGroups 3.x dropping views. We
know due to ISPN-1581, if the number of view changes > 3, there could be a stale lock
on a failed commit. However, we're seeing stale locks occur on a single view change.
In the following logs, the affected cluster is the erm-cluster-xxxx
(We also don't know why JGroups 3.x is unstable. We suspected FLUSH and incorrect FD
settings, but we removed them, and we're still getting dropped messages)
The trace logs (It isn't long at all before the issue occurs) are at:
http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht10/server.l...
http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht11/server.l...
http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht12/server.l...
http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht13/server.l...
http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht14/server.l...
http://dl.dropbox.com/u/50401510/5.1.0.CR1/dec08viewchange/dht15/server.l...
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see:
http://www.atlassian.com/software/jira