Bela Ban wrote:
Manik Surtani wrote:
> Ok - so we could add an extra check into the view change listener to
> force an unblock if a member who initiated a FLUSH dies. We would
> also have to record the address of the member initiating the FLUSH in
> the flushBlockGate.
+1.
I'm also puzzled as to why this could evade detection for so long...
Have we started to use channels in a different way now ? E.g. concurrent
startup ?
3 possible factors:
1) The AS now creates/starts a cache when it needs it, i.e. as part of
deploy of a clustered webapp or SFSB, rather than at AS start. Effect
of this is during a testsuite run there is a lot more starting/stopping
of caches and associated channels than there was back in the day. So
intermittent failures will happen more. This is a fairly old change
though. But for sure it increases odds of these failures vs. say the
first half of this year.
2) AS upgraded to 2.6.5 on October 15. I saw and reported an
intermittent flush failure on October 18. Maybe a relationship; don't know.
3) In our second week in Brno, I changed the protocol stacks to add > 1
min_threads to the pools. Clebert seems to feel the JBM failures he
started reported popped up when he changed the stack JBM is testing
against to match the AS stack; making localhost=true and min_threads > 1
were the most significant changes.
--
Brian Stansberry
Lead, AS Clustering
JBoss, a division of Red Hat
brian.stansberry(a)redhat.com