Re: JGroups and concurrent FLUSHes
by Brian Stansberry
Sounds like an overly large change for a micro release on the highly
stable 2.6 branch. JGroups' 2.7 branch seems more appropriate. NBST in
JBC is not stable tech, so I don't like the idea of destabilizing a
highly stable branch to cater to it.
From a technical POV, what's discussed sounds fine. ;)
Bela Ban wrote:
> Copied Brian, who will probably not like this. I guess... :-)
>
> Actually, we should move this discussion over to jbosscache-dev...
> (copied). Please reply to jbosscache-dev from now on
>
> Vladimir Blagojevic wrote:
>> As Bela and I talked this option simplifies FLUSH quite a bit but puts
>> the burden and the *freedom* of retry management on application code.
>> My only concern is compatibility if we are going to stick this into
>> 2.6.9. This changes flush semantics quite a bit and perhaps we can
>> talk about this as well.
>>
>> On 2/27/09 9:52 AM, Bela Ban wrote:
>>> Manik and I discussed this over the phone. Some items we came up with:
>>>
>>> * Concurrent partial flushes won't happen because if a state
>>> requester or provider isin the process of transferring state,
>>> it'll reject the new state transfer
>>> * Concurrent total and partial flushes *are* possible: a view change
>>> and a partial state, executed concurrently
>>> o We cannot disable flushing for view changes, because the
>>> user probably wanted this with placing FLUSH into the stack.
>>> OTOH, because JBC NBST requires FLUSH (because of the
>>> partial flushing), we need to be able to disable flushing
>>> for view changes. Hence the previous email.
>>> o If a view change flush is in progress, currently a partial
>>> flush would fail. Manik will change code to make the partial
>>> flushing back off and retry, up to a number of times, if
>>> this happens
>>> o If a view change happens, but a partial flush is already in
>>> progress, the view change will fail ! We'll change code such
>>> that the coordinator backs off and retries a number of
>>> times, before giving up.
>>> o Because total flushes and partial flushes are usually very
>>> short, the backing off and retrying mechanism should work
>>> most of the time
>>> o Would establishing total order between total and partial
>>> flushes help ? Vladimir and Bela to investigate
>>> o Vladimir: make sure that if A flushes A,B and C
>>> successfully, but fails for D and E, A only aborts the flush
>>> on A,B and C, but *not* on D or E ! A member cannot abort
>>> flushes started by someone else !
>>>
>>>
>>
>
--
Brian Stansberry
Lead, AS Clustering
JBoss, a division of Red Hat
brian.stansberry(a)redhat.com
15 years, 9 months
NBST + persistent state
by Manik Surtani
Hi guys
I might have found a potential flaw in the NBST impl that we have.
The way it works, transient state is written to the stream, and when
this is done, the tx log is flushed. When the tx log is small, a
FLUSH is requested and then the remaining tx log is written.
But the flush isn't lifted here - the flush is only lifted once the
state transfer process is complete, and in between these 2 events is
the generation, streaming and storing of persistent state. Which is
potentially much bigger than transient state, and could mean that the
flush is in place for much longer than expected.
Any thoughts here, Jason? Perhaps a solution would be to stream
persistent state before the tx log is flushed? Could lead to wacky
stream multiplexing on the receiver's end though.
--
Manik Surtani
Lead, JBoss Cache
http://www.jbosscache.org
manik(a)jboss.org
15 years, 9 months
Stress testing non-blocking state transfer
by Manik Surtani
This is a quick status update for those interested.
We've been running some tests with NBST and overcame initial
weirdnesses with the HTTP session replication stress test, mainly
around config issues (not using MVCC, not using
STREAMING_STATE_TRANSFER in JGroups). These are now tested for in the
JBC configs.
Anyway, the latest run went somewhat well, in that the basic state
transfer worked. The test involved async replication, with 4 nodes.
Here is what happened:
1. 4 nodes started. State transferred accordingly, no problems.
2. Load applied to all servers via a load balancer.
3. Node 4 killed.
4. Node 3 killed.
5. Node 2 was observed to throw an OOM at this point.
Examining what happened on Node 2 at this stage, stay tuned for more
info.
Cheers
--
Manik Surtani
Lead, JBoss Cache
http://www.jbosscache.org
manik(a)jboss.org
15 years, 9 months
JBC 3.0.3.CR1/GA
by Manik Surtani
Hello everyone
Is there any more feedback on 3.0.3.CR1 that I cut last week? So far
the only one I have got is this:
https://jira.jboss.org/jira/browse/JBCACHE-1479
for which I am testing a fix right now. Are people happy with a
3.0.3.GA being cut over the next few days, essentially consisting of
3.0.3.CR1 + JBCACHE-1479 ?
Cheers
--
Manik Surtani
Lead, JBoss Cache
http://www.jbosscache.org
manik(a)jboss.org
15 years, 9 months
EAP Compatibility Requirements
by Jason T. Greene
Hi Folks,
I am trying to get an understanding about what the specific
compatibility requirements are between EAP major releases, so that we
know what our limitations are with AS community releases.
The release taxonomy for EAP [1] states that for major EAP release
"Compatibility is not guaranteed but must be managed and upgrades must
made simple for users through automated tools and documentation."
This however seems to be focused on API and feature-set. What about wire
compatibility for our proprietary protocols? More specifically, is there
any requirement that an EAP4 JMS or Remoting client be able to talk to
an EAP5 JMS or Remoting server.
Such a requirement would mean that Remoting 3 and JBM 2 would have to
have wire level compatibility with their previous major versions before
they would ever be included as the default in a AS release. So there is
a big cost here.
[1] https://docspace.corp.redhat.com/clearspace/docs/DOC-6430
--
Jason T. Greene
JBoss, a division of Red Hat
15 years, 9 months
Multiple Channel.getState() invocations
by Bela Ban
Vladimir, Jason,
what happens when we have multiple Channel.getState() invocations with
STREAMING_STATE_TRANSFER ?
Let's say I invoke Channel.getState() once, and then time out waiting
for the state response. On the state provider, this caused the spawning
of a thread which handles the state request.
Now I call Channel.getState() the 2nd time. Does this mean that
1. there is a 2nd thread spawned by the state provider ?
2. the 2nd Channel.getState() could actually return when the *first*
state requester thread sends back the state ?
If #1 is true, and we spawn 1 thread per state request, then that means
we're traversing the tree *once per thread* ! This is probably not a big
deal in STREAMING_STATE_TRANSFER, because we're only allocating a socket
and output stream, but we're still traversing the tree once per thread
and use up CPU cycles for serialization !
Or is this completely wrong and we're just adding an incoming request to
the request queue if there is already a state transfer request being
processed ? This is what is done in STATE_TRANSFER, don't know about
STREAMING_STATE_TRANSFER. Can you verify what's going on here ?
Jason: we should consider trashing the loop where we call
Channel.getState() on a member and continue the loop if getState() runs
into a timeout. I suggest we set the timeout to 0, or at least to a
large value. This means we'll wait forever for a large state, or return
with rc=false. This would only be the case e.g. when the state cannot be
serialized, ie. when hitting a non serializable object.
Thoughts ?
--
Bela Ban
Lead JGroups / Clustering Team
JBoss - a division of Red Hat
15 years, 9 months
Re: NBST failing on my Mac
by Jason T. Greene
Bela Ban wrote:
> So it *does* throw an exception ?
>
> I didn't look in the code, but my TestNG errors on my Mac were caused by
> none of the members being able to provide initial state, and therefore
> the get() assertions after the 2nd node joined, failed.
>
> I would have expected an exception right ? I'll dig into this tomorrow
> again..
>
Moving this to the the dev list since Brian mentioned we are skipping folks.
--
Jason T. Greene
JBoss, a division of Red Hat
15 years, 9 months
-tests-sources.jar?
by Galder Zamarreno
Hi,
When doing a "mvn install", I've realised that the following files are
installed in my local repo:
[galder@localhost]~/jboss/code/cache/core-trunk% ls
~/.m2/repository/org/jboss/cache/jbosscache-core/3.1.0-SNAPSHOT -al
total 3816
drwxr-xr-x 2 galder galder 4096 2009-02-10 22:24 .
drwxr-xr-x 5 galder galder 4096 2009-02-11 12:19 ..
-rw-r--r-- 1 galder galder 1137186 2009-02-11 11:57
jbosscache-core-3.1.0-SNAPSHOT.jar
-rw-r--r-- 1 galder galder 22315 2009-02-11 11:57
jbosscache-core-3.1.0-SNAPSHOT.pom
-rw-r--r-- 1 galder galder 1059047 2009-02-11 11:57
jbosscache-core-3.1.0-SNAPSHOT-sources.jar
-rw-r--r-- 1 galder galder 1655882 2009-02-11 11:57
jbosscache-core-3.1.0-SNAPSHOT-tests.jar
-rw-r--r-- 1 galder galder 318 2009-02-11 11:57 maven-metadata-local.xml
What about jbosscache-core-3.1.0-SNAPSHOT-tests-sources.jar?
Cheers,
--
Galder Zamarreño
Sr. Software Maintenance Engineer
JBoss, a division of Red Hat
15 years, 9 months
Important! Reverted all commits in trunk after 7632
by Jason T. Greene
All changes in trunk made after 7632 introduce major test regressions.
In order to get things back, each commit needs to be merged(1) into your
local tree, and then only AFTER you fix them and verify there are no
test regressions, commit them.
[1] example: svn up; svn merge -c 7635
--
Jason T. Greene
JBoss, a division of Red Hat
15 years, 9 months