The differences in the control channel config:
UDP.singleton_name. In the AS this instance of the UDP transport protocol is shared
across numerous channels. Giving the transport a name is what allows this to work.
UDP.mcast_addr and mcast_port. These are just the values for the shared transport rather
than the old JBM-specific ones. Don't see what difference this would make.
UDP.loopback=true. We were seeing inscrutable startup failures for the AS when this was
false on machines that had improperly configured multicast. With true you get startup
failures (nodes can't cluster because multicast doesn't work) but they aren't
inscrutable. I'd consider changing this back if we could somehow establish it's a
cause of whatever your problem is.
UDP.enable_bundling=false is just putting in the config file the default you had before.
No change.
UDP.ip_ttl=2. Longstanding AS default value to limit multicast propagation. In most
testsuites, multicast doesn't even need to propagate off the test machine, so I doubt
this is your problem.
UDP.timer.num_threads=12. Your unspecified value defaults to 4. This is because the
transport is meant to be shared between different services in the AS, so the number of
threads available to run timer tasks is increased.
UDP.thread_pool.min_threads="20". Old value = 1. With
thread_pool.queue_enabled="true" and
thread_pool.queue_max_size="1000", once that 1 min_thread was carrying a message
up the stack or handling it at the application level, you would have to receive 1000 more
messages and fill the queue before a 2nd thread would be created in the pool to handle to
take a message off the queue. With a shared transport, it's possible those messages
are for completely unrelated services and while the 1 thread is busy say in the session
replication cache, 1000 JBM messages pile up in the queue. You need a larger # of min
threads to ensure threads are available to read the queue. Testing with just one showed
very poor performance in multi-node clusters. I can't see why having more threads
available in a pool would cause a problem.
UDP.thread_pool.rejection_policy="discard". Was "run". You can hang
the entire cluster with "run", since it allows the single thread that reads
messages off the wire to end up going into code that blocks in NAKACK or UNICAST or even
into arbitrary application code. With multi-node clusters in tests under load, we found
it was quite easy to hang the cluster with "run".
UDP.oob_thread_pool.max_threads="20". See
UDP.thread_pool.min_threads="20" above.
UDP.oob_thread_pool.rejection_policy="run" Was "Run". This is just
consistency in capitalization.
FD.timeout and maxTries. With old values, would take 50 secs to detect a hung node.
That's a long time. That was reasonable in AS 4, where the single-threaded channel and
no OOB thread pool made it quite possible for FD heartbeats to go unacknowledged for a
long time while the single thread was busy doing something else. With the thread pool and
OOB messages, there's no reason FD heartbeats should go unacknowledged for so long, so
we reduced the timeout period to 30 secs.
GMS.shun="true". Previous value of false makes no logical sense in conjunction
with FD.shun="true". This was discussed on a JBM forum thread a while back.
TBH, I don't see why any of these would cause the error you reported, but until I see
more details I don't really know what the error was.
View the original post :
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4187165#...
Reply to the post :
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&a...