Discussion (perhaps with myself) re: logs attached to
http://jira.jboss.com/jira/browse/JBPAPP-653.
Couple indications point to a problem with perf02 receiving messages.
1) perf04, which would use perf02 as a buddy no longer logs remoting's
"closing" messages after this:
2008-02-21 08:33:32,828 DEBUG [org.jboss.remoting.transport.socket.SocketWrapper]
ServerSocketWrapper[Socket[addr=/172.17.68.254,port=33698,localport=3873].4d8918] closing
Those messages are logged at the end of an EJB request; not logging means the request
isn't completed yet. Lot's of logging of intermediate steps in request, but no
completion.
The cache is REPL_ASYNC, so that implies to me the requests are blocking in JGroups FC.
(No JBC locking issues reported).
2) On perf03, a similar pattern of no "closing" begins at 08:33:35,130. But then
you get a few of these:
2008-02-21 08:34:01,393 WARN [org.jgroups.protocols.FC] Received two credit requests from
172.17.65.39:40890 without any intervening messages; sending 1999633 credits
Indicates perf02 is not getting FC credits from perf03.
Then this:
2008-02-21 08:34:53,627 WARN [org.jgroups.protocols.FD] I was suspected by
172.17.65.39:40890; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
again indicating non-delivery of heartbeat responses to perf02.FD.
All 3 live nodes begin at roughly 08:33:30 a pattern of logging EJB request activity but
no "closing". Looks like everyone is blocking waiting for FC credits. Any
multicasts in the message mix? Only one that comes to mind is data gravitation. But the
troublesome logging pattern begins nearly a minute after perf01 is stopped, so why would
there still be a lot of data gravitation going on?
Another weird thing: when perf01 restarts, its Tomcat cache (not the EJB3 SFSB one under
load) has trouble joining the group:
2008-02-21 08:34:41,546 WARN [org.jgroups.protocols.pbcast.GMS] join(172.17.64.39:52927)
sent to 172.17.65.39:40883 timed out, retrying
2008-02-21 08:34:48,553 WARN [org.jgroups.protocols.pbcast.GMS] join(172.17.64.39:52927)
sent to 172.17.65.39:40883 timed out, retrying
2008-02-21 08:34:55,559 WARN [org.jgroups.protocols.pbcast.GMS] join(172.17.64.39:52927)
sent to 172.17.65.39:40883 timed out, retrying
Non-responsive node is again perf02, but this is a completely different channel. Just a
CPU overload on perf02?
Perf02's Tomcat cache does finally issue a view with perf01 in it, but only after
logging this:
2008-02-21 08:35:11,784 WARN [org.jgroups.protocols.pbcast.GMS] failed to collect all
ACKs (3) for view [172.17.65.39:40883|5] [172.17.65.39:40883, 172.17.66.39:35267,
172.17.67.39:39896, 172.17.64.39:52927] after 5000ms, missing ACKs from
[172.17.65.39:40883, 172.17.66.39:35267, 172.17.67.39:39896] (received=[]),
local_addr=172.17.65.39:40883
Note it doesn't get an ACK from itself!
Eventually, perf03's sfsb cache channel excludes perf02:
2008-02-21 08:35:03,233 INFO [org.jboss.cache.TreeCache] viewAccepted():
[172.17.66.39:35274|5] [172.17.66.39:35274, 172.17.67.39:39903]
But perf04 doesn't acknowledge the new view:
2008-02-21 08:35:08,202 WARN [org.jgroups.protocols.pbcast.GMS] failed to collect all
ACKs (2) for view [172.17.66.39:35274|5] [172.17.66.39:35274, 172.17.67.39:39903] after
5000ms, missing ACKs from [172.17.67.39:39903] (received=[172.17.66.39:35274]),
local_addr=172.17.66.39:35274
No viewAccepted() with the new view ever appears on the perf04 logs.
perf02 also doesn't know it's been excluded, continuing to send suspect messages
to perf03:
2008-02-21 08:35:11,785 WARN [org.jgroups.protocols.FD] I was suspected by
172.17.65.39:40890; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK
View the original post :
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4138599#...
Reply to the post :
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&a...