[jboss-jira] [JBoss JIRA] Updated: (JGRP-957) Intermittent cluster stability issues

Wed Apr 15 06:37:27 EDT 2009

     [ https://jira.jboss.org/jira/browse/JGRP-957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

a C updated JGRP-957:
---------------------

    Description: 
We are using jgroups as a notification system between webapps running inside tomcat or weblogic server. In our current test platform all cluster nodes are on the same host, most of them on the same container (tomcat). Some web-applictions may have several connections to the cluster.
We use UDP multicast on a LAN, the configuration is nearly the default one.

The system seems to work fine but regularly  we have cluster stability issues. Typically lot of SUSPECT messages are exchanged, a lot of "GMS: address ..."  items are logged on standard output, the number of view accepted events dramatically increases.

As an example, looking at the number of  viewaccepted (grep -c viewAccepted */logout.log):
logout.log.2009-03-25:6
logout.log.2009-03-26:51
logout.log.2009-03-27:49
logout.log.2009-03-28:0
logout.log.2009-03-29:2290
logout.log.2009-03-30:64
logout.log.2009-03-31:55
logout.log.2009-04-01:15
logout.log.2009-04-02:433
logout.log.2009-04-03:32
logout.log.2009-04-04:4
logout.log.2009-04-05:5
logout.log.2009-04-06:38
logout.log.2009-04-07:26
logout.log.2009-04-08:30
logout.log.2009-04-09:19
logout.log.2009-04-10:32
logout.log.2009-04-11:5
logout.log.2009-04-12:7
logout.log.2009-04-13:2236
logout.log.2009-04-14:56

We performed several test campaigns sending and receiving messages during a 2 or 3 dyas period and checking for message loss but everything went right. Until the problems appears again. No network issue was detected by our system administrator.

Another typical problem is that members send NOT_MEMBER messages causing stacks to shutdown (should I say channels to close?). [ Received NOT_MEMBER event from null I'm being shunned; exiting]. The shun option is not set (neither Channel with auto-reconnect option set) and nevertheless in some cases the stack starts up again (CloserThread - reconnecting to group ...)and in other cases not. Please note that when the stack does not start up automatically, it is impossible to connect to the channel manually (we always receive ChannelClosedException)

Typically
[sip at bipro tmusadmin]$ grep -c NOT_MEMBER jgroup.log*
jgroup.log:0
jgroup.log.2009-03-30:3
jgroup.log.2009-03-31:0
jgroup.log.2009-04-01:0
jgroup.log.2009-04-02:1370
jgroup.log.2009-04-07:0
jgroup.log.2009-04-10:0
jgroup.log.2009-04-11:11
jgroup.log.2009-04-12:9
jgroup.log.2009-04-13:587
jgroup.log.2009-04-14:0

A suggestion would be greatly appreciated.

Sory for the size of the logs!

  was:
We are using jgroups as a notification system between webapps running inside tomcat or weblogic server. In our current test platform all cluster nodes are on the same host, most of them on the same container (tomcat). Some web-applictions may have several connections to the cluster.
We use UDP multicast on a LAN, the configuration is nearly the default one.

The system seems to work fine but regularly  we have cluster stability issues. Typically lot of SUSPECT messages are exchanged, a lot of "GMS: address ..."  items are logged on standard output, the number of view accepted events dramatically increases.

As an example:
logout.log.2009-03-25:6
logout.log.2009-03-26:51
logout.log.2009-03-27:49
logout.log.2009-03-28:0
logout.log.2009-03-29:2290
logout.log.2009-03-30:64
logout.log.2009-03-31:55
logout.log.2009-04-01:15
logout.log.2009-04-02:433
logout.log.2009-04-03:32
logout.log.2009-04-04:4
logout.log.2009-04-05:5
logout.log.2009-04-06:38
logout.log.2009-04-07:26
logout.log.2009-04-08:30
logout.log.2009-04-09:19
logout.log.2009-04-10:32
logout.log.2009-04-11:5
logout.log.2009-04-12:7
logout.log.2009-04-13:2236
logout.log.2009-04-14:56

We performed several test campaigns sending and receiving messages during a 2 or 3 dyas period and checking for message loss but everything went right. Until the problems appears again. No network issue was detected by our system administrator.

Another typical problem is that members send NOT_MEMBER messages causing stacks to shutdown (should I say channels to close?). [ Received NOT_MEMBER event from null I'm being shunned; exiting]. The shun option is not set (neither Channel with auto-reconnect option set) and nevertheless in some cases the stack starts up again (CloserThread - reconnecting to group ...)and in other cases not. Please note that when the stack does not start up automatically, it is impossible to connect to the channel manually (we always receive ChannelClosedException)

Typically
[sip at bipro tmusadmin]$ grep -c NOT_MEMBER jgroup.log*
jgroup.log:0
jgroup.log.2009-03-30:3
jgroup.log.2009-03-31:0
jgroup.log.2009-04-01:0
jgroup.log.2009-04-02:1370
jgroup.log.2009-04-07:0
jgroup.log.2009-04-10:0
jgroup.log.2009-04-11:11
jgroup.log.2009-04-12:9
jgroup.log.2009-04-13:587
jgroup.log.2009-04-14:0

A suggestion would be greatly appreciated.

Sory for the size of the logs!

> Intermittent cluster stability issues
> -------------------------------------
>
>                 Key: JGRP-957
>                 URL: https://jira.jboss.org/jira/browse/JGRP-957
>             Project: JGroups
>          Issue Type: Bug
>    Affects Versions: 2.7
>         Environment: jdk 1.5
>            Reporter: a C
>            Assignee: Bela Ban
>             Fix For: 2.8
>
>         Attachments: jgroups-logs.zip
>
>
> We are using jgroups as a notification system between webapps running inside tomcat or weblogic server. In our current test platform all cluster nodes are on the same host, most of them on the same container (tomcat). Some web-applictions may have several connections to the cluster.
> We use UDP multicast on a LAN, the configuration is nearly the default one.
> The system seems to work fine but regularly  we have cluster stability issues. Typically lot of SUSPECT messages are exchanged, a lot of "GMS: address ..."  items are logged on standard output, the number of view accepted events dramatically increases.
> As an example, looking at the number of  viewaccepted (grep -c viewAccepted */logout.log):
> logout.log.2009-03-25:6
> logout.log.2009-03-26:51
> logout.log.2009-03-27:49
> logout.log.2009-03-28:0
> logout.log.2009-03-29:2290
> logout.log.2009-03-30:64
> logout.log.2009-03-31:55
> logout.log.2009-04-01:15
> logout.log.2009-04-02:433
> logout.log.2009-04-03:32
> logout.log.2009-04-04:4
> logout.log.2009-04-05:5
> logout.log.2009-04-06:38
> logout.log.2009-04-07:26
> logout.log.2009-04-08:30
> logout.log.2009-04-09:19
> logout.log.2009-04-10:32
> logout.log.2009-04-11:5
> logout.log.2009-04-12:7
> logout.log.2009-04-13:2236
> logout.log.2009-04-14:56
> We performed several test campaigns sending and receiving messages during a 2 or 3 dyas period and checking for message loss but everything went right. Until the problems appears again. No network issue was detected by our system administrator.
> Another typical problem is that members send NOT_MEMBER messages causing stacks to shutdown (should I say channels to close?). [ Received NOT_MEMBER event from null I'm being shunned; exiting]. The shun option is not set (neither Channel with auto-reconnect option set) and nevertheless in some cases the stack starts up again (CloserThread - reconnecting to group ...)and in other cases not. Please note that when the stack does not start up automatically, it is impossible to connect to the channel manually (we always receive ChannelClosedException)
> Typically
> [sip at bipro tmusadmin]$ grep -c NOT_MEMBER jgroup.log*
> jgroup.log:0
> jgroup.log.2009-03-30:3
> jgroup.log.2009-03-31:0
> jgroup.log.2009-04-01:0
> jgroup.log.2009-04-02:1370
> jgroup.log.2009-04-07:0
> jgroup.log.2009-04-10:0
> jgroup.log.2009-04-11:11
> jgroup.log.2009-04-12:9
> jgroup.log.2009-04-13:587
> jgroup.log.2009-04-14:0
> A suggestion would be greatly appreciated.
> Sory for the size of the logs!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://jira.jboss.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira