[jboss-user] [JBoss Cache: Core Edition] - jbosscache/jgroups replicated cache not working

Mon Jun 22 11:44:24 EDT 2009

Using jbosscache-core 3.1.0.GA on jdk 1.6.

We have a shared map of users that are logged in to our site.  This is used by our back end services to send alerts to users if they are _not_ logged in.  Each front end web app has a node id, and each user can have multiple sessions per node:

/loggedInUsers/userId/nodeId/sessionId

We hook into the servlet session listener lifecycle to remove sessions from the cache that have timed out, or when users log off.  If a /userId/ node has no children, it is removed and the user is considered logged off of the site.

We are currently deployed with a single front end and a single back end, both on the same linux machine.

We have a number of problems:

  | * the front end and back end cache state is different- e.g. often the front end will have fewer entries than the back end
  | 
  | * the cache does not always reflect the correct state- sessions persist after they expire in the servlet container, and /userId/ nodes that have no remaining children are not cleaned up.
  |   
  | * the eviction policy does not clean up stale entries- we've defined a time to live of 30 minutes to match our session timeout on the front end.  We see entries in the cache that are many hours old.
  |       
  | * we see membership warnings in the logs periodically even though processes are still running and we are running on one node only.
  | 

In short, it is not working.  I'm hoping that there is some configuration issue that someone can spot, or a different way of configuring jgroups that might help.  I've included the config at the end.

I also have a couple of questions:

  | * is there a way to get the last modified/accessed time of an entry through the API without touching the entry and thereby increasing it's TTL?
  |    
  | * is there a way to remove an Fqn via JMX?
  |    
  | * is there work that the application has to do with @NodeEvicted or other annotations to ensure that the cache state is the same on all nodes (I sort of thought this was the point of jbosscache).
  | 

Thanks for any help you can provide.   --Joachim

Here are the errors we are seeing periodically:

anonymous wrote : server.log.2009-06-18:2009-06-18 22:44:32,279 WARN [org.jgroups.protocols.FD] - <I was suspected by 127.0.0.1:59480; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK>
  | server.log.2009-06-18:2009-06-18 22:44:32,279 WARN [org.jgroups.protocols.FD] - <I was suspected by 127.0.0.1:48930; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK>
  | server.log.2009-06-18:2009-06-18 22:44:32,795 WARN [org.jgroups.protocols.pbcast.GMS] - <I (127.0.0.1:35383) am not a member of view [127.0.0.1:51683|6] [127.0.0.1:51683, 127.0.0.1:57843, 127.0.0.1:43144, 127.0.0.1:33294, 127.0.0.1:48930]; discarding view>
  | server.log.2009-06-18:2009-06-18 22:44:32,971 WARN [org.jgroups.protocols.pbcast.GMS] - <I (127.0.0.1:51031) am not a member of view [127.0.0.1:59480|16] [127.0.0.1:59480]; discarding view>
  | server.log.2009-06-18:2009-06-18 22:44:53,056 WARN [org.jgroups.protocols.pbcast.GMS] - <Merge aborted. Merge leader did not get MergeData from all subgroup coordinators [127.0.0.1:51683, 127.0.0.1:35383]>
  | server.log.2009-06-18:2009-06-18 22:45:03,648 WARN [org.jgroups.protocols.pbcast.GMS] - <Merge aborted. Merge leader did not get MergeData from all subgroup coordinators [127.0.0.1:59480, 127.0.0.1:51031]>
  | server.log.2009-06-18:2009-06-18 22:45:03,648 WARN [org.jgroups.protocols.pbcast.GMS] - <merge was supposed to be cancelled at merge participant 127.0.0.1:51031 (merge_id=[127.0.0.1:51031|1245365098648]), but it is not since merge ids do not match>
  | server.log.2009-06-18:2009-06-18 22:45:08,605 WARN [org.jgroups.protocols.pbcast.GMS] - <Merge aborted. Merge leader did not get MergeData from all subgroup coordinators [127.0.0.1:51683, 127.0.0.1:35383]>
  | server.log.2009-06-18:2009-06-18 22:45:08,609 WARN [org.jgroups.protocols.pbcast.GMS] - <merge was supposed to be cancelled at merge participant 127.0.0.1:35383 (merge_id=[127.0.0.1:35383|1245365103604]), but it is not since merge ids do not match>
  | server.log.2009-06-18:2009-06-18 22:45:11,269 WARN [org.jgroups.protocols.pbcast.GMS] - <GMS flush by coordinator at 127.0.0.1:35383 failed>
  | server.log.2009-06-18:2009-06-18 22:45:11,269 WARN [org.jgroups.protocols.pbcast.GMS] - <Since flush failed at 127.0.0.1:35383 rejected merge to 127.0.0.1:35383, merge_id=[127.0.0.1:35383|1245365088055]>
  | server.log.2009-06-18:2009-06-18 22:45:11,269 ERROR [org.jgroups.protocols.pbcast.GMS] - <merge_id ([127.0.0.1:35383|1245365088055]) or this.merge_id (null) is null (sender=127.0.0.1:35383).>
  | server.log.2009-06-18:2009-06-18 22:45:23,157 WARN [org.jgroups.protocols.pbcast.GMS] - <GMS flush by coordinator at 127.0.0.1:51031 failed>
  | server.log.2009-06-18:2009-06-18 22:45:23,157 WARN [org.jgroups.protocols.pbcast.GMS] - <resume([127.0.0.1:51031|1245365098648]) does not match [127.0.0.1:51031|1245365122421]>
  | server.log.2009-06-18:2009-06-18 22:45:23,157 WARN [org.jgroups.protocols.pbcast.GMS] - <Since flush failed at 127.0.0.1:51031 rejected merge to 127.0.0.1:51031, merge_id=[127.0.0.1:51031|1245365098648]>
  | server.log.2009-06-18:2009-06-18 22:45:23,169 ERROR [org.jgroups.protocols.pbcast.GMS] - <this.merge_id ([127.0.0.1:51031|1245365122421]) is different from merge_id ([127.0.0.1:51031|1245365098648])>
  | server.log.2009-06-18:2009-06-18 22:45:24,101 WARN [org.jgroups.protocols.pbcast.GMS] - <Merge aborted. Merge leader did not get MergeData from all subgroup coordinators [127.0.0.1:51683, 127.0.0.1:35383]>
  | server.log.2009-06-18:2009-06-18 22:45:26,006 WARN [org.jgroups.protocols.pbcast.GMS] - <GMS flush by coordinator at 127.0.0.1:35383 failed>
  | server.log.2009-06-18:2009-06-18 22:45:26,006 WARN [org.jgroups.protocols.pbcast.GMS] - <Since flush failed at 127.0.0.1:35383 rejected merge to 127.0.0.1:35383, merge_id=[127.0.0.1:35383|1245365103604]>
  | server.log.2009-06-18:2009-06-18 22:45:26,006 ERROR [org.jgroups.protocols.pbcast.GMS] - <merge_id ([127.0.0.1:35383|1245365103604]) or this.merge_id (null) is null (sender=127.0.0.1:35383).>
  | server.log.2009-06-18:2009-06-18 22:45:27,422 WARN [org.jgroups.protocols.pbcast.GMS] - <Merge aborted. Merge leader did not get MergeData from all subgroup coordinators [127.0.0.1:59480, 127.0.0.1:51031]>
  | server.log.2009-06-18:2009-06-18 22:45:27,422 WARN [org.jgroups.protocols.pbcast.GMS] - <merge was supposed to be cancelled at merge participant 127.0.0.1:51031 (merge_id=[127.0.0.1:51031|1245365122421]), but it is not since merge ids do not match>
  | server.log.2009-06-18:2009-06-18 22:45:33,806 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <sender 127.0.0.1:51031 not found in xmit_table>
  | server.log.2009-06-18:2009-06-18 22:45:33,806 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <range is null>
  | server.log.2009-06-18:2009-06-18 22:45:33,806 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <sender 127.0.0.1:59480 not found in xmit_table>
  | server.log.2009-06-18:2009-06-18 22:45:33,806 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <range is null>
  | server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <sender 127.0.0.1:35383 not found in xmit_table>
  | server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <range is null>
  | server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <sender 127.0.0.1:51683 not found in xmit_table>
  | server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <range is null>
  | server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <sender 127.0.0.1:57843 not found in xmit_table>
  | server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <range is null>
  | server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <sender 127.0.0.1:43144 not found in xmit_table>
  | server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <range is null>
  | server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <sender 127.0.0.1:33294 not found in xmit_table>
  | server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <range is null>
  | server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <sender 127.0.0.1:48930 not found in xmit_table>
  | server.log.2009-06-18:2009-06-18 22:45:36,062 ERROR [org.jgroups.protocols.pbcast.NAKACK] - <range is null>
  | server.log.2009-06-19:2009-06-19 03:47:12,927 WARN [org.jgroups.protocols.pbcast.GMS] - <I (127.0.0.1:40908) am not a member of view [127.0.0.1:59480|18] [127.0.0.1:59480]; discarding view>
  | server.log.2009-06-19:2009-06-19 03:47:12,931 WARN [org.jgroups.protocols.FD] - <I was suspected by 127.0.0.1:59480; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK>
  | server.log.2009-06-19:2009-06-19 03:47:12,951 WARN [org.jgroups.protocols.pbcast.GMS] - <I (127.0.0.1:44101) am not a member of view [127.0.0.1:51683|8] [127.0.0.1:51683, 127.0.0.1:57843, 127.0.0.1:43144, 127.0.0.1:33294, 127.0.0.1:48930]; discarding view>
  | server.log.2009-06-19:2009-06-19 03:47:12,995 WARN [org.jgroups.protocols.FD] - <I was suspected by 127.0.0.1:48930; ignoring the SUSPECT message and sending back a HEARTBEAT_ACK>

Here's the config:

  | <jbosscache xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:jboss:jbosscache-core:config:3.0">
  | 
  |    <!-- Configure the TransactionManager -->
  |    <transaction transactionManagerLookupClass="org.jboss.cache.transaction.GenericTransactionManagerLookup"/>
  | 
  |    <clustering mode="replication" clusterName="loggedInUsers">
  |       <!-- JGroups protocol stack properties. -->
  |       <!-- changed addr and port (to isolate channels) -->
  |       <jgroupsConfig>
  |          <UDP discard_incompatible_packets="true" enable_bundling="false" enable_diagnostics="false" ip_ttl="2"
  |               loopback="false" max_bundle_size="64000" max_bundle_timeout="30" mcast_addr="228.10.10.11"
  |               mcast_port="45589" mcast_recv_buf_size="25000000" mcast_send_buf_size="640000"
  |               oob_thread_pool.enabled="true" oob_thread_pool.keep_alive_time="10000" oob_thread_pool.max_threads="4"
  |               oob_thread_pool.min_threads="1" oob_thread_pool.queue_enabled="true" oob_thread_pool.queue_max_size="10"
  |               oob_thread_pool.rejection_policy="Run" thread_naming_pattern="pl" thread_pool.enabled="true"
  |               thread_pool.keep_alive_time="30000" thread_pool.max_threads="25" thread_pool.min_threads="1"
  |               thread_pool.queue_enabled="true" thread_pool.queue_max_size="10" thread_pool.rejection_policy="Run"
  |               tos="8" ucast_recv_buf_size="20000000" ucast_send_buf_size="640000" use_concurrent_stack="true"
  |               use_incoming_packet_handler="true"/>
  |          <PING num_initial_members="3" timeout="2000"/>
  |          <MERGE2 max_interval="30000" min_interval="10000"/>
  |          <FD_SOCK/>
  |          <FD max_tries="5" shun="true" timeout="10000"/>
  |          <VERIFY_SUSPECT timeout="1500"/>
  |          <pbcast.NAKACK discard_delivered_msgs="true" gc_lag="0" retransmit_timeout="300,600,1200,2400,4800"
  |                         use_mcast_xmit="false"/>
  |          <UNICAST timeout="300,600,1200,2400,3600"/>
  |          <pbcast.STABLE desired_avg_gossip="50000" max_bytes="400000" stability_delay="1000"/>
  |          <pbcast.GMS join_timeout="5000" print_local_addr="true" shun="false" view_ack_collection_timeout="5000"
  |                      view_bundling="true"/>
  |          <FRAG2 frag_size="60000"/>
  |          <pbcast.STREAMING_STATE_TRANSFER/>
  |          <pbcast.FLUSH timeout="0"/>
  |       </jgroupsConfig>
  | 
  |       <sync replTimeout="20000"/>
  |       <!-- Alternatively, to use async replication, comment out the element above and uncomment the element below.  -->
  |        <!--<async />-->
  | 
  |    </clustering>
  | 
  |     <eviction wakeUpInterval="60000">
  |        <region name="/loggedInUsers" algorithmClass="org.jboss.cache.eviction.LRUAlgorithm" eventQueueSize="200000">
  |           <property name="timeToLive" value="1800001" />
  |        </region>
  |     </eviction>
  | 
  | </jbosscache>
  | 
  | 

Thanks!

View the original post : http://www.jboss.org/index.html?module=bb&op=viewtopic&p=4239139#4239139

Reply to the post : http://www.jboss.org/index.html?module=bb&op=posting&mode=reply&p=4239139