[jboss-jira] [JBoss JIRA] (WFLY-5437) Occassional stale data with remote EJB invocations and sync cache

Fri Oct 9 03:19:00 EDT 2015

     [ https://issues.jboss.org/browse/WFLY-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Janík updated WFLY-5437:
--------------------------------
    Description: 
Hi,

we're occasionally getting stale data on remote EJB invocations (number counter returns number 1 lower than expected, see example). -This is usually preceded (~6 seconds before that) by cluster wide rebalance after a node is brought back from dead.-
- 2000 clients, stale data is uncommon
- requests from a single client are separated by a 4 second window.

An example of stale data:
{code}
2015/08/28 13:01:35:442 EDT [WARN ][Runner - 640] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Error sampling data:  &lt;org.jboss.smartfrog.loaddriver.RequestProcessingException: Response serial does not match. Expected: 86, received: 85, runner: 640.>
{code}

And a link to our jobs if you're interested:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-7x-failover-ejb-ejbremote-jvmkill-repl-sync/5/console-perf17/consoleText

This behavior has so far been observed with jvmkill and undeploy scenario, on REPL-SYNC and DIST-SYNC caches.

  was:
Hi,

we're occasionally getting stale data on remote EJB invocations (number counter returns number 1 lower than expected, see example). This is usually preceded (~6 seconds before that) by cluster wide rebalance after a node is brought back from dead.
- 2000 clients, stale data is uncommon
- requests from a single client are separated by a 4 second window.

An example of stale data:
{code}
2015/08/28 12:45:11:868 EDT [WARN ][Runner - 553] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Error sampling data:  &lt;org.jboss.smartfrog.loaddriver.RequestProcessingException: Response serial does not match. Expected: 87, received: 86, runner: 553.>
{code}

Server side log excerpt about rebalance:
{code}
[JBossINF] [0m[0m12:45:02,780 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-1,ee,perf21) ISPN000094: Received new cluster view for channel web: [perf21|7] (4) [perf21, perf20, perf18, perf19]
[JBossINF] [0m[0m12:45:02,781 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-1,ee,perf21) ISPN000094: Received new cluster view for channel ejb: [perf21|7] (4) [perf21, perf20, perf18, perf19]
[JBossINF] [0m[0m12:45:03,660 INFO  [org.infinispan.CLUSTER] (remote-thread--p3-t6) ISPN000310: Starting cluster-wide rebalance for cache repl, topology CacheTopology{id=12, rebalanceId=5, currentCH=ReplicatedConsistentHash{ns = 60, owners = (3)[perf21: 20, perf20: 20, perf18: 20]}, pendingCH=ReplicatedConsistentHash{ns = 60, owners = (4)[perf21: 15, perf20: 15, perf18: 15, perf19: 15]}, unionCH=null, actualMembers=[perf21, perf20, perf18, perf19]}
[JBossINF] [0m[0m12:45:03,660 INFO  [org.infinispan.CLUSTER] (remote-thread--p4-t16) ISPN000310: Starting cluster-wide rebalance for cache dist, topology CacheTopology{id=16, rebalanceId=7, currentCH=DefaultConsistentHash{ns=80, owners = (3)[perf21: 27+26, perf20: 26+28, perf18: 27+26]}, pendingCH=DefaultConsistentHash{ns=80, owners = (4)[perf21: 20+20, perf20: 20+20, perf18: 20+20, perf19: 20+20]}, unionCH=null, actualMembers=[perf21, perf20, perf18, perf19]}
[JBossINF] [0m[0m12:45:03,663 INFO  [org.infinispan.CLUSTER] (remote-thread--p4-t19) ISPN000310: Starting cluster-wide rebalance for cache clusterbench-ee7.ear.clusterbench-ee7-web-default.war, topology CacheTopology{id=16, rebalanceId=7, currentCH=DefaultConsistentHash{ns=80, owners = (3)[perf21: 27+26, perf20: 26+28, perf18: 27+26]}, pendingCH=DefaultConsistentHash{ns=80, owners = (4)[perf21: 20+20, perf20: 20+20, perf18: 20+20, perf19: 20+20]}, unionCH=null, actualMembers=[perf21, perf20, perf18, perf19]}
[JBossINF] [0m[0m12:45:03,664 INFO  [org.infinispan.CLUSTER] (remote-thread--p4-t18) ISPN000310: Starting cluster-wide rebalance for cache clusterbench-ee7.ear.clusterbench-ee7-web-passivating.war, topology CacheTopology{id=16, rebalanceId=7, currentCH=DefaultConsistentHash{ns=80, owners = (3)[perf21: 27+26, perf20: 26+28, perf18: 27+26]}, pendingCH=DefaultConsistentHash{ns=80, owners = (4)[perf21: 20+20, perf20: 20+20, perf18: 20+20, perf19: 20+20]}, unionCH=null, actualMembers=[perf21, perf20, perf18, perf19]}
[JBossINF] [0m[0m12:45:03,759 INFO  [org.infinispan.CLUSTER] (remote-thread--p4-t18) ISPN000336: Finished cluster-wide rebalance for cache clusterbench-ee7.ear.clusterbench-ee7-web-passivating.war, topology id = 16
[JBossINF] [0m[0m12:45:03,820 INFO  [org.infinispan.CLUSTER] (remote-thread--p3-t7) ISPN000336: Finished cluster-wide rebalance for cache repl, topology id = 12
[JBossINF] [0m[0m12:45:03,832 INFO  [org.infinispan.CLUSTER] (remote-thread--p4-t18) ISPN000336: Finished cluster-wide rebalance for cache dist, topology id = 16
[JBossINF] [0m[0m12:45:03,958 INFO  [org.infinispan.CLUSTER] (remote-thread--p3-t3) ISPN000310: Starting cluster-wide rebalance for cache clusterbench-ee7.ear/clusterbench-ee7-ejb.jar, topology CacheTopology{id=12, rebalanceId=5, currentCH=ReplicatedConsistentHash{ns = 60, owners = (3)[perf21: 20, perf20: 20, perf18: 20]}, pendingCH=ReplicatedConsistentHash{ns = 60, owners = (4)[perf21: 15, perf20: 15, perf18: 15, perf19: 15]}, unionCH=null, actualMembers=[perf21, perf20, perf18, perf19]}
[JBossINF] [0m[0m12:45:04,760 INFO  [org.infinispan.CLUSTER] (remote-thread--p4-t18) ISPN000336: Finished cluster-wide rebalance for cache clusterbench-ee7.ear.clusterbench-ee7-web-default.war, topology id = 16
[JBossINF] [0m[0m12:45:06,331 INFO  [org.infinispan.CLUSTER] (remote-thread--p3-t9) ISPN000336: Finished cluster-wide rebalance for cache clusterbench-ee7.ear/clusterbench-ee7-ejb.jar, topology id = 12
{code}

And a link to our jobs if you're interested:
https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-7x-failover-ejb-ejbremote-jvmkill-repl-async/6

This behavior has been observed with jvmkill and undeploy scenario, on REPL-SYNC, REPL-ASYNC and DIST-SYNC caches.

        Summary: Occassional stale data with remote EJB invocations and sync cache  (was: Stale data after cluster wide rebalance with remote EJB invocations)

After disregarding async caches I can no longer find any evidence that this ties to cluster wide rebalance. The number of stale data reported is very low (4 occurrences in the linked job) and it occurs rarely. I changed the description accordingly.

> Occassional stale data with remote EJB invocations and sync cache
> -----------------------------------------------------------------
>
>                 Key: WFLY-5437
>                 URL: https://issues.jboss.org/browse/WFLY-5437
>             Project: WildFly
>          Issue Type: Bug
>          Components: Clustering
>            Reporter: Richard Janík
>            Assignee: Paul Ferraro
>
> Hi,
> we're occasionally getting stale data on remote EJB invocations (number counter returns number 1 lower than expected, see example). -This is usually preceded (~6 seconds before that) by cluster wide rebalance after a node is brought back from dead.-
> - 2000 clients, stale data is uncommon
> - requests from a single client are separated by a 4 second window.
> An example of stale data:
> {code}
> 2015/08/28 13:01:35:442 EDT [WARN ][Runner - 640] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Error sampling data:  &lt;org.jboss.smartfrog.loaddriver.RequestProcessingException: Response serial does not match. Expected: 86, received: 85, runner: 640.>
> {code}
> And a link to our jobs if you're interested:
> https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-7x-failover-ejb-ejbremote-jvmkill-repl-sync/5/console-perf17/consoleText
> This behavior has so far been observed with jvmkill and undeploy scenario, on REPL-SYNC and DIST-SYNC caches.

--
This message was sent by Atlassian JIRA
(v6.4.11#64026)