[infinispan-issues] [JBoss JIRA] (ISPN-1965) Some entries not available during view change

Thu Apr 5 05:44:47 EDT 2012

    [ https://issues.jboss.org/browse/ISPN-1965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681974#comment-12681974 ] 

RH Bugzilla Integration commented on ISPN-1965:
-----------------------------------------------

Dan Berindei <dberinde at redhat.com> made a comment on [bug 808623|https://bugzilla.redhat.com/show_bug.cgi?id=808623]

Misha, after reading it again I think it could be a little clearer. So here's another attempt:

In rare circumstances, when a node leaves the cluster, instead of going directly to a new cluster view that contains everyone but the leaver, the cluster splits into two partitions which then merge after a short amount of time. During this time, at least some nodes will not have access to all the data that previously existed in the cache. After the merge, all the nodes will again have access to all the data, but changes made during the split may be lost or be visible only to a part of the cluster.

Normally, when the view changes because of a join or a leave, the cache data is rebalanced on the new cluster members. However, if numOwners or more nodes leave in quick succession, keys for which all nodes have left will be lost. The same thing happens during a network split - regardless how the partitions form, there will be at least one partition that doesn't have all the data (assuming cluster size > numOwners).

While there are multiple partitions, each one can make changes to the data independently, so a remote client will see inconsistencies in the data. When merging, JBoss Data Grid does not attempt to resolve these inconsistencies, so different nodes may hold different values even after the merge.

> Some entries not available during view change
> ---------------------------------------------
>
>                 Key: ISPN-1965
>                 URL: https://issues.jboss.org/browse/ISPN-1965
>             Project: Infinispan
>          Issue Type: Bug
>    Affects Versions: 5.1.3.FINAL
>            Reporter: Michal Linhard
>            Assignee: Dan Berindei
>
> In the 4 node, dist mode, num-owners=2, elasticity test
> http://www.qa.jboss.com/~mlinhard/hyperion/run44-elas-dist/
> there is a cca 90 sec period of time where clients get null responses to GET
> requests on entries that should exist in the cache.
> first occurence:
> hyperion1139.log 05:31:01,202 286.409
> last occurence:
> hyperion1135.log 05:32:45,441 390.648
> total occurence count: (in all 19 driver nodes)
> 152241
> (this doesn't mean it happens for 152K keys, because each key is retried after
> erroneous attempt)
> data doesn't seem to be lost, because these errors cease after a while and
> number of entries returns back to normal (see cache_entries.csv)
> this happens approximately in the period between node0001 is killed and cluster
> {node0002 - node0004} is formed (and shortly after).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira