[JBoss JIRA] (ISPN-4995) ClusteredGet served for non-member of CH
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-4995?page=com.atlassian.jira.plugin.... ]
Dan Berindei reassigned ISPN-4995:
----------------------------------
Assignee: Dan Berindei
> ClusteredGet served for non-member of CH
> ----------------------------------------
>
> Key: ISPN-4995
> URL: https://issues.jboss.org/browse/ISPN-4995
> Project: Infinispan
> Issue Type: Bug
> Components: Core, State Transfer
> Reporter: Radim Vansa
> Assignee: Dan Berindei
> Priority: Critical
>
> When nodes accept ClusteredGetCommand from node that is not member of CH, it can happen that when one thread does
> {code}
> put(K1, V1);
> put(K2, V2)
> {code}
> and another gets
> {code}
> get(K2) -> V2
> get(K1) -> V0 (some old value)
> {code}
> edg-perf01, 02 and 03 share this view and topology:
> {code}
> 04:40:08,714 TRACE [org.jgroups.protocols.FD_SOCK] (INT-8,edg-perf01-63779) edg-perf01-63779: i-have-sock: edg-perf02-45117 --> 172.18.1.3:37476 (cache is {edg-perf01-63779=172.18.1.1:40099, edg-perf02-45117=172.18.1.3:37476})
> 04:40:08,715 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread--p2-t6) Received new cluster view: 8, isCoordinator = true, becameCoordinator = false
> 04:40:11,203 DEBUG [org.infinispan.topology.LocalTopologyManagerImpl] (transport-thread--p2-t1) Updating local consistent hash(es) for cache testCache: new topology = CacheTopology{id=16, rebalanceId=4, currentC
> H=DefaultConsistentHash{ns = 512, owners = (3)[edg-perf02-45117: 171+170, edg-perf03-6264: 171+171, edg-perf01-63779: 170+171]}, pendingCH=null, unionCH=null, actualMembers=[edg-perf02-45117, edg-perf03-6264, edg-perf01-63779]}
> {code}
> Later, edg-perf02 and edg-perf03 get new view and install a new topology, where edg-perf01 does not exist:
> {code}
> 04:41:13,681 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-2,edg-perf03-6264) ISPN000093: Received new, MERGED cluster view for channel default: MergeView::[edg-perf02-45117|9] (3) [edg-perf02-45117, edg-perf03-6264, edg-perf04-10989], 1 subgroups: [edg-perf04-10989|7] (1) [edg-perf04-10989]
> 04:41:13,681 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread--p2-t22) Received new cluster view: 9, isCoordinator = false, becameCoordinator = false
> 04:41:13,760 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (remote-thread--p3-t32) Attempting to execute non-CacheRpcCommand command: CacheTopologyControlCommand{cache=testCache, type=CH_UPDATE, sender=edg-perf02-45117, joinInfo=null, topologyId=18, rebalanceId=4, currentCH=DefaultConsistentHash{ns = 512, owners = (2)[edg-perf02-45117: 256+85, edg-perf03-6264: 256+86]}, pendingCH=null, availabilityMode=AVAILABLE, actualMembers=[edg-perf02-45117, edg-perf03-6264], throwable=null, viewId=9}[sender=edg-perf02-45117]
> {code}
> After that, edg-perf04 writes to {{key_00000000000020DB}} which is currently owned only by edg-perf03 - this key servers as K1 in example above. It is not backed up to edg-perf01, but edg-perf01 still thinks it's an owner of this key as it did not get any new view (this is a log from edg-perf03) :
> {code}
> 04:41:30,884 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (remote-thread--p3-t45) edg-perf03-6264 invoking PutKeyValueCommand{key=key_00000000000020DB, value=[33 #4: 0, 169, 284, 634, ], flags=[SKIP_CACHE_LOAD, SKIP_REMOTE_LOOKUP], putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedMetadata{version=null}, successful=true} to recipient list [edg-perf03-6264] with options RpcOptions{timeout=60000, unit=MILLISECONDS, fifoOrder=true, totalOrder=false, responseFilter=null, responseMode=SYNCHRONOUS, skipReplicationQueue=false}
> {code}
> Later, edg-perf04 writes to another key {{stressor_33}} (K2 in the example) value with operationId=650 (previous value is 600) which is replicated to edg-perf02 and edg-perf03.
> Now a merge view with all 4 nodes is installed:
> {code}
> 04:41:31,258 INFO [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-2,edg-perf01-63779) ISPN000093: Received new, MERGED cluster view for channel default: MergeView::[edg-perf01-63779|10] (4) [edg-perf01-63779, edg-perf03-6264, edg-perf02-45117, edg-perf04-10989], 6 subgroups: [edg-perf02-45117|7] (2) [edg-perf02-45117, edg-perf03-6264], [edg-perf01-63779|4] (2) [edg-perf01-63779, edg-perf02-45117], [edg-perf02-45117|9] (3) [edg-perf02-45117, edg-perf03-6264, edg-perf04-10989], [edg-perf03-6264|4] (2) [edg-perf03-6264, edg-perf04-10989], [edg-perf01-63779|8] (3) [edg-perf01-63779, edg-perf02-45117, edg-perf03-6264], [edg-perf01-63779|6] (1) [edg-perf01-63779]
> 04:41:31,258 TRACE [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread--p2-t2) Received new cluster view: 10, isCoordinator = true, becameCoordinator = false
> {code}
> edg-perf01 now issues a remote get to edg-perf02 for key stressor_33 and receives the correct answer (operationId=650):
> {code}
> 04:41:32,494 TRACE [org.infinispan.remoting.rpc.RpcManagerImpl] (BackgroundOps-Checker-1) Response(s) to ClusteredGetCommand{key=stressor_33, flags=null} is {edg-perf02-45117=SuccessfulResponse{responseValue=ImmortalCacheValue {value=LastOperation{operationId=650, seed=0000A15A4C2DD25A}}} }
> {code}
> However, when edg-perf01 reads {{key_00000000000020DB}}, it loads the old value from local data container as no CH update/rebalance happened so far:
> {code}
> 04:41:32,496 TRACE [org.infinispan.partitionhandling.impl.PartitionHandlingManagerImpl] (BackgroundOps-Checker-1) Checking availability for key=key_00000000000020DB, status=AVAILABLE
> 04:41:32,497 ERROR [org.radargun.stages.cache.background.LogChecker] (BackgroundOps-Checker-1) Missing operation 634 for thread 33 on key 8411 (key_00000000000020DB)
> 04:41:32,499 DEBUG [org.radargun.service.InfinispanDebugable] (BackgroundOps-Checker-1) Debug info for key testCache key_00000000000020DB: owners=edg-perf01-63779, edg-perf03-6264, local=true, uncertain=false, container.key_00000000000020DB=ImmortalCacheEntry[key=key_00000000000020DB, value=[33 #3: 0, 169, 284, ], created=-1, isCreated=false, lastUsed=-1, isChanged=false, expires=-1, isExpired=false, canExpire=false, isEvicted=true, isRemoved=false, isValid=false, lifespan=-1, maxIdle=-1], segmentId=173
> {code}
> Note that this was found on branch https://github.com/infinispan/infinispan/pull/3062/files trying to fix ISPN-4949.
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 4 months
[JBoss JIRA] (ISPN-3811) Initial ST leaves node as member without data after MERGE
by RH Bugzilla Integration (JIRA)
[ https://issues.jboss.org/browse/ISPN-3811?page=com.atlassian.jira.plugin.... ]
RH Bugzilla Integration commented on ISPN-3811:
-----------------------------------------------
Tristan Tarrant <ttarrant(a)redhat.com> changed the Status of [bug 1040046|https://bugzilla.redhat.com/show_bug.cgi?id=1040046] from NEW to ON_QA
> Initial ST leaves node as member without data after MERGE
> ---------------------------------------------------------
>
> Key: ISPN-3811
> URL: https://issues.jboss.org/browse/ISPN-3811
> Project: Infinispan
> Issue Type: Bug
> Components: State Transfer
> Affects Versions: 6.0.0.Final
> Reporter: Radim Vansa
> Assignee: Dan Berindei
> Priority: Blocker
> Labels: testsuite_stability
> Fix For: 7.0.0.Final
>
>
> Under certain circumstances, JGroups can issue a MERGE view when a node is joining the cache. The new node joins the cluster, and all nodes have the same cache topology (not containing the joiner yet).
> During the merge, the CH's are joined (through CHFactory.union) and as all report the same topology/hash, the resulting hash is identical. However, the joiner is added to the members list and therefore it can finish the initial state transfer, although no data have been assigned to him.
> Later, the coordinator starts rebalance and the node begins to receive some data, but the thread which started the cluster manager (and should wait until the cluster becomes properly replicated through initial ST) is already released.
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 4 months
[JBoss JIRA] (ISPN-4851) Make SyncConsistentHashFactory the default CH factory
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-4851?page=com.atlassian.jira.plugin.... ]
Dan Berindei updated ISPN-4851:
-------------------------------
Status: Open (was: New)
> Make SyncConsistentHashFactory the default CH factory
> -----------------------------------------------------
>
> Key: ISPN-4851
> URL: https://issues.jboss.org/browse/ISPN-4851
> Project: Infinispan
> Issue Type: Feature Request
> Components: Configuration, Core
> Affects Versions: 7.0.0.CR1
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Fix For: 7.1.0.Alpha1
>
>
> With ISPN-4682 fixed, SyncConsistentHashFactory should be good enough to be the default. It still allows for more variation in the number of owned segments per node (+/-10% owned segments and +/-20% for primary-owned segments), but that should be acceptable for most purposes.
> The major advantage of SCHF is that it depends only on the cache members and not on the order they joined. Users expect a key to map to the same node in all caches (as long as the caches have the same members).
> One downside of SCHF, especially for testing, is that the segment ownership differs between test runs (being based on the random address assigned to each node). However, most tests that depend on key ownership should use {{ControlledConsistentHashFactory}} anyway.
> We also need to verify that the number of segments moved by SCHF is comparable to the number of segments moved by DefaultConsistentHashFactory (ISPN-3729).
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 4 months
[JBoss JIRA] (ISPN-4988) TopologyAwareDistAsyncFuncTest fails with SIGSEGV exception with Azul JDK
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-4988?page=com.atlassian.jira.plugin.... ]
Dan Berindei updated ISPN-4988:
-------------------------------
Status: Open (was: New)
> TopologyAwareDistAsyncFuncTest fails with SIGSEGV exception with Azul JDK
> -------------------------------------------------------------------------
>
> Key: ISPN-4988
> URL: https://issues.jboss.org/browse/ISPN-4988
> Project: Infinispan
> Issue Type: Bug
> Components: Test Suite - Core
> Affects Versions: 7.0.1.Final
> Reporter: Vitalii Chepeliuk
> Assignee: Dan Berindei
> Labels: testsuite_stability
>
> {noformat}
> Test suite progress: tests succeeded: 2602, failed: 0, skipped: 0.
> 2014-11-16 04:38:48,750 WARN [Configurator] (testng-TopologyAwareDistAsyncFuncTest) JGRP000014: TP.loopback has been deprecated: enabled by default
> 2014-11-16 04:38:48,750 WARN [Configurator] (testng-TopologyAwareDistAsyncFuncTest) JGRP000014: TP.physical_addr_max_fetch_attempts has been deprecated: will be ignored
> 2014-11-16 04:38:48,829 WARN [TCP] (testng-TopologyAwareDistAsyncFuncTest) JGRP000046: bundler_type=old has been removed; using sender-sends-with-timer
> 2014-11-16 04:38:48,945 WARN [Configurator] (testng-TopologyAwareDistAsyncFuncTest) JGRP000014: TP.loopback has been deprecated: enabled by default
> 2014-11-16 04:38:48,945 WARN [Configurator] (testng-TopologyAwareDistAsyncFuncTest) JGRP000014: TP.physical_addr_max_fetch_attempts has been deprecated: will be ignored
> 2014-11-16 04:38:49,023 WARN [TCP] (testng-TopologyAwareDistAsyncFuncTest) JGRP000046: bundler_type=old has been removed; using sender-sends-with-timer
> 2014-11-16 04:38:49,249 WARN [Configurator] (testng-TopologyAwareDistAsyncFuncTest) JGRP000014: TP.loopback has been deprecated: enabled by default
> 2014-11-16 04:38:49,249 WARN [Configurator] (testng-TopologyAwareDistAsyncFuncTest) JGRP000014: TP.physical_addr_max_fetch_attempts has been deprecated: will be ignored
> 2014-11-16 04:38:49,326 WARN [TCP] (testng-TopologyAwareDistAsyncFuncTest) JGRP000046: bundler_type=old has been removed; using sender-sends-with-timer
> Signum: [11] - Exiting due to unhandled SIGSEGV exception.
> 0: rip=0x0000000020d9eb48 @rip=[0xffffffffffffffff] (hotspot_os_backtrace_callback+40) [gcc frame, calls gcc]
> 1: rip=0x00007f526459da3f @rip=[0x0000440000bffb38] (os_backtrace+31) [gcc frame, calls gcc]
> 2: rip=0x0000000020d99ef5 @rip=[0x0000440000bffba8] (jvm_unexpected_exception_handler+165) [gcc frame, calls gcc]
> 3: rip=0x00007f526459cf52 @rip=[0x0000440000bffc58] (jvm_unexpected_exception_handler_wrapper+82) [gcc frame, calls gcc]
> 4: rip=0x0000000020838c30 @rip=[0x0000440000bffc78] (GPGC_GCManagerMark::process_mutator_stack(HeapRefBuffer*)+112) [gcc frame, calls gcc]
> 5: rip=0x0000000020876355 @rip=[0x0000440000bffcd8] (void GPGC_MarkAlgorithm::drain_stacks<GPGC_GCManagerOldStrong>(GPGC_GCManagerOldStrong*)+149) [gcc frame, calls gcc]
> 6: rip=0x000000002087748c @rip=[0x0000440000bffda8] (void GPGC_MarkAlgorithm::drain_and_steal_stacks<GPGC_GCManagerOldStrong>(GPGC_GCManagerOldStrong*)+28) [gcc frame, calls gcc]
> 7: rip=0x00000000209ba1ed @rip=[0x0000440000bffe68] (PGCTaskThread::run()+589) [gcc frame, calls gcc]
> 8: rip=0x00007f526459fc49 @rip=[0x0000440000bfff88] (alternate_stack_create+153) [gcc frame, calls gcc]
> #
> # An unexpected error has been detected by Java Runtime Environment:
> #
> # Segmentation fault (0xb) at pc=0x20838c30, pid=25165, tid=25171
> #
> # Java VM: Zing 64-Bit Tiered VM (1.7.0-zing_5.10.1.0-b9-product-azlinuxM-X86_64, mixed mode)
> # Problematic frame:
> # C [libjvm.so+0x438c30] GPGC_GCManagerMark::process_mutator_stack(HeapRefBuffer*)+0x70
> #
> # An error report file with more information is saved as:
> # /qa/hudson_workspace/workspace/jdg-63-ispn-testsuite-rhel-azul/c6098cff/infinispan/core/hs_err_pid25165.log
> 2014-11-16 04:38:49,693 WARN [Configurator] (testng-TopologyAwareDistAsyncFuncTest) JGRP000014: TP.loopback has been deprecated: enabled by default
> 2014-11-16 04:38:49,693 WARN [Configurator] (testng-TopologyAwareDistAsyncFuncTest) JGRP000014: TP.physical_addr_max_fetch_attempts has been deprecated: will be ignored
> #
> # If you would like to submit a bug report, please visit:
> # http://www.azulsystems.com/support/
> #
> {noformat}
> More info from jenkins jobs here
> https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/jdg-63-ispn-testsuit...
> https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/jdg-63-ispn-testsuit...
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 4 months
[JBoss JIRA] (ISPN-3421) Transaction is sometimes not applied on all owners if originator dies during commit
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-3421?page=com.atlassian.jira.plugin.... ]
Dan Berindei reassigned ISPN-3421:
----------------------------------
Assignee: Dan Berindei
> Transaction is sometimes not applied on all owners if originator dies during commit
> -----------------------------------------------------------------------------------
>
> Key: ISPN-3421
> URL: https://issues.jboss.org/browse/ISPN-3421
> Project: Infinispan
> Issue Type: Bug
> Components: State Transfer
> Affects Versions: 5.2.7.Final
> Reporter: Erik Salter
> Assignee: Dan Berindei
> Priority: Critical
>
> There's a hole in state transfer mechanism that can occur when a node is leaving the cluster, but it was creating the entries and was only able to replicate the data to some of the nodes.
> The problem occurs when the segment ownership of the node doesn't change after the rebalance. Since state transfer does not request state for keys in which it is already an owner, the cache could be left in a state where a key is resident < numOwners nodes. In addition, this could be any subset of the primary OR backup nodes.
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 4 months
[JBoss JIRA] (ISPN-4908) Clustered cache with FileStore (shared=false) is inconsistent after restarting one node if entries are deleted during restart
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-4908?page=com.atlassian.jira.plugin.... ]
Dan Berindei updated ISPN-4908:
-------------------------------
Status: Open (was: New)
> Clustered cache with FileStore (shared=false) is inconsistent after restarting one node if entries are deleted during restart
> -----------------------------------------------------------------------------------------------------------------------------
>
> Key: ISPN-4908
> URL: https://issues.jboss.org/browse/ISPN-4908
> Project: Infinispan
> Issue Type: Bug
> Environment: Clustered REPL cache, preloaded, no eviction/expiration
> Reporter: Wolf-Dieter Fink
> Assignee: William Burns
>
> If a cache instance with a cache store is down and the cache is changed until the instance is back and join the cluster the cache can become inconsisstent.
> If entries are deleted during downtime,
> - the FileStore with stale object is loaded first if preload=true
> - the local entries are updated with new and changed objects from the cluster
> - removed entries from the cluster are not seen and therefore not deleted
> After complete sync (only) this instance will have stale objects.
> From a consistence and performance perspective the FileStore should be pruned on cluster-join by default in this case
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 4 months
[JBoss JIRA] (ISPN-3811) Initial ST leaves node as member without data after MERGE
by Dan Berindei (JIRA)
[ https://issues.jboss.org/browse/ISPN-3811?page=com.atlassian.jira.plugin.... ]
Dan Berindei resolved ISPN-3811.
--------------------------------
Assignee: Dan Berindei
Fix Version/s: 7.0.0.Final
Resolution: Done
I believe I have actually fixed this with the partition handling work, we no longer use {{CHF.union}} after merge and we have special precautions for ReplicatedCHF.
> Initial ST leaves node as member without data after MERGE
> ---------------------------------------------------------
>
> Key: ISPN-3811
> URL: https://issues.jboss.org/browse/ISPN-3811
> Project: Infinispan
> Issue Type: Bug
> Components: State Transfer
> Affects Versions: 6.0.0.Final
> Reporter: Radim Vansa
> Assignee: Dan Berindei
> Priority: Blocker
> Labels: testsuite_stability
> Fix For: 7.0.0.Final
>
>
> Under certain circumstances, JGroups can issue a MERGE view when a node is joining the cache. The new node joins the cluster, and all nodes have the same cache topology (not containing the joiner yet).
> During the merge, the CH's are joined (through CHFactory.union) and as all report the same topology/hash, the resulting hash is identical. However, the joiner is added to the members list and therefore it can finish the initial state transfer, although no data have been assigned to him.
> Later, the coordinator starts rebalance and the node begins to receive some data, but the thread which started the cluster manager (and should wait until the cluster becomes properly replicated through initial ST) is already released.
--
This message was sent by Atlassian JIRA
(v6.3.8#6338)
11 years, 4 months