[JBoss JIRA] (ISPN-9588) JGroups fails to install new cluster view after coordinator leave
by Galder Zamarreño (Jira)
[ https://issues.jboss.org/browse/ISPN-9588?page=com.atlassian.jira.plugin.... ]
Galder Zamarreño updated ISPN-9588:
-----------------------------------
Fix Version/s: 10.0.0.Beta1
(was: 10.0.0.Alpha2)
> JGroups fails to install new cluster view after coordinator leave
> -----------------------------------------------------------------
>
> Key: ISPN-9588
> URL: https://issues.jboss.org/browse/ISPN-9588
> Project: Infinispan
> Issue Type: Bug
> Components: Core, Test Suite - Core
> Affects Versions: 9.4.0.CR3
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Fix For: 10.0.0.Beta1
>
> Attachments: ISPN-9127_ScatteredCrashInSequenceTest-infinispan-core.log.gz, ISPN-9127_ThreeWaySplitAndMergeTest-infinispan-core.log.gz
>
>
> The core test suite normally shuts down cache managers in the reverse order of their start, so that the coordinator stops last. This should be just an optimization, to reduce the number of coordinator changes, and with it the test suite duration and log size.
> Unfortunately, the few tests that stop the coordinator first are routinely failing to stop the cluster properly, usually when there are 2 nodes left in the cluster the 2nd node doesn't receive the view:
> {noformat}
> 10:08:01,656 DEBUG (jgroups-26,Test-NodeB-20661:[]) [GMS] Test-NodeB-20661: installing view [Test-NodeC-25266|33] (3) [Test-NodeC-25266, Test-NodeA-8936, Test-NodeB-20661]
> 10:08:01,862 TRACE (testng-Test:[]) [GMS] Test-NodeC-25266: sending LEAVE request to Test-NodeA-8936
> 10:08:01,863 DEBUG (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: members are (3) Test-NodeC-25266,Test-NodeA-8936,Test-NodeB-20661, coord=Test-NodeA-8936: I'm the new coordinator
> 10:08:01,896 TRACE (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: joiners=[], suspected=[], leaving=[Test-NodeC-25266], new view: [Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
> 10:08:01,896 TRACE (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: mcasting view [Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
> 10:08:01,900 DEBUG (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: installing view [Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
> 10:08:02,245 TRACE (testng-Test:[]) [JGroupsTransport] Test-NodeB-20661 sending request 140 to Test-NodeC-25266: CacheTopologyControlCommand{cache=org.infinispan.CONFIG, type=LEAVE, sender=Test-NodeB-20661, joinInfo=null, topologyId=0, rebalanceId=0, currentCH=null, pendingCH=null, availabilityMode=null, phase=null, actualMembers=null, throwable=null, viewId=33}
> 10:12:02,247 DEBUG (testng-Test:[]) [LocalTopologyManagerImpl] Error sending the leave request for cache org.infinispan.CONFIG to coordinator
> org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 140 from 7d5d965d-eacc-fd47-2ab8-5cec95f4c10b
> {noformat}
> Sometimes the new coordinator gets stuck in a merge loop, keeps trying to re-include the stopped node and fails:
> {noformat}
> 10:06:21,484 DEBUG (jgroups-24,Test-NodeC-57941:[]) [GMS] Test-NodeC-57941: installing view [Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]
> 10:06:21,486 DEBUG (jgroups-25,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing view [Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]
> 10:06:22,070 TRACE (testng-Test:[]) [GMS] Test-NodeD-37696: sending LEAVE request to Test-NodeB-39137
> 10:06:22,078 TRACE (jgroups-10,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: mcasting view [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> 10:06:22,080 DEBUG (jgroups-10,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing view [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> 10:06:30,887 DEBUG (jgroups-24,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: I will be the merge leader. Starting the merge task. Views: {Test-NodeB-39137=[Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941], Test-NodeC-57941=[Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]}
> 10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [STABLE] suspending message garbage collection
> 10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [STABLE] Test-NodeB-39137: resume task started, max_suspend_time=220000
> 10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge task Test-NodeB-39137::7 started with 2 participants
> 10:06:30,888 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: sending MERGE_REQ to [Test-NodeD-37696, Test-NodeB-39137]
> 10:06:36,041 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: collected 1 merge response(s) in 5153 ms
> 10:06:36,041 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:06:36,043 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: consolidated view=MergeView::[Test-NodeB-39137|10] (2) [Test-NodeB-39137, Test-NodeC-57941], 1 subgroups: [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> consolidated digest=Test-NodeB-39137: [15 (16)], Test-NodeC-57941: [5 (5)]
> 10:06:36,043 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing merge view [Test-NodeB-39137|10] (2 members) in 1 coords
> 10:06:36,043 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge Test-NodeB-39137::7 took 5155 ms
> 10:06:36,043 TRACE (jgroups-24,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: mcasting view MergeView::[Test-NodeB-39137|10] (2) [Test-NodeB-39137, Test-NodeC-57941], 1 subgroups: [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> 10:06:49,877 DEBUG (MergeTask-229,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:03,755 DEBUG (MergeTask-342,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:17,703 DEBUG (MergeTask-447,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:31,547 DEBUG (MergeTask-568,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:45,381 DEBUG (MergeTask-688,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:59,196 DEBUG (MergeTask-806,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:13,031 DEBUG (MergeTask-932,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:26,885 DEBUG (MergeTask-1061,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:40,697 DEBUG (MergeTask-1197,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:54,510 DEBUG (MergeTask-1330,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:08,324 DEBUG (MergeTask-1463,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:22,135 DEBUG (MergeTask-1596,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:35,953 DEBUG (MergeTask-1729,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:49,770 DEBUG (MergeTask-1857,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:03,609 DEBUG (MergeTask-1986,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:17,420 DEBUG (MergeTask-2117,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:31,252 DEBUG (MergeTask-2248,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:45,099 DEBUG (MergeTask-2383,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:58,916 DEBUG (MergeTask-2516,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:11:12,743 DEBUG (MergeTask-2648,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> {noformat}
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
7 years, 4 months
[JBoss JIRA] (ISPN-9699) Cluster member owning no data
by Galder Zamarreño (Jira)
[ https://issues.jboss.org/browse/ISPN-9699?page=com.atlassian.jira.plugin.... ]
Galder Zamarreño updated ISPN-9699:
-----------------------------------
Fix Version/s: 10.0.0.Beta1
(was: 10.0.0.Alpha2)
> Cluster member owning no data
> -----------------------------
>
> Key: ISPN-9699
> URL: https://issues.jboss.org/browse/ISPN-9699
> Project: Infinispan
> Issue Type: Feature Request
> Components: Core
> Affects Versions: 9.4.1.Final
> Reporter: Thomas Segismont
> Assignee: Katia Aresti
> Priority: Major
> Fix For: 10.0.0.Beta1, 9.4.4.Final
>
>
> Currently, you can set {{capacity-factor}} to zero on a cache if you don't want a node to own any segment.
> If you want a node to own no data at all, you could set this property on all declared caches. But this wouldn't work for internal caches anyway (locks, counters).
> It would be nice to have a global switch.
> This would be useful when your app needs to be a cluster member for discovery/membership but is deployed mostly for processing. Indeed, when such nodes are added/removed from the cluster:
> * there would be no data loss
> * the cluster would be back in healthy state faster as no data would be moved around.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
7 years, 4 months
[JBoss JIRA] (ISPN-9693) Make MarshalledEntryImpl private and utilise MarshalledEntryFactory instead
by Galder Zamarreño (Jira)
[ https://issues.jboss.org/browse/ISPN-9693?page=com.atlassian.jira.plugin.... ]
Galder Zamarreño updated ISPN-9693:
-----------------------------------
Fix Version/s: 10.0.0.Beta1
(was: 10.0.0.Alpha2)
> Make MarshalledEntryImpl private and utilise MarshalledEntryFactory instead
> ---------------------------------------------------------------------------
>
> Key: ISPN-9693
> URL: https://issues.jboss.org/browse/ISPN-9693
> Project: Infinispan
> Issue Type: Enhancement
> Components: Core, Loaders and Stores
> Affects Versions: 9.4.1.Final
> Reporter: Ryan Emerson
> Assignee: Ryan Emerson
> Priority: Major
> Fix For: 10.0.0.Beta1
>
>
> Currently MarshalledEntryImpl is used throughout both our production and test code. Instead, we should utilise a MarshalledEntryFactory to create all MarshalledEntry instances in production and where appropriate in test classes. Furthermore, we should prevent MarshalledEntryImpl instances being created directly by users.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
7 years, 4 months
[JBoss JIRA] (ISPN-9640) Reduce verbosity of hotrod cache configurations
by Galder Zamarreño (Jira)
[ https://issues.jboss.org/browse/ISPN-9640?page=com.atlassian.jira.plugin.... ]
Galder Zamarreño updated ISPN-9640:
-----------------------------------
Fix Version/s: 10.0.0.Beta1
(was: 10.0.0.Alpha2)
> Reduce verbosity of hotrod cache configurations
> -----------------------------------------------
>
> Key: ISPN-9640
> URL: https://issues.jboss.org/browse/ISPN-9640
> Project: Infinispan
> Issue Type: Enhancement
> Components: Hot Rod
> Affects Versions: 9.4.0.Final
> Reporter: Ryan Emerson
> Priority: Major
> Fix For: 10.0.0.Beta1
>
>
> Currently when defining a cache configuration via `RemoteCacheManagerAdmin::createCache` and `XMLStringConfiguration` it's necessary for the xml to include the `<infinispan><cache-container>..<infinispan/><cache-container/>` elements. This is unnecessarily verbose, and it should be possible for a user to simply provide the cache config XML, e.g. `<distributed-cache name="example"/>`
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
7 years, 4 months
[JBoss JIRA] (ISPN-9632) PersistenceManager should be able to start if store's are unavailable
by Galder Zamarreño (Jira)
[ https://issues.jboss.org/browse/ISPN-9632?page=com.atlassian.jira.plugin.... ]
Galder Zamarreño updated ISPN-9632:
-----------------------------------
Fix Version/s: 10.0.0.Beta1
(was: 10.0.0.Alpha2)
> PersistenceManager should be able to start if store's are unavailable
> ---------------------------------------------------------------------
>
> Key: ISPN-9632
> URL: https://issues.jboss.org/browse/ISPN-9632
> Project: Infinispan
> Issue Type: Enhancement
> Components: Loaders and Stores
> Affects Versions: 9.4.0.Final
> Reporter: Ryan Emerson
> Assignee: Ryan Emerson
> Priority: Major
> Fix For: 10.0.0.Beta1
>
>
> Currently it is necessary for a store to become available before `connection-attempts` are exceeded, otherwise cache startup fails. It should be possible for the PM to still start without the store being available, with the store's unavailability logged and operations requiring persistence still throwing StoreUnavailableException messages unless `SKIP_CACHE_STORE`/`SKIP_CACHE_LOAD` flags are present.
> Cache startup should still fail if preload=true as this behaviour is not possible if the store is unavailable.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
7 years, 4 months