[JBoss JIRA] (ISPN-9588) JGroups fails to install new cluster view after coordinator leave
by Tristan Tarrant (Jira)
[ https://issues.jboss.org/browse/ISPN-9588?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-9588:
----------------------------------
Sprint: Sprint 10.0.0.Alpha1, Sprint 10.0.0.Alpha2, Sprint 10.0.0.Alpha0, Sprint 10.0.0.Beta1, DataGrid Sprint #31, DataGrid Sprint #32 (was: Sprint 10.0.0.Alpha1, Sprint 10.0.0.Alpha2, Sprint 10.0.0.Alpha0, Sprint 10.0.0.Beta1, DataGrid Sprint #31)
> JGroups fails to install new cluster view after coordinator leave
> -----------------------------------------------------------------
>
> Key: ISPN-9588
> URL: https://issues.jboss.org/browse/ISPN-9588
> Project: Infinispan
> Issue Type: Bug
> Components: Core, Test Suite - Core
> Affects Versions: 9.4.0.CR3
> Reporter: Dan Berindei
> Assignee: Dan Berindei
> Priority: Major
> Fix For: 9.4.7.Final, 10.0.0.Beta1
>
> Attachments: ISPN-9127_ScatteredCrashInSequenceTest-infinispan-core.log.gz, ISPN-9127_ThreeWaySplitAndMergeTest-infinispan-core.log.gz, ThreeWaySplitAndMergeTest.clearContent.log.gz
>
>
> The core test suite normally shuts down cache managers in the reverse order of their start, so that the coordinator stops last. This should be just an optimization, to reduce the number of coordinator changes, and with it the test suite duration and log size.
> Unfortunately, the few tests that stop the coordinator first are routinely failing to stop the cluster properly, usually when there are 2 nodes left in the cluster the 2nd node doesn't receive the view:
> {noformat}
> 10:08:01,656 DEBUG (jgroups-26,Test-NodeB-20661:[]) [GMS] Test-NodeB-20661: installing view [Test-NodeC-25266|33] (3) [Test-NodeC-25266, Test-NodeA-8936, Test-NodeB-20661]
> 10:08:01,862 TRACE (testng-Test:[]) [GMS] Test-NodeC-25266: sending LEAVE request to Test-NodeA-8936
> 10:08:01,863 DEBUG (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: members are (3) Test-NodeC-25266,Test-NodeA-8936,Test-NodeB-20661, coord=Test-NodeA-8936: I'm the new coordinator
> 10:08:01,896 TRACE (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: joiners=[], suspected=[], leaving=[Test-NodeC-25266], new view: [Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
> 10:08:01,896 TRACE (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: mcasting view [Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
> 10:08:01,900 DEBUG (jgroups-6,Test-NodeA-8936:[]) [GMS] Test-NodeA-8936: installing view [Test-NodeA-8936|34] (2) [Test-NodeA-8936, Test-NodeB-20661]
> 10:08:02,245 TRACE (testng-Test:[]) [JGroupsTransport] Test-NodeB-20661 sending request 140 to Test-NodeC-25266: CacheTopologyControlCommand{cache=org.infinispan.CONFIG, type=LEAVE, sender=Test-NodeB-20661, joinInfo=null, topologyId=0, rebalanceId=0, currentCH=null, pendingCH=null, availabilityMode=null, phase=null, actualMembers=null, throwable=null, viewId=33}
> 10:12:02,247 DEBUG (testng-Test:[]) [LocalTopologyManagerImpl] Error sending the leave request for cache org.infinispan.CONFIG to coordinator
> org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 140 from 7d5d965d-eacc-fd47-2ab8-5cec95f4c10b
> {noformat}
> Sometimes the new coordinator gets stuck in a merge loop, keeps trying to re-include the stopped node and fails:
> {noformat}
> 10:06:21,484 DEBUG (jgroups-24,Test-NodeC-57941:[]) [GMS] Test-NodeC-57941: installing view [Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]
> 10:06:21,486 DEBUG (jgroups-25,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing view [Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]
> 10:06:22,070 TRACE (testng-Test:[]) [GMS] Test-NodeD-37696: sending LEAVE request to Test-NodeB-39137
> 10:06:22,078 TRACE (jgroups-10,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: mcasting view [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> 10:06:22,080 DEBUG (jgroups-10,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing view [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> 10:06:30,887 DEBUG (jgroups-24,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: I will be the merge leader. Starting the merge task. Views: {Test-NodeB-39137=[Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941], Test-NodeC-57941=[Test-NodeD-37696|8] (3) [Test-NodeD-37696, Test-NodeB-39137, Test-NodeC-57941]}
> 10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [STABLE] suspending message garbage collection
> 10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [STABLE] Test-NodeB-39137: resume task started, max_suspend_time=220000
> 10:06:30,888 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge task Test-NodeB-39137::7 started with 2 participants
> 10:06:30,888 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: sending MERGE_REQ to [Test-NodeD-37696, Test-NodeB-39137]
> 10:06:36,041 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: collected 1 merge response(s) in 5153 ms
> 10:06:36,041 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:06:36,043 TRACE (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: consolidated view=MergeView::[Test-NodeB-39137|10] (2) [Test-NodeB-39137, Test-NodeC-57941], 1 subgroups: [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> consolidated digest=Test-NodeB-39137: [15 (16)], Test-NodeC-57941: [5 (5)]
> 10:06:36,043 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: installing merge view [Test-NodeB-39137|10] (2 members) in 1 coords
> 10:06:36,043 DEBUG (MergeTask-108,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge Test-NodeB-39137::7 took 5155 ms
> 10:06:36,043 TRACE (jgroups-24,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: mcasting view MergeView::[Test-NodeB-39137|10] (2) [Test-NodeB-39137, Test-NodeC-57941], 1 subgroups: [Test-NodeB-39137|9] (2) [Test-NodeB-39137, Test-NodeC-57941]
> 10:06:49,877 DEBUG (MergeTask-229,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:03,755 DEBUG (MergeTask-342,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:17,703 DEBUG (MergeTask-447,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:31,547 DEBUG (MergeTask-568,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:45,381 DEBUG (MergeTask-688,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:07:59,196 DEBUG (MergeTask-806,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:13,031 DEBUG (MergeTask-932,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:26,885 DEBUG (MergeTask-1061,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:40,697 DEBUG (MergeTask-1197,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:08:54,510 DEBUG (MergeTask-1330,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:08,324 DEBUG (MergeTask-1463,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:22,135 DEBUG (MergeTask-1596,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:35,953 DEBUG (MergeTask-1729,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:09:49,770 DEBUG (MergeTask-1857,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:03,609 DEBUG (MergeTask-1986,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:17,420 DEBUG (MergeTask-2117,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:31,252 DEBUG (MergeTask-2248,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:45,099 DEBUG (MergeTask-2383,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:10:58,916 DEBUG (MergeTask-2516,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> 10:11:12,743 DEBUG (MergeTask-2648,Test-NodeB-39137:[]) [GMS] Test-NodeB-39137: merge leader Test-NodeB-39137 did not get responses from all 2 partition coordinators; missing responses from 1 members, removing them from the merge
> {noformat}
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 5 months
[JBoss JIRA] (ISPN-9382) Jenkins release pipeline should handle multiple branches and reversioning
by Tristan Tarrant (Jira)
[ https://issues.jboss.org/browse/ISPN-9382?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-9382:
----------------------------------
Sprint: Sprint 9.4.0.Beta1, Sprint 9.4.0.CR1, Sprint 9.4.0.CR3, Sprint 10.0.0.Alpha1, Sprint 10.0.0.Alpha2, Sprint 9.4.0.Final, Sprint 10.0.0.Alpha0, Sprint 10.0.0.Beta1, DataGrid Sprint #31, DataGrid Sprint #32 (was: Sprint 9.4.0.Beta1, Sprint 9.4.0.CR1, Sprint 9.4.0.CR3, Sprint 10.0.0.Alpha1, Sprint 10.0.0.Alpha2, Sprint 9.4.0.Final, Sprint 10.0.0.Alpha0, Sprint 10.0.0.Beta1, DataGrid Sprint #31)
> Jenkins release pipeline should handle multiple branches and reversioning
> -------------------------------------------------------------------------
>
> Key: ISPN-9382
> URL: https://issues.jboss.org/browse/ISPN-9382
> Project: Infinispan
> Issue Type: Enhancement
> Components: Build
> Reporter: Tristan Tarrant
> Assignee: Tristan Tarrant
> Priority: Major
> Fix For: 10.0.0.Final
>
>
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 5 months
[JBoss JIRA] (ISPN-9827) The Hot Rod client should apply topology changes after the response has been processed
by Tristan Tarrant (Jira)
[ https://issues.jboss.org/browse/ISPN-9827?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-9827:
----------------------------------
Sprint: Sprint 10.0.0.Beta1, DataGrid Sprint #31, DataGrid Sprint #32 (was: Sprint 10.0.0.Beta1, DataGrid Sprint #31)
> The Hot Rod client should apply topology changes after the response has been processed
> --------------------------------------------------------------------------------------
>
> Key: ISPN-9827
> URL: https://issues.jboss.org/browse/ISPN-9827
> Project: Infinispan
> Issue Type: Bug
> Affects Versions: 10.0.0.Alpha2, 9.4.4.Final
> Reporter: Tristan Tarrant
> Assignee: Tristan Tarrant
> Priority: Major
>
> The Hot Rod client applies topology changes to the connection pool while it is processing the header. This could cause the channel on which it is reading to be closed before the response has been completely processed. Topology changes should be applied when the operation completes.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 5 months
[JBoss JIRA] (ISPN-2445) Safe shutdown - transfer data to other nodes
by Tristan Tarrant (Jira)
[ https://issues.jboss.org/browse/ISPN-2445?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-2445:
----------------------------------
Sprint: Sprint 9.4.0.Beta1, Sprint 9.4.0.CR1, Sprint 9.4.0.CR3, Sprint 10.0.0.Alpha1, Sprint 10.0.0.Alpha2, Sprint 10.0.0.Beta1, DataGrid Sprint #31, DataGrid Sprint #32 (was: Sprint 9.4.0.Beta1, Sprint 9.4.0.CR1, Sprint 9.4.0.CR3, Sprint 10.0.0.Alpha1, Sprint 10.0.0.Alpha2, Sprint 10.0.0.Beta1, DataGrid Sprint #31)
> Safe shutdown - transfer data to other nodes
> --------------------------------------------
>
> Key: ISPN-2445
> URL: https://issues.jboss.org/browse/ISPN-2445
> Project: Infinispan
> Issue Type: Feature Request
> Components: Core
> Reporter: Matej Lazar
> Assignee: Dan Berindei
> Priority: Major
>
> Distributed cache should provide an configuration option to block shutdown until data from shutting down node is transferred to other nodes.
> This feature is useful if Infinispan is used as persistent storage.
> Without this option, in a cloud environment, where server node is deleted on scale down event, users are forced to use external storage, if they don't want to loose data.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 5 months
[JBoss JIRA] (ISPN-10429) Hot Rod client doesn't retry on socket timeout
by Tristan Tarrant (Jira)
[ https://issues.jboss.org/browse/ISPN-10429?page=com.atlassian.jira.plugin... ]
Tristan Tarrant updated ISPN-10429:
-----------------------------------
Sprint: DataGrid Sprint #31, DataGrid Sprint #32 (was: DataGrid Sprint #31)
> Hot Rod client doesn't retry on socket timeout
> ----------------------------------------------
>
> Key: ISPN-10429
> URL: https://issues.jboss.org/browse/ISPN-10429
> Project: Infinispan
> Issue Type: Bug
> Components: Hot Rod
> Affects Versions: 10.0.0.Beta4, 9.4.15.Final
> Reporter: Tristan Tarrant
> Assignee: Tristan Tarrant
> Priority: Major
> Fix For: 10.0.0.Beta5
>
>
> The Hot Rod client registers a timeout handler for operations using the socket timeout. However when this timeout is hit, the retry logic is not invoked causing requests to dead servers to fail immediately without retry.
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 5 months
[JBoss JIRA] (ISPN-8535) Rest API redesign
by Tristan Tarrant (Jira)
[ https://issues.jboss.org/browse/ISPN-8535?page=com.atlassian.jira.plugin.... ]
Tristan Tarrant updated ISPN-8535:
----------------------------------
Sprint: Sprint 10.0.0.Alpha1, Sprint 10.0.0.Alpha2, Sprint 10.0.0.Beta1, DataGrid Sprint #30, DataGrid Sprint #31, DataGrid Sprint #32 (was: Sprint 10.0.0.Alpha1, Sprint 10.0.0.Alpha2, Sprint 10.0.0.Beta1, DataGrid Sprint #30, DataGrid Sprint #31)
> Rest API redesign
> -----------------
>
> Key: ISPN-8535
> URL: https://issues.jboss.org/browse/ISPN-8535
> Project: Infinispan
> Issue Type: Feature Request
> Components: Server
> Affects Versions: 9.2.0.Final
> Reporter: Gustavo Fernandes
> Assignee: Gustavo Fernandes
> Priority: Major
> Fix For: 10.0.0.Final
>
>
> Infinispan REST API deals with only one resource, (the cache) and maps all operations on the cache using HTTP verbs and request parameters. The API assumes the URI is related to the cache making it hard to add new kinds of resources without causing ambiguous references.
> Since we now have other types of entities, such as counters, scripts, templates, etc, and each one of them can involve different operations, we should make the API more "Restful" by using more than one resource and collections of resources, plus HTTP verbs and operations on them. Examples:
> * Create a cache: POST /rest/caches
> * Delete a cache: DELETE /rest/caches/myCache
> * Create a template: POST /rest/templates
> * Delete a template: DELETE /rest/templates/myTemplate
> * Create an entry: POST /rest/caches/myCache/1
> * Create a counter: POST /rest/counters
> * Get a counter value: GET /rest/counters/mycounter
> * Increment a counter: GET /rest/counters/mycounter?action=increment
> * Search a cache: GET /rest/caches/myCache?action=search
> * Create a script: POST /rest/scripts/
> * Get a script source: GET /rest/scritps/myScript
> * Execute a script: GET /rest/scripts/myScript?action=execute¶m1=foo
> * Stop individual servers and the full cluster
>
--
This message was sent by Atlassian Jira
(v7.12.1#712002)
5 years, 5 months