[JBoss JIRA] (ISPN-9157) Improve RELAY/RELAY2 JGroups protocol selection
by Galder Zamarreño (JIRA)
Galder Zamarreño created ISPN-9157:
--------------------------------------
Summary: Improve RELAY/RELAY2 JGroups protocol selection
Key: ISPN-9157
URL: https://issues.jboss.org/browse/ISPN-9157
Project: Infinispan
Issue Type: Enhancement
Components: Cross-Site Replication
Reporter: Galder Zamarreño
When using x-site replication, there is the choice of using RELAY or RELAY2 JGroups protocols. The existence of choice is not obvious and not well documented.
As an example, if using SYNC x-site replication and RELAY2, it's easy to end up in this situation: if I define site A to have site B as backup, and site B is simply not up yet, why should a put in site A fail? In a normal cluster, if I have node X and node Y is not up yet, a put on node X simply does not fail. To make this work, you need to configure RELAY instead of RELAY2.
The RELAY2 doc is very explicit that it doesn't want to expose a virtual view, the sites should be configured statically, and sites that are not up are considered unreachable (https://github.com/belaban/JGroups/blob/master/doc/design/RELAY2.txt)
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
7 years, 10 months
[JBoss JIRA] (ISPN-9111) Internal caches should be replicated across sites
by Galder Zamarreño (JIRA)
[ https://issues.jboss.org/browse/ISPN-9111?page=com.atlassian.jira.plugin.... ]
Galder Zamarreño updated ISPN-9111:
-----------------------------------
Description:
Given a cache manager, we should look for all enabled x-site locations and add those sites as SYNC backups for the protobuf metadata cache. Without this data, the user has to implement its own code to make sure the data is added in each site which is troublesome.
Using SYNC/FAIL combo turns out to be very buggy. In the initial test created, only one site was up and the other was not. The put call to replicate the metadata was failing (as a result of ISPN-9113) but this was going under the radar (more tests needed!), and it ended up waiting for the replication timeout to happen.
Even after replication timeout happened, the put call was completing fine. This is because invocation batching was enabled for protobuf metadata cache which means any update failures would not make the cache operations fail.
was:
Given a cache manager, we should look for all enabled x-site locations and add those sites as SYNC backups for the protobuf metadata cache. Without this data, the user has to implement its own code to make sure the data is added in each site which is troublesome.
Using SYNC/FAIL combo turns out to be very buggy. In the initial test created, only one site was up and the other was not. The put call to replicate the metadata was failing (as a result of ISPN-9113) but this was going under the radar (more tests needed!), and it ended up waiting for the replication timeout to happen.
Even after replication timeout happened, the put call was completing fine. This is because invocation batching was enabled for protobuf metadata cache which means any update failures would not make the cache operations fail. It’s unclear whether this is something this is a bug of invocation batching itself, or whether it’s the combination of of invocation batching being enabled and the location where x-site backup replication is called. This can easily be replicated by modifying JGroupsTransport.ChannelCallbacks.up to throw a runtime exception when dealing with SITE_UNREACHABLE event, and then execute ProtobufMetadataXSiteStateTransferTest.
> Internal caches should be replicated across sites
> -------------------------------------------------
>
> Key: ISPN-9111
> URL: https://issues.jboss.org/browse/ISPN-9111
> Project: Infinispan
> Issue Type: Enhancement
> Components: Cross-Site Replication, Remote Querying
> Reporter: Galder Zamarreño
> Assignee: Galder Zamarreño
> Labels: redhat-summit-18
>
> Given a cache manager, we should look for all enabled x-site locations and add those sites as SYNC backups for the protobuf metadata cache. Without this data, the user has to implement its own code to make sure the data is added in each site which is troublesome.
> Using SYNC/FAIL combo turns out to be very buggy. In the initial test created, only one site was up and the other was not. The put call to replicate the metadata was failing (as a result of ISPN-9113) but this was going under the radar (more tests needed!), and it ended up waiting for the replication timeout to happen.
> Even after replication timeout happened, the put call was completing fine. This is because invocation batching was enabled for protobuf metadata cache which means any update failures would not make the cache operations fail.
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
7 years, 10 months
[JBoss JIRA] (ISPN-9113) SITE_UNREACHABLE not handled by JGroupsTransport
by Galder Zamarreño (JIRA)
[ https://issues.jboss.org/browse/ISPN-9113?page=com.atlassian.jira.plugin.... ]
Galder Zamarreño updated ISPN-9113:
-----------------------------------
Labels: redhat-summit-18 (was: )
> SITE_UNREACHABLE not handled by JGroupsTransport
> ------------------------------------------------
>
> Key: ISPN-9113
> URL: https://issues.jboss.org/browse/ISPN-9113
> Project: Infinispan
> Issue Type: Bug
> Components: Cross-Site Replication
> Affects Versions: 9.2.2.Final, 9.3.0.Alpha1
> Reporter: Galder Zamarreño
> Assignee: Dan Berindei
> Labels: redhat-summit-18
> Fix For: 9.3.0.Beta1
>
>
> If a user defineds a site with SYNC x-site replication and the site is unavailable, requests will timeout instead of quickly failing. See below for a summary discussion:
> {code}
> Galder: @Dan Berindei @Pedro Ruivo @Bela Ban Any thoughts on my dev thread?
> Galder: In essence, @Dan Berindei you made ChannelCallbacks implement UpHandler, but JChannel.invokeCallback()
> won't pass any events to receive instance variable because it doesn't check whether the receiver is UpHandler //cc @Tristan
> Bela Ban: @Galder yes, this won't work
> Bela Ban: @Galder You need to call RELAY2.setRouterStatusListener() directly
> Bela Ban: Implementing this as part of Receiver won't help
> Galder: Don't think RELAY2.setRouterStatusListener() is what I want - I can see RELAY2.handleMessage() passing up the
> stack Event.SITE_UNREACHABLE though, the problem is that there's no handler for that
> Bela Ban: @Galder Yes, but the SITE_UNREACHABLE event is only handled by RequestCorrelator, not by JChannel
> Galder: Ah ok, let me check what that does
> Bela Ban: The thing is that RequestCorrelator is not used anymore (AFAIK), as Infinispan moved from
> MessageDispatcher to JChannel
> Galder: the RequestCorrelator is never called
> Galder: exactly
> Bela Ban: So this is a regression caused by that move then
> Galder: Yeah, that's my feeling too. That's why I was asking Dan about the move to make ChannelCallbacks class an
> UpHandler, because I noticed that happened when the move to JChannel happened
> Galder: The impact of this is the following: if any site in SYNC and the site is unreachable, you'd get a timeout eventually
> instead of a immediate failure
> Galder: I'm trying to implement auto x-site state transfer for protobuf metadata cache and I cannot do it until this is fixed
> Galder: I'll see if I can get something working with ASYNC, but ASYNC is not a good option for protobuf metadata.
> If a node does not succesfully get it, it won't be able to work properly
> {code}
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
7 years, 10 months
[JBoss JIRA] (ISPN-9111) Internal caches should be replicated across sites
by Galder Zamarreño (JIRA)
[ https://issues.jboss.org/browse/ISPN-9111?page=com.atlassian.jira.plugin.... ]
Galder Zamarreño updated ISPN-9111:
-----------------------------------
Description:
Given a cache manager, we should look for all enabled x-site locations and add those sites as SYNC backups for the protobuf metadata cache. Without this data, the user has to implement its own code to make sure the data is added in each site which is troublesome.
Using SYNC/FAIL combo turns out to be very buggy. In the initial test created, only one site was up and the other was not. The put call to replicate the metadata was failing (as a result of ISPN-9113) but this was going under the radar (more tests needed!), and it ended up waiting for the replication timeout to happen.
Even after replication timeout happened, the put call was completing fine. This is because invocation batching was enabled for protobuf metadata cache which means any update failures would not make the cache operations fail. It’s unclear whether this is something this is a bug of invocation batching itself, or whether it’s the combination of of invocation batching being enabled and the location where x-site backup replication is called. This can easily be replicated by modifying JGroupsTransport.ChannelCallbacks.up to throw a runtime exception when dealing with SITE_UNREACHABLE event, and then execute ProtobufMetadataXSiteStateTransferTest.
was:Given a cache manager, we should look for all enabled x-site locations and add those sites as SYNC backups for the protobuf metadata cache. Without this data, the user has to implement its own code to make sure the data is added in each site which is troublesome.
> Internal caches should be replicated across sites
> -------------------------------------------------
>
> Key: ISPN-9111
> URL: https://issues.jboss.org/browse/ISPN-9111
> Project: Infinispan
> Issue Type: Enhancement
> Components: Cross-Site Replication, Remote Querying
> Reporter: Galder Zamarreño
> Assignee: Galder Zamarreño
> Labels: redhat-summit-18
>
> Given a cache manager, we should look for all enabled x-site locations and add those sites as SYNC backups for the protobuf metadata cache. Without this data, the user has to implement its own code to make sure the data is added in each site which is troublesome.
> Using SYNC/FAIL combo turns out to be very buggy. In the initial test created, only one site was up and the other was not. The put call to replicate the metadata was failing (as a result of ISPN-9113) but this was going under the radar (more tests needed!), and it ended up waiting for the replication timeout to happen.
> Even after replication timeout happened, the put call was completing fine. This is because invocation batching was enabled for protobuf metadata cache which means any update failures would not make the cache operations fail. It’s unclear whether this is something this is a bug of invocation batching itself, or whether it’s the combination of of invocation batching being enabled and the location where x-site backup replication is called. This can easily be replicated by modifying JGroupsTransport.ChannelCallbacks.up to throw a runtime exception when dealing with SITE_UNREACHABLE event, and then execute ProtobufMetadataXSiteStateTransferTest.
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
7 years, 10 months
[JBoss JIRA] (ISPN-9111) Internal caches should be replicated across sites
by Galder Zamarreño (JIRA)
[ https://issues.jboss.org/browse/ISPN-9111?page=com.atlassian.jira.plugin.... ]
Galder Zamarreño updated ISPN-9111:
-----------------------------------
Summary: Internal caches should be replicated across sites (was: Backup sites should be implicitly set on protobuf metadata cache configuration)
> Internal caches should be replicated across sites
> -------------------------------------------------
>
> Key: ISPN-9111
> URL: https://issues.jboss.org/browse/ISPN-9111
> Project: Infinispan
> Issue Type: Enhancement
> Components: Cross-Site Replication, Remote Querying
> Reporter: Galder Zamarreño
> Assignee: Galder Zamarreño
> Labels: redhat-summit-18
>
> Given a cache manager, we should look for all enabled x-site locations and add those sites as SYNC backups for the protobuf metadata cache. Without this data, the user has to implement its own code to make sure the data is added in each site which is troublesome.
--
This message was sent by Atlassian JIRA
(v7.5.0#75005)
7 years, 10 months