TopologySafe Map / Reduce

Remoting package refactor

7.0.1.Final tomorrow

Emmanuel Bernard

Thursday, 9 October 2014 Thu, 9 Oct '14

7:18 a.m.

Pedro and I have been having discussions with the LEADS guys on their experience of Map / Reduce especially around stability during topology changes. This ties to the .size() thread you guys have been exchanging on (I only could read it partially). On the requirements, theirs is pretty straightforward and expected I think from most users. They are fine with inconsistencies with entries create/updated/deleted between the M/R start and the end. They are *not* fine with seeing the same key/value several time for the duration of the M/R execution. This AFAIK can happen when a topology change occurs. Here is a proposal. Why not run the M/R job not per node but rather per segment? The point is that segments are stable across topology changes. The M/R tasks would then be about iterating over the keys in a given segment. The M/R request would send the task per segments on each node where the segment is primary. (We can imagine interesting things like sending it to one of the backups for workload optimization purposes or sending it to both primary and backups and to comparisons). The M/R requester would be in an interesting situation. It could detect that a segment M/R never returns and trigger a new computation on another node than the one initially sent. One tricky question around that is when the M/R job store data in an intermediary state. We need some sort of way to expose the user indirectly to segments so that we can evict per segment intermediary caches in case of failure or retry. But before getting ahead of ourselves, what do you thing of the general idea? Even without retry framework, this approach would be more stable than our current per node approach during topology changes and improve dependability. Emmanuel

Show replies by date

William Burns

Thursday, 9 October Thu, 9 Oct

7:40 a.m.

...

There is no way we can fix this without adding a very strict isolation level like SERIALIZABLE.

...

They are *not* fine with seeing the same key/value several time for the duration of the M/R execution. This AFAIK can happen when a topology change occurs.

This can happen if it was processed on one node and then rehash migrates the entry to another and runs it there.

...

Here is a proposal. Why not run the M/R job not per node but rather per segment? The point is that segments are stable across topology changes. The M/R tasks would then be about iterating over the keys in a given segment. The M/R request would send the task per segments on each node where the segment is primary.

This is exactly what the iterator does today but also watches for rehashes to send the request to a new owner when the segment moves between nodes.

...

(We can imagine interesting things like sending it to one of the backups for workload optimization purposes or sending it to both primary and backups and to comparisons). The M/R requester would be in an interesting situation. It could detect that a segment M/R never returns and trigger a new computation on another node than the one initially sent. One tricky question around that is when the M/R job store data in an intermediary state. We need some sort of way to expose the user indirectly to segments so that we can evict per segment intermediary caches in case of failure or retry.

This was one place I was thinking I would need to take special care to look into when doing a conversion like this.

...

But before getting ahead of ourselves, what do you thing of the general idea? Even without retry framework, this approach would be more stable than our current per node approach during topology changes and improve dependability.

Doing it solely based on segment would remove the possibility of having duplicates. However without a mechanism to send a new request on rehash it would be possible to only find a subset of values (if a segment is removed while iterating on it).

...

Emmanuel _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

Dan Berindei

8:41 a.m.

On Thu, Oct 9, 2014 at 3:40 PM, William Burns <mudokonman(a)gmail.com> wrote:

...

Actually this was something I was hoping to get to possibly in the near future. I already have to do https://issues.jboss.org/browse/ISPN-4358 which will require rewriting parts of the distributed entry iterator. In doing so I was planning on breaking this out to a more generic framework where you could run a given operation by segment guaranteeing it was only ran once per entry. In doing so I was thinking I could try to move M/R on top of this to allow it to also be resilient to rehash events. Additional comments inline. On Thu, Oct 9, 2014 at 8:18 AM, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote: > Pedro and I have been having discussions with the LEADS guys on their experience of Map / Reduce especially around stability during topology changes. > > This ties to the .size() thread you guys have been exchanging on (I only could read it partially). > > On the requirements, theirs is pretty straightforward and expected I think from most users. > They are fine with inconsistencies with entries create/updated/deleted between the M/R start and the end. There is no way we can fix this without adding a very strict isolation level like SERIALIZABLE.

...

> They are *not* fine with seeing the same key/value several time for the duration of the M/R execution. This AFAIK can happen when a topology change occurs. This can happen if it was processed on one node and then rehash migrates the entry to another and runs it there. > > Here is a proposal. > Why not run the M/R job not per node but rather per segment? > The point is that segments are stable across topology changes. The M/R tasks would then be about iterating over the keys in a given segment. > > The M/R request would send the task per segments on each node where the segment is primary. This is exactly what the iterator does today but also watches for rehashes to send the request to a new owner when the segment moves between nodes. > (We can imagine interesting things like sending it to one of the backups for workload optimization purposes or sending it to both primary and backups and to comparisons). > The M/R requester would be in an interesting situation. It could detect that a segment M/R never returns and trigger a new computation on another node than the one initially sent. > > One tricky question around that is when the M/R job store data in an intermediary state. We need some sort of way to expose the user indirectly to segments so that we can evict per segment intermediary caches in case of failure or retry. This was one place I was thinking I would need to take special care to look into when doing a conversion like this.

I'd rather not expose this to the user. Instead, we could split the intermediary values for each key by the source segment, and do the invalidation of the retried segments in our M/R framework (e.g. when we detect that the primary owner at the start of the map/combine phase is not an owner at all at the end). I think we have another problem with the publishing of intermediary values not being idempotent. The default configuration for the intermediate cache is non-transactional, and retrying the put(delta) command after a topology change could add the same intermediate values twice. A transactional intermediary cache should be safe, though, because the tx won't commit on the old owner until the new owner knows about the tx.

...

> > But before getting ahead of ourselves, what do you thing of the general idea? Even without retry framework, this approach would be more stable than our current per node approach during topology changes and improve dependability. Doing it solely based on segment would remove the possibility of having duplicates. However without a mechanism to send a new request on rehash it would be possible to only find a subset of values (if a segment is removed while iterating on it). > > Emmanuel > _______________________________________________ > infinispan-dev mailing list > infinispan-dev(a)lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

Pedro Ruivo

4:16 p.m.

On 10/09/2014 04:41 PM, Dan Berindei wrote:

...

On Thu, Oct 9, 2014 at 3:40 PM, William Burns <mudokonman(a)gmail.com <mailto:mudokonman@gmail.com>> wrote: Actually this was something I was hoping to get to possibly in the near future. I already have to do https://issues.jboss.org/browse/ISPN-4358 which will require rewriting parts of the distributed entry iterator. In doing so I was planning on breaking this out to a more generic framework where you could run a given operation by segment guaranteeing it was only ran once per entry. In doing so I was thinking I could try to move M/R on top of this to allow it to also be resilient to rehash events. Additional comments inline. On Thu, Oct 9, 2014 at 8:18 AM, Emmanuel Bernard <emmanuel(a)hibernate.org <mailto:emmanuel@hibernate.org>> wrote: > Pedro and I have been having discussions with the LEADS guys on their experience of Map / Reduce especially around stability during topology changes. > > This ties to the .size() thread you guys have been exchanging on (I only could read it partially). > > On the requirements, theirs is pretty straightforward and expected I think from most users. > They are fine with inconsistencies with entries create/updated/deleted between the M/R start and the end. There is no way we can fix this without adding a very strict isolation level like SERIALIZABLE. > They are *not* fine with seeing the same key/value several time for the duration of the M/R execution. This AFAIK can happen when a topology change occurs. This can happen if it was processed on one node and then rehash migrates the entry to another and runs it there. > > Here is a proposal. > Why not run the M/R job not per node but rather per segment? > The point is that segments are stable across topology changes. The M/R tasks would then be about iterating over the keys in a given segment. > > The M/R request would send the task per segments on each node where the segment is primary. This is exactly what the iterator does today but also watches for rehashes to send the request to a new owner when the segment moves between nodes. > (We can imagine interesting things like sending it to one of the backups for workload optimization purposes or sending it to both primary and backups and to comparisons). > The M/R requester would be in an interesting situation. It could detect that a segment M/R never returns and trigger a new computation on another node than the one initially sent. > > One tricky question around that is when the M/R job store data in an intermediary state. We need some sort of way to expose the user indirectly to segments so that we can evict per segment intermediary caches in case of failure or retry. This was one place I was thinking I would need to take special care to look into when doing a conversion like this. I'd rather not expose this to the user. Instead, we could split the intermediary values for each key by the source segment, and do the invalidation of the retried segments in our M/R framework (e.g. when we detect that the primary owner at the start of the map/combine phase is not an owner at all at the end). I think we have another problem with the publishing of intermediary values not being idempotent. The default configuration for the intermediate cache is non-transactional, and retrying the put(delta) command after a topology change could add the same intermediate values twice. A transactional intermediary cache should be safe, though, because the tx won't commit on the old owner until the new owner knows about the tx.

can you elaborate on it? anyway, I think the retry mechanism should solve it. If we detect a topology change (during the iteration of segment _i_) and the segment _i_ is moved, then we can cancel the iteration, remove all the intermediate values generated in segment _i_ and restart (on the primary owner).

...

> > But before getting ahead of ourselves, what do you thing of the general idea? Even without retry framework, this approach would be more stable than our current per node approach during topology changes and improve dependability. Doing it solely based on segment would remove the possibility of having duplicates. However without a mechanism to send a new request on rehash it would be possible to only find a subset of values (if a segment is removed while iterating on it). > > Emmanuel > _______________________________________________ > infinispan-dev mailing list > infinispan-dev(a)lists.jboss.org <mailto:infinispan-dev@lists.jboss.org> > https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org <mailto:infinispan-dev@lists.jboss.org> https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

Dan Berindei

Friday, 10 October Fri, 10 Oct

2:03 a.m.

On Fri, Oct 10, 2014 at 12:16 AM, Pedro Ruivo <pedro(a)infinispan.org> wrote:

...

On 10/09/2014 04:41 PM, Dan Berindei wrote: > > > On Thu, Oct 9, 2014 at 3:40 PM, William Burns <mudokonman(a)gmail.com > <mailto:mudokonman@gmail.com>> wrote: > > Actually this was something I was hoping to get to possibly in the > near future. > > I already have to do https://issues.jboss.org/browse/ISPN-4358 which > will require rewriting parts of the distributed entry iterator. In > doing so I was planning on breaking this out to a more generic > framework where you could run a given operation by segment > guaranteeing it was only ran once per entry. In doing so I was > thinking I could try to move M/R on top of this to allow it to also be > resilient to rehash events. > > Additional comments inline. > > On Thu, Oct 9, 2014 at 8:18 AM, Emmanuel Bernard > <emmanuel(a)hibernate.org <mailto:emmanuel@hibernate.org>> wrote: > > Pedro and I have been having discussions with the LEADS guys on their experience of Map / Reduce especially around stability during topology changes. > > > > This ties to the .size() thread you guys have been exchanging on (I only could read it partially). > > > > On the requirements, theirs is pretty straightforward and expected I think from most users. > > They are fine with inconsistencies with entries create/updated/deleted between the M/R start and the end. > > There is no way we can fix this without adding a very strict isolation > level like SERIALIZABLE. > > > > They are *not* fine with seeing the same key/value several time for the duration of the M/R execution. This AFAIK can happen when a topology change occurs. > > This can happen if it was processed on one node and then rehash > migrates the entry to another and runs it there. > > > > > Here is a proposal. > > Why not run the M/R job not per node but rather per segment? > > The point is that segments are stable across topology changes. The M/R tasks would then be about iterating over the keys in a given segment. > > > > The M/R request would send the task per segments on each node where the segment is primary. > > This is exactly what the iterator does today but also watches for > rehashes to send the request to a new owner when the segment moves > between nodes. > > > (We can imagine interesting things like sending it to one of the backups for workload optimization purposes or sending it to both primary and backups and to comparisons). > > The M/R requester would be in an interesting situation. It could detect that a segment M/R never returns and trigger a new computation on another node than the one initially sent. > > > > One tricky question around that is when the M/R job store data in an intermediary state. We need some sort of way to expose the user indirectly to segments so that we can evict per segment intermediary caches in case of failure or retry. > > This was one place I was thinking I would need to take special care to > look into when doing a conversion like this. > > > I'd rather not expose this to the user. Instead, we could split the > intermediary values for each key by the source segment, and do the > invalidation of the retried segments in our M/R framework (e.g. when we > detect that the primary owner at the start of the map/combine phase is > not an owner at all at the end). > > I think we have another problem with the publishing of intermediary > values not being idempotent. The default configuration for the > intermediate cache is non-transactional, and retrying the put(delta) > command after a topology change could add the same intermediate values > twice. A transactional intermediary cache should be safe, though, > because the tx won't commit on the old owner until the new owner knows > about the tx. can you elaborate on it?

say we have a cache with numOwners=2, owners(k) = [A, B] C will become the primary owner of k, but for now owners(k) = [A, B, C] O sends put(delta) to A (the primary) A sends put(delta) to B, C B sees a topology change (owners(k) = [C, B]), doesn't apply the delta and replies with an OutdatedTopologyException C applies the delta A resends put(delta) to C (new primary) C sends put(delta) to B, applies the delta again I think it could be solved with versions, I just wanted to point out that we don't do that now.

...

anyway, I think the retry mechanism should solve it. If we detect a topology change (during the iteration of segment _i_) and the segment _i_ is moved, then we can cancel the iteration, remove all the intermediate values generated in segment _i_ and restart (on the primary owner).

The problem is that the intermediate keys aren't in the same segment: we want the reduce phase to access only keys local to the reducing node, and keys in different input segments can yield values for the same intermediate key. So like you say, we'd have to retry on every topology change in the intermediary cache, not just the ones affecting segment _i_. There's another complication: in the scenario above, O may only get the topology update with owners(k) = [C, B] after the map/combine phase completed. So the originator of the M/R job would have to watch for topology changes seen by any node, and invalidate/retry any input segments that could have been affected. All that without slowing down the no-topology-change case too much...

...

> > > > > But before getting ahead of ourselves, what do you thing of the general idea? Even without retry framework, this approach would be more stable than our current per node approach during topology changes and improve dependability. > > Doing it solely based on segment would remove the possibility of > having duplicates. However without a mechanism to send a new request > on rehash it would be possible to only find a subset of values (if a > segment is removed while iterating on it). > > > > > Emmanuel > > _______________________________________________ > > infinispan-dev mailing list > > infinispan-dev(a)lists.jboss.org > <mailto:infinispan-dev@lists.jboss.org> > > https://lists.jboss.org/mailman/listinfo/infinispan-dev > _______________________________________________ > infinispan-dev mailing list > infinispan-dev(a)lists.jboss.org <mailto: infinispan-dev(a)lists.jboss.org> > https://lists.jboss.org/mailman/listinfo/infinispan-dev > > > > > _______________________________________________ > infinispan-dev mailing list > infinispan-dev(a)lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev > _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

Vladimir Blagojevic

10:13 a.m.

On 2014-10-10, 3:03 AM, Dan Berindei wrote:

...

If we have to retry for all segments on every topology change than I am not sure why would it make sense to work on this optimization and topology handling mechanism at all. We have to handle the cases where one node might have completed map phase and inserted deltas, while the other only started inserting deltas, and the third one is still doing map phase and has not inserted any deltas at all. The same thing with reduce portion. It seems to me that in the end any algorithm we come up with will not be not much better than: detect topology change, retry map/reduce job. Vladimir

Dan Berindei

11:06 a.m.

On Fri, Oct 10, 2014 at 6:13 PM, Vladimir Blagojevic <vblagoje(a)redhat.com> wrote:

...

On 2014-10-10, 3:03 AM, Dan Berindei wrote: > > The problem is that the intermediate keys aren't in the same segment: > we want the reduce phase to access only keys local to the reducing > node, and keys in different input segments can yield values for the > same intermediate key. So like you say, we'd have to retry on every > topology change in the intermediary cache, not just the ones affecting > segment _i_. > If we have to retry for all segments on every topology change than I am not sure why would it make sense to work on this optimization and topology handling mechanism at all. We have to handle the cases where one node might have completed map phase and inserted deltas, while the other only started inserting deltas, and the third one is still doing map phase and has not inserted any deltas at all. The same thing with reduce portion. It seems to me that in the end any algorithm we come up with will not be not much better than: detect topology change, retry map/reduce job.

Initially that was my thinking as well. But if the originator invokes the map/combine phase for only one segment at a time, it will have to retry only one segment per cluster node, not all the segments. And each node would write to separate keys in the intermediate cache, making it easy to clean up only one node's work. So it would still be worth it, as usually numSegments >> clusterSize. Plus we don't need this broad retry strategy if the intermediate cache is transactional (I think). The biggest downside I see is that it would be horribly slow if the cache store doesn't support efficient iteration of a single segment. So we might want to implement a full retry strategy as well, if some cache stores can't support that. Cheers Dan

Emmanuel Bernard

Wednesday, 15 October Wed, 15 Oct

11:21 a.m.

On 10 Oct 2014, at 18:06, Dan Berindei <dan.berindei(a)gmail.com> wrote:

...

The biggest downside I see is that it would be horribly slow if the cache store doesn't support efficient iteration of a single segment. So we might want to implement a full retry strategy as well, if some cache stores can't support that.

My understanding from a discussion with Pedro (in a hard, cold and sinister place but that’s another story) is that *today* M/R is kinda horrible for global cache stores anyways that have to do the key per node filtering dance anyways. So it’s not significantly worse. Plus I said we should do work per segment but in reality if you send 5 Map segment work to the same node, you can optimize and do a single loop only making it feel like they are separated work.

Emmanuel Bernard

Friday, 10 October Fri, 10 Oct

10:49 a.m.

When wrestling with the subject, here is what I had in mind. The M/R coordinator node sends the M task per segment on the node where the segment is primary. Each "per-segment" M task is executed and is offered the way to push intermediary results in a temp cache. The intermediary results are stored with a composite key [imtermKey-i, seg-j]. The M/R coordinator waits for all M tasks to return. If one does not (timeout, rehash), the following happens: - delete [intermKey-i, seg-i] (that operation could be handled by the new per-segment M before the map task is effectively started) - ship the M task for that segment-i to the new primary owner of segment-i When all M tasks are received the Reduce phase will read all [intermKey-i, *] keys and reduce them. Note that if the reduction phase is itself distributed, we could apply the same key per segment and shipping split for these. Again the tricky part is to expose the ability to write to intermediary caches per segment without exposing segments per se as well as let someone see a concatenated view if intermKey-i from all segments subkeys during reduction. Thoughts? Dan, I did not quite get what alternative approach you wanted to propose. Care to respin it for a slow brain? :) Emmanuel On Fri 2014-10-10 10:03, Dan Berindei wrote:

...

> > I'd rather not expose this to the user. Instead, we could split the > > intermediary values for each key by the source segment, and do the > > invalidation of the retried segments in our M/R framework (e.g. when we > > detect that the primary owner at the start of the map/combine phase is > > not an owner at all at the end). > > > > I think we have another problem with the publishing of intermediary > > values not being idempotent. The default configuration for the > > intermediate cache is non-transactional, and retrying the put(delta) > > command after a topology change could add the same intermediate values > > twice. A transactional intermediary cache should be safe, though, > > because the tx won't commit on the old owner until the new owner knows > > about the tx. > > can you elaborate on it? > say we have a cache with numOwners=2, owners(k) = [A, B] C will become the primary owner of k, but for now owners(k) = [A, B, C] O sends put(delta) to A (the primary) A sends put(delta) to B, C B sees a topology change (owners(k) = [C, B]), doesn't apply the delta and replies with an OutdatedTopologyException C applies the delta A resends put(delta) to C (new primary) C sends put(delta) to B, applies the delta again I think it could be solved with versions, I just wanted to point out that we don't do that now. > > anyway, I think the retry mechanism should solve it. If we detect a > topology change (during the iteration of segment _i_) and the segment > _i_ is moved, then we can cancel the iteration, remove all the > intermediate values generated in segment _i_ and restart (on the primary > owner). > The problem is that the intermediate keys aren't in the same segment: we want the reduce phase to access only keys local to the reducing node, and keys in different input segments can yield values for the same intermediate key. So like you say, we'd have to retry on every topology change in the intermediary cache, not just the ones affecting segment _i_. There's another complication: in the scenario above, O may only get the topology update with owners(k) = [C, B] after the map/combine phase completed. So the originator of the M/R job would have to watch for topology changes seen by any node, and invalidate/retry any input segments that could have been affected. All that without slowing down the no-topology-change case too much... > > > > > > > > > But before getting ahead of ourselves, what do you thing of the > general idea? Even without retry framework, this approach would be more > stable than our current per node approach during topology changes and > improve dependability. > > > > Doing it solely based on segment would remove the possibility of > > having duplicates. However without a mechanism to send a new request > > on rehash it would be possible to only find a subset of values (if a > > segment is removed while iterating on it).

Dan Berindei

Monday, 13 October Mon, 13 Oct

3:45 a.m.

On Fri, Oct 10, 2014 at 6:49 PM, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote:

...

When wrestling with the subject, here is what I had in mind. The M/R coordinator node sends the M task per segment on the node where the segment is primary.

What's M? Is it just a shorthand for "map", or is it a new parameter that controls the number of map/combine tasks sent at once?

...

Each "per-segment" M task is executed and is offered the way to push intermediary results in a temp cache.

Just to be clear, the user-provided mapper and combiner don't know anything about the intermediary cache (which doesn't have to be temporary, if it's shared by all M/R tasks). They only interact with the Collector interface. The map/combine task on the other hand is our code, and it deals with the intermediary cache directly.

...

The intermediary results are stored with a composite key [imtermKey-i, seg-j]. The M/R coordinator waits for all M tasks to return. If one does not (timeout, rehash), the following happens:

We can't allow time out map tasks, or they will keep writing to the intermediate cache in parallel with the retried tasks. So the originator has to wait for a response from each node to which it sent a map task.

...

Note that if the reduction phase is itself distributed, we could apply

...

the same key per segment and shipping split for these.

Sure, we have to retry reduce tasks when the primary owner changes, and it makes sense to retry as little as possible.

...

Again the tricky part is to expose the ability to write to intermediary caches per segment without exposing segments per se as well as let someone see a concatenated view if intermKey-i from all segments subkeys during reduction.

Writing to and reading from the intermediate cache is already abstracted from user code (in the Mapper and Reducer interfaces). So we don't need to worry about exposing extra details to the user.

...

Thoughts? Dan, I did not quite get what alternative approach you wanted to propose. Care to respin it for a slow brain? :)

I think where we differ is that I don't think user code needs to know about how we store the intermediate values and what we retry, as long as their mappers/combiners/reducers don't have side effects. Otherwise I was thinking on the same lines: send 1 map/combine task for each segment (maybe with a cap on the number of segments being processed at the same time on each node), split the intermediate values per input segment, cancel+retry each map task if the topology changes and the executing node is no longer an owner. If the reduce phase is distributed, run 1 reduce task per segment as well, and cancel+retry the reduce task if the executing node is no longer an owner. I had some ideas about assigning each map/combine phase a UUID and making the intermediate keys [intermKey, seg, mctask] to allow the originator to retry a map/combine task without waiting for the previous one to finish, but I don't think I mentioned that before :) There are also some details that I'm worried about: 1) If the reduce phase is distributed, and the intermediate cache is non-transactional, any topology change in the intermediate cache will require us to retry all the map/combine tasks that were running at the time on any node (even if some nodes did not detect the topology change yet). So it would make sense to limit the number of map/combine tasks that are processed at one time, in order to limit the amount of tasks we retry (OR require the intermediate cache to be transactional). 2) Running a separate map/combine task for each segment is not really an option until we implement the the segment-aware data container and cache stores. Without that change, it will make everything much slower, because of all the extra iterations for each segment. 3) And finally, all this will be overkill when the input cache is small, and the time needed to process the data is comparable to the time needed to send all those extra RPCs. So I'm thinking it might be better to adopt Vladimir's suggestion to retry everything if we detect a topology change in the input and/or intermediate cache at the end of the M/R task, at least in the first phase. Cheers Dan

...

Emmanuel On Fri 2014-10-10 10:03, Dan Berindei wrote: > > > I'd rather not expose this to the user. Instead, we could split the > > > intermediary values for each key by the source segment, and do the > > > invalidation of the retried segments in our M/R framework (e.g. when we > > > detect that the primary owner at the start of the map/combine phase is > > > not an owner at all at the end). > > > > > > I think we have another problem with the publishing of intermediary > > > values not being idempotent. The default configuration for the > > > intermediate cache is non-transactional, and retrying the put(delta) > > > command after a topology change could add the same intermediate values > > > twice. A transactional intermediary cache should be safe, though, > > > because the tx won't commit on the old owner until the new owner knows > > > about the tx. > > > > can you elaborate on it? > > > > say we have a cache with numOwners=2, owners(k) = [A, B] > C will become the primary owner of k, but for now owners(k) = [A, B, C] > O sends put(delta) to A (the primary) > A sends put(delta) to B, C > B sees a topology change (owners(k) = [C, B]), doesn't apply the delta and > replies with an OutdatedTopologyException > C applies the delta > A resends put(delta) to C (new primary) > C sends put(delta) to B, applies the delta again > > I think it could be solved with versions, I just wanted to point out that > we don't do that now. > > > > > > anyway, I think the retry mechanism should solve it. If we detect a > > topology change (during the iteration of segment _i_) and the segment > > _i_ is moved, then we can cancel the iteration, remove all the > > intermediate values generated in segment _i_ and restart (on the primary > > owner). > > > > The problem is that the intermediate keys aren't in the same segment: we > want the reduce phase to access only keys local to the reducing node, and > keys in different input segments can yield values for the same intermediate > key. So like you say, we'd have to retry on every topology change in the > intermediary cache, not just the ones affecting segment _i_. > > There's another complication: in the scenario above, O may only get the > topology update with owners(k) = [C, B] after the map/combine phase > completed. So the originator of the M/R job would have to watch for > topology changes seen by any node, and invalidate/retry any input segments > that could have been affected. All that without slowing down the > no-topology-change case too much... > > > > > > > > > > > > > > But before getting ahead of ourselves, what do you thing of the > > general idea? Even without retry framework, this approach would be more > > stable than our current per node approach during topology changes and > > improve dependability. > > > > > > Doing it solely based on segment would remove the possibility of > > > having duplicates. However without a mechanism to send a new request > > > on rehash it would be possible to only find a subset of values (if a > > > segment is removed while iterating on it). _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

Emmanuel Bernard

Wednesday, 15 October Wed, 15 Oct

11:41 a.m.

On 13 Oct 2014, at 10:45, Dan Berindei <dan.berindei(a)gmail.com> wrote:

...

On Fri, Oct 10, 2014 at 6:49 PM, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote: When wrestling with the subject, here is what I had in mind. The M/R coordinator node sends the M task per segment on the node where the segment is primary. What's M? Is it just a shorthand for "map", or is it a new parameter that controls the number of map/combine tasks sent at once?

M is short for Map. Sorry.

...

Each "per-segment" M task is executed and is offered the way to push intermediary results in a temp cache. Just to be clear, the user-provided mapper and combiner don't know anything about the intermediary cache (which doesn't have to be temporary, if it's shared by all M/R tasks). They only interact with the Collector interface. The map/combine task on the other hand is our code, and it deals with the intermediary cache directly.

Interesting, Evangelos, do you actually use the collector interface or actual explicit intermediary caches in your approach. If that’s the collector interface, I guess that’s easier to hide that sharding business.

...

The intermediary results are stored with a composite key [imtermKey-i, seg-j]. The M/R coordinator waits for all M tasks to return. If one does not (timeout, rehash), the following happens: We can't allow time out map tasks, or they will keep writing to the intermediate cache in parallel with the retried tasks. So the originator has to wait for a response from each node to which it sent a map task.

OK. I guess the originator can see that a node is out of the cluster though and act accordingly.

...

- delete [intermKey-i, seg-i] (that operation could be handled by the new per-segment M before the map task is effectively started) - ship the M task for that segment-i to the new primary owner of segment-i When all M tasks are received the Reduce phase will read all [intermKey-i, *] keys and reduce them. Note that if the reduction phase is itself distributed, we could apply the same key per segment and shipping split for these. Sure, we have to retry reduce tasks when the primary owner changes, and it makes sense to retry as little as possible. Again the tricky part is to expose the ability to write to intermediary caches per segment without exposing segments per se as well as let someone see a concatenated view if intermKey-i from all segments subkeys during reduction. Writing to and reading from the intermediate cache is already abstracted from user code (in the Mapper and Reducer interfaces). So we don't need to worry about exposing extra details to the user. Thoughts? Dan, I did not quite get what alternative approach you wanted to propose. Care to respin it for a slow brain? :) I think where we differ is that I don't think user code needs to know about how we store the intermediate values and what we retry, as long as their mappers/combiners/reducers don't have side effects.

Right but my understanding from the LEADS guys was that they had side effects on their M/Rs. Waiting for Evangelos to speak up.

...

Nice touch, that fixes the rogue node / timeout problem.

...

There are also some details that I'm worried about: 1) If the reduce phase is distributed, and the intermediate cache is non-transactional, any topology change in the intermediate cache will require us to retry all the map/combine tasks that were running at the time on any node (even if some nodes did not detect the topology change yet). So it would make sense to limit the number of map/combine tasks that are processed at one time, in order to limit the amount of tasks we retry (OR require the intermediate cache to be transactional).

I am not fully following that. What matters in the end it seems is for the originator to detect a topology change and discard things accordingly, no? If the other nodes are slaves of that originator for the purpose of that M/R, we are good.

...

2) Running a separate map/combine task for each segment is not really an option until we implement the the segment-aware data container and cache stores. Without that change, it will make everything much slower, because of all the extra iterations for each segment.

See my other email about physically merging down the per segment work into a per node work when you ship that work.

...

3) And finally, all this will be overkill when the input cache is small, and the time needed to process the data is comparable to the time needed to send all those extra RPCs. So I'm thinking it might be better to adopt Vladimir's suggestion to retry everything if we detect a topology change in the input and/or intermediate cache at the end of the M/R task, at least in the first phase.

You half lost but I think that with my proposal to physically merge the RPC calls per node instead of per segment, that problem would be alleviated. Emmanuel

Evangelos Vazaios

Thursday, 13 November Thu, 13 Nov

11:51 p.m.

I am really sorry for the ridiculously late response. I will describe briefly our 1st year and our current approach. 1st year approach. During the first year, we used infinispan MR to implement our operators. Most of our operators were Map-only (for example project,filter) and for these we did not use the intermediate cache. For all the other operators (join,group by) we used the collector interface. Our reducers always returned null and the actual output was written to another cache, because we had a workflow of operators. Current approach At the moment we do not use replaced MR, with two dist calls one for the map and another for the reduce phase. The intermediate data are stored in a cache (Cache<K,List<V>>). At some point we would like to change to a delta aware cache. We changed from the MR to dist calls, because we want to run MR tasks across multiple micro-clouds and the synchronization of Mappers and reducers it would be more complicated than monitoring the execution of independent dist calls ( 1 for each micro-cloud). The intermediate data are written to a ensemble cache ( a LEADS cache), which spans multiple micro-clouds. In general, I find it quite useful to be able to "consistently" (without missing data that are already inside) iterate over the values of a cache. On Wed, Oct 15, 2014 at 7:41 PM, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote:

...

On 13 Oct 2014, at 10:45, Dan Berindei <dan.berindei(a)gmail.com> wrote: On Fri, Oct 10, 2014 at 6:49 PM, Emmanuel Bernard <emmanuel(a)hibernate.org> wrote: > When wrestling with the subject, here is what I had in mind. > > The M/R coordinator node sends the M task per segment on the node where > the segment is primary. > What's M? Is it just a shorthand for "map", or is it a new parameter that controls the number of map/combine tasks sent at once? M is short for Map. Sorry. > Each "per-segment" M task is executed and is offered the way to push > intermediary results in a temp cache. > Just to be clear, the user-provided mapper and combiner don't know anything about the intermediary cache (which doesn't have to be temporary, if it's shared by all M/R tasks). They only interact with the Collector interface. The map/combine task on the other hand is our code, and it deals with the intermediary cache directly. Interesting, Evangelos, do you actually use the collector interface or actual explicit intermediary caches in your approach. If that’s the collector interface, I guess that’s easier to hide that sharding business.

We use explicit caches, but should that functionality become available, we could possibly revert back to Infinspan MR.

...

> The intermediary results are stored with a composite key [imtermKey-i, > seg-j]. > The M/R coordinator waits for all M tasks to return. If one does not > (timeout, rehash), the following happens: > We can't allow time out map tasks, or they will keep writing to the intermediate cache in parallel with the retried tasks. So the originator has to wait for a response from each node to which it sent a map task. OK. I guess the originator can see that a node is out of the cluster though and act accordingly. > - delete [intermKey-i, seg-i] (that operation could be handled by the > new per-segment M before the map task is effectively started) > - ship the M task for that segment-i to the new primary owner of > segment-i > > When all M tasks are received the Reduce phase will read all > [intermKey-i, *] > keys and reduce them. > Note that if the reduction phase is itself distributed, we could apply > the same key per segment and shipping split for these. > Sure, we have to retry reduce tasks when the primary owner changes, and it makes sense to retry as little as possible. > > Again the tricky part is to expose the ability to write to intermediary > caches per segment without exposing segments per se as well as let > someone see a concatenated view if intermKey-i from all segments subkeys > during reduction. > Writing to and reading from the intermediate cache is already abstracted from user code (in the Mapper and Reducer interfaces). So we don't need to worry about exposing extra details to the user. > > Thoughts? > > Dan, I did not quite get what alternative approach you wanted to > propose. Care to respin it for a slow brain? :) > I think where we differ is that I don't think user code needs to know about how we store the intermediate values and what we retry, as long as their mappers/combiners/reducers don't have side effects. Right but my understanding from the LEADS guys was that they had side effects on their M/Rs. Waiting for Evangelos to speak up. Should that be available for MapReduce, and the underlying ensemble cache

can correctly handle one of the strategies described above, we might be able to change back to Infinispan MR.

...

Otherwise I was thinking on the same lines: send 1 map/combine task for each segment (maybe with a cap on the number of segments being processed at the same time on each node), split the intermediate values per input segment, cancel+retry each map task if the topology changes and the executing node is no longer an owner. If the reduce phase is distributed, run 1 reduce task per segment as well, and cancel+retry the reduce task if the executing node is no longer an owner. I had some ideas about assigning each map/combine phase a UUID and making the intermediate keys [intermKey, seg, mctask] to allow the originator to retry a map/combine task without waiting for the previous one to finish, but I don't think I mentioned that before :) Nice touch, that fixes the rogue node / timeout problem. There are also some details that I'm worried about: 1) If the reduce phase is distributed, and the intermediate cache is non-transactional, any topology change in the intermediate cache will require us to retry all the map/combine tasks that were running at the time on any node (even if some nodes did not detect the topology change yet). So it would make sense to limit the number of map/combine tasks that are processed at one time, in order to limit the amount of tasks we retry (OR require the intermediate cache to be transactional). I am not fully following that. What matters in the end it seems is for the originator to detect a topology change and discard things accordingly, no? If the other nodes are slaves of that originator for the purpose of that M/R, we are good. 2) Running a separate map/combine task for each segment is not really an option until we implement the the segment-aware data container and cache stores. Without that change, it will make everything much slower, because of all the extra iterations for each segment. See my other email about physically merging down the per segment work into a per node work when you ship that work. 3) And finally, all this will be overkill when the input cache is small, and the time needed to process the data is comparable to the time needed to send all those extra RPCs. So I'm thinking it might be better to adopt Vladimir's suggestion to retry everything if we detect a topology change in the input and/or intermediate cache at the end of the M/R task, at least in the first phase. It would also be an overkill to restart everything MR task if the volume

of data is large. I would propose a solution using the distributed iterator and that it would not miss data whenever a topology change happens.

...

You half lost but I think that with my proposal to physically merge the RPC calls per node instead of per segment, that problem would be alleviated. Emmanuel _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

Cheers, Evangelos

Pedro Ruivo

Thursday, 9 October Thu, 9 Oct

4:13 p.m.

On 10/09/2014 03:40 PM, William Burns wrote:

...

Snapshot Isolation should be fine, but I don't wanna enter in discussion about it right now :)

...

true. I think the retry mechanism is the best approach. other alternative, would be to implement a Map<K,V> getBySegment(int) operations that goes remote if the segment is not local.

...

> > Emmanuel > _______________________________________________ > infinispan-dev mailing list > infinispan-dev(a)lists.jboss.org > https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list infinispan-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/infinispan-dev

3977

days inactive

4013

days old

infinispan-dev@lists.jboss.org

Manage subscription

12 comments

6 participants

tags (0)

participants (6)

Dan Berindei
Emmanuel Bernard
Evangelos Vazaios
Pedro Ruivo
Vladimir Blagojevic
William Burns

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

TopologySafe Map / Reduce