<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><br><div><div>On Aug 4, 2014, at 9:17 PM, Summers Pittman &lt;<a href="mailto:supittma@redhat.com">supittma@redhat.com</a>&gt; wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div style="font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">On Mon 04 Aug 2014 01:19:17 PM EDT, Randall Hauch wrote:<br><blockquote type="cite">Perhaps we’re looking at this from different perspectives. It’s great<br>that you guys are trying to better understand DS so that you can<br>compare it to other techniques, including OT. That certainly needs to<br>be done. I guess I was looking at DS from the perspective of how a<br>data service might need to implement it, knowing that the choice of<br>how data sync is ultimately done will be influenced in part by how the<br>data service would implement each approach and the impact on<br>scalability and performance. Perhaps it’s too early to provide my<br>thoughts along those lines.<br><br>On Aug 4, 2014, at 9:34 AM, Summers Pittman &lt;<a href="mailto:supittma@redhat.com">supittma@redhat.com</a><br>&lt;<a href="mailto:supittma@redhat.com">mailto:supittma@redhat.com</a>&gt;&gt; wrote:<br><br><blockquote type="cite">I am speaking from the perspective of the algorithm and less from my<br>opinions of the way the system should work as a whole.<br><br>On 08/01/2014 03:02 PM, Randall Hauch wrote:<br><blockquote type="cite">I’ve really enjoyed learning about what AeroGear has been doing with<br>data sync. This is a tough problem, but finding a solution is really<br>important. Both data sync POCs appear to use Differential<br>Synchronization, or DS [1]. I was not familiar with the paper until<br>today, but after reading it I do have a few questions/comments. Bear<br>with me; this is a long post.<br><br>DS is clearly targeted for use within a collaborative document<br>editor, where there are multiple clients concurrently editing the<br>same document, and at any one time there are a relatively small<br>number of documents being edited; you can get a feel for this by<br>looking at figures 5 and 7 in the paper [1] — look at the amount of<br>server memory and CPU required to perform DS on just one document<br>being edited by a half-dozen clients. Also, in a collaborative<br>document editor, clients are often continually making changes even<br>as they attempt to synchronize with the server.<br></blockquote>It doesn't actually make any claims about CPU or memory usage. &nbsp;A<br>shadow document is needed for each connection. &nbsp;For documents which<br>are infrequently edited, the shadow doc can easily be frozen to disk<br>until an edit comes in.<br></blockquote><br>Sure, you don’t have to keep it in memory. But it does have to be in<br>memory to do anything with it in an efficient way. And, yes, you<br>certainly can build a data service that uses this technique. My point<br>was that having multiple copies of a document being edited will reduce<br>the scalability of the data service compared to other techniques.<br><br><blockquote type="cite"><blockquote type="cite"><br>(It’s interesting that Google Docs, and Google Wave before it,<br>appear to use Operational Transformation [2] rather than DS. OT<br>might also make it easier to implement undo/redo, which works really<br>well in Google Docs.)<br></blockquote>That is probably because OT, Docs, and Apache Wave are all older then<br>Diff-sync. &nbsp;OT is also a much more complicated algorithm in my<br>experience (and from browsing around on wikipedia)<br><blockquote type="cite"><br>An MBaaS or any other database-like service is very different. It<br>has to host multiple applications (i.e., databases), each with<br>multiple collections containing potentially millions of entities<br>(e.g., JSON documents). The entities themselves are more<br>fine-grained and smaller than collaborative documents (though<br>probably a bit coarser-grained and larger than a single record in a<br>RDBMS). Many clients might be reading and updating lots of documents<br>at once, and the data service has to coordinate those changes. A<br>single batch update from one client might request changes to dozens<br>of entities. And the clients can/will always wait for confirmation<br>that the server made the requested changes before continuing (unless<br>the client is offline); or at a minimum can enqueue the requested<br>changes.<br></blockquote>Two quick things. &nbsp;A document is just a collection of entities and<br>can be structured to reduce this problem (especially is we are faking<br>it on a RDBMS with particularly sadistic abuses to an ORM).<br></blockquote><br>Yes, a document might be a JSON document that is an aggregate of<br>multiple objects, and not just a flat map of key-value pairs. The use<br>of aggregate data structures and denormalization are some of the ways<br>that eventually-consistent data stores work. The goal is to reduce the<br>scope of a set of operations to a single aggregate. Other data stores<br>(like graph and hierarchical databases) require strong consistency and<br>transactions because operations necessarily span multiple objects. But<br>limiting operations to a single aggregate is also quite constraining<br>w/r/t app development, since you can’t always denormalize all data to<br>separate aggregates.<br><br>So even if a collection (in the MongoDB sense) contains documents that<br>are aggregates of multiple “entities” (in the Hibernate sense of the<br>word), my point still stands that generally any given JSON document<br>will still be smaller than a document used in a collaborative document<br>editor. Also, I would not be surprised if the sheer number of<br>documents in a MongoDB collection is orders of magnitude larger than<br>the number of documents stored by a collaborative editor app.<br><br><blockquote type="cite">Clients don't have to wait for the edits to be merged on the server<br>and the nature of diff-sync gives us batching for free.<br></blockquote><br>Hmm… even if you could do it this way, do you not want to be able to<br>give feedback to the user that the changes might not have been<br>accepted/persisted?<br><br>Do you have some scenarios that describe the kinds of applications<br>you’re considering? I’m wondering if I’m envisioning a different kind<br>of app.<br><br><blockquote type="cite"><blockquote type="cite"><br>Given these characteristics, using DS within the data service might<br>be extremely expensive in terms of CPU and memory<br></blockquote>or it might not be. &nbsp;We need data, use cases, etc to test and see<br>what happens.<br><blockquote type="cite">, and difficult for a DS-based service to implement all of the<br>features necessary.<br></blockquote>Which features? &nbsp;Features of the algorithm of features of the<br>application? &nbsp;The algorithm is really REALLY simple for what we get<br>out of it.<br></blockquote><br>I was referring to features of the data service, and especially how<br>the data service’s implementation can satisfy the difficult<br>non-functional requirements like scalability and performance. While<br>the algorithm might be really simple, that doesn’t mean implementing<br>it on the server is efficient. What I’ve read so far makes me think<br>that it’s could very well be less efficient and scalable than other<br>techniques used in data services.<br><br><blockquote type="cite"><blockquote type="cite">First, the data service doesn’t really know which entities are<br>being“edited”; instead, connected clients read entities, make<br>changes locally, then request the service make those changes.<br></blockquote>I disagree. &nbsp;The service knows documents which the client has a<br>connection to/active session for. &nbsp;It most certainly knows which<br>entities are being edited.<br></blockquote><br>I guess I was hoping that the client can manipulate documents locally<br>without having to coordinate that with the server. Again, I’m<br>concerned about server scalability.<br><br><blockquote type="cite"><blockquote type="cite">Secondly, every time a change comes in, to compute the diff the<br>service would have to read the persisted entity; this not only is<br>inefficient, but this also makes it more difficult to scale and<br>handle the concurrency, consistency, atomicity, and serializability<br>guarantees.<br></blockquote>See earlier comment about sadistic abuses of an ORM. &nbsp;Yes we have to<br>be aware of the RDB underneath the sync server, but I don't think<br>this is a problem with the algorithm.<br></blockquote><br>I agree, it’s not a problem with the algorithm. It’s a problem insofar<br>as it would mandate what the server has to support.<br><br><blockquote type="cite"><blockquote type="cite">Thirdly, what would the data service need to do when a client<br>connects and asks for the changes since it was last connected?<br></blockquote>Send it the diff of the clients serverside shadow and the server's<br>current document. &nbsp;This diff will get sent to the client, merged with<br>the clients shadow, and the diff of that will get sent back to the<br>server. &nbsp;Repeat until the client is in sync.<br><blockquote type="cite">The data service might be able to quickly find out which entities<br>were modified since then, but computing the diffs (relative to the<br>time the client last connected) for all of those changed entities<br>would be very complicated.<br></blockquote>It isn’t.<br></blockquote><br>Perhaps I should have said “expensive” rather than “complicated”.<br><br><blockquote type="cite"><blockquote type="cite">It may be easier and better for the data service to record the<br>individual changes (edits) made by each transaction, and then to use<br>that information to compute the effective diffs from some period of<br>time. In fact, these recorded edits might also be useful to<br>implement other features within the data service; see CQRS [3] and [4].<br><br>What is really required by the client when trying to synchronize its<br>data after being disconnected? Assuming the client can say which<br>subset of entities it’s interested in when it reconnects (via some<br>criteria in a subscription), does the client want:<br><br>1. the new versions of those entities that changed;<br></blockquote>No<br></blockquote><br>This is actually what a number of MBaaS offerings do, although it’s<br>often hidden by the client SDKs. It may not be ideal because it places<br>more work onto the client SDK, but the benefit is that a good portion<br>of the work is done on the client, and the load on the server is<br>reduced (and scalability increased). It’s also trivially easy for the<br>data service to implement.<br><br><blockquote type="cite"><blockquote type="cite">1. the deltas in the entities; and/or<br></blockquote>Yes<br><blockquote type="cite"><br>1. all of the events describing the individual changes made to all<br>&nbsp;&nbsp;&nbsp;of those entities?<br></blockquote>No.<br><blockquote type="cite"><br>It may not matter for clients that don’t allow local offline<br>changes, but what might the preferred approach be for clients that<br>do allow offline changes? Option 1 is clearly the easiest from the<br>perspective of the data service, but options #2 and #3 can certainly<br>be handled. With option #1, can the client do something like DS and<br>maintain copies of each original (unmodified) entity so that it can<br>compute the differences? Does this (perhaps with a journal of edits<br>made while offline) provide enough info for the client to properly<br>merge the local changes, or does the client really need the<br>individual events in #3 so that it can, for example, know that some<br>local changes were made to now-out-date data?<br></blockquote>Except in the case of a merge error, the algorithm handles long<br>offline periods with edits just fine. &nbsp;If there is a merge error the<br>user/application will have to manually merge the documents somehow.<br><br>One of the things to keep in mind is on mobile devices the radio is<br>the most expensive thing you can control as an application. &nbsp;Any<br>decision we make should err toward only sending as little data as<br>possible as few times as possible.<br></blockquote><br>I completely agree.<br><br><blockquote type="cite"><blockquote type="cite"><br>Will the same option work for online notifications? After all, it’d<br>be great if the same mechanism was used for data-sync, offline<br>(push) notifications, and online notifications (events).<br></blockquote>I don't understand your question.<br></blockquote><br>Only that it seems beneficial that the same mechanism be used for<br>“events" (while connected) and both online and offline data-sync.<br><br><br>_______________________________________________<br>aerogear-dev mailing list<br><a href="mailto:aerogear-dev@lists.jboss.org">aerogear-dev@lists.jboss.org</a><br>https://lists.jboss.org/mailman/listinfo/aerogear-dev<br></blockquote><br>I don't think it is too early for service minded people to point out<span class="Apple-converted-space">&nbsp;</span><br>the big honking problems we will face. &nbsp;At the very least it helps us<span class="Apple-converted-space">&nbsp;</span><br>figure out the use cases we will support for the initial releases as to<span class="Apple-converted-space">&nbsp;</span><br>POC morphs into an actual project.<br></div></blockquote><div><br></div><div>I actually this this a very good point (as have other points on both sides). &nbsp;We will rely on people like you Randall to assist us in understanding the impact and advocate for the deeper server-side needs. &nbsp;While at the same time the client side APIs and developer experience will also need to be reviewed and taken into consideration.</div><div><br></div><div>I think it is important to remember that all of these are POC's and I think with a problem domain as complex as this, before we get to actual cross-project implementation we need to develop and flush out specs, including all of these various points, risks, etc... &nbsp;This would be across data services, liveoak and aerogear and possibly others.</div><br><blockquote type="cite"><div style="font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;"><br>From a mobile client developer perspective, this algorithm feels very<span class="Apple-converted-space">&nbsp;</span><br>easy to understand and it handles a lot of annoying corner cases very<span class="Apple-converted-space">&nbsp;</span><br>well (e.g. offline, operation batching, &nbsp;conflict handling, document<span class="Apple-converted-space">&nbsp;</span><br>synchronization, collaboration etc.) . &nbsp;Additionally, in my<span class="Apple-converted-space">&nbsp;</span><br>experiments, plugging in different diff-merge-synch operations allows<span class="Apple-converted-space">&nbsp;</span><br>for the general framework to be adapted to other use cases and data<span class="Apple-converted-space">&nbsp;</span><br>types. &nbsp;(i.e. if the server is just batching a list of changed Object<span class="Apple-converted-space">&nbsp;</span><br>IDs instead of the actual changes for binary files).<br><br>One of the things we need to do is determine which problems we are<span class="Apple-converted-space">&nbsp;</span><br>going to solve/support (single user multi device sync, multi user<span class="Apple-converted-space">&nbsp;</span><br>collaboration, binary files etc), which we are going to short circuit<span class="Apple-converted-space">&nbsp;</span><br>(limiting numbers of collaborators, limiting file size, etc), and which<span class="Apple-converted-space">&nbsp;</span><br>we are just going to document as best practices to use or to avoid.<br><br>Maybe I'm just stuck seeing my hammer (DS) and everything is nails and<span class="Apple-converted-space">&nbsp;</span><br>if someone can give a demo of something else (OT, etc) then I will be<span class="Apple-converted-space">&nbsp;</span><br>really interested to compare.<br><br><br>--<br>Summers Pittman<br><blockquote type="cite"><blockquote type="cite">Phone:404 941 4698<br>Java is my crack.<br></blockquote></blockquote><br>_______________________________________________<br>aerogear-dev mailing list<br><a href="mailto:aerogear-dev@lists.jboss.org">aerogear-dev@lists.jboss.org</a><br><a href="https://lists.jboss.org/mailman/listinfo/aerogear-dev">https://lists.jboss.org/mailman/listinfo/aerogear-dev</a></div></blockquote></div><br></body></html>