[aerogear-dev] Use of Differential Synchronization for data sync

Wed Aug 6 02:27:06 EDT 2014

>I think it is important to remember that all of these are POC's and I
think with a problem domain as complex as this, before we get to actual
cross->project implementation we need to develop and flush out specs,
including all of these various points, risks, etc...
I agree with this.

I think limiting the scope of this for now and enabling the underlying
choice of storage/data service to be flexible will allow us to cater for
the complex enterprise as well as a the simpler use cases which are
important too if we want people to start using this and trying it out.

On 6 August 2014 00:34, Jay Balunas <jbalunas at redhat.com> wrote:

>
> On Aug 4, 2014, at 9:17 PM, Summers Pittman <supittma at redhat.com> wrote:
>
> On Mon 04 Aug 2014 01:19:17 PM EDT, Randall Hauch wrote:
>
> Perhaps we’re looking at this from different perspectives. It’s great
> that you guys are trying to better understand DS so that you can
> compare it to other techniques, including OT. That certainly needs to
> be done. I guess I was looking at DS from the perspective of how a
> data service might need to implement it, knowing that the choice of
> how data sync is ultimately done will be influenced in part by how the
> data service would implement each approach and the impact on
> scalability and performance. Perhaps it’s too early to provide my
> thoughts along those lines.
>
> On Aug 4, 2014, at 9:34 AM, Summers Pittman <supittma at redhat.com
> <mailto:supittma at redhat.com <supittma at redhat.com>>> wrote:
>
> I am speaking from the perspective of the algorithm and less from my
> opinions of the way the system should work as a whole.
>
> On 08/01/2014 03:02 PM, Randall Hauch wrote:
>
> I’ve really enjoyed learning about what AeroGear has been doing with
> data sync. This is a tough problem, but finding a solution is really
> important. Both data sync POCs appear to use Differential
> Synchronization, or DS [1]. I was not familiar with the paper until
> today, but after reading it I do have a few questions/comments. Bear
> with me; this is a long post.
>
> DS is clearly targeted for use within a collaborative document
> editor, where there are multiple clients concurrently editing the
> same document, and at any one time there are a relatively small
> number of documents being edited; you can get a feel for this by
> looking at figures 5 and 7 in the paper [1] — look at the amount of
> server memory and CPU required to perform DS on just one document
> being edited by a half-dozen clients. Also, in a collaborative
> document editor, clients are often continually making changes even
> as they attempt to synchronize with the server.
>
> It doesn't actually make any claims about CPU or memory usage.  A
> shadow document is needed for each connection.  For documents which
> are infrequently edited, the shadow doc can easily be frozen to disk
> until an edit comes in.
>
>
> Sure, you don’t have to keep it in memory. But it does have to be in
> memory to do anything with it in an efficient way. And, yes, you
> certainly can build a data service that uses this technique. My point
> was that having multiple copies of a document being edited will reduce
> the scalability of the data service compared to other techniques.
>
>
> (It’s interesting that Google Docs, and Google Wave before it,
> appear to use Operational Transformation [2] rather than DS. OT
> might also make it easier to implement undo/redo, which works really
> well in Google Docs.)
>
> That is probably because OT, Docs, and Apache Wave are all older then
> Diff-sync.  OT is also a much more complicated algorithm in my
> experience (and from browsing around on wikipedia)
>
>
> An MBaaS or any other database-like service is very different. It
> has to host multiple applications (i.e., databases), each with
> multiple collections containing potentially millions of entities
> (e.g., JSON documents). The entities themselves are more
> fine-grained and smaller than collaborative documents (though
> probably a bit coarser-grained and larger than a single record in a
> RDBMS). Many clients might be reading and updating lots of documents
> at once, and the data service has to coordinate those changes. A
> single batch update from one client might request changes to dozens
> of entities. And the clients can/will always wait for confirmation
> that the server made the requested changes before continuing (unless
> the client is offline); or at a minimum can enqueue the requested
> changes.
>
> Two quick things.  A document is just a collection of entities and
> can be structured to reduce this problem (especially is we are faking
> it on a RDBMS with particularly sadistic abuses to an ORM).
>
>
> Yes, a document might be a JSON document that is an aggregate of
> multiple objects, and not just a flat map of key-value pairs. The use
> of aggregate data structures and denormalization are some of the ways
> that eventually-consistent data stores work. The goal is to reduce the
> scope of a set of operations to a single aggregate. Other data stores
> (like graph and hierarchical databases) require strong consistency and
> transactions because operations necessarily span multiple objects. But
> limiting operations to a single aggregate is also quite constraining
> w/r/t app development, since you can’t always denormalize all data to
> separate aggregates.
>
> So even if a collection (in the MongoDB sense) contains documents that
> are aggregates of multiple “entities” (in the Hibernate sense of the
> word), my point still stands that generally any given JSON document
> will still be smaller than a document used in a collaborative document
> editor. Also, I would not be surprised if the sheer number of
> documents in a MongoDB collection is orders of magnitude larger than
> the number of documents stored by a collaborative editor app.
>
> Clients don't have to wait for the edits to be merged on the server
> and the nature of diff-sync gives us batching for free.
>
>
> Hmm… even if you could do it this way, do you not want to be able to
> give feedback to the user that the changes might not have been
> accepted/persisted?
>
> Do you have some scenarios that describe the kinds of applications
> you’re considering? I’m wondering if I’m envisioning a different kind
> of app.
>
>
> Given these characteristics, using DS within the data service might
> be extremely expensive in terms of CPU and memory
>
> or it might not be.  We need data, use cases, etc to test and see
> what happens.
>
> , and difficult for a DS-based service to implement all of the
> features necessary.
>
> Which features?  Features of the algorithm of features of the
> application?  The algorithm is really REALLY simple for what we get
> out of it.
>
>
> I was referring to features of the data service, and especially how
> the data service’s implementation can satisfy the difficult
> non-functional requirements like scalability and performance. While
> the algorithm might be really simple, that doesn’t mean implementing
> it on the server is efficient. What I’ve read so far makes me think
> that it’s could very well be less efficient and scalable than other
> techniques used in data services.
>
> First, the data service doesn’t really know which entities are
> being“edited”; instead, connected clients read entities, make
> changes locally, then request the service make those changes.
>
> I disagree.  The service knows documents which the client has a
> connection to/active session for.  It most certainly knows which
> entities are being edited.
>
>
> I guess I was hoping that the client can manipulate documents locally
> without having to coordinate that with the server. Again, I’m
> concerned about server scalability.
>
> Secondly, every time a change comes in, to compute the diff the
> service would have to read the persisted entity; this not only is
> inefficient, but this also makes it more difficult to scale and
> handle the concurrency, consistency, atomicity, and serializability
> guarantees.
>
> See earlier comment about sadistic abuses of an ORM.  Yes we have to
> be aware of the RDB underneath the sync server, but I don't think
> this is a problem with the algorithm.
>
>
> I agree, it’s not a problem with the algorithm. It’s a problem insofar
> as it would mandate what the server has to support.
>
> Thirdly, what would the data service need to do when a client
> connects and asks for the changes since it was last connected?
>
> Send it the diff of the clients serverside shadow and the server's
> current document.  This diff will get sent to the client, merged with
> the clients shadow, and the diff of that will get sent back to the
> server.  Repeat until the client is in sync.
>
> The data service might be able to quickly find out which entities
> were modified since then, but computing the diffs (relative to the
> time the client last connected) for all of those changed entities
> would be very complicated.
>
> It isn’t.
>
>
> Perhaps I should have said “expensive” rather than “complicated”.
>
> It may be easier and better for the data service to record the
> individual changes (edits) made by each transaction, and then to use
> that information to compute the effective diffs from some period of
> time. In fact, these recorded edits might also be useful to
> implement other features within the data service; see CQRS [3] and [4].
>
> What is really required by the client when trying to synchronize its
> data after being disconnected? Assuming the client can say which
> subset of entities it’s interested in when it reconnects (via some
> criteria in a subscription), does the client want:
>
> 1. the new versions of those entities that changed;
>
> No
>
>
> This is actually what a number of MBaaS offerings do, although it’s
> often hidden by the client SDKs. It may not be ideal because it places
> more work onto the client SDK, but the benefit is that a good portion
> of the work is done on the client, and the load on the server is
> reduced (and scalability increased). It’s also trivially easy for the
> data service to implement.
>
> 1. the deltas in the entities; and/or
>
> Yes
>
>
> 1. all of the events describing the individual changes made to all
>    of those entities?
>
> No.
>
>
> It may not matter for clients that don’t allow local offline
> changes, but what might the preferred approach be for clients that
> do allow offline changes? Option 1 is clearly the easiest from the
> perspective of the data service, but options #2 and #3 can certainly
> be handled. With option #1, can the client do something like DS and
> maintain copies of each original (unmodified) entity so that it can
> compute the differences? Does this (perhaps with a journal of edits
> made while offline) provide enough info for the client to properly
> merge the local changes, or does the client really need the
> individual events in #3 so that it can, for example, know that some
> local changes were made to now-out-date data?
>
> Except in the case of a merge error, the algorithm handles long
> offline periods with edits just fine.  If there is a merge error the
> user/application will have to manually merge the documents somehow.
>
> One of the things to keep in mind is on mobile devices the radio is
> the most expensive thing you can control as an application.  Any
> decision we make should err toward only sending as little data as
> possible as few times as possible.
>
>
> I completely agree.
>
>
> Will the same option work for online notifications? After all, it’d
> be great if the same mechanism was used for data-sync, offline
> (push) notifications, and online notifications (events).
>
> I don't understand your question.
>
>
> Only that it seems beneficial that the same mechanism be used for
> “events" (while connected) and both online and offline data-sync.
>
>
> _______________________________________________
> aerogear-dev mailing list
> aerogear-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/aerogear-dev
>
>
> I don't think it is too early for service minded people to point out
> the big honking problems we will face.  At the very least it helps us
> figure out the use cases we will support for the initial releases as to
> POC morphs into an actual project.
>
>
> I actually this this a very good point (as have other points on both
> sides).  We will rely on people like you Randall to assist us in
> understanding the impact and advocate for the deeper server-side needs.
>  While at the same time the client side APIs and developer experience will
> also need to be reviewed and taken into consideration.
>
> I think it is important to remember that all of these are POC's and I
> think with a problem domain as complex as this, before we get to actual
> cross-project implementation we need to develop and flush out specs,
> including all of these various points, risks, etc...  This would be across
> data services, liveoak and aerogear and possibly others.
>
>
> From a mobile client developer perspective, this algorithm feels very
> easy to understand and it handles a lot of annoying corner cases very
> well (e.g. offline, operation batching,  conflict handling, document
> synchronization, collaboration etc.) .  Additionally, in my
> experiments, plugging in different diff-merge-synch operations allows
> for the general framework to be adapted to other use cases and data
> types.  (i.e. if the server is just batching a list of changed Object
> IDs instead of the actual changes for binary files).
>
> One of the things we need to do is determine which problems we are
> going to solve/support (single user multi device sync, multi user
> collaboration, binary files etc), which we are going to short circuit
> (limiting numbers of collaborators, limiting file size, etc), and which
> we are just going to document as best practices to use or to avoid.
>
> Maybe I'm just stuck seeing my hammer (DS) and everything is nails and
> if someone can give a demo of something else (OT, etc) then I will be
> really interested to compare.
>
>
> --
> Summers Pittman
>
> Phone:404 941 4698
> Java is my crack.
>
>
> _______________________________________________
> aerogear-dev mailing list
> aerogear-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/aerogear-dev
>
>
>
> _______________________________________________
> aerogear-dev mailing list
> aerogear-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/aerogear-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/aerogear-dev/attachments/20140806/c4973776/attachment-0001.html