Re: [aerogear-dev] Use of Differential Synchronization for data sync

Monday, 4 August 2014

I am speaking from the perspective of the algorithm and less from my 
opinions of the way the system should work as a whole.

On 08/01/2014 03:02 PM, Randall Hauch wrote:
...
 I've really enjoyed learning about what AeroGear has been doing
with 
 data sync. This is a tough problem, but finding a solution is really 
 important. Both data sync POCs appear to use Differential 
 Synchronization, or DS [1]. I was not familiar with the paper until 
 today, but after reading it I do have a few questions/comments. Bear 
 with me; this is a long post.

 DS is clearly targeted for use within a collaborative document editor, 
 where there are multiple clients concurrently editing the same 
 document, and at any one time there are a relatively small number of 
 documents being edited; you can get a feel for this by looking at 
 figures 5 and 7 in the paper [1] --- look at the amount of server 
 memory and CPU required to perform DS on just one document being 
 edited by a half-dozen clients. Also, in a collaborative document 
 editor, clients are often continually making changes even as they 
 attempt to synchronize with the server. It doesn't actually make any claims
about CPU or memory usage.  A shadow 
document is needed for each connection.  For documents which are 
infrequently edited, the shadow doc can easily be frozen to disk until 
an edit comes in.
...

 (It's interesting that Google Docs, and Google Wave before it, appear 
 to use Operational Transformation [2] rather than DS. OT might also 
 make it easier to implement undo/redo, which works really well in 
 Google Docs.) That is probably because OT, Docs, and Apache Wave are all older then

Diff-sync.  OT is also a much more complicated algorithm in my 
experience (and from browsing around on wikipedia)
...

 An MBaaS or any other database-like service is very different. It has 
 to host multiple applications (i.e., databases), each with multiple 
 collections containing potentially millions of entities (e.g., JSON 
 documents). The entities themselves are more fine-grained and smaller 
 than collaborative documents (though probably a bit coarser-grained 
 and larger than a single record in a RDBMS). Many clients might be 
 reading and updating lots of documents at once, and the data service 
 has to coordinate those changes. A single batch update from one client 
 might request changes to dozens of entities. And the clients can/will 
 always wait for confirmation that the server made the requested 
 changes before continuing (unless the client is offline); or at a 
 minimum can enqueue the requested changes. Two quick things.  A document is just a
collection of entities and can 
be structured to reduce this problem (especially is we are faking it on 
a RDBMS with particularly sadistic abuses to an ORM). Clients don't have 
to wait for the edits to be merged on the server and the nature of 
diff-sync gives us batching for free.
...

 Given these characteristics, using DS within the data service might be 
 extremely expensive in terms of CPU and memory or it might not be.  We need data,
use cases, etc to test and see what 
happens.
...
 , and difficult for a DS-based service to implement all of the 
 features necessary. Which features?  Features of the algorithm of features of the 
application?  The algorithm is really REALLY simple for what we get out 
of it.
...
 First, the data service doesn't really know which entities are 
 being"edited"; instead, connected clients read entities, make changes 
 locally, then request the service make those changes. I disagree.  The service
knows documents which the client has a 
connection to/active session for.  It most certainly knows which 
entities are being edited.
...
 Secondly, every time a change comes in, to compute the diff the 
 service would have to read the persisted entity; this not only is 
 inefficient, but this also makes it more difficult to scale and handle 
 the concurrency, consistency, atomicity, and serializability guarantees. See
earlier comment about sadistic abuses of an ORM.  Yes we have to be 
aware of the RDB underneath the sync server, but I don't think this is a 
problem with the algorithm.
...
 Thirdly, what would the data service need to do when a client
connects 
 and asks for the changes since it was last connected? Send it the diff of the
clients serverside shadow and the server's 
current document.  This diff will get sent to the client, merged with 
the clients shadow, and the diff of that will get sent back to the 
server.  Repeat until the client is in sync.
...
 The data service might be able to quickly find out which entities
were 
 modified since then, but computing the diffs (relative to the time the 
 client last connected) for all of those changed entities would be very 
 complicated. It isn't.
...
 It may be easier and better for the data service to record the 
 individual changes (edits) made by each transaction, and then to use 
 that information to compute the effective diffs from some period of 
 time. In fact, these recorded edits might also be useful to implement 
 other features within the data service; see CQRS [3] and [4].

 What is really required by the client when trying to synchronize its 
 data after being disconnected? Assuming the client can say which 
 subset of entities it's interested in when it reconnects (via some 
 criteria in a subscription), does the client want:

  1. the new versions of those entities that changed;
 No
...

  1. the deltas in the entities; and/or
 Yes
...

  1. all of the events describing the individual changes made to all of
     those entities?
 No.
...

 It may not matter for clients that don't allow local offline changes, 
 but what might the preferred approach be for clients that do allow 
 offline changes? Option 1 is clearly the easiest from the perspective 
 of the data service, but options #2 and #3 can certainly be handled. 
 With option #1, can the client do something like DS and maintain 
 copies of each original (unmodified) entity so that it can compute the 
 differences? Does this (perhaps with a journal of edits made while 
 offline) provide enough info for the client to properly merge the 
 local changes, or does the client really need the individual events in 
 #3 so that it can, for example, know that some local changes were made 
 to now-out-date data? Except in the case of a merge error, the algorithm handles
long offline 
periods with edits just fine.  If there is a merge error the 
user/application will have to manually merge the documents somehow.

One of the things to keep in mind is on mobile devices the radio is the 
most expensive thing you can control as an application.  Any decision we 
make should err toward only sending as little data as possible as few 
times as possible.
...

 Will the same option work for online notifications? After all, it'd be 
 great if the same mechanism was used for data-sync, offline (push) 
 notifications, and online notifications (events). I don't understand your
question.
...

 Finally, the data sync APIs of the data service should support the use 
 of local client storage, but it should not require it.

 Best regards,

 Randall

 [1] http://research.google.com/pubs/pub35605.html
 [2] http://en.wikipedia.org/wiki/Operational_transformation
 [3] 
 http://www.infoq.com/presentations/Events-Are-Not-Just-for-Notifications
 [4] http://martinfowler.com/bliki/CQRS.html

 _______________________________________________
 aerogear-dev mailing list
 aerogear-dev(a)lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/aerogear-dev 

-- 
Summers Pittman
...
>Phone:404 941 4698
>Java is my crack. 

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [aerogear-dev] Use of Differential Synchronization for data sync