Use of Differential Synchronization for data sync

Friday, 1 August 2014

I’ve really enjoyed learning about what AeroGear has been doing with data sync. This is a
tough problem, but finding a solution is really important. Both data sync POCs appear to
use Differential Synchronization, or DS [1]. I was not familiar with the paper until
today, but after reading it I do have a few questions/comments. Bear with me; this is a
long post.

DS is clearly targeted for use within a collaborative document editor, where there are
multiple clients concurrently editing the same document, and at any one time there are a
relatively small number of documents being edited; you can get a feel for this by looking
at figures 5 and 7 in the paper [1] — look at the amount of server memory and CPU required
to perform DS on just one document being edited by a half-dozen clients. Also, in a
collaborative document editor, clients are often continually making changes even as they
attempt to synchronize with the server. 

(It’s interesting that Google Docs, and Google Wave before it, appear to use Operational
Transformation [2] rather than DS. OT might also make it easier to implement undo/redo,
which works really well in Google Docs.)

An MBaaS or any other database-like service is very different. It has to host multiple
applications (i.e., databases), each with multiple collections containing potentially
millions of entities (e.g., JSON documents). The entities themselves are more fine-grained
and smaller than collaborative documents (though probably a bit coarser-grained and larger
than a single record in a RDBMS). Many clients might be reading and updating lots of
documents at once, and the data service has to coordinate those changes. A single batch
update from one client might request changes to dozens of entities. And the clients
can/will always wait for confirmation that the server made the requested changes before
continuing (unless the client is offline); or at a minimum can enqueue the requested
changes.

Given these characteristics, using DS within the data service might be extremely expensive
in terms of CPU and memory, and difficult for a DS-based service to implement all of the
features necessary. First, the data service doesn’t really know which entities are
being“edited”; instead, connected clients read entities, make changes locally, then
request the service make those changes. Secondly, every time a change comes in, to compute
the diff the service would have to read the persisted entity; this not only is
inefficient, but this also makes it more difficult to scale and handle the concurrency,
consistency, atomicity, and serializability guarantees. Thirdly, what would the data
service need to do when a client connects and asks for the changes since it was last
connected? The data service might be able to quickly find out which entities were modified
since then, but computing the diffs (relative to the time the client last connected) for
all of those changed entities would be very complicated. It may be easier and better for
the data service to record the individual changes (edits) made by each transaction, and
then to use that information to compute the effective diffs from some period of time. In
fact, these recorded edits might also be useful to implement other features within the
data service; see CQRS [3] and [4].

What is really required by the client when trying to synchronize its data after being
disconnected? Assuming the client can say which subset of entities it’s interested in when
it reconnects (via some criteria in a subscription), does the client want:
the new versions of those entities that changed;
the deltas in the entities; and/or
all of the events describing the individual changes made to all of those entities? 

It may not matter for clients that don’t allow local offline changes, but what might the
preferred approach be for clients that do allow offline changes? Option 1 is clearly the
easiest from the perspective of the data service, but options #2 and #3 can certainly be
handled. With option #1, can the client do something like DS and maintain copies of each
original (unmodified) entity so that it can compute the differences? Does this (perhaps
with a journal of edits made while offline) provide enough info for the client to properly
merge the local changes, or does the client really need the individual events in #3 so
that it can, for example, know that some local changes were made to now-out-date data?

Will the same option work for online notifications? After all, it’d be great if the same
mechanism was used for data-sync, offline (push) notifications, and online notifications
(events).

Finally, the data sync APIs of the data service should support the use of local client
storage, but it should not require it.

Best regards,

Randall

[1] http://research.google.com/pubs/pub35605.html
[2] http://en.wikipedia.org/wiki/Operational_transformation
[3] http://www.infoq.com/presentations/Events-Are-Not-Just-for-Notifications
[4] http://martinfowler.com/bliki/CQRS.html

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012