Perhaps we’re looking at this from different perspectives. It’s great that you guys are
trying to better understand DS so that you can compare it to other techniques, including
OT. That certainly needs to be done. I guess I was looking at DS from the perspective of
how a data service might need to implement it, knowing that the choice of how data sync is
ultimately done will be influenced in part by how the data service would implement each
approach and the impact on scalability and performance. Perhaps it’s too early to provide
my thoughts along those lines.
On Aug 4, 2014, at 9:34 AM, Summers Pittman <supittma(a)redhat.com> wrote:
I am speaking from the perspective of the algorithm and less from my
opinions of the way the system should work as a whole.
On 08/01/2014 03:02 PM, Randall Hauch wrote:
> I’ve really enjoyed learning about what AeroGear has been doing with data sync. This
is a tough problem, but finding a solution is really important. Both data sync POCs appear
to use Differential Synchronization, or DS [1]. I was not familiar with the paper until
today, but after reading it I do have a few questions/comments. Bear with me; this is a
long post.
>
> DS is clearly targeted for use within a collaborative document editor, where there
are multiple clients concurrently editing the same document, and at any one time there are
a relatively small number of documents being edited; you can get a feel for this by
looking at figures 5 and 7 in the paper [1] — look at the amount of server memory and CPU
required to perform DS on just one document being edited by a half-dozen clients. Also, in
a collaborative document editor, clients are often continually making changes even as they
attempt to synchronize with the server.
It doesn't actually make any claims about CPU or memory usage. A shadow document is
needed for each connection. For documents which are infrequently edited, the shadow doc
can easily be frozen to disk until an edit comes in.
Sure, you don’t have to keep it in memory. But it does have to be in memory to do anything
with it in an efficient way. And, yes, you certainly can build a data service that uses
this technique. My point was that having multiple copies of a document being edited will
reduce the scalability of the data service compared to other techniques.
>
> (It’s interesting that Google Docs, and Google Wave before it, appear to use
Operational Transformation [2] rather than DS. OT might also make it easier to implement
undo/redo, which works really well in Google Docs.)
That is probably because OT, Docs, and Apache Wave are all older then Diff-sync. OT is
also a much more complicated algorithm in my experience (and from browsing around on
wikipedia)
>
> An MBaaS or any other database-like service is very different. It has to host
multiple applications (i.e., databases), each with multiple collections containing
potentially millions of entities (e.g., JSON documents). The entities themselves are more
fine-grained and smaller than collaborative documents (though probably a bit
coarser-grained and larger than a single record in a RDBMS). Many clients might be reading
and updating lots of documents at once, and the data service has to coordinate those
changes. A single batch update from one client might request changes to dozens of
entities. And the clients can/will always wait for confirmation that the server made the
requested changes before continuing (unless the client is offline); or at a minimum can
enqueue the requested changes.
Two quick things. A document is just a collection of entities and can be structured to
reduce this problem (especially is we are faking it on a RDBMS with particularly sadistic
abuses to an ORM).
Yes, a document might be a JSON document that is an aggregate of multiple objects, and not
just a flat map of key-value pairs. The use of aggregate data structures and
denormalization are some of the ways that eventually-consistent data stores work. The goal
is to reduce the scope of a set of operations to a single aggregate. Other data stores
(like graph and hierarchical databases) require strong consistency and transactions
because operations necessarily span multiple objects. But limiting operations to a single
aggregate is also quite constraining w/r/t app development, since you can’t always
denormalize all data to separate aggregates.
So even if a collection (in the MongoDB sense) contains documents that are aggregates of
multiple “entities” (in the Hibernate sense of the word), my point still stands that
generally any given JSON document will still be smaller than a document used in a
collaborative document editor. Also, I would not be surprised if the sheer number of
documents in a MongoDB collection is orders of magnitude larger than the number of
documents stored by a collaborative editor app.
Clients don't have to wait for the edits to be merged on the
server and the nature of diff-sync gives us batching for free.
Hmm… even if you could do it this way, do you not want to be able to give feedback to the
user that the changes might not have been accepted/persisted?
Do you have some scenarios that describe the kinds of applications you’re considering? I’m
wondering if I’m envisioning a different kind of app.
>
> Given these characteristics, using DS within the data service might be extremely
expensive in terms of CPU and memory
or it might not be. We need data, use cases, etc to test and see what happens.
> , and difficult for a DS-based service to implement all of the features necessary.
Which features? Features of the algorithm of features of the application? The algorithm
is really REALLY simple for what we get out of it.
I was referring to features of the data service, and especially how the data service’s
implementation can satisfy the difficult non-functional requirements like scalability and
performance. While the algorithm might be really simple, that doesn’t mean implementing it
on the server is efficient. What I’ve read so far makes me think that it’s could very well
be less efficient and scalable than other techniques used in data services.
> First, the data service doesn’t really know which entities are
being“edited”; instead, connected clients read entities, make changes locally, then
request the service make those changes.
I disagree. The service knows documents which the client has a connection to/active
session for. It most certainly knows which entities are being edited.
I guess I was hoping that the client can manipulate documents locally without having to
coordinate that with the server. Again, I’m concerned about server scalability.
> Secondly, every time a change comes in, to compute the diff the
service would have to read the persisted entity; this not only is inefficient, but this
also makes it more difficult to scale and handle the concurrency, consistency, atomicity,
and serializability guarantees.
See earlier comment about sadistic abuses of an ORM. Yes we have to be aware of the RDB
underneath the sync server, but I don't think this is a problem with the algorithm.
I agree, it’s not a problem with the algorithm. It’s a problem insofar as it would mandate
what the server has to support.
> Thirdly, what would the data service need to do when a client
connects and asks for the changes since it was last connected?
Send it the diff of the clients serverside shadow and the server's current document.
This diff will get sent to the client, merged with the clients shadow, and the diff of
that will get sent back to the server. Repeat until the client is in sync.
> The data service might be able to quickly find out which entities were modified since
then, but computing the diffs (relative to the time the client last connected) for all of
those changed entities would be very complicated.
It isn’t.
Perhaps I should have said “expensive” rather than “complicated”.
> It may be easier and better for the data service to record the
individual changes (edits) made by each transaction, and then to use that information to
compute the effective diffs from some period of time. In fact, these recorded edits might
also be useful to implement other features within the data service; see CQRS [3] and [4].
>
> What is really required by the client when trying to synchronize its data after being
disconnected? Assuming the client can say which subset of entities it’s interested in when
it reconnects (via some criteria in a subscription), does the client want:
> the new versions of those entities that changed;
No
This is actually what a number of MBaaS offerings do, although it’s often hidden by the
client SDKs. It may not be ideal because it places more work onto the client SDK, but the
benefit is that a good portion of the work is done on the client, and the load on the
server is reduced (and scalability increased). It’s also trivially easy for the data
service to implement.
> the deltas in the entities; and/or
Yes
> all of the events describing the individual changes made to all of those entities?
No.
>
> It may not matter for clients that don’t allow local offline changes, but what might
the preferred approach be for clients that do allow offline changes? Option 1 is clearly
the easiest from the perspective of the data service, but options #2 and #3 can certainly
be handled. With option #1, can the client do something like DS and maintain copies of
each original (unmodified) entity so that it can compute the differences? Does this
(perhaps with a journal of edits made while offline) provide enough info for the client to
properly merge the local changes, or does the client really need the individual events in
#3 so that it can, for example, know that some local changes were made to now-out-date
data?
Except in the case of a merge error, the algorithm handles long offline periods with
edits just fine. If there is a merge error the user/application will have to manually
merge the documents somehow.
One of the things to keep in mind is on mobile devices the radio is the most expensive
thing you can control as an application. Any decision we make should err toward only
sending as little data as possible as few times as possible.
I completely agree.
>
> Will the same option work for online notifications? After all, it’d be great if the
same mechanism was used for data-sync, offline (push) notifications, and online
notifications (events).
I don't understand your question.
Only that it seems beneficial that the same mechanism be used for “events" (while
connected) and both online and offline data-sync.