On 6 August 2014 00:34, Jay Balunas <jbalunas@redhat.com> wrote:

On Aug 4, 2014, at 9:17 PM, Summers Pittman <supittma@redhat.com> wrote:

On Mon 04 Aug 2014 01:19:17 PM EDT, Randall Hauch wrote:
Perhaps we’re looking at this from different perspectives. It’s great
that you guys are trying to better understand DS so that you can
compare it to other techniques, including OT. That certainly needs to
be done. I guess I was looking at DS from the perspective of how a
data service might need to implement it, knowing that the choice of
how data sync is ultimately done will be influenced in part by how the
data service would implement each approach and the impact on
scalability and performance. Perhaps it’s too early to provide my
thoughts along those lines.

On Aug 4, 2014, at 9:34 AM, Summers Pittman <supittma@redhat.com
<mailto:supittma@redhat.com>> wrote:

I am speaking from the perspective of the algorithm and less from my
opinions of the way the system should work as a whole.

On 08/01/2014 03:02 PM, Randall Hauch wrote:
I’ve really enjoyed learning about what AeroGear has been doing with
data sync. This is a tough problem, but finding a solution is really
important. Both data sync POCs appear to use Differential
Synchronization, or DS [1]. I was not familiar with the paper until
today, but after reading it I do have a few questions/comments. Bear
with me; this is a long post.

DS is clearly targeted for use within a collaborative document
editor, where there are multiple clients concurrently editing the
same document, and at any one time there are a relatively small
number of documents being edited; you can get a feel for this by
looking at figures 5 and 7 in the paper [1] — look at the amount of
server memory and CPU required to perform DS on just one document
being edited by a half-dozen clients. Also, in a collaborative
document editor, clients are often continually making changes even
as they attempt to synchronize with the server.

It doesn't actually make any claims about CPU or memory usage. A
shadow document is needed for each connection. For documents which
are infrequently edited, the shadow doc can easily be frozen to disk
until an edit comes in.

Sure, you don’t have to keep it in memory. But it does have to be in
memory to do anything with it in an efficient way. And, yes, you
certainly can build a data service that uses this technique. My point
was that having multiple copies of a document being edited will reduce
the scalability of the data service compared to other techniques.

(It’s interesting that Google Docs, and Google Wave before it,
appear to use Operational Transformation [2] rather than DS. OT
might also make it easier to implement undo/redo, which works really
well in Google Docs.)
That is probably because OT, Docs, and Apache Wave are all older then
Diff-sync. OT is also a much more complicated algorithm in my
experience (and from browsing around on wikipedia)

An MBaaS or any other database-like service is very different. It
has to host multiple applications (i.e., databases), each with
multiple collections containing potentially millions of entities
(e.g., JSON documents). The entities themselves are more
fine-grained and smaller than collaborative documents (though
probably a bit coarser-grained and larger than a single record in a
RDBMS). Many clients might be reading and updating lots of documents
at once, and the data service has to coordinate those changes. A
single batch update from one client might request changes to dozens
of entities. And the clients can/will always wait for confirmation
that the server made the requested changes before continuing (unless
the client is offline); or at a minimum can enqueue the requested
changes.
Two quick things. A document is just a collection of entities and
can be structured to reduce this problem (especially is we are faking
it on a RDBMS with particularly sadistic abuses to an ORM).

Yes, a document might be a JSON document that is an aggregate of
multiple objects, and not just a flat map of key-value pairs. The use
of aggregate data structures and denormalization are some of the ways
that eventually-consistent data stores work. The goal is to reduce the
scope of a set of operations to a single aggregate. Other data stores
(like graph and hierarchical databases) require strong consistency and
transactions because operations necessarily span multiple objects. But
limiting operations to a single aggregate is also quite constraining
w/r/t app development, since you can’t always denormalize all data to
separate aggregates.

So even if a collection (in the MongoDB sense) contains documents that
are aggregates of multiple “entities” (in the Hibernate sense of the
word), my point still stands that generally any given JSON document
will still be smaller than a document used in a collaborative document
editor. Also, I would not be surprised if the sheer number of
documents in a MongoDB collection is orders of magnitude larger than
the number of documents stored by a collaborative editor app.

Clients don't have to wait for the edits to be merged on the server
and the nature of diff-sync gives us batching for free.

Hmm… even if you could do it this way, do you not want to be able to
give feedback to the user that the changes might not have been
accepted/persisted?

Do you have some scenarios that describe the kinds of applications
you’re considering? I’m wondering if I’m envisioning a different kind
of app.

Given these characteristics, using DS within the data service might
be extremely expensive in terms of CPU and memory
or it might not be. We need data, use cases, etc to test and see
what happens.
, and difficult for a DS-based service to implement all of the
features necessary.
Which features? Features of the algorithm of features of the
application? The algorithm is really REALLY simple for what we get
out of it.

I was referring to features of the data service, and especially how
the data service’s implementation can satisfy the difficult
non-functional requirements like scalability and performance. While
the algorithm might be really simple, that doesn’t mean implementing
it on the server is efficient. What I’ve read so far makes me think
that it’s could very well be less efficient and scalable than other
techniques used in data services.

First, the data service doesn’t really know which entities are
being“edited”; instead, connected clients read entities, make
changes locally, then request the service make those changes.

I disagree. The service knows documents which the client has a
connection to/active session for. It most certainly knows which
entities are being edited.

I guess I was hoping that the client can manipulate documents locally
without having to coordinate that with the server. Again, I’m
concerned about server scalability.

Secondly, every time a change comes in, to compute the diff the
service would have to read the persisted entity; this not only is
inefficient, but this also makes it more difficult to scale and
handle the concurrency, consistency, atomicity, and serializability
guarantees.

See earlier comment about sadistic abuses of an ORM. Yes we have to
be aware of the RDB underneath the sync server, but I don't think
this is a problem with the algorithm.

I agree, it’s not a problem with the algorithm. It’s a problem insofar
as it would mandate what the server has to support.

Thirdly, what would the data service need to do when a client
connects and asks for the changes since it was last connected?

Send it the diff of the clients serverside shadow and the server's
current document. This diff will get sent to the client, merged with
the clients shadow, and the diff of that will get sent back to the
server. Repeat until the client is in sync.
The data service might be able to quickly find out which entities
were modified since then, but computing the diffs (relative to the
time the client last connected) for all of those changed entities
would be very complicated.
It isn’t.

Perhaps I should have said “expensive” rather than “complicated”.

It may be easier and better for the data service to record the
individual changes (edits) made by each transaction, and then to use
that information to compute the effective diffs from some period of
time. In fact, these recorded edits might also be useful to
implement other features within the data service; see CQRS [3] and [4].

What is really required by the client when trying to synchronize its
data after being disconnected? Assuming the client can say which
subset of entities it’s interested in when it reconnects (via some
criteria in a subscription), does the client want:

1. the new versions of those entities that changed;
No

This is actually what a number of MBaaS offerings do, although it’s
often hidden by the client SDKs. It may not be ideal because it places
more work onto the client SDK, but the benefit is that a good portion
of the work is done on the client, and the load on the server is
reduced (and scalability increased). It’s also trivially easy for the
data service to implement.

1. the deltas in the entities; and/or
Yes

1. all of the events describing the individual changes made to all
of those entities?

No.

It may not matter for clients that don’t allow local offline
changes, but what might the preferred approach be for clients that
do allow offline changes? Option 1 is clearly the easiest from the
perspective of the data service, but options #2 and #3 can certainly
be handled. With option #1, can the client do something like DS and
maintain copies of each original (unmodified) entity so that it can
compute the differences? Does this (perhaps with a journal of edits
made while offline) provide enough info for the client to properly
merge the local changes, or does the client really need the
individual events in #3 so that it can, for example, know that some
local changes were made to now-out-date data?

Except in the case of a merge error, the algorithm handles long
offline periods with edits just fine. If there is a merge error the
user/application will have to manually merge the documents somehow.

One of the things to keep in mind is on mobile devices the radio is
the most expensive thing you can control as an application. Any
decision we make should err toward only sending as little data as
possible as few times as possible.

I completely agree.

Will the same option work for online notifications? After all, it’d
be great if the same mechanism was used for data-sync, offline
(push) notifications, and online notifications (events).
I don't understand your question.

Only that it seems beneficial that the same mechanism be used for
“events" (while connected) and both online and offline data-sync.

_______________________________________________
aerogear-dev mailing list
aerogear-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/aerogear-dev

I don't think it is too early for service minded people to point out
the big honking problems we will face. At the very least it helps us
figure out the use cases we will support for the initial releases as to
POC morphs into an actual project.

I actually this this a very good point (as have other points on both sides). We will rely on people like you Randall to assist us in understanding the impact and advocate for the deeper server-side needs. While at the same time the client side APIs and developer experience will also need to be reviewed and taken into consideration.

I think it is important to remember that all of these are POC's and I think with a problem domain as complex as this, before we get to actual cross-project implementation we need to develop and flush out specs, including all of these various points, risks, etc... This would be across data services, liveoak and aerogear and possibly others.

From a mobile client developer perspective, this algorithm feels very
easy to understand and it handles a lot of annoying corner cases very
well (e.g. offline, operation batching, conflict handling, document
synchronization, collaboration etc.) . Additionally, in my
experiments, plugging in different diff-merge-synch operations allows
for the general framework to be adapted to other use cases and data
types. (i.e. if the server is just batching a list of changed Object
IDs instead of the actual changes for binary files).

One of the things we need to do is determine which problems we are
going to solve/support (single user multi device sync, multi user
collaboration, binary files etc), which we are going to short circuit
(limiting numbers of collaborators, limiting file size, etc), and which
we are just going to document as best practices to use or to avoid.

Maybe I'm just stuck seeing my hammer (DS) and everything is nails and
if someone can give a demo of something else (OT, etc) then I will be
really interested to compare.

--
Summers Pittman
Phone:404 941 4698
Java is my crack.

_______________________________________________
aerogear-dev mailing list
aerogear-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/aerogear-dev

_______________________________________________
aerogear-dev mailing list
aerogear-dev@lists.jboss.org
https://lists.jboss.org/mailman/listinfo/aerogear-dev