Re: [aerogear-dev] Differential Synchronization: Shadow/Backup Revision Control and its Garbage Collection

Friday, 5 September 2014

This does sounds quite interesting, Lukáš. If AeroGear uses Differential Synchronization,
this approach might make it more efficient for the server to implement the behavior.

But I’m still bothered by the thought that AeroGear’s Data Sync feature uses any algorithm
that requires the server to maintain client-specific state. 

First of all, servers that maintain no client state across requests will *always* scale
much better than servers that maintain client-specific state across requests. Maintaining
client-specific state requires extra work, consumes more memory, and complicates
clustering, failover, and fault tolerance.

Secondly, if a server is maintaining client-specific state, how long does that client
state have to be maintained before assuming the client is no longer there? What happens
when a client on a mobile device loses its connection for a short period of time and then
reconnects? What if load balancing cause the client to reconnect to a different process in
the cluster? Bottom line is that it adds quite a bit of complexity.

Finally, I completely understand the benefits of using Differential Synchronization where
many clients are collaborating on a single document. That’s the “collaborative editor”
scenario, and surely there are apps that need this functionality. I’m just not convinced
that this is the most common use case. In fact, I think it’s relatively uncommon, and I
don’t think LiveOak will support it in the near term. (BTW, please tell me if the whole
purpose of AeroGear's Data Sync feature is to satisfy this and only this scenario. If
so, then I apologize for being a distraction.)

My understanding of the Data Sync feature, though, is that it is should be applicable to
other scenarios. IMO, the far more common scenario is:

The server manages (many) millions of small entities, each of which is a domain-specific
aggregate JSON document. Examples of different kinds of entities include: customers,
books, insurance claims, catalog items, restaurants, purchases, tasks, etc. A single
customer entity would aggregate most/all the information about the customer, including the
name, phone numbers, addresses, profile information, domain-specific privileges (e.g.,
authorized to log issues or call support), etc. Note that these aggregate entities do not
correspond to the entities in a traditional RDBMS/JPA/Hibernate app, where you’d have much
finer-grained records (separate tables for customer, customer addresses, customer phone
numbers, customer privileges, etc.).
At any point in time there may be 1Ks or 10Ks of clients reading dozens of
documents/entities and sending changes to some of them (likely in batch operations). IOW,
any one document might be concurrently modified by a relatively small number of clients.

I think it’s possible to support Data Sync in this scenario while maintaining no
client-specific state on the server (other than while processing a single request).

The Data Sync functionality of the SDK used in the clients would work with local
persistence and enable the app to function online or offline. This includes making it
possible to edit entities and create new entities while offline, though it does not matter
to Data Sync which subset of entities is persisted locally. If the client goes offline and
then back online, the Data Sync functionality in the SDK would figure out which of the
local entities the client has changed, and request from the server the latest revisions
for each of them. Data Sync then merges the local changes onto the latest revisions
(perhaps asking the user to manually resolve anything that cannot be automatically
resolved), computes the new effective changes, and then sends a single batch request to
the server to apply them. Some, all or none changes are accepted, and likely the
application then has to decide whether to continue the process again, revert to the
server’s versions, etc.  (The SDK would also make use of subscriptions/notifications so
that it can be told immediately when entities are changed.)

What the batch operations entail is dependent upon the needs of the client app and
capabilities of the server. If a server supports batch operations that contain the entire
entity, then the server will almost certainly need to put revision identifiers in each
entity sent to the client, so that when the client sends back a modified representation
entity, the server can tell whether that document has since been modified by another
client. If so, the server might rely upon client preferences (see below) to know whether
to merge the changes (using a simple algorithm or more useful application-specific logic),
overwrite, or fail to update the entity.

On the other hand, the server might support batch operations with partial updates, where
the client only sends for each entity only the set of fields that are to be changed or
removed. Conflicts and merging with this approach are easier and less-likely to be
application specific, though client preferences (see below) might still dictate whether or
not to merge based upon revision numbers. It also might make it easier for Data Sync,
since the SDK simply has to record for each modified entity only those modified fields (no
matter how long the client was offline). To keep the SDK simple, a best practice might be
to ensure clients always modify together those fields that must be consistent. For
example, in a contacts application, if a user modifies the first name of a contact, the
app might tell the SDK to modify the first name and last name fields, even though the user
didn’t modify the last name. When synchronized, both the first name and last name fields
would be sent to the server as a single update. This helps ensure that even when multiple
clients are concurrently updating the same set of fields to different values, the result
of applying all of those changes will exactly match the state as set by one of the
clients.

It’s also likely that for batch operations to work well for many different kinds of
applications, the server may support multiple policies that specify for a given batch
operation the atomicity, consistency, and isolation guarantees. One policy might ensure
that the entire batch either completely succeeds only when there are no revision
conflicts, otherwise it completely fails. Another policy might be eventually-consistent in
the sense that the changes in every batch operations will eventually be applied, though
this may require the server to have application-specific conflict resolution logic. And
there are policies that are somewhere in-between. For example, imagine a server that
supports “public” collections whose entities can be read/updated by any user, and
“private” collections that expose only the entities that are readable and editable by that
user. The likelihood of a concurrent update conflict on a private entity is quite small,
since it’s possible only when different changes made on different devices are submitted at
exactly the same time. On the other hand, the likelihood of a concurrent update conflict
on a public entity is much higher. The server may allow updates to both public and private
entities within a single batch operation, and a save policy that might update all private
entities atomically (they all succeed or they all fail) while updates to *each* public
entity is atomic (some might succeed while others might fail).

Any given server will likely support only a few of these policies. But either way, the
server has to report back to the client the outcome of the batch request: which entities
were updated, which were not, and the reason why each of the rejected updates failed
(because other entity changes were rejected, because of a conflict, etc.). The SDK’s Data
Sync functionality would use those results to know how to update the local persisted
representations, and to start trying to resolve any failures.

The kinds of requests needed to support data sync in this scenario are fairly basic, so
the server doesn’t have to be that complicated. Best of all, the server is not required to
maintain any client-specific state.

Thoughts? Am I way off-base?

Best regards,

Randall

On Sep 4, 2014, at 10:38 AM, Lukáš Fryč <lukas.fryc(a)gmail.com&gt; wrote:

...
 Hey guys,

 TL;DR: there are ways how to make the memory footprint of Differential Synchronization
(DS) very low; if we assume that JSON patches are reversible and we create cumulative
patches from series of individual patches, we can use (smart, garbage-collactable)
revision control for storage of documents that server need to maintain.

 ----

 I had an idea in my head that for couple of days as I was becoming familiar with how
Differential Synchronization (DS) works and studying Dan's and Luke's impl.

 I share the concern that Randall expressed in earlier thread - the DS in its pure version
doesn't scale for huge amount of users when connected to one server.

 Sure, the algorithm can be scaled trivially by adding new nodes that uses very same
algorithm as is used for client<->server. But that doesn't mean it is something
that we should do regularly to get more memory available.

 Instead, I was thinking about limiting a memory that server needs to maintain in any
point in time.

 In pure DS, server have to maintain all copies of documents that clients ever requested.
Even worse, it has to maintain two copies - "shadow" and "backup",

 So, in worst case, 2 x N copies of document will be maintained by the server for N
clients.

 -----

 The DS algorithm doesn't tell us how the "shadow" and "backup"
documents should be stored.

 We can transparently plug in a storage that will be clever about how to use
"backups" and "shadow".

 -----

 CONCEPTS:

 1) REVISION CONTROL

 In order to limit a number of documents we need to maintain in server's memory, we
can come with a system where just last known state is remembered in full version. Together
with last known version, we remember also history of the document for each revision as it
was patched by clients.

 But server doesn't have to maintain full history of the document, it maintains just
revisions that are known to its clients (that revision is known by DS for each client out
of the box).

 2) GARBAGE COLLECTION

 Server trivially knows what versions are mantained by client, so it can get rid of those
revisions that are no longer needed.

 First, it can cut all the history that is older than the oldest version of document
throughout all its clients.

 Secondly, it can accumulate the serious of patches between states so that it doesn't
have to maintain full history, but just revisions that is known to any of those clients.

 ----

 DIAGRAM:

 I've attached a picture for illustration of the revision control / garbage
collection, it does not illustrate algorithm itself (Dan did awesome job describing DS
here [3]).

 In the attached diagram, you can see four clients with different revisions:

 Client 1: revision X
 Client 2: revision X+3
 Client 3: revision X+3
 Client 4: revision X+4

 If any of the clients reaches server, server can reconstruct the document (shadow and
backup) for given client revision and the DS algorithm then can operate as normally
(that's why it is transparent from algorithmic point of view).

 If server comes to garbage collection, it can scan through available client revisions. As
it identifies, that the oldest known revision is X, it can remove patch for X-1 and less
as they are no longer needed.

 Later, it can even more optimize, and accumulate patches so that the only information he
knows is how to get to the revision for each participating client. In the picture, he can
compute cumulative patch for getting from state X to state X+3, as X+1 and X+2 themselves
are not needed.

 (Patch accumulation is actually something that Google Wave does in its Operational
Transformation implementation, just there it is called cumulative Operations).

 ----

 Then it comes to caching strategies, you can maintain caches of things like last recently
used revisions (they don't need to be computed each time). You can store some
documents and their revision history to the disk (fully or partially), etc.

 ----

 THE USE CASE:

 1000 employees maintain one document - Contact List - on their smartphones

 400 of those employees has mobile data plan, so they want to synchronize almost
immediately as they are notified about changes. The rest of the devices is connected
rather sparely.

 If one employee changes data, 600 devices are notified about the change immediately (some
on data plan, some through Wifi, etc.) Those clients all have version say X+3, because
they are synchronized proactively.

 These clients reach the server in say <10 seconds after that employee did the change.

 First client reaches server and ask for version X+3 (it probably is still in the memory,
but if it isn't, it is deconstructed and placed into MRU cache). All of the others
reaches the server and re-synchronize themselves against the one copy of backup/shadow
that is in the memory already.

 The other clients will reach the server later, as they are connecting to Wifi, say in 5
minutes to 24 hours. They will have even older versions, such as X. Those clients will
require more computation, but at the end the server can resolve them and through their
revisions out of the window.

 At the end, there are clients that didn't connected for days, maybe months - I
suggest we don't actually use DS here and fallback to e.g. slow synchronization (show
me what you have, I will show you mine), because maintaining full history for longer
period of days may be too resource demanding (configurable?).

 Obviously, 1000 is not much, but this can scale to pretty decent numbers if we use hybrid
approach (fallback to other sync strategies if revision is already forgotten by a
server).

 ----

 WORST CASE:

 we still have to maintain all the document revisions, but practically it won't happen
:-)

 ----

 ASSUMPTIONS:

 1. patches are reversible (we are able to compute reverse patch for each patch sent by
client)
 2. patches are cumulative (we are able to compute aggregated patch from series of
patches)

 For 1, it seems many of JSON-patch libraries actually implement it (Java, JavaScript) [1,
2]

 For 2, it again seems some libraries implement it and if not, we can implement it or even
use brute-force - compute a patch between two documents

 ----

 It's not the only way how to make it scale, but it does solve the problem pretty
independently of the DS algorithm and still leaves the algorithm clean and simple.

 Cheers,

 ~ Lukas

 [1] https://github.com/fge/json-patch
 [2] https://github.com/benjamine/jsondiffpatch
 <DS-revision-control.png>_______________________________________________
 aerogear-dev mailing list
 aerogear-dev(a)lists.jboss.org
 https://lists.jboss.org/mailman/listinfo/aerogear-dev 

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: [aerogear-dev] Differential Synchronization: Shadow/Backup Revision Control and its Garbage Collection