[keycloak-dev] Import proposal

Wed Nov 11 09:54:54 EST 2015

On 11 November 2015 at 15:51, Marek Posolda <mposolda at redhat.com> wrote:

> On 11/11/15 15:36, Stian Thorgersen wrote:
>
>
>
> On 11 November 2015 at 15:23, Marek Posolda <mposolda at redhat.com> wrote:
>
>> On 11/11/15 09:01, Stian Thorgersen wrote:
>>
>>
>>
>> On 10 November 2015 at 16:11, Marek Posolda < <mposolda at redhat.com>
>> mposolda at redhat.com> wrote:
>>
>>> On 09/11/15 14:09, Stian Thorgersen wrote:
>>>
>>>
>>>
>>> On 9 November 2015 at 13:35, Sebastien Blanc < <sblanc at redhat.com>
>>> sblanc at redhat.com> wrote:
>>>
>>>> That would be really nice indeed !
>>>> But are the markers files not enough, instead of also having a table in
>>>> the DB ?
>>>>
>>>
>>> We need a way to prevent multiple nodes in a cluster to import the same
>>> file. For example on Kerberos you end up spinning up multiple instances of
>>> the same Docker image.
>>>
>>> I bet you meant 'Kubernetes' :-)
>>>
>>
>> Yup
>>
>>
>>>
>>>
>>> +1 for the improvements. Besides those I think that earlier or later, we
>>> will need to solve long-running export+import where you want to import
>>> 100.000 users.
>>>
>>
>> +1
>>
>>
>>>
>>> As I mentioned in another mail few weeks ago, we can have:
>>>
>>> 1) Table with the progress (51.000 users already imported, around 49.000
>>> remaining etc.)
>>>
>>
>> We would still need to split into multiple files in either case. Having a
>> single json file with 100K users is probably not going to perform very
>> well. So what I proposed would actually work for long-running import as
>> well. If each file has a manageable amount of users (say ~5 min to import)
>> then each file will be marked as imported or failed. At least for now I
>> don't think we should do smaller batches than one file. As long as one file
>> is imported within the same TX then it's an all or nothing import.
>>
>>
>>
>>> 2) Concurrency and dividing the work among cluster nodes (Node1 will
>>> import 50.000 users and node2 another 50.000 users)
>>>
>>
>> This would be solved as well. Each node picks up a file that's not
>> processed yet. Marks it in the DB and then gets to process it.
>>
>>
>>> 3) Failover (Import won't be completely broken if cluster node crashes
>>> after import 90.000, but can continue on other cluster nodes)
>>>
>>> I think the stuff I did recently for pre-loading offline sessions at
>>> startup could be reused for this stuff too and it can handle (2) and (3) .
>>> Also it can handle parallel import triggered from more cluster nodes.
>>>
>>> For example: currently if you trigger kubernetes with 2 cluster nodes,
>>> both nodes will start to import same file at the same time because import
>>> triggered by node1 is not yet finished before node2 is started, so there is
>>> not yet existing DB record that file is already imported. With the stuff I
>>> did, just the coordinator (node1) will start the import . Node2 will wait
>>> until import triggered by node1 is finished, but at the same time it can
>>> "help" to import some users (pages) if coordinator asks him to do so. This
>>> impl is based on infinispan distributed executor service
>>> <http://infinispan.org/docs/5.3.x/user_guide/user_guide.html#_distributed_execution_framework>
>>> http://infinispan.org/docs/5.3.x/user_guide/user_guide.html#_distributed_execution_framework
>>> .
>>>
>>
>> The DB record needs to be created before a node tries to import it,
>> including a timestamp when it started the import. It should then be updated
>> once the import is completed, with the result. Using the distributed
>> execution framework sounds like a good idea though. How do you prevent
>> scheduling the same job multiple times? For example if all nodes on startup
>> scan the import folder and simply import everything they find, then there
>> will be multiple of the same job. Not really a big deal as the first thing
>> the job should do is check if there's a record in the DB already.
>>
>> With distributed executor, it's the cluster coordinator, which
>> coordinates which node would import what. It will send messages to cluster
>> nodes like "Hey, please import the file testrealm-users-3.json with
>> timestamp abcd123" .
>>
>> After node finishes the job, it notifies coordinator and coordinator will
>> insert DB record and mark it as finished. So there is no DB record inserted
>> before node starts import, because whole coordination is handled by the
>> coordinator. Also there will never be same file imported more times by
>> different cluster nodes.
>>
>> Only exception would be if cluster node crashes before import is
>> finished. Then it needs to be reimported by other cluster node, but that's
>> the case with DB locks as well.
>>
>> IMO the DB locks approach doesn't handle well crash of some cluster node.
>> For example when node2 crashes unexpectedly when it's importing the file
>> testrealm-users-3.json, the DB lock is held by this node, so other cluster
>> nodes can't start on importing the file (until timeout occurs.)
>>
>> On the other hand, distributed executor approach may have issues if there
>> is inconsistent content of the standalone/import directory among cluster
>> nodes. However it can be solved, so that each node will need to send
>> checksums of the files it has and coordinator will need to ensure that file
>> with checksum "abcd123" is assigned just to the node which has this file.
>>
>
> With Docker/Kubernetes all nodes would have the same files. At least
> initially. Would be nice if we could come up with a solution where you can
> just drop an additional file onto any node and have it imported.
>
> Exactly, was thinking about Docker too. Here we don't have any issue at
> all.
>
> The main question here is, do we want to support the scenario when various
> cluster nodes have different content? As I mentioned, distributed
> coordinator can handle it, so that each cluster node will send the
> checksums of the files it has and coordinator will always assign to node
> just the checksums, which it has.
>

That would be a nice addition IMO

>
> However regardless of distributed executor approach or DB locks approach,
> there may be still the issues. For example:
> 1) The file testrealm.json with checksum "abc" is triggered for import on
> node1
> 2) At the same time, admin will do some minor change in this file on node2
> and save it. This will mean that checksum of the file on node2 will be
> changed to "def"
> 3) Node2 will trigger import of that file. So we have both node1 and node2
> importing same file concurrently because the previously retrieved lock was
> for "abc" checksum, but now checksum is "def"
>
> This problem will be with both DB lock and DistributedExecutor approaches
> though...
>

Maybe a better approach is to elect a single node that can perform imports
and only allow one import at the time?

>
>
> Marek
>
>
>
>>
>>
>> Marek
>>
>>
>>
>>>
>>>
>>> Marek
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>> On Mon, Nov 9, 2015 at 1:20 PM, Stian Thorgersen <
>>>> <sthorger at redhat.com>sthorger at redhat.com> wrote:
>>>>
>>>>> Currently we support importing a complete realm definition using the
>>>>> import/export feature. Issues with the current approach is:
>>>>>
>>>>> * Only complete realm - not possible to add to an existing realm
>>>>> * No good feedback if import was successful or not
>>>>> * Use of system properties to initiate the import is not very user
>>>>> friendly
>>>>> * Not very elegant for provisioning. For example a Docker image that
>>>>> want's to bundle some initial setup ends up always running the import of a
>>>>> realm, which is skipped if realm exists
>>>>>
>>>>> To solve this I've come up with the following proposal:
>>>>>
>>>>> Allow dropping representations to be imported into
>>>>> 'standalone/import'. This should support creating a new realm as well as
>>>>> importing into an existing realm. When importing into an existing realm we
>>>>> will have an import strategy that is used to configure what happens if a
>>>>> resource exists (user, role, identity provider, user federtation provider).
>>>>> The import strategies are:
>>>>>
>>>>> * Skip - existing resources are skipped,
>>>>> * Fail - if any resource exists nothing is imported
>>>>> * Overwrite - any existing resources are deleted.
>>>>>
>>>>> The directory will be scanned at startup, but there will also be an
>>>>> option to monitor this directory at runtime.
>>>>>
>>>>> To prevent a file being imported multiple times (also to make sure
>>>>> only one node in a cluster imports) we will have a table in the database
>>>>> that contains what files was imported, from what node, date and result
>>>>> (including a list of what resources where imported, which was not, and
>>>>> stack trace if applicable). The primary key will be the checksum of the
>>>>> file. We will also add marker files (<json file>.imported or <json
>>>>> file>.failed). The contents of the marker files will be a json object with
>>>>> date imported, outcome (including stack trace if applicable) as well as a
>>>>> complete list of what resources was successfully imported, what where not.
>>>>>
>>>>> The files will also allow resolving system properties and environment
>>>>> variables. For example:
>>>>>
>>>>> {
>>>>>     "secret": "${env.MYCLIENT_SECRET}"
>>>>> }
>>>>>
>>>>> This will be very convenient for example with Docker as it would be
>>>>> very easy to create a Docker image that extends ours to add a few clients
>>>>> and users.
>>>>>
>>>>> It will also be convenient for examples as it will make it possible to
>>>>> add the required clients and users to an existing realm.
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> keycloak-dev mailing list
>>>>> <keycloak-dev at lists.jboss.org>keycloak-dev at lists.jboss.org
>>>>> <https://lists.jboss.org/mailman/listinfo/keycloak-dev>
>>>>> https://lists.jboss.org/mailman/listinfo/keycloak-dev
>>>>>
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> keycloak-dev mailing listkeycloak-dev at lists.jboss.orghttps://lists.jboss.org/mailman/listinfo/keycloak-dev
>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/keycloak-dev/attachments/20151111/720ba427/attachment-0001.html