[infinispan-issues] [JBoss JIRA] (ISPN-2156) Benchmark and blog about a fast method of loading data into Infinispan

Thu Jul 19 06:06:07 EDT 2012

     [ https://issues.jboss.org/browse/ISPN-2156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Manik Surtani updated ISPN-2156:
--------------------------------

            Summary: Benchmark and blog about a fast method of loading data into Infinispan    (was: Benchmark and blog about a fast method of loading that into Infinispan  )
        Description: 
To summarise:
When using distributed caches, when we need to batch-load a set of data into the cluster inserting batches of keys that map to the same node should significantly increase the performance.
Why?
during the prepare phase each node receives the 
complete list of modifications in that transaction and not only the 
modification pertaining to it.
E.g. say we have the following key->node mapping:
{code}
k1 -> A
k2 -> B
k3 -> C
{code}
Where k1, k2 and k3 are keys; A, B and C are nodes.
If Tx1 writes (k1,k2,k3) then during the prepare A,B and C will receive 
the the same package containing all the modification - namely (k1, 
k2,k3). There are several reasons for doing this (apparently) 
unoptimized approach: serialize the prepare only once, better handling 
of recovery information.

Now if you group transactions/batches base on key distribution the amount of redundant traffic is significantly reduced - and that translates in better performance especially when the datasets 
you're inserting is quite high.

This JIRA is basically about benchmarking and blogging about this approach.
A entry in the FAQ would be helpful as well.

  was:
To summarise:
When using distributed caches, when we need to batch-load a set of data into the cluster inserting bathes of keys that map to the same node should significantly increase the performance.
Why?
during the prepare phase each node receives the 
complete list of modifications in that transaction and not only the 
modification pertaining to it.
E.g. say we have the following key->node mapping:
{code}
k1 -> A
k2 -> B
k3 -> C
{code}
Where k1, k2 and k3 are keys; A, B and C are nodes.
If Tx1 writes (k1,k2,k3) then during the prepare A,B and C will receive 
the the same package containing all the modification - namely (k1, 
k2,k3). There are several reasons for doing this (apparently) 
unoptimized approach: serialize the prepare only once, better handling 
of recovery information.

Now if you group transactions/batches base on key distribution the amount of redundant traffic is significantly reduced - and that translates in better performance especially when the datasets 
you're inserting is quite high.

This JIRA is basically about benchmarking and blogging about this approach.
A entry in the FAQ would be helpful as well.

    Forum Reference: http://lists.jboss.org/pipermail/infinispan-dev/2012-July/010968.html  (was: http://lists.jboss.org/pipermail/infinispan-dev/2012-July/010968.html)

> Benchmark and blog about a fast method of loading data into Infinispan  
> ------------------------------------------------------------------------
>
>                 Key: ISPN-2156
>                 URL: https://issues.jboss.org/browse/ISPN-2156
>             Project: Infinispan
>          Issue Type: Task
>            Reporter: Mircea Markus
>            Assignee: Vladimir Blagojevic
>              Labels: docs
>             Fix For: 5.2.0.FINAL
>
>
> To summarise:
> When using distributed caches, when we need to batch-load a set of data into the cluster inserting batches of keys that map to the same node should significantly increase the performance.
> Why?
> during the prepare phase each node receives the 
> complete list of modifications in that transaction and not only the 
> modification pertaining to it.
> E.g. say we have the following key->node mapping:
> {code}
> k1 -> A
> k2 -> B
> k3 -> C
> {code}
> Where k1, k2 and k3 are keys; A, B and C are nodes.
> If Tx1 writes (k1,k2,k3) then during the prepare A,B and C will receive 
> the the same package containing all the modification - namely (k1, 
> k2,k3). There are several reasons for doing this (apparently) 
> unoptimized approach: serialize the prepare only once, better handling 
> of recovery information.
> Now if you group transactions/batches base on key distribution the amount of redundant traffic is significantly reduced - and that translates in better performance especially when the datasets 
> you're inserting is quite high.
> This JIRA is basically about benchmarking and blogging about this approach.
> A entry in the FAQ would be helpful as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira