[infinispan-dev] Major version cleaning

Tue Feb 21 10:29:27 EST 2017

On 21 February 2017 at 13:20, Radim Vansa <rvansa at redhat.com> wrote:
> On 02/21/2017 09:39 AM, Sanne Grinovero wrote:
>> On 21 February 2017 at 07:37, Dan Berindei <dan.berindei at gmail.com> wrote:
>>> On Mon, Feb 20, 2017 at 8:02 PM, Sanne Grinovero <sanne at infinispan.org> wrote:
>>>> -1 to batch removal
>>>>
>>>> Frankly I've always been extremely negative about the fact that
>>>> batches are built on top of transactions. It's easy to find several
>>>> iterations of rants of mine on this mailing list, especially fierce
>>>> debates with Mircea. So I'd welcome a separation of these features.
>>>>
>>>> Yet, removing batching seems very wrong. I disagree that people should
>>>> use Transactions instead; for many use cases it's overkill, and for
>>>> others - and this is the main pain point I always had with the current
>>>> design - it might make sense to have a batch (or more than one) within
>>>> a transaction.
>>>> I have had practical problems with needing to "flush" a single batch
>>>> within a transaction as the size of the combined elements was getting
>>>> too large. That doesn't imply at all that I'm ready to commit.
>>>>
>>> WDYM by "flush" here? I have a feeling this is nothing like our
>>> batching ever was...
>> I'm just listing points about why incorporating the batch concept with
>> transactions is not practical.
>>
>> I mean that when one has to write very large amounts of data, you need to
>> break it up in smaller batches; *essentially* to flush the first batch before
>> starting the second one.
>> So that is unrelated with atomicity and consistency, as in practice one has
>> to split one business operation into multiple batches.
>
> You haven't explained what "flush" means. Since you separate that from
> atomicity/consistency, I assume that batches on non-tx cache are just
> ordered putOrRemoveAll operations, immediately visible on flush without
> any atomicity. If you want batches in transactions, you probably mean
> that the changes are not visible until tx commits. So you are worried
> that all data modified within this transaction won't fit into single
> node memory? Or is there more to that? What's a "nested batch", then?
> I could see some benefits in removing repeatable-reads cached values.

Ok my bad, I thought "flush" was quite well understood as a concept
but I guess you're right
it deserves a bit of elaboration on a non-typical storage engine.

In my mind, it means "make sure that buffers are sent to their
destination", and this doesn't have
to directly relate with a commit. I just want the data entry to be
shipped over to the heap which
will ultimately contain it.

We have real use cases in various pieces of Infinispan code I wrote
over the years in which
we need to write to multiple keys.
For the sake of example, let's use the Lucene Directory case, in which
each Lucene chunk is
encoded as 3 different Infinispan entries (let's not debate why or we
get heavily out of topic, trust me
that it's a reasonable example of a business use case - it's probably
simpler than most other cases).

So I want to write a first chunk, in our code that looks like:

startBatch
put(chunk1/A, [some large value])
put(chunk1/B, [some small metadata])
put(chunk1/C, [some small metadata])
endBatch
There is no reason to use a transaction, in fact we had to disable
transactions as some of these entries could be large.
There also is no reason for the batch, other than optimising the latency.

You also want to be able to write 2 chunks, and still be mindful for latency:

startBatch
put(chunk1/A,..)
put(chunk1/B,..)
put(chunk1/C,..)
endBatch

startBatch
put(chunk2/A,..)
put(chunk2/B,..)
put(chunk2/C,..)
endBatch

Again, you don't want to combine entries from chunk1 with chunk2 as
they are very large. The data belonging to chunk2 is not produced yet
when we write chunk1, so the explict "flush" is helpful to keep the
total used heap size beyond some treshold; chunk sizes are
configurable.
Also, it's not important among A,B,C in which order they are written,
but we definitely want to make sure that
no entry from chunk2 is visible before any entry of chunk1, so we
prefer these batches to be separate and sequential.

Finally, such operations are usually *internal* caused by some
adaptation framework which could need to be coupled to user
transactions.
So you actually want to be able to do:

- Start Tx1

startBatch
put(chunk1/A,..)
put(chunk1/B,..)
put(chunk1/C,..)
endBatch

startBatch
put(chunk2/A,..)
put(chunk2/B,..)
put(chunk2/C,..)
endBatch

- Commit Tx1

Which is a very reasonable expectation from anyone using a database,
yet Infinispan can't do this as it would be "nesting".

In the specific case of the Lucene Directory, I don't expect people to
want to do this with transactions, and we definitely need an efficient
way to do batching w/o having the performance penalty of transactions.
Hibernate OGM would be a better example, of a framework which needs to
write a sequence of (hopefully batched) operations, and wrap them all
as a single Transaction.

>
>>>> @Pedro: the fact that some code is broken today is not relevant, when
>>>> there's need for such features. Like you suggest, it had bad premises
>>>> (build it on TX) so we should address that, but not throw it all out.
>>>>
>>> Infinispan never created nested batches:
>> I know. I'm not describing the current implementation, I'm describing use
>> cases which are not addressed, or in which Infinispan is clumsy to use,
>> to suggest improvements.
>> And I'm not asking to have nested batches. Again, just pointing
>> out practical problems with the current batch design.
>>
>> It should be possible to run an efficient batch of operations within a
>> transaction.
>> Or a sequence of batches, all within the same transaction.
>> N.B. you could see that as a "nested batch" in the current twisted idea that
>> a batch is a transaction - which is what I'm arguing against.
>>
>>
>>> calling startBatch() when a
>>> batch was already associated with the current thread just incremented
>>> a refcount, and only the final endBatch() did any work. OTOH running a
>>> batch within a transaction always worked very much like suspending the
>>> current transaction, starting a new one, and committing it on
>>> endBatch(). So the only real difference between batching and using
>>> DummyTransactionManager is that batching is limited to one cache's
>>> operations, while DummyTransactionManager supports multiple resources.
>>>
>>>
>>>> @Bela is making spot-on objections based on use cases, which need to
>>>> be addressed in some way. The "atomical" operations still don't work
>>>> right[1] in Infinispan and that's a big usability problem.
>>>>
>>> Batching never was about sending updates asynchronously. We have
>>> putAllAsync() for that, which doesn't need transactions, and it's even
>>> slightly more efficient without transactions.
>> If you think this way, it sounds like you give no value to the application
>> performance: people need more than a couple put operations.
>
> Removes?
>
> I am really asking out of curiosity, I hope these questions don't sound
> ironically.
>
> R.
>
>
>>
>>> And atomical operations have bugs, yes, but I'm not sure how
>>> implementing a new kind of batching that isn't based on transactions
>>> would help with that.
>>>
>>>
>>>> +1 to remove async TX
>>>>
>>>> I actually like the concept but the API should be different.. might as
>>>> well remove this for now.
>>>>
>>>> +1 to remove the Tree module
>>>>
>>>> I personally never used it as you all advised against, yet it's been
>>>> often requested by users; sometimes explicitly and others not so
>>>> explicit, yet people often have need which would be solved by a good
>>>> Tree module.
>>>> I understand the reasons you all want to remove it: it's buggy. But I
>>>> believe the real reason is that it should have been built on top of a
>>>> proper batch API, and using real MVCC. In that case it wouldn't have
>>>> been buggy, nor too hard to maintain, and might have attracted way
>>>> more interest as well.
>>> I think the fact that we haven't been able to build a "proper" batch
>>> API using real MVCC yet is a proof to the contrary...
>>>
>>>
>>>> I'd remove it as a temporary measure: delete the bad stuff, but
>>>> hopefully it could be reintroduced built on better principles in some
>>>> future?
>>>>
>>>> Thanks,
>>>> Sanne
>>>>
>>>> [1] - "right" as in user expectations and actual practical use, which
>>>> is currently different than in the twisted definition of "right" that
>>>> the team has been using as an excuse ;-) I'll also proof that it
>>>> doesn't work right according to your own twisted specs, when I find
>>>> the time to finish some tests..
>>>>
>>>>
>>>> On 20 February 2017 at 16:48, Pedro Ruivo <pedro at infinispan.org> wrote:
>>>>>
>>>>> On 20-02-2017 16:12, Bela Ban wrote:
>>>>>>
>>>>>> On 20/02/17 17:06, Tristan Tarrant wrote:
>>>>>>> Hi guys, we discussed about this a little bit in the past and this
>>>>>>> morning on IRC. Here are some proposed removals:
>>>>>>>
>>>>>>> - Remove the async transactional modes, as they are quite pointless
>>>>>>> - Remove batching: users should use transactions
>>>>>> How do you make a bunch of modifications and send them asynchronously if
>>>>>> both batching *and* async TXs are getting removed?
>>>>> We are not removing features, we are removing broken code.
>>>>>
>>>>> Batching is using transactions and async transactions doesn't make sense
>>>>> since infinispan has to report to TransactionManager.
>>>>>
>>>>> Our current asyn-tx is broken in a way that is starts to commit and
>>>>> reports OK to the transaction manager. if you have a topology change or
>>>>> a conflict, you will end with inconsistent data.
>>>>>
>>>>> So, why do we keeping this code around?
>>>>>
>>>>> you can still move a transaction commit to another thread if you don't
>>>>> wanna wait for its commit:
>>>>>
>>>>> tm.begin()
>>>>> doWork()
>>>>> tx = tm.suspend()
>>>>> new_thread {
>>>>>     tm.resume(tx)
>>>>>     tm.commit()
>>>>> }
>>>>>
>>>>> The best thing I can think of is to keep the batching API and
>>>>> re-implement it to provide an endBatchAsync() that will do the above.
>>>>>
>>>>>> So if someone wants to apply a unit of work *atomically* (either all
>>>>>> modifications within that unit of work are applied, or none), what
>>>>>> alternatives exist?
>>>>>>
>>>>>>> - Remove the tree module: it doesn't work properly, and uses batching
>>>>>>>
>>>>>>> Please cast your votes
>>>>>>>
>>>>>>> Tristan
>>>>>>>
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> infinispan-dev at lists.jboss.org
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> infinispan-dev at lists.jboss.org
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> infinispan-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> _______________________________________________
>> infinispan-dev mailing list
>> infinispan-dev at lists.jboss.org
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
> --
> Radim Vansa <rvansa at redhat.com>
> JBoss Performance Team
>
> _______________________________________________
> infinispan-dev mailing list
> infinispan-dev at lists.jboss.org
> https://lists.jboss.org/mailman/listinfo/infinispan-dev