Hi Galder
Sorry I'm replying so late
On Thu, May 19, 2011 at 2:02 PM, Galder ZamarreƱo <galder(a)redhat.com> wrote:
Hi all,
Re:
https://issues.jboss.org/browse/ISPN-1102
First of all thanks to Dan for his suggestion on reservoir sampling+percentiles, very
good suggestion:). So, I'm looking into this and Trustin's
http://docs.jboss.org/netty/3.2/api/org/jboss/netty/channel/AdaptiveRecei...
but in this email I wanted to discuss the reservoir sampling mechanism
(
http://en.wikipedia.org/wiki/Reservoir_sampling).
So, the way I look at it, to implement this you'd keep track of N buffer sizes used
so far, and out of those chose K samples based on reservoir sampling, and then of those K
samples left, take the 90th percentile.
Calculating the percentile is easy with those K samples stored in an ordered collection.
Now, my problem with this is that reservoir sampling is an O(n) operation and you would
not want to be doing that per each request for a buffer that comes in.
The way I see it the R algorithm is only O(n) because you have to do a
random(1, i) call for each request i.
Because random(1, i) tends to return bigger and bigger values with
every request, the probability of actually modifying the collection of
samples becomes smaller and smaller. So it would be ok to compute the
buffer size estimate on every modification because those modifications
are very rare after startup.
One option I can think of that instead of ever letting a user thread
calculate this, the user thread could just feed the buffer size collection (a concurrent
collection) and we could have a thread in the background that periodically or based on
some threshold calculates the reservoir sample + percentile and this is what's used as
next buffer size. My biggest problem here is the concurrent collection in the middle. You
could have a priority queue ordered by buffer sizes but it's unbounded. The concurrent
collection does not require to be ordered though, the reservoir sampling could do that,
but you want it the collection bounded. But if bounded and the limit is hit, you would not
want it to block but instead override values remove the last element and insert again. You
only care about the last N relevant buffer sizes...
Based on my assumption that modifications become very rare after
startup I was thinking of using a fixed size array with a synchronized
block around each access (we'd probably need synchronization for the
Random, too). Note that you must use an array or a list, because you
have to replace the element random(1, i) and not the
smallest/greatest/something else.
Another option that would avoid the use of a concurrent collection
would be if this was calculated per thread and stored in a thread local. The calculation
could be done every X requests still in the client thread, or could be sent to a separate
thread wrapping it around a callable and keeping the future as thread local, you could
query it next time the thread wants to marshall something.
True, per thread statistics could remove the need of synchronization
completely. But using a separate thread for doing the computations
would require synchronization again, and I don't think there's enough
work to be done to justify it.
I feel a bit more inclined towards the latter option although it
limits the calculation to be per-thread for several reasons:
- We already have org.jboss.marshalling.Marshaller and org.jboss.marshalling.Unmarshaller
instances as thread local which have proven to perform well.
- So we could tap into this set up to maintain not only the marshaller, but the size of
the buffer too.
- It could offer the possibility of being extended further to avoid creating buffers all
the time and instead reuse them as long as the size is constant. After a size
recalculation we'd ditch it and create a new one.
I totally agree, combining adaptive size with buffer reuse would be
really cool. I imagine when passing the buffer to JGroups we'd still
make an arraycopy, but we'd get rid of a lot of arraycopy calls to
resize the buffer when the average object size is > 500 bytes. At the
same time, if a small percentage of the objects are much bigger than
the rest, we wouldn't reuse those huge buffers so we wouldn't waste
too much memory.
Thoughts?
--
Galder ZamarreƱo
Sr. Software Engineer
Infinispan, JBoss Cache
_______________________________________________
infinispan-dev mailing list
infinispan-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/infinispan-dev