[
https://issues.jboss.org/browse/ISPN-2037?page=com.atlassian.jira.plugin....
]
Sanne Grinovero commented on ISPN-2037:
---------------------------------------
{quote}
<vblagoje> I just pinged you sannegrinovero regarding cache loaders in map reduce.
Once you get a chance provide as much details as you can so I can finish that one off
quickly
<sannegrinovero> vblagoje, just read your ISPN-2037 .. good timing ! let's talk
about it?
<jbossbot> jira [ISPN-2037] Map/Reduce tasks should process entries in the
CacheLoader as well [Open (Unresolved) Feature Request, Critical, Vladimir Blagojevic]
https://issues.jboss.org/browse/ISPN-2037
<vblagoje> ok sannegrinovero
<vblagoje> sure, have time now
<vblagoje> ?
<sannegrinovero> yes
<sannegrinovero> so the problem is that you have to load all entries from the
cacheloader
<sannegrinovero> but you don't actually know which keys exist
<vblagoje> aha i see
<vblagoje> side question, does keySet return stuff in cache loader?
<sannegrinovero> so you have to load "Select * from database", which will
very likely kill you by OOM
<sannegrinovero> I don't know that
<vblagoje> hmmm, how are these caches configured to begin with? any eviction,
passivation, shared/single etc?
<vblagoje> shared/single cache loader
<vblagoje> sannegrinovero, who is most familiar with cahceloaders on our team?
<sannegrinovero> mmarkus is the leader I think.. ttarrant accumulated some
experience too.
<sannegrinovero> sorry was away, back now.
<sannegrinovero> Does it matter how caches are configured?
<sannegrinovero> for sure if it's a single shared cacheloader, you'll have
the problem that you only want to load entries for the current node, skipping all the
rest.
<sannegrinovero> otherwise you definitely kill the purpose of map reduce.. loading
the whole database from a single slow store N times where N is the number of nodes in the
cluster... very bad :D
<sannegrinovero> vblagoje, ^
<vblagoje> so basically the problem is when map command arrives to node where it is
executed and it needs to load all local keys, right now it does not load keys in cache
loader - you say, is that right?
<mmarkus> vblagoje: anything I can help with?
<vblagoje> this is the case of map/reduce when input keys are not specified -
basically do map/reduce on all keys
<sannegrinovero> vblagoje, I'm not sure. maybe it works but I don't think
so. we definitely need to have tests with Map/Reduce && cacheloader interactions.
<sannegrinovero> yes, and for Query I never have the keys. I think even for
map/reduce it's the most common case (to not have the set of keys)
<sannegrinovero> on the other hand, with non-shared cacheloaders it's even
trickier.
<mmarkus> vblagoje: keySet ignores the cache loader interceptor
<vblagoje> yeah ok
<vblagoje> aha mmarkus, thanks dude; how do we get them?
<mmarkus> vblagoje: only returns what's in memory
<mmarkus> you need a way to get all the keys from a cache loader?
<vblagoje> exactly
<sannegrinovero> this is strongly related guys:
https://community.jboss.org/wiki/CacheLoaderAndCacheStoreSPIRedesign#comm...
<mmarkus> vblagoje: that's a hard one man
<sannegrinovero> I had sketched an API proposal on the mailing list
<sannegrinovero> which manik had +1'ed .. searching for a link
<sannegrinovero> vblagoje:
http://lists.jboss.org/pipermail/infinispan-dev/2012-May/010760.html
<vblagoje> mmarkus, sanne , maybe we can use loadAllKeys on CacheLoader interface
and then slowly load values for each of those keys
<mmarkus> vblagoje: thinkig about it from an jdbc perspective
<sannegrinovero> vblagoje, you are the map/reduce API master :) the proposal I
described has a similar API in concept, that the visitor "demands" collection of
entries from the CacheLoader implementation, in blocks.
<sannegrinovero> so the details of how the blocks of data are loaded are delegated
to the implementation, which is important
<vblagoje> you mean this Processor API sanne?
<sannegrinovero> but the flow control is handled by your consumer
<mmarkus> sannegrinovero: +1. I was about to say the same thing :)
<sannegrinovero> mmarkus, nice :)
<sannegrinovero> vblagoje, yes
<mmarkus> sannegrinovero: indeed, it was ur idea :)
<sannegrinovero> guys am I the only one wasting my dev days by writing on design on
the ML :D ?
<vblagoje> No, we all are hahaha
<sannegrinovero> admittedly this was assigned to Manik ;)
<mmarkus> sannegrinovero: you do have a fare share though :)
<vblagoje> there are so many things to do noone has time to review anyone else's
design
<vblagoje> so Manik was going to do these Processor callbacks?
<sannegrinovero> yea I realize that, I admit I was hooked to this as I need it.
<sannegrinovero> no vblagoje I think I moved it to your plate, but as you can see
from [1] he was driving the proposal of a new API
<sannegrinovero>
https://community.jboss.org/wiki/CacheLoaderAndCacheStoreSPIRedesign
<vblagoje> ok, sanne, but how can we do this if it is planned for 6.0 and you need
this yesterday?
<sannegrinovero> 6?? I'm not sure.
<sannegrinovero> let me read Manik's email again
<vblagoje> that is what the documents says
<vblagoje> at the top of
https://community.jboss.org/wiki/CacheLoaderAndCacheStoreSPIRedesign
<sannegrinovero> right, and he doesn't mention it in the todo mails. Not sure
that's a mistake?
<sannegrinovero> Or maybe he means that you should make it work on the current
CacheLoader SPI
<sannegrinovero> which basically means, very dumbly load it all in memory, and
improve later on.
<sannegrinovero> vblagoje, proposal: focus on creating some good tests which cover
map/reduce examples on non-trivial data, both with shared/non shared cacheloaders,
passivation/no passivation, etc. and make it work without bothering too much about OOM and
efficiency at the CacheLoader level.
<sannegrinovero> So that will help define exactly what is best to have at SPI later
for 6.0
<vblagoje> ok sannegrinovero, I looked at Manik's proposal, this is all very
very rough sketches
<sannegrinovero> and in terms of efficiency/performance you focus on what you have
done so far (not considering cacheloaders), but add cacheloaders only as *functional*
tests.
<vblagoje> but
<vblagoje> we can make this work without some new API redesign - I think
<sannegrinovero> just forwarded Manik's last comment by email ;)
<vblagoje> yes, sannegrinovero, we should not make API changes now
<vblagoje> but lets make it work somehow
<sannegrinovero> right. Just make sure to document it, it's better to warn
people than to disappoint.
<sannegrinovero> I mean in terms of maturity, I guess you're going to advertise
your new Map/Reduce as it's maturing quickly, but the CacheLoader integration
can't be considered usable until that design is fixed.
<vblagoje> i have to think about how can this be done; and could use some help
there
<vblagoje> for example: i think wee need to use cacheLoader.loadAllKeys and then use
that to load values and pass them to map reduce
<vblagoje> use keys, to laod
<vblagoje> load values
<sannegrinovero> yes, seems the only way we can do it with the current SPI.
<vblagoje> but can we use raw reference to cache loader outside of cache loader
interceptor
<vblagoje> these are some of the questions I have
<vblagoje> should it be done this way? If not, then how?
<vblagoje> the only person I think might help here is mmarkus; but I am afraid to
ask him for any help as my tab in his pub is very very long
<vblagoje> hahaha
<sannegrinovero> :)
<vblagoje> nothing; I'll play with it until we figure out something
sannegrinovero
<sannegrinovero> vblagoje, one could think of some hacks here and there, but the
fact remains that with this API it's too limited to do it properly. Then let's not
do it propertly, just correct and document the limitation. I wouldn't bother too much,
unless you get a genius intuituion.
<vblagoje> ok let me see sannegrinovero
<vblagoje> i thought this is is life and death critical to you?
<vblagoje> i mean this impl
<sannegrinovero> I mean, let's keep it clean. just load all keys, and iterate on
them. Maybe you can filter on the keys: for DIST, you keep only the keys locally owned and
ignore the others.
<sannegrinovero> vblagoje, I'll explain you the use case.
<vblagoje> yeah something like that
<sannegrinovero> the index containing all data is corrupted, or not longer valid
because of an upgrade, or the disks containing it are on fire.
<sannegrinovero> so indexes are lost.
<sannegrinovero> you have to re-index ALL data stored in the grid, so to rebuild the
indexes and be able to find your objects again.
<sannegrinovero> Imagine you stored your items for sale,
<vblagoje> ok
<sannegrinovero> at this point if you don't load the stuff from the cacheloaders
(even those for which the keys are not in memory)
<sannegrinovero> those iterms for sale are lost forever :(
<sannegrinovero> But you can take out indexing from the example, and think of a
Map/Reduce task on all your items.
<sannegrinovero> It's definitely no fun it Infinispan "forgets" to
process 90% of the data you have.
<sannegrinovero> there are two main use cases:
<sannegrinovero> 1) memory is not enough - very likely you need to offload
not-so-hot elements to disks/cassandra/wathever
<sannegrinovero> 2) you powered down some nodes, and reboot them. data is in the
cacheloaders, but you don't have the keys.
<vblagoje> ok got it; this is pretty crucial then
<sannegrinovero> so the M/R api is of no use in real world if it doesn't process
passivated entries as well.
<sannegrinovero> which is why I thought of you as the best person to think about it
;-)
<sannegrinovero> yea simply but I don't think nor M/R nor indexing are of any
use without this.
<vblagoje> what if we can get M/R to load and process keys from cache loaders - as a
first target of this task and then once a nice new API is in place we'll just adjust
M/R?
<sannegrinovero> sounds like the best plan.
<vblagoje> just a sec
<vblagoje> ok, so lets do that sannegrinovero; when do you need this by? working and
tested?
<sannegrinovero> vblagoje, it's not me needing it, but you ;) as I said, M/R is
not going to be used on real world applications until you have it. Same for Query, we need
it to make Query good enough to be ready.
<vblagoje> hahaha, italian school of diplomacy
<vblagoje> sure
<sannegrinovero> so Manik listed priorities. Query and Map/Reduce are both highly
requested, this is blocking both..
<vblagoje> yeah makes sense; I'll work on this full force then and after it is
done back to M/R
<sannegrinovero> so this is a priority, unless it interferes with cross-data center
or NBST which are even more important.
<vblagoje> ok sannegrinovero, nuff for today, i am fried and I cannot believe you
are not asleep yet :-(
<vblagoje> lets talk soon
<sannegrinovero> vblagoje, cool :) I'll paste this on the JIRA.
<vblagoje> ok, deal
{quote}
Map/Reduce tasks should process entries in the CacheLoader as well
------------------------------------------------------------------
Key: ISPN-2037
URL:
https://issues.jboss.org/browse/ISPN-2037
Project: Infinispan
Issue Type: Feature Request
Reporter: Sanne Grinovero
Assignee: Vladimir Blagojevic
Priority: Critical
Fix For: 5.2.0.FINAL
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.jboss.org/secure/ContactAdministrators!default.jspa
For more information on JIRA, see:
http://www.atlassian.com/software/jira