[infinispan-dev] Extend GridFS

Tue Jul 12 04:17:48 EDT 2011

On 7/12/11 2:58 AM, Yuri de Wit wrote:
> Hi Galder,
>
> Thanks for your reply. Let me continue this discussion here first to
> validate my thinking before I create any issues in JIRA (forgive me
> for the lengthy follow up).
>
> First of all, thanks for this wonderful project! I started looking
> into Ehcache as the default caching implementation, but found it
> lacking on some key features when using JGroups. My guess is that all
> the development there is going towards the Terracotta distribution
> instead of JGroups. Terracotta does seems like a wonderful product,
> but I was hoping to stick to JGroups based caching impl. So I was
> happy to have found Infinispan.

Yes, I guess Terracotta (the company) has no interest in using something 
other than Terracotta (the product) in ehcache, let alone supporting a 
competitor...

However, I heard that recently the JGroups plugin for ehcache (which 
used to be terrible) was updated by some outside contributor...

> I need to create a distributed cache that loads data from the file
> system. It's a tree of folders/files containing mostly metadata info
> that changes seldom, but changes. Our mid-term goal is to move the
> metadata away from the file system and into a database, but that is
> not feasible now due to a tight deadline and the risks of refactoring
> too much of the code base.
>
> So I was happy to see the GridFilesystem implementation in Infinispan
> and the fact that clustered caches can be lazily populated (the
> metadata tree in the FS can be large and having all nodes in the
> cluster preloaded with all the data would not work for us).

Note that I wrote GridFS as a prototype in JGroups, and then Manik 
copied it over to Infinispan. Code quality is beta and not all methods 
have been implemented. So, in short, this is to say that GridFS needs 
some work before it can be used in production !

>  However,
> it defines it's own persistence scheme with specific file names and
> serialized buckets, which would require us to have a cache-aside
> strategy to read our metadata tree and populate the GridFilesystem
> with it.
>
> What I am looking for is to be able to plug into the GridFilesystem a
> new FileCacheStore that can load directly from an existing directory
> tree, transparently. This will basically automatically lazy load FS
> content across the cluster without having to pre-populate the
> GridFilesystem programatically.

Interesting... I guess that loader would have to know the mapping of 
files to chunks, e.g. if a file is 10K, and the chunk size 2k, then a 
get("/home/bela/dump.txt.#3") would mean 'read the 3rd chunk from 
/home/bela/dump.txt' from the file system and return it, unless it's in 
the local cache.

This requires that your loader knows the chunk size and the 
mapping/naming between files and chunks...

Hmm. Perhaps the mapping can be more intuitive ? Maybe instead of the 
chunk number, the suffix should incorporate the index (in bytes), e.g. 
/home/bela/dump.txt.#6000 ?

Also, a put() on the cache loader would have to update the real file, 
and *not* store a chunk named "/home/bela/dump.txt.#3"...

-- 
Bela Ban
Lead JGroups (http://www.jgroups.org)
JBoss / Red Hat