[hibernate-dev] Contributing to OGM / Cassandra

Fri Aug 22 12:25:44 EDT 2014

On Thu 2014-08-07  9:10, John Worrell wrote:
> Hi Emmanuel et al.,
> 
> My apologies for the log radio silence. I've taken a look at the code-base
> on Jon Halliday's repo, and have set up a nick on freenode - #jlesinge.

No worries I was on holidays.
And you email was the few lucky ones that I had to delay as it required
thinking ;)

> 
> On the time-series question I was wondering how you envisaged the data
> stored: I tend to think of a single row under an primary key with an
> object-instance per column. Now what we have typically done (generally the
> data has been immutable) is to store the data serialized as a blob (JSON or
> XML), but I understand you do not favour this approach. With this sort of
> model I imagine the collection is then all the objects stored in the row,
> and the challenge is to page through the objects in the row.

Actually it is one of the valid strategies.
If I understand you well, you want to create:

- one row per time series generating object (say a thermometer)
- the column names of that row would be a timestamp of time at bay
- the value would be a JSON structure containing the data at bay for
  that specific time.

That is one of the valid approach. But I think we need to support
several:

- simple column if the data is literally a single element (temperature)
- JSON structure for more complex data per time event
- key pointing to the detailed data somewhere else in the cluster

The latest would be done in two phases, you load all the keys you are
interested in matching your time range and then do a multiget of sort to
load the data.

It seems datastax tends to recommend 1 or 2 (denormalization FTW).

I don't know but there is also the notion of super column which is a
grouping of columns that might also address our composite problem
assuming they can be used for dynamic column families.

http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra
http://planetcassandra.org/blog/post/getting-started-with-time-series-data-modeling/
http://www.datastax.com/docs/1.0/ddl/column_family

> 
> An approach we have often taken is to create multiple copies of data in
> different (obviously works well only for immutable objects) or better to

Yes, that is a feature that I would like OGM to automate for the user.
It declaratively defines the denormalization approaches he wants and the
engine does the persistence.
Next the query engine uses that knowledge to find the best path (or only
possible path in the case of Cassandra :) )

> create a table of keys to a main table where in either approach the
> row-keys are effectively a foreign-key and there is column per  object
> associated through the foreign-key. Another approach though might be to use
> a column with type list (or set, or map) to contain keys to the associated
> objects - this would be a little like the extensions Oracle have for
> mapping 1-* associations, though with the caveat that a column of
> collection type may only contain 64k elements. I wondered if some though
> had been given to this strategy (which I must admit I have not yet used
> myself).

I am not aware of that approach.

> 
> It seems very likely that different mapping strategies should be
> specifiable, but then I have still to understand how these might fit with
> treiid.

Forget Teiid for now. We will likely start with the HQL->Walker and do
our own proto query engine before layering Teiid.

> 
> Can I ask about assumptions: is it fair to assume that for Cassandra, OGM
> will target only CQL 3 (which means Cassandra 2 or maybe 1.2)? This would
> certainly make life simpler.

Yes that's fine.

> 
> An issue I don't see addressed is the choice of consistency-level (read or
> write) and I wondered if there was a plan for this? Assumptions can be made
> on a per table basis, but, certainly for ad hoc queries, it is important
>  think to have the flexibility to specify on a per-query basis.

That's planned. We have an option system that allow for entity /
property overriding of a global setting. While not implemented, we will
also have the ability to override setting per session / query.
That was the plan all along.

> 
> Those are my thoughts so far... I'll see about doing a POC of some of what
> I have described above

Thanks :)

> 
> Cheers,
> 
> John
> 
> 
> On Mon, Jul 21, 2014 at 4:48 PM, John Worrell <jlesinge at gmail.com> wrote:
> 
> > Hi Emmanuel,
> >
> > I'll take a look at what is there, and I'll get up and running on IRC.
> >
> > I'll particularly look at the time-series issue - non-trivial I think.
> >
> > Cheers,
> >
> > John
> >
> >
> > On Mon, Jul 21, 2014 at 1:06 PM, Emmanuel Bernard <emmanuel at hibernate.org>
> > wrote:
> >
> >> Hi John,
> >>
> >> I thought I had replied to you on Friday but apparently the email never
> >> went through :/
> >>
> >> That is good news :)
> >> Jonathan worked on a Cassandra prototype but had to drop due to other
> >> duties. He pushed everything at
> >> https://github.com/jhalliday/hibernate-ogm/tree/jonathan_cassandra
> >>
> >> Have a look at what he has done and come ask any question to Gunnar,
> >> Davide or me. There are a bunch of moving pieces. We are mostly on
> >> freenode’s #hibernate-dev ( you need a freenode login
> >> http://freenode.net/faq.shtml#nicksetup ). If you are allergic to IRC,
> >> let me know and we will find alternatives.
> >>
> >> The most interesting challenge will be to see how we can map time series
> >> into a collection and make sure we let the user decide how much he wants to
> >> load.
> >>
> >> Emmanuel
> >>
> >> On 16 Jul 2014, at 13:17, John Worrell <jlesinge at gmail.com> wrote:
> >>
> >> > Hi,
> >> >
> >> > I'm interested in contributing to the Cassandra module of Hibernate-OGM
> >> -
> >> > what would be the baest way to go about this?
> >> >
> >> > Thanks,
> >> >
> >> > John
> >> > _______________________________________________
> >> > hibernate-dev mailing list
> >> > hibernate-dev at lists.jboss.org
> >> > https://lists.jboss.org/mailman/listinfo/hibernate-dev
> >>
> >>
> >