[JBoss JIRA] Closed: (JBMAIL-36) Support (optional) CRC or long hashkey generation for bodies

Thursday, 14 December 2006


     [ http://jira.jboss.com/jira/browse/JBMAIL-36?page=all ]

Andrew Oliver closed JBMAIL-36.
-------------------------------

    Resolution: Won't Fix
      Assignee:     (was: Andrew Oliver)

...
 Support (optional) CRC or long hashkey generation for bodies
 ------------------------------------------------------------

                 Key: JBMAIL-36
                 URL: http://jira.jboss.com/jira/browse/JBMAIL-36
             Project: JBoss Mail ** Closed - moved to http://buni.org **
          Issue Type: Sub-task
            Reporter: Andrew Oliver
            Priority: Critical
   Original Estimate: 1 week
          Time Spent: 1 day, 30 minutes
  Remaining Estimate: 3 days, 7 hours, 30 minutes

 The M3 Message Store prevents bodies from being stored multiple times and allows messages
to stream directly to the DB.  For large messages a line by line hash should be calculable
and if it matches an existing message (this optimizes for disk size but costs performance)
then the Mailbox entry is reassigned to the existing mailstore and then the new body is
deleted.
 Example.  
 1. Assume that the following is a 64mb stream that comes in (minus headers) in duplicate
for both mails (meaning we're sending the same file):
 body line                          CRC/checksum/whatever
 XXXXXXXXX...XXXXXXXXXXXXXXXXXXX    123456
 YYYYYYYYY...YYYYYYYYYYYYYYYYYYY    654321
 ZZZZZZZZZ...ZZZZZZZZZZZZZZZZZZZ    321654
 ...............................    ......
 XXXXXXXXX...XXXXXXXXXXXXXXXXXXX    123456
 YYYYYYYYY...YYYYYYYYYYYYYYYYYYY    654321
 ZZZZZZZZZ...ZZZZZZZZZZZZZZZZZZZ    321654
 cumulative checksum accurate to at least 1/50000000
 12341235125132412512  
 if a "select body_id from bodies where checksum='12341235125132412512'"
returns more than 1 result then the new body is deleted and the mailbox is assigned to the
older of the two.
 So the idea above is important, algorythmic and method suggestions are not (I don't
know my posterior from my elbow when it comes to efficient binary similarity detection --
I'm just pretty sure that's not to be done by direct matching on content!).  
 It is important that minor revisions not cause collisions.  So the 1/50000000 target for
minimum collision should not be taken to mean if you send me a doc, I edit it and send it
back that it drops my edits and that's okay.  It means that for this to be a viable
algoyrthm if I upload the text of a speech and you upload a completely different speech
and somehow the checksum comes out just right....we could have that 1/50,000,000 chance of
two very different documents getting the same check, a minor revision to either should fix
it.
 It is also important that proper boundries be created (no chance that one time we include
fuzz surrounding the body and another time we don't). 
-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://jira.jboss.com/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006