Hi Luca,
Thank you for your contribution, FYI, I planed it for the next minor version
of eXo JCR i.e. JCR 1.14 since now it is too late now for JCR 1.12
On Wed, Mar 31, 2010 at 1:13 AM, Luca Stancapiano <
l.stancapiano(a)sourcesense.com> wrote:
I used cyberneko to parse the properties inside the html code. My
code
parses the meta tag inside the html file for example:
<meta name="DC.creator" xml:lang="it"
content="creator_brand" />
<meta name="DC.date" xml:lang="it" content="2010-01-11"
/>
<meta name="DC.contributor" xml:lang="it"
content="contributor_brand" />
<meta name="DC.language" xml:lang="it" content="it"
/>
<meta name="DC.subject" xml:lang="it"
content="subject_brand" />
<meta name="DC.description" xml:lang="it"
content="description_brand" />
<meta name="DC.title" xml:lang="it" content="title_brand
/>
<meta name="DC.publisher" xml:lang="it"
content="publisher_brand" />
<meta name="DC.resourceType" xml:lang="it"
content="text/html/brand" />
<meta name="DC.format" xml:lang="it" content="text/html"
/>
<meta name="DC.identifier" xml:lang="it"
content="identifier_brand" />
<meta name="DC.source" xml:lang="it"
content="source_brand" />
<meta name="DC.relation" xml:lang="it"
content="relation_brand" />
<meta name="DC.coverage" xml:lang="it"
content="coverage_brand" />
<meta name="DC.rights" xml:lang="it"
content="rights_brand" />
Here a part of the code:
....
DOMFragmentParser parser = new DOMFragmentParser();
HTMLDocument document = new HTMLDocumentImpl();
DocumentFragment fragment =
document.createDocumentFragment();
parser.parse(new InputSource(is), fragment);
setProperties(fragment);
.....
public void setProperties(Node node) {
if (node instanceof org.apache.html.dom.HTMLMetaElementImpl) {
String name =
node.getAttributes().getNamedItem("name").getTextContent();
String value =
node.getAttributes().getNamedItem("content").getTextContent();
if (name.contains(DCMetaData.CONTRIBUTOR.getName()))
props.put(DCMetaData.CONTRIBUTOR, value);
if (name.contains(DCMetaData.DESCRIPTION.getName()))
props.put(DCMetaData.DESCRIPTION, value);
if (name.contains(DCMetaData.DATE.getName()))
props.put(DCMetaData.DATE, value);
if (name.contains(DCMetaData.CREATOR.getName()))
props.put(DCMetaData.CREATOR, value);
if (name.contains(DCMetaData.SUBJECT.getName()))
props.put(DCMetaData.SUBJECT, value);
if (name.contains(DCMetaData.PUBLISHER.getName()))
props.put(DCMetaData.PUBLISHER, value);
if (name.contains(DCMetaData.TITLE.getName()))
props.put(DCMetaData.TITLE, value);
if (name.contains(DCMetaData.LANGUAGE.getName()))
props.put(DCMetaData.LANGUAGE, value);
}
Node child = node.getFirstChild();
while (child != null) {
setProperties(child);
child = child.getNextSibling();
}
}
I've seen you used HTMLParser in your code. I've no preferences about the
parser tools. Simply neko is the first I thought. It is better use a only
parser.
For the images I used javax.imageio package. You put the properties in the
image through some graphic rendering software. The java code extract the
properties so:
.....
ImageInputStream imgInput =
ImageIO.createImageInputStream(is);
ImageReader imageReader =
ImageIO.getImageReaders(imgInput).next();
imageReader.setInput(imgInput);
IIOMetadata metaData = imageReader.getImageMetadata(0);
Node fragment =
metaData.getAsTree(IIOMetadataFormatImpl.standardMetadataFormatName);
setProperties(fragment);
......
public void setProperties(Node node) {
if (node instanceof javax.imageio.metadata.IIOMetadataNode) {
if (node.getNodeName().equals("TextEntry")) {
String name =
node.getAttributes().getNamedItem("keyword").getNodeValue();
String value =
node.getAttributes().getNamedItem("value").getNodeValue();
if
(name.toLowerCase().contains(DCMetaData.CONTRIBUTOR.getName()))
props.put(DCMetaData.CONTRIBUTOR, value);
if (name.toLowerCase().contains(DCMetaData.DATE.getName()))
props.put(DCMetaData.DATE, value);
if
(name.toLowerCase().contains(DCMetaData.CREATOR.getName()))
props.put(DCMetaData.CREATOR, value);
if
(name.toLowerCase().contains(DCMetaData.LANGUAGE.getName()))
props.put(DCMetaData.LANGUAGE, value);
}
}
Node child = node.getFirstChild();
while (child != null) {
setProperties(child);
child = child.getNextSibling();
}
}
I published the complete patch in jira
On Tue, Mar 30, 2010 at 7:05 PM, Patrice Lamarque <
patrice.lamarque(a)gmail.com> wrote:
> It looks interesting.
> How did you map image/html properties to dublin core ?
>
> Can you give an example ?
>
> On Tue, Mar 30, 2010 at 7:00 PM, Luca Stancapiano <
> l.stancapiano(a)sourcesense.com> wrote:
>
>> Hi... I created a task to extract the Dublin Core properties from an
>> image and a HTML code similar to the
>> org.exoplatform.services.document.impl.MSExcelDocumentReader . I have a
>> patch for it. Can be it useful?
>>
>> The task is:
>>
>>
https://jira.jboss.org/jira/browse/EXOJCR-624
>>
>> _______________________________________________
>> gatein-dev mailing list
>> gatein-dev(a)lists.jboss.org
>>
https://lists.jboss.org/mailman/listinfo/gatein-dev
>>
>>
>
>
> --
> Patrice Lamarque
> Product Manager
> eXo Platform
>
>
_______________________________________________
gatein-dev mailing list
gatein-dev(a)lists.jboss.org
https://lists.jboss.org/mailman/listinfo/gatein-dev
--
Nicolas Filotto
JCR Product Manager
Project Manager
eXo Platform SAS
nicolas.filotto(a)exoplatform.com
+33 (0)6 31 32 92 19