[gatein-dev] Dublin Core properties from an image and a HTML code

Patrice Lamarque patrice.lamarque at gmail.com
Wed Mar 31 01:23:33 EDT 2010


Thank you Luca. I voted for your issue

On Wed, Mar 31, 2010 at 1:13 AM, Luca Stancapiano <
l.stancapiano at sourcesense.com> wrote:

> I used cyberneko to parse the properties inside the html code. My code
> parses the meta tag inside the html file for example:
>
> <meta name="DC.creator" xml:lang="it" content="creator_brand" />
> <meta name="DC.date" xml:lang="it" content="2010-01-11" />
> <meta name="DC.contributor" xml:lang="it" content="contributor_brand" />
> <meta name="DC.language" xml:lang="it" content="it" />
> <meta name="DC.subject" xml:lang="it" content="subject_brand" />
> <meta name="DC.description" xml:lang="it" content="description_brand" />
> <meta name="DC.title" xml:lang="it" content="title_brand />
> <meta name="DC.publisher" xml:lang="it" content="publisher_brand" />
> <meta name="DC.resourceType" xml:lang="it" content="text/html/brand" />
> <meta name="DC.format" xml:lang="it" content="text/html" />
> <meta name="DC.identifier" xml:lang="it" content="identifier_brand" />
> <meta name="DC.source" xml:lang="it" content="source_brand" />
> <meta name="DC.relation" xml:lang="it" content="relation_brand" />
> <meta name="DC.coverage" xml:lang="it" content="coverage_brand" />
> <meta name="DC.rights" xml:lang="it" content="rights_brand" />
>
> Here a part of the code:
>
> ....
>                 DOMFragmentParser parser = new DOMFragmentParser();
>                 HTMLDocument document = new HTMLDocumentImpl();
>                 DocumentFragment fragment =
> document.createDocumentFragment();
>                 parser.parse(new InputSource(is), fragment);
>                 setProperties(fragment);
> .....
>
>
>     public void setProperties(Node node) {
>         if (node instanceof org.apache.html.dom.HTMLMetaElementImpl) {
>             String name =
> node.getAttributes().getNamedItem("name").getTextContent();
>             String value =
> node.getAttributes().getNamedItem("content").getTextContent();
>             if (name.contains(DCMetaData.CONTRIBUTOR.getName()))
>                 props.put(DCMetaData.CONTRIBUTOR, value);
>             if (name.contains(DCMetaData.DESCRIPTION.getName()))
>                 props.put(DCMetaData.DESCRIPTION, value);
>             if (name.contains(DCMetaData.DATE.getName()))
>                 props.put(DCMetaData.DATE, value);
>             if (name.contains(DCMetaData.CREATOR.getName()))
>                 props.put(DCMetaData.CREATOR, value);
>             if (name.contains(DCMetaData.SUBJECT.getName()))
>                 props.put(DCMetaData.SUBJECT, value);
>             if (name.contains(DCMetaData.PUBLISHER.getName()))
>                 props.put(DCMetaData.PUBLISHER, value);
>             if (name.contains(DCMetaData.TITLE.getName()))
>                 props.put(DCMetaData.TITLE, value);
>             if (name.contains(DCMetaData.LANGUAGE.getName()))
>                 props.put(DCMetaData.LANGUAGE, value);
>         }
>         Node child = node.getFirstChild();
>         while (child != null) {
>             setProperties(child);
>             child = child.getNextSibling();
>         }
>     }
>
> I've seen you used HTMLParser in your code. I've no preferences about the
> parser tools. Simply neko is the first I thought. It is better use a only
> parser.
>
>
> For the images I used javax.imageio package. You put the properties in the
> image through some graphic rendering software. The java code extract the
> properties so:
>
> .....
>                     ImageInputStream imgInput =
> ImageIO.createImageInputStream(is);
>                     ImageReader imageReader =
> ImageIO.getImageReaders(imgInput).next();
>                     imageReader.setInput(imgInput);
>                     IIOMetadata metaData = imageReader.getImageMetadata(0);
>                     Node fragment =
> metaData.getAsTree(IIOMetadataFormatImpl.standardMetadataFormatName);
>                     setProperties(fragment);
> ......
>
>
>     public void setProperties(Node node) {
>         if (node instanceof javax.imageio.metadata.IIOMetadataNode) {
>             if (node.getNodeName().equals("TextEntry")) {
>                 String name =
> node.getAttributes().getNamedItem("keyword").getNodeValue();
>                 String value =
> node.getAttributes().getNamedItem("value").getNodeValue();
>                 if
> (name.toLowerCase().contains(DCMetaData.CONTRIBUTOR.getName()))
>                     props.put(DCMetaData.CONTRIBUTOR, value);
>                 if (name.toLowerCase().contains(DCMetaData.DATE.getName()))
>                     props.put(DCMetaData.DATE, value);
>                 if
> (name.toLowerCase().contains(DCMetaData.CREATOR.getName()))
>                     props.put(DCMetaData.CREATOR, value);
>                 if
> (name.toLowerCase().contains(DCMetaData.LANGUAGE.getName()))
>                     props.put(DCMetaData.LANGUAGE, value);
>             }
>         }
>         Node child = node.getFirstChild();
>         while (child != null) {
>             setProperties(child);
>             child = child.getNextSibling();
>         }
>     }
>
> I published the complete patch in jira
>
> On Tue, Mar 30, 2010 at 7:05 PM, Patrice Lamarque <
> patrice.lamarque at gmail.com> wrote:
>
>> It looks interesting.
>> How did you map image/html properties to dublin core ?
>>
>> Can you give an example ?
>>
>> On Tue, Mar 30, 2010 at 7:00 PM, Luca Stancapiano <
>> l.stancapiano at sourcesense.com> wrote:
>>
>>> Hi... I created a task to extract the Dublin Core properties from an
>>> image and a HTML code similar to the
>>> org.exoplatform.services.document.impl.MSExcelDocumentReader . I have a
>>> patch for it.  Can be it useful?
>>>
>>> The task is:
>>>
>>> https://jira.jboss.org/jira/browse/EXOJCR-624
>>>
>>> _______________________________________________
>>> gatein-dev mailing list
>>> gatein-dev at lists.jboss.org
>>> https://lists.jboss.org/mailman/listinfo/gatein-dev
>>>
>>>
>>
>>
>> --
>> Patrice Lamarque
>> Product Manager
>> eXo Platform
>>
>>
>


-- 
Patrice Lamarque
Product Manager
eXo Platform
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.jboss.org/pipermail/gatein-dev/attachments/20100331/bba272c6/attachment-0001.html 


More information about the gatein-dev mailing list