Dublin Core properties from an image and a HTML code - gatein-dev - Jboss List Archives

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Dublin Core properties from an image and a HTML code

JIRA issue

Setting URIEncoding to UTF-8 in...

Luca Stancapiano

Tuesday, 30 March 2010 Tue, 30 Mar '10

noon

Hi... I created a task to extract the Dublin Core properties from an image and a HTML code similar to the org.exoplatform.services.document.impl.MSExcelDocumentReader . I have a patch for it. Can be it useful? The task is: https://jira.jboss.org/jira/browse/EXOJCR-624

Attachments:

attachment.html (text/html — 349 bytes)

Reply

Show replies by date

Patrice Lamarque

Tuesday, 30 March Tue, 30 Mar

1:05 p.m.

It looks interesting. How did you map image/html properties to dublin core ? Can you give an example ? On Tue, Mar 30, 2010 at 7:00 PM, Luca Stancapiano < l.stancapiano(a)sourcesense.com> wrote:

Hi... I created a task to extract the Dublin Core properties from an image and a HTML code similar to the org.exoplatform.services.document.impl.MSExcelDocumentReader . I have a patch for it. Can be it useful? The task is: https://jira.jboss.org/jira/browse/EXOJCR-624 _______________________________________________ gatein-dev mailing list gatein-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/gatein-dev

-- Patrice Lamarque Product Manager eXo Platform

Reply

Luca Stancapiano

6:13 p.m.

I used cyberneko to parse the properties inside the html code. My code parses the meta tag inside the html file for example: <meta name="DC.creator" xml:lang="it" content="creator_brand" /> <meta name="DC.date" xml:lang="it" content="2010-01-11" /> <meta name="DC.contributor" xml:lang="it" content="contributor_brand" /> <meta name="DC.language" xml:lang="it" content="it" /> <meta name="DC.subject" xml:lang="it" content="subject_brand" /> <meta name="DC.description" xml:lang="it" content="description_brand" /> <meta name="DC.title" xml:lang="it" content="title_brand /> <meta name="DC.publisher" xml:lang="it" content="publisher_brand" /> <meta name="DC.resourceType" xml:lang="it" content="text/html/brand" /> <meta name="DC.format" xml:lang="it" content="text/html" /> <meta name="DC.identifier" xml:lang="it" content="identifier_brand" /> <meta name="DC.source" xml:lang="it" content="source_brand" /> <meta name="DC.relation" xml:lang="it" content="relation_brand" /> <meta name="DC.coverage" xml:lang="it" content="coverage_brand" /> <meta name="DC.rights" xml:lang="it" content="rights_brand" /> Here a part of the code: .... DOMFragmentParser parser = new DOMFragmentParser(); HTMLDocument document = new HTMLDocumentImpl(); DocumentFragment fragment = document.createDocumentFragment(); parser.parse(new InputSource(is), fragment); setProperties(fragment); ..... public void setProperties(Node node) { if (node instanceof org.apache.html.dom.HTMLMetaElementImpl) { String name = node.getAttributes().getNamedItem("name").getTextContent(); String value = node.getAttributes().getNamedItem("content").getTextContent(); if (name.contains(DCMetaData.CONTRIBUTOR.getName())) props.put(DCMetaData.CONTRIBUTOR, value); if (name.contains(DCMetaData.DESCRIPTION.getName())) props.put(DCMetaData.DESCRIPTION, value); if (name.contains(DCMetaData.DATE.getName())) props.put(DCMetaData.DATE, value); if (name.contains(DCMetaData.CREATOR.getName())) props.put(DCMetaData.CREATOR, value); if (name.contains(DCMetaData.SUBJECT.getName())) props.put(DCMetaData.SUBJECT, value); if (name.contains(DCMetaData.PUBLISHER.getName())) props.put(DCMetaData.PUBLISHER, value); if (name.contains(DCMetaData.TITLE.getName())) props.put(DCMetaData.TITLE, value); if (name.contains(DCMetaData.LANGUAGE.getName())) props.put(DCMetaData.LANGUAGE, value); } Node child = node.getFirstChild(); while (child != null) { setProperties(child); child = child.getNextSibling(); } } I've seen you used HTMLParser in your code. I've no preferences about the parser tools. Simply neko is the first I thought. It is better use a only parser. For the images I used javax.imageio package. You put the properties in the image through some graphic rendering software. The java code extract the properties so: ..... ImageInputStream imgInput = ImageIO.createImageInputStream(is); ImageReader imageReader = ImageIO.getImageReaders(imgInput).next(); imageReader.setInput(imgInput); IIOMetadata metaData = imageReader.getImageMetadata(0); Node fragment = metaData.getAsTree(IIOMetadataFormatImpl.standardMetadataFormatName); setProperties(fragment); ...... public void setProperties(Node node) { if (node instanceof javax.imageio.metadata.IIOMetadataNode) { if (node.getNodeName().equals("TextEntry")) { String name = node.getAttributes().getNamedItem("keyword").getNodeValue(); String value = node.getAttributes().getNamedItem("value").getNodeValue(); if (name.toLowerCase().contains(DCMetaData.CONTRIBUTOR.getName())) props.put(DCMetaData.CONTRIBUTOR, value); if (name.toLowerCase().contains(DCMetaData.DATE.getName())) props.put(DCMetaData.DATE, value); if (name.toLowerCase().contains(DCMetaData.CREATOR.getName())) props.put(DCMetaData.CREATOR, value); if (name.toLowerCase().contains(DCMetaData.LANGUAGE.getName())) props.put(DCMetaData.LANGUAGE, value); } } Node child = node.getFirstChild(); while (child != null) { setProperties(child); child = child.getNextSibling(); } } I published the complete patch in jira On Tue, Mar 30, 2010 at 7:05 PM, Patrice Lamarque < patrice.lamarque(a)gmail.com> wrote:

It looks interesting. How did you map image/html properties to dublin core ? Can you give an example ? On Tue, Mar 30, 2010 at 7:00 PM, Luca Stancapiano < l.stancapiano(a)sourcesense.com> wrote: > Hi... I created a task to extract the Dublin Core properties from an image > and a HTML code similar to the > org.exoplatform.services.document.impl.MSExcelDocumentReader . I have a > patch for it. Can be it useful? > > The task is: > > https://jira.jboss.org/jira/browse/EXOJCR-624 > > _______________________________________________ > gatein-dev mailing list > gatein-dev(a)lists.jboss.org > https://lists.jboss.org/mailman/listinfo/gatein-dev > > -- Patrice Lamarque Product Manager eXo Platform

Reply

Patrice Lamarque

Wednesday, 31 March Wed, 31 Mar

12:23 a.m.

Thank you Luca. I voted for your issue On Wed, Mar 31, 2010 at 1:13 AM, Luca Stancapiano < l.stancapiano(a)sourcesense.com> wrote:

I used cyberneko to parse the properties inside the html code. My code parses the meta tag inside the html file for example: <meta name="DC.creator" xml:lang="it" content="creator_brand" /> <meta name="DC.date" xml:lang="it" content="2010-01-11" /> <meta name="DC.contributor" xml:lang="it" content="contributor_brand" /> <meta name="DC.language" xml:lang="it" content="it" /> <meta name="DC.subject" xml:lang="it" content="subject_brand" /> <meta name="DC.description" xml:lang="it" content="description_brand" /> <meta name="DC.title" xml:lang="it" content="title_brand /> <meta name="DC.publisher" xml:lang="it" content="publisher_brand" /> <meta name="DC.resourceType" xml:lang="it" content="text/html/brand" /> <meta name="DC.format" xml:lang="it" content="text/html" /> <meta name="DC.identifier" xml:lang="it" content="identifier_brand" /> <meta name="DC.source" xml:lang="it" content="source_brand" /> <meta name="DC.relation" xml:lang="it" content="relation_brand" /> <meta name="DC.coverage" xml:lang="it" content="coverage_brand" /> <meta name="DC.rights" xml:lang="it" content="rights_brand" /> Here a part of the code: .... DOMFragmentParser parser = new DOMFragmentParser(); HTMLDocument document = new HTMLDocumentImpl(); DocumentFragment fragment = document.createDocumentFragment(); parser.parse(new InputSource(is), fragment); setProperties(fragment); ..... public void setProperties(Node node) { if (node instanceof org.apache.html.dom.HTMLMetaElementImpl) { String name = node.getAttributes().getNamedItem("name").getTextContent(); String value = node.getAttributes().getNamedItem("content").getTextContent(); if (name.contains(DCMetaData.CONTRIBUTOR.getName())) props.put(DCMetaData.CONTRIBUTOR, value); if (name.contains(DCMetaData.DESCRIPTION.getName())) props.put(DCMetaData.DESCRIPTION, value); if (name.contains(DCMetaData.DATE.getName())) props.put(DCMetaData.DATE, value); if (name.contains(DCMetaData.CREATOR.getName())) props.put(DCMetaData.CREATOR, value); if (name.contains(DCMetaData.SUBJECT.getName())) props.put(DCMetaData.SUBJECT, value); if (name.contains(DCMetaData.PUBLISHER.getName())) props.put(DCMetaData.PUBLISHER, value); if (name.contains(DCMetaData.TITLE.getName())) props.put(DCMetaData.TITLE, value); if (name.contains(DCMetaData.LANGUAGE.getName())) props.put(DCMetaData.LANGUAGE, value); } Node child = node.getFirstChild(); while (child != null) { setProperties(child); child = child.getNextSibling(); } } I've seen you used HTMLParser in your code. I've no preferences about the parser tools. Simply neko is the first I thought. It is better use a only parser. For the images I used javax.imageio package. You put the properties in the image through some graphic rendering software. The java code extract the properties so: ..... ImageInputStream imgInput = ImageIO.createImageInputStream(is); ImageReader imageReader = ImageIO.getImageReaders(imgInput).next(); imageReader.setInput(imgInput); IIOMetadata metaData = imageReader.getImageMetadata(0); Node fragment = metaData.getAsTree(IIOMetadataFormatImpl.standardMetadataFormatName); setProperties(fragment); ...... public void setProperties(Node node) { if (node instanceof javax.imageio.metadata.IIOMetadataNode) { if (node.getNodeName().equals("TextEntry")) { String name = node.getAttributes().getNamedItem("keyword").getNodeValue(); String value = node.getAttributes().getNamedItem("value").getNodeValue(); if (name.toLowerCase().contains(DCMetaData.CONTRIBUTOR.getName())) props.put(DCMetaData.CONTRIBUTOR, value); if (name.toLowerCase().contains(DCMetaData.DATE.getName())) props.put(DCMetaData.DATE, value); if (name.toLowerCase().contains(DCMetaData.CREATOR.getName())) props.put(DCMetaData.CREATOR, value); if (name.toLowerCase().contains(DCMetaData.LANGUAGE.getName())) props.put(DCMetaData.LANGUAGE, value); } } Node child = node.getFirstChild(); while (child != null) { setProperties(child); child = child.getNextSibling(); } } I published the complete patch in jira On Tue, Mar 30, 2010 at 7:05 PM, Patrice Lamarque < patrice.lamarque(a)gmail.com> wrote: > It looks interesting. > How did you map image/html properties to dublin core ? > > Can you give an example ? > > On Tue, Mar 30, 2010 at 7:00 PM, Luca Stancapiano < > l.stancapiano(a)sourcesense.com> wrote: > >> Hi... I created a task to extract the Dublin Core properties from an >> image and a HTML code similar to the >> org.exoplatform.services.document.impl.MSExcelDocumentReader . I have a >> patch for it. Can be it useful? >> >> The task is: >> >> https://jira.jboss.org/jira/browse/EXOJCR-624 >> >> _______________________________________________ >> gatein-dev mailing list >> gatein-dev(a)lists.jboss.org >> https://lists.jboss.org/mailman/listinfo/gatein-dev >> >> > > > -- > Patrice Lamarque > Product Manager > eXo Platform > >

-- Patrice Lamarque Product Manager eXo Platform

Reply

Nicolas Filotto

1:40 a.m.

Hi Luca, Thank you for your contribution, FYI, I planed it for the next minor version of eXo JCR i.e. JCR 1.14 since now it is too late now for JCR 1.12 On Wed, Mar 31, 2010 at 1:13 AM, Luca Stancapiano < l.stancapiano(a)sourcesense.com> wrote:

I used cyberneko to parse the properties inside the html code. My code parses the meta tag inside the html file for example: <meta name="DC.creator" xml:lang="it" content="creator_brand" /> <meta name="DC.date" xml:lang="it" content="2010-01-11" /> <meta name="DC.contributor" xml:lang="it" content="contributor_brand" /> <meta name="DC.language" xml:lang="it" content="it" /> <meta name="DC.subject" xml:lang="it" content="subject_brand" /> <meta name="DC.description" xml:lang="it" content="description_brand" /> <meta name="DC.title" xml:lang="it" content="title_brand /> <meta name="DC.publisher" xml:lang="it" content="publisher_brand" /> <meta name="DC.resourceType" xml:lang="it" content="text/html/brand" /> <meta name="DC.format" xml:lang="it" content="text/html" /> <meta name="DC.identifier" xml:lang="it" content="identifier_brand" /> <meta name="DC.source" xml:lang="it" content="source_brand" /> <meta name="DC.relation" xml:lang="it" content="relation_brand" /> <meta name="DC.coverage" xml:lang="it" content="coverage_brand" /> <meta name="DC.rights" xml:lang="it" content="rights_brand" /> Here a part of the code: .... DOMFragmentParser parser = new DOMFragmentParser(); HTMLDocument document = new HTMLDocumentImpl(); DocumentFragment fragment = document.createDocumentFragment(); parser.parse(new InputSource(is), fragment); setProperties(fragment); ..... public void setProperties(Node node) { if (node instanceof org.apache.html.dom.HTMLMetaElementImpl) { String name = node.getAttributes().getNamedItem("name").getTextContent(); String value = node.getAttributes().getNamedItem("content").getTextContent(); if (name.contains(DCMetaData.CONTRIBUTOR.getName())) props.put(DCMetaData.CONTRIBUTOR, value); if (name.contains(DCMetaData.DESCRIPTION.getName())) props.put(DCMetaData.DESCRIPTION, value); if (name.contains(DCMetaData.DATE.getName())) props.put(DCMetaData.DATE, value); if (name.contains(DCMetaData.CREATOR.getName())) props.put(DCMetaData.CREATOR, value); if (name.contains(DCMetaData.SUBJECT.getName())) props.put(DCMetaData.SUBJECT, value); if (name.contains(DCMetaData.PUBLISHER.getName())) props.put(DCMetaData.PUBLISHER, value); if (name.contains(DCMetaData.TITLE.getName())) props.put(DCMetaData.TITLE, value); if (name.contains(DCMetaData.LANGUAGE.getName())) props.put(DCMetaData.LANGUAGE, value); } Node child = node.getFirstChild(); while (child != null) { setProperties(child); child = child.getNextSibling(); } } I've seen you used HTMLParser in your code. I've no preferences about the parser tools. Simply neko is the first I thought. It is better use a only parser. For the images I used javax.imageio package. You put the properties in the image through some graphic rendering software. The java code extract the properties so: ..... ImageInputStream imgInput = ImageIO.createImageInputStream(is); ImageReader imageReader = ImageIO.getImageReaders(imgInput).next(); imageReader.setInput(imgInput); IIOMetadata metaData = imageReader.getImageMetadata(0); Node fragment = metaData.getAsTree(IIOMetadataFormatImpl.standardMetadataFormatName); setProperties(fragment); ...... public void setProperties(Node node) { if (node instanceof javax.imageio.metadata.IIOMetadataNode) { if (node.getNodeName().equals("TextEntry")) { String name = node.getAttributes().getNamedItem("keyword").getNodeValue(); String value = node.getAttributes().getNamedItem("value").getNodeValue(); if (name.toLowerCase().contains(DCMetaData.CONTRIBUTOR.getName())) props.put(DCMetaData.CONTRIBUTOR, value); if (name.toLowerCase().contains(DCMetaData.DATE.getName())) props.put(DCMetaData.DATE, value); if (name.toLowerCase().contains(DCMetaData.CREATOR.getName())) props.put(DCMetaData.CREATOR, value); if (name.toLowerCase().contains(DCMetaData.LANGUAGE.getName())) props.put(DCMetaData.LANGUAGE, value); } } Node child = node.getFirstChild(); while (child != null) { setProperties(child); child = child.getNextSibling(); } } I published the complete patch in jira On Tue, Mar 30, 2010 at 7:05 PM, Patrice Lamarque < patrice.lamarque(a)gmail.com> wrote: > It looks interesting. > How did you map image/html properties to dublin core ? > > Can you give an example ? > > On Tue, Mar 30, 2010 at 7:00 PM, Luca Stancapiano < > l.stancapiano(a)sourcesense.com> wrote: > >> Hi... I created a task to extract the Dublin Core properties from an >> image and a HTML code similar to the >> org.exoplatform.services.document.impl.MSExcelDocumentReader . I have a >> patch for it. Can be it useful? >> >> The task is: >> >> https://jira.jboss.org/jira/browse/EXOJCR-624 >> >> _______________________________________________ >> gatein-dev mailing list >> gatein-dev(a)lists.jboss.org >> https://lists.jboss.org/mailman/listinfo/gatein-dev >> >> > > > -- > Patrice Lamarque > Product Manager > eXo Platform > > _______________________________________________ gatein-dev mailing list gatein-dev(a)lists.jboss.org https://lists.jboss.org/mailman/listinfo/gatein-dev

-- Nicolas Filotto JCR Product Manager Project Manager eXo Platform SAS nicolas.filotto(a)exoplatform.com +33 (0)6 31 32 92 19

Reply

5964

days inactive

5965

days old

gatein-dev@lists.jboss.org

Manage subscription

4 comments

3 participants

tags (0)

participants (3)

Luca Stancapiano
Nicolas Filotto
Patrice Lamarque