[teiid-issues] [JBoss JIRA] (TEIID-3733) Add support for web scraping

Wed Sep 30 16:41:00 EDT 2015

    [ https://issues.jboss.org/browse/TEIID-3733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113991#comment-13113991 ] 

Steven Hawkins commented on TEIID-3733:
---------------------------------------

Note the jsoup example will extract based upon the jsoup selector - http://jsoup.org/apidocs/org/jsoup/select/Selector.html which is a css like selector syntax.  This is somewhat idiomatic to jsoup and the results are simple the set of selected elements - and component information such as inner_text, tag name, id, etc. are returned in the result.  For any usage scenarios more logic would be needed to transform the result, and this would not handle tabular data well (at best assuming that you could somewhat easily identify a single html table to extract, you would read the rows, then for each row use the soup extraction again to extract the columns - then a pivot would be needed.  however that may not work well in practice unless the table is regular.  missing or spanning values would likely be an issue).

> Add support for web scraping
> ----------------------------
>
>                 Key: TEIID-3733
>                 URL: https://issues.jboss.org/browse/TEIID-3733
>             Project: Teiid
>          Issue Type: Feature Request
>          Components: Misc. Connectors
>            Reporter: Van Halbert
>            Assignee: Steven Hawkins
>
> Add support for web scraping.
> Here's one from CA using JSoup - https://github.com/rokhmanov/teiid-translators/blob/master/translator-scrape/src/main/java/com/rokhmanov/teiid/translator/scrape/

--
This message was sent by Atlassian JIRA
(v6.4.11#64026)