knime and the web – extract, test, automate...2016/02/25 · knime and the web – extract, test,...
TRANSCRIPT
![Page 1: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/1.jpg)
KNIME and the Web –Extract, Test, Automate
KNIME Spring Summit, Berlin, 25.02.2016
Philipp Katz,
![Page 2: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/2.jpg)
Our Background
• Three former PhD students at TU Dresden (me, Klemens Muthmann, David Urbansky)
• Computer Science, Information Extraction
• After PhD, each of us founded a startup
CYFACE(fancy logo under construction)
![Page 3: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/3.jpg)
Palladian Nodes
![Page 4: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/4.jpg)
Palladian?
• Java-based toolkit for information retrieval started in 2009
• Palladian KNIME nodes since 2011
• Used in commercial and academic projects
• Available from KNIME Community Contributions download site
![Page 5: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/5.jpg)
The Palladian Nodes• Text classification
• Content extraction
• Date extraction
• Named entity recognition
• Geo data extraction
• Web page, image, news search
• HTML, RSS, Atom parsing
• Ranking value retrieval
• Evaluation metrics
![Page 6: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/6.jpg)
Access Web APIs
• Web Searcher
• Ranking Services
![Page 7: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/7.jpg)
Text Classification
• Very simple, one predictor, one learner
• n-gram features and Naïve Bayes scoring
• Optimized for big amounts of training data
• Learner is now streamable, Predictor soon
• Competitive accuracy for many use cases
![Page 8: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/8.jpg)
Geographic Data
• Was cooking for a while, added after last year's summit due to popular demand
• New: Nodes for IP and address lookup
• New: Use local gazetteer as source for location extraction node
![Page 9: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/9.jpg)
Geographic Data
• Extract and disambiguate locations from unstructured text, visualize them on the map
![Page 10: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/10.jpg)
Geographic Data
• Extract and disambiguate locations from unstructured text, visualize them on the map
![Page 11: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/11.jpg)
Geographic Data
• Extract and disambiguate locations from unstructured text, visualize them on the map
![Page 12: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/12.jpg)
HTTP and HTML
• New: Support for cookies, headers, and further HTTP methods besides GET
• New: Sending arbitrary byte stream content, form-encoding of table data
• New: OAuth signing for HTTP requests
![Page 13: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/13.jpg)
![Page 14: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/14.jpg)
?
![Page 15: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/15.jpg)
?
![Page 16: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/16.jpg)
Selenium Nodes
![Page 17: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/17.jpg)
Selenium?
• “Selenium automates browsers.”
• The Selenium Nodes allow to simulate a real web browser with KNIME
• Use a KNIME workflow to describe actions and extract all the data you need
![Page 18: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/18.jpg)
Use Cases
Data extraction
Task automatization
Web application testing
![Page 19: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/19.jpg)
Browser Support
• Local installations
• Headless “browsers”
• PhantomJS, jBrowserDriver
• Remotely running
![Page 20: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/20.jpg)
Browser Support
• Remotely running
• Connect to Selenium servers or VMs on your local network to simulate a variety of operating systems or browsers
• Use cloud services such as BrowserStack or SauceLabs, which provide ready-to-use Selenium instances (even iOS and Android)
![Page 21: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/21.jpg)
Example Workflow
![Page 22: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/22.jpg)
Example Workflow
![Page 23: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/23.jpg)
Example Workflow
![Page 24: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/24.jpg)
Example Workflow
![Page 25: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/25.jpg)
Example Workflow
![Page 26: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/26.jpg)
Node Overview
• Configure, start, and quit web browsers
• Navigate
• Locate Elements (using attributes, XPath, or CSS)
• Interact with Elements (click, input text, select, submit, …)
![Page 27: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/27.jpg)
Node Overview• Highlight elements
• Take screenshots
• Extract data (page source, text content, attributes, …)
• Execute JavaScript
• Execute Selenium script
• Waiting and synchronization
![Page 28: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/28.jpg)
Outlook
• More sample workflows
• Documentation, how-tos, …
• Workflow import and export for Selenium Scripts
![Page 29: KNIME and the Web – Extract, Test, Automate...2016/02/25 · KNIME and the Web – Extract, Test, Automate KNIME Spring Summit, Berlin, 25.02.2016 Philipp Katz, Our Background •](https://reader036.vdocuments.us/reader036/viewer/2022081607/5eaa1a090e9a1477d55b1566/html5/thumbnails/29.jpg)