developing an improved focused crawler for the ideal project ward bonnefond, chris menzel, zack...
TRANSCRIPT
![Page 1: Developing an improved focused crawler for the IDEAL project Ward Bonnefond, Chris Menzel, Zack Morris, Suhas Patel, Tyler Ritchie, Mark Tedesco, Franklin](https://reader035.vdocuments.us/reader035/viewer/2022072014/56649eac5503460f94bb2c6e/html5/thumbnails/1.jpg)
Developing an improved focused crawler for the IDEAL project
Ward Bonnefond, Chris Menzel, Zack Morris, Suhas Patel, Tyler Ritchie, Mark Tedesco, Franklin Zheng
![Page 2: Developing an improved focused crawler for the IDEAL project Ward Bonnefond, Chris Menzel, Zack Morris, Suhas Patel, Tyler Ritchie, Mark Tedesco, Franklin](https://reader035.vdocuments.us/reader035/viewer/2022072014/56649eac5503460f94bb2c6e/html5/thumbnails/2.jpg)
IDEAL project
Integrating Digital Event Archiving and Library
Finding webpages related to an event (i.e. natural disaster)
Store found webpages locally for parsing and analysis
![Page 3: Developing an improved focused crawler for the IDEAL project Ward Bonnefond, Chris Menzel, Zack Morris, Suhas Patel, Tyler Ritchie, Mark Tedesco, Franklin](https://reader035.vdocuments.us/reader035/viewer/2022072014/56649eac5503460f94bb2c6e/html5/thumbnails/3.jpg)
Enhanced focus crawler
Extract key words and key concepts (i.e. date, location, type of disaster)
Construct trees based on these words and concepts
Develop algorithm to compare different trees and their relationships
Make this process accessible via a web application
![Page 4: Developing an improved focused crawler for the IDEAL project Ward Bonnefond, Chris Menzel, Zack Morris, Suhas Patel, Tyler Ritchie, Mark Tedesco, Franklin](https://reader035.vdocuments.us/reader035/viewer/2022072014/56649eac5503460f94bb2c6e/html5/thumbnails/4.jpg)
Project components
1. Tree construction and visual representation
2. Event representation (i.e. key words and key concepts) versus actual event (i.e. webpage)
3. Integrating updated modules into the existing focused crawler
![Page 5: Developing an improved focused crawler for the IDEAL project Ward Bonnefond, Chris Menzel, Zack Morris, Suhas Patel, Tyler Ritchie, Mark Tedesco, Franklin](https://reader035.vdocuments.us/reader035/viewer/2022072014/56649eac5503460f94bb2c6e/html5/thumbnails/5.jpg)
Original Implementation
Start with a list of seed
URLs
Web-crawler crawls
through list of URLs
Outputs a score for each URL based on keyword
matchings
Searches the
webpage for other
URLs
Adds any good URLs
found to the list
![Page 6: Developing an improved focused crawler for the IDEAL project Ward Bonnefond, Chris Menzel, Zack Morris, Suhas Patel, Tyler Ritchie, Mark Tedesco, Franklin](https://reader035.vdocuments.us/reader035/viewer/2022072014/56649eac5503460f94bb2c6e/html5/thumbnails/6.jpg)
Current Progress
Front-End
User can enter multiple seed URLS into a textbox and submit them to Python bundle
Python bundle returns scored webpages, which are then displayed on the front-end webpage
Back-end
Halfway through creating an event tree from online articles
Type of storm can be retrieved from the title of an article
![Page 7: Developing an improved focused crawler for the IDEAL project Ward Bonnefond, Chris Menzel, Zack Morris, Suhas Patel, Tyler Ritchie, Mark Tedesco, Franklin](https://reader035.vdocuments.us/reader035/viewer/2022072014/56649eac5503460f94bb2c6e/html5/thumbnails/7.jpg)
Future Work
Finish producing the event-tree
Compare it with the tree provided by user to determine article relevancy
Make the GUI for displaying the event-tree for a specific event
Finish the UI for the webpage
![Page 8: Developing an improved focused crawler for the IDEAL project Ward Bonnefond, Chris Menzel, Zack Morris, Suhas Patel, Tyler Ritchie, Mark Tedesco, Franklin](https://reader035.vdocuments.us/reader035/viewer/2022072014/56649eac5503460f94bb2c6e/html5/thumbnails/8.jpg)
Start with a list of seed
URLs
Web-crawler crawls
through list of URLs
Outputs a score for each URL based on tree-edit distance
Searches the
webpage for other
URLs
Adds any good URLs
found to the list
Projected Implementation
![Page 9: Developing an improved focused crawler for the IDEAL project Ward Bonnefond, Chris Menzel, Zack Morris, Suhas Patel, Tyler Ritchie, Mark Tedesco, Franklin](https://reader035.vdocuments.us/reader035/viewer/2022072014/56649eac5503460f94bb2c6e/html5/thumbnails/9.jpg)
Current Back-End Example
![Page 10: Developing an improved focused crawler for the IDEAL project Ward Bonnefond, Chris Menzel, Zack Morris, Suhas Patel, Tyler Ritchie, Mark Tedesco, Franklin](https://reader035.vdocuments.us/reader035/viewer/2022072014/56649eac5503460f94bb2c6e/html5/thumbnails/10.jpg)
Current Front-End Example
![Page 11: Developing an improved focused crawler for the IDEAL project Ward Bonnefond, Chris Menzel, Zack Morris, Suhas Patel, Tyler Ritchie, Mark Tedesco, Franklin](https://reader035.vdocuments.us/reader035/viewer/2022072014/56649eac5503460f94bb2c6e/html5/thumbnails/11.jpg)
Questions?