introduction of vertical crawler
DESCRIPTION
Introduction of vertical crawlerTRANSCRIPT
![Page 1: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/1.jpg)
Introduction of Crawler
Speaker: Jinglun
![Page 2: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/2.jpg)
Target
• Know concept of crawler• Design/Implement a crawler with good
performance, flexibility
![Page 3: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/3.jpg)
Agenda
• Crawler Introduction• Design• More Challenge• Source Code (If have time)
![Page 4: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/4.jpg)
What’s Crawler
![Page 5: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/5.jpg)
What’s Crawler
• Web crawler• Vertical crawler
![Page 6: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/6.jpg)
Example
• Get mappings from source code to yicu– Step1 : Find products link from codesearch– Step2: Find source code link from svn– Step3: Find mapping from source code
![Page 7: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/7.jpg)
Requirements
• Get mappings from source code to yicu
![Page 8: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/8.jpg)
Format RequirementsFeatures Qualities Constraints
Business Find which source code using our libs
1. High performance
2. Flexibility3. Easy Maintain
1.Thousands of page2. One machine3. Several days
Users Only me now
Developers Only me
![Page 9: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/9.jpg)
Architect
• Two directions• Two dimensions
![Page 10: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/10.jpg)
Process
![Page 11: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/11.jpg)
Components
![Page 12: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/12.jpg)
Layers
View
Control
Module
General Components
Business Bus
![Page 13: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/13.jpg)
Layers
• Crawler (View and Control) • Downloader, Extractor (Module)• Storage (General components)• Not use business bus
![Page 14: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/14.jpg)
Other views
• Running view• Deploy view• Data view• Develop view• …
![Page 15: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/15.jpg)
Other views
• Running view• Deploy view• Data view• Develop view• …
![Page 16: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/16.jpg)
Develop View
• Trunk/– Src/– Test/– Bin/
![Page 17: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/17.jpg)
Review DesignFeatures Qualities Constrai
nts
Business Find which source code using our libs
1. High performance
2. Flexibility
3. Easy maintain
1.Thousands of page2. One machine3. Several days
Users Only me now
Developers
Only me
![Page 18: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/18.jpg)
Solutions
• Crawler• Downloader• Storage• Extractor
![Page 19: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/19.jpg)
Solutions
• Crawler• Downloader (One API)• Storage• Extractor
![Page 20: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/20.jpg)
Crawler
DFS or BFS?
![Page 21: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/21.jpg)
Crawler
DFS or BFS?BFS1. More deep, more UNIMPORTANT2. Many paths are available to an certain page3. Simple for distributed crawler4. More efficient developing
![Page 22: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/22.jpg)
Crawler
Foreach($seed_urls as $url) { $page = GetPage($url); // download or read
fileSave($page);$meta_info = Parse($page);
Save($meta_info);}
![Page 23: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/23.jpg)
Storage
• What to be stored?– Urls?– Meta info?– Http Response (header, body, curl_info)?
• How to store?– Mysql?– File system?
![Page 24: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/24.jpg)
Storage
Hash table• Index• physical store
![Page 25: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/25.jpg)
Storage
Hash table• Index– md5
• physical store– Head file, body file, curl_info file
![Page 26: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/26.jpg)
Storage
• Basedir/– header_093e3575e895287cf471e6d5f5028446 body_093e3575e895287cf471e6d5f5028446 info_093e3575e895287cf471e6d5f5028446– header_ 66e7c612cf23049fa731c831bcee9048
body_ 66e7c612cf23049fa731c831bcee9048info_ 66e7c612cf23049fa731c831bcee9048
– …– meta_info.txt– failed_download_url.txt– failed_extract_page.txt
![Page 27: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/27.jpg)
Extractor
• Dom tree• Regular expresses
![Page 28: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/28.jpg)
Review DesignFeatures Qualities Constrai
nts
Business Find which source code using our libs
1. High performance
2. Flexibility
3. Easy maintain
1.Thousands of page2. One machine3. Several days
Users Only me now
Developers
Only me
![Page 29: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/29.jpg)
Performance Issue?
• Multi threads?• Multi process?
![Page 30: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/30.jpg)
Data analyze
• Vim• Shell (awk, sed)• Regular expression
![Page 31: Introduction of vertical crawler](https://reader035.vdocuments.us/reader035/viewer/2022081514/5564bdb5d8b42a565b8b4608/html5/thumbnails/31.jpg)
More Challenges
• Distributed• Noise• Duplication• Quick updates• Concurrent and Performance• …