how to scraping content from web for location-based mobile app
TRANSCRIPT
![Page 1: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/1.jpg)
Scraping content from web for location-based mobile
app.
![Page 2: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/2.jpg)
Nguyen Hong Diepfounder, magik.vn
![Page 3: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/3.jpg)
Summary
1. Web Scraping– Definitions– Value added– Analysis a Sample Case
2. Scrapy Framework– Overview– Architecture– A simple Scrapy program.
3. Build a auto scraping system for location-based apps– Extract LatLng from address– Extract phone number – Realtime update & continuous 24/7– Prevent duplication data– Deploy without a dedicated server or VPS
![Page 4: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/4.jpg)
Web crawler
Internet bot that systematically browses the World Wide Web,
typically for web indexing.
Sources: wikipedia.org
![Page 5: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/5.jpg)
Scrape
Crawl websites and extract structured data from pages.
Sources: wikipedia.org
![Page 6: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/6.jpg)
Added Value?
![Page 7: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/7.jpg)
giamua.com – “groupon”
![Page 8: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/8.jpg)
baomoi.com
![Page 9: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/9.jpg)
Added Value?
same user experiencebut
more content than
![Page 10: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/10.jpg)
oizoioi.vn Price comparison for electronic
![Page 11: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/11.jpg)
Added Value?
make
new knowledge from many informations
Wisdom
Knowledge
Information
Data
DIKW Hierachy
![Page 12: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/12.jpg)
Nha Tro Tot
![Page 13: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/13.jpg)
![Page 14: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/14.jpg)
Added Value?
The smartphone revolutionnew platform
need new user experienced
Source: www.widexconnect.ca
![Page 15: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/15.jpg)
And mores
Sources : Laban.vn
![Page 16: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/16.jpg)
Analysis a sample case
(1)collect [home for sales] records from Web
(2)from many websites in Vietnam(3) as soon as they posted(4) continuous 24 / 7
Need
![Page 17: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/17.jpg)
Step 1: Listing sources
![Page 18: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/18.jpg)
Step 2: build general database
![Page 19: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/19.jpg)
Step 3: Ctrl+C, Ctrl+V
• For every sites:– Find listing latest records webpage link.– For every record :• Check if new record
– Copy & paste fields into a new record in my DB.
![Page 20: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/20.jpg)
Step 3: Ctrl+C, Ctrl+V
![Page 21: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/21.jpg)
Bước 3 : Let’s Scrapy
![Page 22: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/22.jpg)
Scrapy Framework
• Overview• Architecture• Xpath• Make a simple Scrapy program.
![Page 23: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/23.jpg)
• Scrapy is a fast high-level screen scraping and web crawling framework.
• Open-source, 100% Python => Portable
![Page 24: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/24.jpg)
Scrapy’s github info
• From 2008
• Stats
![Page 25: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/25.jpg)
Architecture
Source: http://doc.scrapy.org/en/0.12/topics/architecture.html
![Page 26: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/26.jpg)
XPath
Navigate through elements and attributes
in an XML document.
![Page 27: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/27.jpg)
Simple Scrapy Program
• (1) Pick a website – http://www.mininova.org/today
• (2) Define the data you want to scrape
![Page 28: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/28.jpg)
Simple Scrapy Program (cont.)
• (3) Write a Spider to extract the data
![Page 29: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/29.jpg)
![Page 30: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/30.jpg)
Simple Scrapy Program (cont.)
(4) Run the spider to extract the data
(5) Review scraped data
![Page 31: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/31.jpg)
Build a auto scraping system for location-based apps
• Extract LatLng from address• Extract phone number • Realtime update & continuous 24/7• Prevent duplication data• Deploy without a dedicated server or
VPS
![Page 32: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/32.jpg)
Extract LatLng from address
• Use Google Geocode• https://maps.googleapis.com/maps/api/geocode/json?
address=xxx&sensor=true_or_false&key=API_KEY
![Page 33: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/33.jpg)
Extract LatLng from address (cont.)
![Page 34: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/34.jpg)
Extract LatLng from address (cont.)
![Page 35: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/35.jpg)
Extract Phone Number
• Libphonenumber’s python port.
• Sample
![Page 36: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/36.jpg)
“Real time” update and continuous 24/7.
• Task Schedule (Windows)
• Cron jobs (Linux)
![Page 37: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/37.jpg)
Prevent duplication data
• Make a middleware for ignore exists Item. IgnoreExistsMiddleW
are
![Page 38: How to scraping content from web for location-based mobile app](https://reader034.vdocuments.us/reader034/viewer/2022052619/5552bef1b4c905920f8b4723/html5/thumbnails/38.jpg)
Without a dedicated server or VPS
• Problems: my server-side is on a cpanel web hosting => can’t deploy scrapy
• Solutions: – Make a web services for sync new record data.
• /get_head_revision• /sync
– Scrapy run on my PC, then sync with server.