tutorial on web scraping in python
TRANSCRIPT
![Page 1: Tutorial on Web Scraping in Python](https://reader036.vdocuments.us/reader036/viewer/2022081800/5a66e0617f8b9a85028b45c9/html5/thumbnails/1.jpg)
Scraping Data from the Web using Scrapy & Beautiful Soup
Nithish Raghunandanan
PyData Munich | 8th November 2017
![Page 2: Tutorial on Web Scraping in Python](https://reader036.vdocuments.us/reader036/viewer/2022081800/5a66e0617f8b9a85028b45c9/html5/thumbnails/2.jpg)
About Me● MSc. Informatics Student at the Technical University of Munich
○ Focus on Data Science & Software Engineering
● Student Employee at KI labs, part of KI Group
● Love to play with different technologies
● Connect
■ nithishr1
@nithishr
![Page 3: Tutorial on Web Scraping in Python](https://reader036.vdocuments.us/reader036/viewer/2022081800/5a66e0617f8b9a85028b45c9/html5/thumbnails/3.jpg)
What is Scraping?● Extract data from the web pages
● Store the data into structured formats
● Data not available directly or via APIs
![Page 4: Tutorial on Web Scraping in Python](https://reader036.vdocuments.us/reader036/viewer/2022081800/5a66e0617f8b9a85028b45c9/html5/thumbnails/4.jpg)
Use Cases
![Page 5: Tutorial on Web Scraping in Python](https://reader036.vdocuments.us/reader036/viewer/2022081800/5a66e0617f8b9a85028b45c9/html5/thumbnails/5.jpg)
Tools for Scraping● Scrapy
○ Python framework to extract data from web pages
● Beautiful Soup
○ Python library to parse HTML/XML documents
● Alternatives
○ Selenium
○ Requests
○ Octoparse
![Page 6: Tutorial on Web Scraping in Python](https://reader036.vdocuments.us/reader036/viewer/2022081800/5a66e0617f8b9a85028b45c9/html5/thumbnails/6.jpg)
![Page 7: Tutorial on Web Scraping in Python](https://reader036.vdocuments.us/reader036/viewer/2022081800/5a66e0617f8b9a85028b45c9/html5/thumbnails/7.jpg)
Scraping 101● Spider
○ A bot that downloads web pages
● robots.txt
○ File present on the server specifying access limits to bots
![Page 8: Tutorial on Web Scraping in Python](https://reader036.vdocuments.us/reader036/viewer/2022081800/5a66e0617f8b9a85028b45c9/html5/thumbnails/8.jpg)
Pitfalls in Crawling● Javascript heavy websites
○ Splash plugin
○ Selenium
● Default settings not too friendly to website
owners
○ Inbuilt Auto throttle extension
● Captchas
![Page 9: Tutorial on Web Scraping in Python](https://reader036.vdocuments.us/reader036/viewer/2022081800/5a66e0617f8b9a85028b45c9/html5/thumbnails/9.jpg)
Why Yellow Pages? Email Marketing for Customer Acquisition
![Page 10: Tutorial on Web Scraping in Python](https://reader036.vdocuments.us/reader036/viewer/2022081800/5a66e0617f8b9a85028b45c9/html5/thumbnails/10.jpg)
Email Marketing for Customer AcquisitionInitial Approach
● Buy Email Lists
● Send via 3rd Parties
● Poor Quality
○ Non transparent
○ Generic emails
● Expensive
Crawling
● Scrapy + Beautiful Soup
● Over 500k Emails
● Quality Improvement
○ Categorized into segments
○ Targeted emails
● Cheap
![Page 11: Tutorial on Web Scraping in Python](https://reader036.vdocuments.us/reader036/viewer/2022081800/5a66e0617f8b9a85028b45c9/html5/thumbnails/11.jpg)
nithishr1
@nithishr
Connect
Nithish Raghunandanan
www.ki-labs.com
![Page 12: Tutorial on Web Scraping in Python](https://reader036.vdocuments.us/reader036/viewer/2022081800/5a66e0617f8b9a85028b45c9/html5/thumbnails/12.jpg)
Resources● Scrapy Guide
○ https://doc.scrapy.org/en/latest/intro/tutorial.html
● Beautiful Soup Guide
○ https://www.crummy.com/software/BeautifulSoup/bs4/doc/
● Crawling Etiquette
○ https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/
● Code
○ https://github.com/nithishr/meetup_scraping