[320] web 1: selenium · page a page b page c page d how to scrape a webpage graph? how to scrape a...
TRANSCRIPT
![Page 1: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/1.jpg)
[320] Web 1: SeleniumTyler Caraza-Harter
![Page 2: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/2.jpg)
Page A
Page B
Page C
Page D
how to scrape a webpage graph?
how to scrape a complicated page?
![Page 3: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/3.jpg)
Review Document Object Model
![Page 4: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/4.jpg)
url: http://domain/rsrc.html HTTP Response
HTTP/1.0 200 OK Content-Type: text/html; charset=utf-8 Content-Length: 74
<html> <head><script>...</script></head> <body> <h1>Welcome</h1> <a href="about.html">About</a> <a href="contact.html">Contact</a> </body> </html>
Browser
What does a web browser do when it gets some HTML in an HTTP response?
![Page 5: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/5.jpg)
url: http://domain/rsrc.html HTTP Response<html> <body> <h1>Welcome</h1> <a href="about.html">About</a> <a href="contact.html">Contact</a> </body> </html> HTTP/1.0 200 OK
Content-Type: text/html; charset=utf-8 Content-Length: 74
<html> <head><script>...</script></head> <body> <h1>Welcome</h1> <a href="about.html">About</a> <a href="contact.html">Contact</a> </body> </html>
![Page 6: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/6.jpg)
url: http://domain/rsrc.html HTTP Response
before displaying a page, the browser uses HTML to generate a Document
Object Model(DOM Tree)
HTTP/1.0 200 OK Content-Type: text/html; charset=utf-8 Content-Length: 74
<html> <head><script>...</script></head> <body> <h1>Welcome</h1> <a href="about.html">About</a> <a href="contact.html">Contact</a> </body> </html>
![Page 7: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/7.jpg)
url: http://domain/rsrc.html HTTP Response
html
body
ah1 a
vocab: elements
HTTP/1.0 200 OK Content-Type: text/html; charset=utf-8 Content-Length: 74
<html> <head><script>...</script></head> <body> <h1>Welcome</h1> <a href="about.html">About</a> <a href="contact.html">Contact</a> </body> </html>
![Page 8: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/8.jpg)
url: http://domain/rsrc.html HTTP Response
html
body
ah1 a
attr: hrefattr: href
Elements may contain• attributes
HTTP/1.0 200 OK Content-Type: text/html; charset=utf-8 Content-Length: 74
<html> <head><script>...</script></head> <body> <h1>Welcome</h1> <a href="about.html">About</a> <a href="contact.html">Contact</a> </body> </html>
![Page 9: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/9.jpg)
url: http://domain/rsrc.html HTTP Response
html
body
ah1 a
attr: hrefattr: href
ContactAboutWelcome
Elements may contain• attributes• text
HTTP/1.0 200 OK Content-Type: text/html; charset=utf-8 Content-Length: 74
<html> <head><script>...</script></head> <body> <h1>Welcome</h1> <a href="about.html">About</a> <a href="contact.html">Contact</a> </body> </html>
![Page 10: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/10.jpg)
url: http://domain/rsrc.html HTTP Response
html
ah1
attr: hrefattr: href
ContactAboutWelcome
Elements may contain• attributes• text• other elements
body
a
parent
child
HTTP/1.0 200 OK Content-Type: text/html; charset=utf-8 Content-Length: 74
<html> <head><script>...</script></head> <body> <h1>Welcome</h1> <a href="about.html">About</a> <a href="contact.html">Contact</a> </body> </html>
![Page 11: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/11.jpg)
url: http://domain/rsrc.html HTTP Response
html
ah1
attr: hrefattr: href
ContactAboutWelcome
body
a
parent
child
HTTP/1.0 200 OK Content-Type: text/html; charset=utf-8 Content-Length: 74
<html> <head><script>...</script></head> <body> <h1>Welcome</h1> <a href="about.html">About</a> <a href="contact.html">Contact</a> </body> </html>
table
JavaScript (if there's an engine to execute it) may directly edit the DOM!
original .html file doesn't change, but the result is equivalent
![Page 12: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/12.jpg)
url: http://domain/rsrc.html HTTP Response
Welcome
About Contact
browser renders (displays) the DOM tree, based on original file
and any JavaScript changes
HTTP/1.0 200 OK Content-Type: text/html; charset=utf-8 Content-Length: 74
<html> <head><script>...</script></head> <body> <h1>Welcome</h1> <a href="about.html">About</a> <a href="contact.html">Contact</a> </body> </html>
![Page 13: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/13.jpg)
Web Scraping: Simple and Complicated
![Page 14: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/14.jpg)
requests vs. Selenium
computer 2(Virtual Machine)
IP address: 18.216.110.65
computer 1(laptop)
index.html, please [GET]
<html> <body> <img src="A.png"> <b>Hello</b> <script src="B.js"> </script> </body> </html>
requests module- can fetch .html, .js, .etc file
Selenium- can fetch .html, .js, .etc file- can run a .js file in browser- can grab HTML version of DOM after JavaScript has modified it
Jupyter : import requests
r=requests.get(...)
Web Server
![Page 15: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/15.jpg)
requests vs. Selenium
computer 2(Virtual Machine)
IP address: 18.216.110.65
computer 1(laptop)
index.html, please [GET]
<html> <body> <img src="A.png"> <b>Hello</b> <script src="B.js"> </script> </body> </html>
requests module- can fetch .html, .js, .etc file
Selenium- can fetch .html, .js, .etc file- can run a .js file in browser- can grab HTML version of DOM after JavaScript has modified it
from selenium import webdriverdriver=webdriver.Chrome()
chromedriver
Web Server
note: Selenium is most commonly used for testing websites, but it works
great for tricky scraping too
![Page 16: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/16.jpg)
Tricky Pages
https://tyler.caraza-harter.com/cs320/f20/materials/lec-16/scrape.html
![Page 17: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/17.jpg)
Install
![Page 18: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/18.jpg)
Selenium Install
computer 1(laptop)
from selenium import webdriverdriver=webdriver.Chrome()
chromedriver
https://chromedriver.chromium.org/downloads
wget https://chromedriver.storage.googleapis.com/85.0.4183.87/chromedriver_linux64.zipunzip chromedriver_linux64.zipecho $PATHmv chromedriver ~/.local/bin/
sudo apt install chromium-browser
pip3 install selenium
trh@instance-1:/tmp$ chromium-browser --version Chromium 85.0.4183.83 Built on Ubuntu... trh@instance-1:/tmp$ chromedriver --version ChromeDriver 85.0.4183.87 (...)
Check...
![Page 19: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/19.jpg)
Why Drivers?
Python Java Ruby JavaScript
Python module for Selenium
Java module for Selenium
Ruby module for Selenium
JavaScript mod for Selenium
Chrome Driver Firefox Driver Edge Driver
![Page 20: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/20.jpg)
![Page 21: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/21.jpg)
Examples
![Page 22: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/22.jpg)
Starter Codefrom selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.common.exceptions import NoSuchElementException
options = Options() #options.headless = True b = webdriver.Chrome(options=options)
b.get(????)
print(b.page_source)
try: elem = b.find_element_by_id(element_id) print("found it") except NoSuchElementException: print("couldn't find it")
b.close()
open browser window
go to a URL
get HTML for current page (including JavaScript changes)
search for id=???? attributes
no such element
![Page 23: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/23.jpg)
Example 1a: Late Loading Table (page1.html)
https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page1.html
added after 1 second
![Page 24: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/24.jpg)
Example 1b: Headless Mode and Screenshotsfrom selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.common.exceptions import NoSuchElementException
options = Options() options.headless = True b = webdriver.Chrome(options=options)
b.get(????)
from IPython.core.display import Image b.save_screenshot("out.png") Image("out.png")
b.close()
![Page 25: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/25.jpg)
Example 2: Auto-Clicking Buttonsfrom selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.common.exceptions import NoSuchElementException
options = Options() options.headless = True b = webdriver.Chrome(options=options)
b.get(????)
btn = b.find_element_by_id("BTN_ID") btn.click()
b.close()
auto clickhttps://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page2.html
![Page 26: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/26.jpg)
Example 3: Entering Passwordsfrom selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.common.exceptions import NoSuchElementException
options = Options() options.headless = True b = webdriver.Chrome(options=options)
b.get(????)
pw = b.find_element_by_id("pw") pw.send_keys("fido")
b.close()
https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page3.html
![Page 27: [320] Web 1: Selenium · Page A Page B Page C Page D how to scrape a webpage graph? how to scrape a complicated page?](https://reader034.vdocuments.us/reader034/viewer/2022052100/6039986564b34c71e02207cb/html5/thumbnails/27.jpg)
Example 4: Many Queries
https://tyler.caraza-harter.com/cs320/s20/materials/lec-19/page4.html