scraping in python - charlotte j. lloyd · scraping process // battle plan u 1. surveillance u...
TRANSCRIPT
![Page 1: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/1.jpg)
Scraping in PythonWORKSHOP 2 | CREATOR: CHARLOTTE LLOYD
![Page 2: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/2.jpg)
Outline
I. Introduction to PythonII. Python Three WaysIII. Scraping RecapIV. Workshop ExampleV. Verify DataVI. Celebration, Back-slapping
![Page 3: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/3.jpg)
Introduction to PythonPART I
![Page 4: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/4.jpg)
What is Python?
u general purpose
u high-level
u interpreted (not compiled)
u name is related to Monty Python
![Page 5: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/5.jpg)
Very Popular Language
Checkout the full infographic: http://blog.datacamp.com/wp-content/uploads/2015/05/R-vs-Python-216-2.png
![Page 6: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/6.jpg)
Less Popular in Data Analysis
Checkout the full infographic: http://blog.datacamp.com/wp-content/uploads/2015/05/R-vs-Python-216-2.png
![Page 7: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/7.jpg)
Great Beginner Language
Checkout the full infographic: http://blog.datacamp.com/wp-content/uploads/2015/05/R-vs-Python-216-2.png
![Page 8: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/8.jpg)
Packages for Python
u Packages are bits of code that other people have built to extend Python functionality.u If you install a package you will be able to use the additional
commands that package has defined.
u Over 100,000 publically listed packages famously including:u numpy
u scikit-learn
u pandas
![Page 9: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/9.jpg)
Python Three WaysPART II
![Page 10: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/10.jpg)
What is Anaconda?
u Anaconda is an “installation” of Python that includes:
u package management
u environment management
u python distribution
u Anaconda pre-installs over 100 packages
![Page 11: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/11.jpg)
Three Major Ways to Use Python
1. Command Line
2. “IDE”
3. Notebook
![Page 12: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/12.jpg)
1. “Command Line” Python
A. Run an interactive session in a Unix shell1. In Terminal (Mac) or Powershell (PC):
1. type ”python”
2. type “2+2”
2. In qtconsole (Anaconda Navigator): [do nothing]
u try typing ”2+2”
B. Run a script (file)
1. In Terminal (Mac) or Powershell (PC): type “python file.py”
2. In qtconsole (Anaconda Navigator): type “%load file.py”
![Page 13: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/13.jpg)
2. Python in IDEs
u IDE (“integrated development environment”)u Spyder (provided in Anaconda)
u PyCharm
u Xcode (Macs)
u Write code (esp. multiple files) and easily execute within the IDE.
u Activity: Write a ”helloworld” program in Spyder. Execute in both Spyder and Terminal/Powershell.
![Page 14: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/14.jpg)
3. Python Notebooks
u web-based “interactive computational environment”
u very visual, very cool
u segmented into small cells of executable code
![Page 15: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/15.jpg)
Hands-on Demo
u Open Anaconda Navigator. Open the Jupyter Notebook. u Navigate to “handypy.ipynb” and open.
u Topics to be covered:u integers, floats, and strings
u lists
u for and while loops
u conditionals
u functions
u reading and writing csv files
![Page 16: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/16.jpg)
Scraping RecapPART III
![Page 17: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/17.jpg)
Programming Philosophy
u Concepts are key.
u Syntax is secondary.
u Stackoverflow is your friend.
![Page 18: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/18.jpg)
What is scraping?
![Page 19: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/19.jpg)
Scraping Process // Battle Plan
u 1. Surveillanceu Evaluate the page, learn the terrain.
u 2. Plan of Attack
u Brainstorm ways to approach the enemy.
u 3. Write codeu Be willing to change your strategy if you encounter obstacles or see another
“weakness” to exploit.
u 4. Emerge bloodied, yet victorious.
u Verify the data before all that syntax evaporates from your short term memory.
![Page 20: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/20.jpg)
Workshop ExamplePART IV
![Page 21: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/21.jpg)
GOAL
u Scrape all text in the table as well as URLs to download files.
u Save data as a csv file that preserves the table format.
u Save URLs on separate lines in a txt file.
![Page 22: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/22.jpg)
Package: BeautifulSoup
![Page 23: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/23.jpg)
Hands-on Demo
u Open Anaconda Navigator. Open the Jupyter Notebook. u Navigate to “workshop2.ipynb” and open.
u http://www.goes-r.gov/users/2016-OCONUS.html
![Page 24: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/24.jpg)
Downloading Files from urls.txt
u Terminalu for i in `cat urls.txt`; do curl -O $i; done
u Powershell (courtesy of Ryann & Keith!)
u foreach ($file in Get-Content url.txt) {echo "downloading $file"; curl -O $file}
![Page 25: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/25.jpg)
Verify DataPART V
![Page 26: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/26.jpg)
Make notes, be organized
u This kind of piecemeal code is hard to come back to later, so if you’ll need it again, organize it and write yourself notes. u SYNTAX WILL LEAK OUT OF YOUR BRAIN FASTER THAN ALL THE OTHER
IMPORTANT THINGS THAT YOU HAVE ALREADY FORGOTTEN YOU WERE SUPPOSED TO REMEMBER. PLEASE BELIEVE ME THAT JUST A FEW SHORT WEEKS FROM NOW YOU WILL NOT KNOW WHY YOU DID THAT THING YOU DID OR WHAT THAT VARIABLE MEANS OR WHAT THAT FUNCTION DOES AND WHY YOU APPARENTLY PUT THAT THERE. AND WHY ISN’T THIS PACKAGE UPDATE WORKING WITH MY OLD CODE AND DO I EVEN HAVE THE CORRECT THINGS INSTALLED TO MAKE THIS ALL WORK AGAIN?
![Page 27: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach](https://reader033.vdocuments.us/reader033/viewer/2022042914/5f4fd04488c95601cb5f1490/html5/thumbnails/27.jpg)