scraping with geb
TRANSCRIPT
SERGIO DEL [email protected] @SDELAMO
GROOVYCALAMARI.COM
A “weekly” curated email newsletter full of interesting, relevant links about the Groovy Ecosystem
WHAT CAN YOU DO?
▸ PRICE-MONITORING WEBBOTS
▸ IMAGE-CAPTURING WEBBOTS
▸ LINK VERIFICATION WEBBOTS
▸ WEBBOTS THAT SEND EMAIL
▸ WEBBOTS THAT CONVERT A WEBSITE IN AN API
▸ SNIPERS
EXAMPLE 1 CREW SCRAPING
HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_GR8CONF
java -jar gebwebbot_gr8conf-all.jar crew|sponsors destinationFolder outputFilename sqlite|plist|csv phantomjs.binary.path
Define the interesting parts of your pages in a concise, maintanable and extensible manner
GEB PAGES
EXAMPLE 2 SPONSORS SCRAPING
HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_GR8CONF
GEB EXAMPLE GRADLEHTTPS://GITHUB.COM/GEB/GEB-EXAMPLE-GRADLE
The following commands will launch the tests with the individual browsers: ./gradlew chromeTest ./gradlew firefoxTest ./gradlew phantomJsTest To run with all, you can run: ./gradlew test
MARCIN ERDMANN
EXAMPLE 3 PAGINATION
HTTPS://GITHUB.COM/SDELAMO/WEBBOT_GEB_MEETUP_MEMBERS
DYNAMIC URL http://www.meetup.com/es-ES/madrid-gug/members/49149882/
BASE URL:MEETUP GROUP SLUG:
MEMBER ID:
http://www.meetup.com/madrid-gug28938802
SPLIT LOAD BETWEEN WEBBOTS1 2 53 4
11 12 1513 14
21 22 2523 24
31 32 3533 34
41 42 4543 44
6 7 108 9
16 17 2018 19
26 27 3028 29
36 37 4038 39
46 47 5048 49
def ids = 1..50 def webbotIndex = 3 def webbotsInParallel = 6 int total = ids.size() def sublistsSize = (total / webbotsInParallel) as int def s = ids.collate(sublistsSize)[webbotIndex]
USER AGENT SPOOFING
HTTPS://GITHUB.COM/SDELAMO/GEBWEBBOT_USERAGENT
STEALTH MEANS SIMULATING HUMAN PATTERNS
▸ BE KIND TO YOUR RESOURCES
▸ RUN YOUR WEBBOTS DURING BUSY HOURS
▸ DON’T RUN YOUR WEBBOTS AT THE SAME TIME EACH DAY
▸ DON’T RUN YOUR WEBBOT ON HOLIDAYS AND WEEKENDS
▸ USE RANDOM, INTRA-FETCH DELAYS