ruby robots
DESCRIPTION
Talk about creating web robots in Ruby programming language, using restclient, nokogiri, mechanize in rsonrails eventTRANSCRIPT
![Page 1: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/1.jpg)
Ruby Robots
http://www.flickr.com/photos/flysi/183272970
Daniel Cukier@danicuki
![Page 2: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/2.jpg)
![Page 3: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/3.jpg)
Relatives
• spiders
• crawlers
• bots
![Page 4: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/4.jpg)
Why robot?
![Page 5: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/5.jpg)
http://www.flickr.com/photos/nhankamer/5016628611
![Page 6: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/6.jpg)
require 'anemone'
Anemone.crawl(url) do |anemone| anemone.on_every_page do |page| puts page.url endend http://www.cantora.mus.br/
http://www.cantora.mus.br/fotoshttp://www.cantora.mus.br/?locale=enhttp://www.cantora.mus.br/?locale=pt-BRhttp://www.cantora.mus.br/musicashttp://www.cantora.mus.br/videoshttp://www.cantora.mus.br/agendahttp://www.cantora.mus.br/novidadeshttp://www.cantora.mus.br/musicas/baixarhttp://www.cantora.mus.br/visitors/baixarhttp://www.cantora.mus.br/socialhttp://www.cantora.mus.br/fotos?locale=pt-BRhttp://www.cantora.mus.br/musicas?locale=enhttp://www.cantora.mus.br/fotos?locale=en
![Page 7: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/7.jpg)
![Page 8: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/8.jpg)
XPath<html>...<div class="bla"> <a>legal</a></div>...</html>
html_doc = Nokogiri::HTML(html)info = html_doc.xpath( "//div[@class='bla']/a")info.text=> legal
![Page 9: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/9.jpg)
XPath<table id="super"> <tr> <td>L1C1</td> <td>L1C2</td> </tr> <tr> <td>L2C1</td> <td>L2C2</td> </tr> <tr> <td>L3C1</td> <td>L3C2</td> </tr></table>
>> html_doc = Nokogiri::HTML(html)>> info = html_doc.xpath( "//table[@id='super']/tr")>> info.size=> 3
>> info=> legal
>> info[0].xpath("td").size=> 2
>> info[2].xpath("td")[1].text=> "L3C2"
![Page 10: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/10.jpg)
rest-client
![Page 11: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/11.jpg)
http://www.flickr.com/photos/amortize/766738216
GET
GET
![Page 12: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/12.jpg)
http://www.flickr.com/photos/abbeychristine/223898960
![Page 13: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/13.jpg)
Good bot
/robots.txt
User-agent: *
Disallow:
http://www.flickr.com/photos/temily/5645585162
![Page 14: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/14.jpg)
Ruby Robots
http://www.flickr.com/photos/flysi/183272970
Daniel Cukier@danicuki
![Page 15: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/15.jpg)
http://www.flickr.com/photos/nephelim/5632618462
![Page 16: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/16.jpg)
![Page 17: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/17.jpg)
maxRowsList=16
WTF?
![Page 18: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/18.jpg)
![Page 19: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/19.jpg)
![Page 20: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/20.jpg)
>> body = RestClient.get(url)>> json = JSON.parse(body)>> content = json["Content"]>> content.size=> 16
AHA!!!
http://.../artistas?maxRowsList=1600&filter=Recentes>> body = RestClient.get(url)>> json = JSON.parse(a)>> content = json["Content"]>> content.size=> 1600
http://.../artistas?maxRowsList=1600000&filter=Recentes
>> content.size=> 9154
Bingo!!!
![Page 21: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/21.jpg)
>> b["Content"].map {|c| c["ProfileUrl"]}["caravella", "tomleite", "jeffersontavares", "rodrigoaraujo", "jorgemendes", "bossapunk", "daviddepiro", "freetools", "ironia", "tiagorosa", "outprofile", "lucianokoscky", "bandateatraldecarona", "tlounge", "almanaque", "razzyoficial", "cretinosecanalhas", "cincorios", "ninoantunes", "caiocorsalette", "alinedelima", "thelio", "grupodomdesamba", "ladoz", "alexandrepontes", "poeiradgua", "betimalu", "leonardobessa", "kamaross", "marcusdocavaco", "atividadeinformal", "angelkeys", "locojohn", "forcamusic", "tiaguinhoabreu", "marcelonegrao", "jstonemghiphop", "uniaoglobal", "bandaefex", "severarock", "manitu", "sasso", "kakka", "xsopretty", "belepoke", "caixaazul", "wknd", "bandastarven", "bleiamusic", "3porcentoaocubo", "lucianoterra", "hipnoia", "influencianegra", "bandaursamaior", "mariafreitas", "jessejames", "vagnerrockxe", "stageo3", "lemoneight", "innocence", "dinda", "marcelocapela", "paulocamoeseoslusiadas", "magnussrock", "bandatheburk", "mercantes", "bandaturnerock", "flaviasaolli", "tonysagga", "thiagoponde", "centeio", "grupodeubranco", "bocadeleao", "eusoueliascardan", "notoriaoficial", "planomasterrock", "rofgod", "dreemonphc", "chicobrant", "osz", "bandalightspeed", "cavernadenarnia", "sergiobenevenuto", "viniciusdeoliveira", ...]
![Page 22: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/22.jpg)
email?phone?
![Page 23: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/23.jpg)
![Page 24: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/24.jpg)
>> html = RestClient.get("http://.../robomacaco")>> html_doc = Nokogiri::HTML(html)>> info = html_doc.xpath("//span[@class='name']")>> info.text=> "[email protected] DE JANEIRO - RJ - Brasil21 9675-0199
![Page 25: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/25.jpg)
![Page 26: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/26.jpg)
cookies
cookies = {}c = "s_nr=12954999; s_v19=12978609471; ... __utmc=206845458"cook = c.split(";").map {|i| i.strip.split("=")}cook.each {|u| cookies[u[0]] = u[1]}
RestClient.get(url, :cookies => cookies)
![Page 27: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/27.jpg)
Proxies
![Page 28: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/28.jpg)
http://www.ip-adress.com/proxy_list
![Page 29: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/29.jpg)
![Page 30: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/30.jpg)
![Page 31: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/31.jpg)
![Page 32: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/32.jpg)
>> response = RestClient.get(url)>> html_doc = Nokogiri::HTML(response)>> table = html_doc.xpath("//table[@class='proxylist']")>> lines = table.children>> lines.shift # tira o cabeçalho
>> lines[1].text=> "208.52.144.55 document.write(\":\"+i+r+i+r) anonymous proxy server-2 minutes ago United States"
Text
IP WTF?
![Page 33: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/33.jpg)
<script type="text/javascript"> z=5;i=8;x=4;l=1;o=9;q=6;n=3;u=2;k=7;r=0;</script>
![Page 34: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/34.jpg)
JAVASCRIPT=
RUBY
http://www.flickr.com/photos/drics/4266471776/
![Page 35: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/35.jpg)
<script type="text/javascript"> z=5;i=8;x=4;l=1;o=9;q=6;n=3;u=2;k=7;r=0;</script>
>> script = html_doc.xpath("//script")[1]>> eval script.text>> z=> 5>> i=> 8
![Page 36: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/36.jpg)
>> digits = lines[1].text.split(")")[0].split("+")=> ["208.52.144.55document.write(\":\"", "i", "r", "i", "r"]>> digits.shift>> digits=> ["i", "r", "i", "r"]>> port = digits.map {|c| eval(c)}.join("")=> "8080"
>> lines[1].text=> "208.52.144.55 document.write(\":\"+i+r+i+r) anonymous proxy server-2 minutes ago United States"
Voilà
RestClient.proxy = "http://#{server}:#{port}"
>> server = lines[1].text.split[0]=> "208.52.144.55"
![Page 37: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/37.jpg)
agent = Mechanize.newsite = "http://www.cantora.mus.br"page = agent.get("#{site}/baixar")form = page.formform['visitor[name]'] = 'daniel'form['visitor[email]'] = "[email protected]"page = agent.submit(form)tracks = page.links.select { |l| l.href =~ /track/ }tracks.each do |t| file = agent.get("#{site}#{t}) file.saveend
mechanize
![Page 38: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/38.jpg)
protection techniques
javascript
text as image
captcha
don’t be ingenuous
![Page 39: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/39.jpg)
captcha
YES you can!
prove you are not a robot
![Page 40: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/40.jpg)
3 steps
1. Download Image2. filter image3. run OCR software
![Page 41: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/41.jpg)
Good Luck!
![Page 42: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/42.jpg)
scaling
http://www.flickr.com/photos/liquene/3330714590
![Page 43: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/43.jpg)
clouds
$ knife ec2 server create
![Page 44: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/44.jpg)
threads+
queues
![Page 45: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/45.jpg)
![Page 46: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/46.jpg)
Nessa vida de programador malucoMe aparece cada situaçãoDe repente um cliente, uma proposta brutaPra pegar de um site informaçãoVocê tá louco, esse tipo de crime eu não façoSe quiser tenho uns amigos lá do sulFaz pra mim que eu te pago com essa jóia cool
Te dou um rubyPra você roubarCom o seu robô
Quer fazer robô?É só usar rubyÉ só usar rubyPra fazer robô
http://www.flickr.com/photos/jobafunky/5572503988
![Page 47: Ruby Robots](https://reader033.vdocuments.us/reader033/viewer/2022052310/5563ed92d8b42a2a3a8b5738/html5/thumbnails/47.jpg)
Thank you
Daniel Cukier@danicuki