mining the web for information using hadoop
DESCRIPTION
TRANSCRIPT
![Page 1: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/1.jpg)
1
– Someday Soon
(Flickr)
Mining the web with HadoopSteve Watt Emerging Technologies @
HP
![Page 2: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/2.jpg)
2
– timsnell (Flickr)
![Page 3: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/3.jpg)
3
Gathering Data
Data Marketplaces
![Page 4: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/4.jpg)
4
![Page 5: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/5.jpg)
5
![Page 6: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/6.jpg)
6
Gathering Data
Apache Nutch(Web Crawler)
![Page 7: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/7.jpg)
7
Tech Bubble?
What does the Data Say?
Pascal Terjan (Flickr)
![Page 8: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/8.jpg)
8
![Page 9: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/9.jpg)
9
![Page 10: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/10.jpg)
10
Using Apache
Identify Optimal Seed URLs for a Seed List & Crawl to a depth of 2
For example:
http://www.crunchbase.com/companies?c=a&q=private_heldhttp://www.crunchbase.com/companies?c=b&q=private_heldhttp://www.crunchbase.com/companies?c=c&q=private_heldhttp://www.crunchbase.com/companies?c=d&q=private_held. . .
Crawl data is stored in sequence files in the segments dir on the HDFS
![Page 11: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/11.jpg)
11
ALSO
![Page 12: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/12.jpg)
12
Company POJO then /t Out
Prelim Filtering on URL
Making the data STRUCTURED
Retrieving HTML
![Page 13: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/13.jpg)
13
Company City State Country Sector Round Day Month Year Amount Investors
InfoChimps Austin TX USA Enterprise Angel 14 9 2010 350000 Stage One Capital
InfoChimps Austin TX USA Enterprise A 7 11 2010 1200000 DFJ Mercury
MassRelevance Austin TX USA Enterprise A 20 12 2010 2200000 Floodgate, AV,etc
Masher Calabasas CA USA Games_Video Seed 0 2 2009 175000
Masher Calabasas CA USA Games_Video Angel 11 8 2009 300000 Tech Coast Angels
The Result? Tab Delimited Structured Data…
Note: I dropped the ZipCode because it didn’t occur
consistently
![Page 14: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/14.jpg)
14
Time to Analyze/Visualize the data…
Step1: Select the right visual encoding for your questions
Lets start by asking questions & seeing what we can learn from some simple Bar Charts…
![Page 15: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/15.jpg)
*Total Tech Investments By Year
![Page 16: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/16.jpg)
*Total Tech Investments By Year
*Total Tech Investments By Year
![Page 17: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/17.jpg)
*Investment Funding By Sector
![Page 18: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/18.jpg)
18
Total Investments By Zip Code for all Sectors
$7.3 Billion in San Francisco
$2.9 Billion in Mountain View
$1.2 Billion in Boston
$1.7 Billion in Austin
![Page 19: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/19.jpg)
19
Total Investments By Zip Code for all Sectors
$7.3 Billion in San Francisco
$2.9 Billion in Mountain View
$1.2 Billion in Boston
$1.7 Billion in Austin
![Page 20: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/20.jpg)
20
Total Investments By Zip Code for Consumer Web
$1.2 Billion in Chicago
$600 Million in Seattle
$1.7 Billion in San Francisco
![Page 21: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/21.jpg)
21
Total Investments By Zip Code for BioTech
$1.3 Billion in Cambridge
$528 Million in Dallas
$1.1 Billion in San Diego
![Page 22: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/22.jpg)
22
HP Confidential
Geospatial Encoding of Data
![Page 23: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/23.jpg)
23
Steve’s Not so Excellent Adventure
• Let’s try a Choropleth Encoding of the distribution of investment income by County
• Wait, what is GeoJSON?
• OK, the GeoJSON County is mapped to some code
• Each County code has a value that corresponds to a palette color
• So what are these codes? FIPS Codes? But Google returns 3 & 5 digit codes?!?
• I found a 5 digit code list, it has A LOT of codes in it. I’m going to assume its correct because there is no way I can manually verify all of them
![Page 24: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/24.jpg)
24
Generating Investment Income By County
FIPS = LOAD ‘data/fips.txt’ using PigStorage(‘\t’) as (City, State, FIPSCode);
Amt = LOAD ‘data/equity.txt’ using PigStorage(‘\t’) as (City, State, Amount);
AmtGroup = Group Amt BY (City, State);
SumGroup = FOREACH AmtGroup Generate group, SUM(Amt.Amount);
JoinGroup = JOIN SumGroup by (City,State), FIPS By (City,State);
Final = FOREACH JoinGroup generate FIPSCode, Amount;
RESULT: 51234 5000000
16234 1234000 (...)
ALWAYS, ALWAYS check your output…
![Page 25: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/25.jpg)
25
But wait, why are there duplicate records?
Apparently some cities can actually belong to two counties… I guess I’ll pick one.
![Page 26: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/26.jpg)
26
Yay, no duplicates. Lets visualize this!
• Wait, what happened to California ?
• Aaargh, I stored the FIPS codes in PIG as INTS instead of charrays which trimmed off the leading Zero. OK, I add them back. Voila! We have California.
![Page 27: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/27.jpg)
27
On Error Checking…
• Crowd Sourced data has LOADS of errors in it. Actually influencing your results. You need a good system that helps identify those errors.
• Santa Clara, Ca
• Santa, Clara
• Santa, Clara CA
• Track(Count) input and output records. Examine the results. Something fishy?
![Page 28: Mining the Web for Information using Hadoop](https://reader035.vdocuments.us/reader035/viewer/2022070302/54829bd6b07959520c8b483c/html5/thumbnails/28.jpg)
28
HP Confidential