new building mini-google in ruby - 123seminarsonly.com · 2011. 11. 30. · building mini-google in...

72
Building Mini-Google in Ruby @igrigorik #railsconf http://bit.ly/railsconf-pagerank Building Mini-Google in Ruby Ilya Grigorik @igrigorik

Upload: others

Post on 16-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Building Mini-Google in Ruby

Ilya Grigorik

@igrigorik

Page 2: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

postrank.com/topic/ruby

The slides… Twitter My blog

Page 3: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Ruby + MathOptimization

PageRank

IndexingExamplesMisc Fun

Page 4: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank PageRank + Ruby

IndexingExamplesTools

+ Optimization

Page 5: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Consume with care…everything that follows is based on released / public domain info

Page 6: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Search-engine graveyardGoogle did pretty well…

Page 7: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Search pipeline50,000-foot view

Query: Ruby

Results

1. Crawl 2. Index 3. Rank

Page 8: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Query: Ruby

Results

1. Crawl 2. Index 3. Rank

Bah FunInteresting

Page 9: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

circa 1997-1998

CPU Speed 333MhzRAM 32-64MB

Index 27,000,000 documentsIndex refresh once a month~ishPageRank computation several days

Laptop CPU 2.1GhzVM RAM 1GB1-Million page web ~10 minutes

Page 10: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Creating & Maintaining an Inverted Index DIY and the gotchas within

Page 11: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Building an Inverted Index

require 'set'

pages = {"1" => "it is what it is","2" => "what is it","3" => "it is a banana"

}

index = {}

pages.each do |page, content|content.split(/\s/).each do |word|

if index[word]index[word] << page

elseindex[word] = Set.new(page)

endend

end

{"it"=>#<Set: {"1", "2", "3"}>,"a"=>#<Set: {"3"}>,"banana"=>#<Set: {"3"}>,"what"=>#<Set: {"1", "2"}>,"is"=>#<Set: {"1", "2", "3"}>}

}

Page 12: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Building an Inverted Index

require 'set'

pages = {"1" => "it is what it is","2" => "what is it","3" => "it is a banana"

}

index = {}

pages.each do |page, content|content.split(/\s/).each do |word|

if index[word]index[word] << page

elseindex[word] = Set.new(page)

endend

end

{"it"=>#<Set: {"1", "2", "3"}>,"a"=>#<Set: {"3"}>,"banana"=>#<Set: {"3"}>,"what"=>#<Set: {"1", "2"}>,"is"=>#<Set: {"1", "2", "3"}>}

}

Page 13: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Building an Inverted Index

require 'set'

pages = {"1" => "it is what it is","2" => "what is it","3" => "it is a banana"

}

index = {}

pages.each do |page, content|content.split(/\s/).each do |word|

if index[word]index[word] << page

elseindex[word] = Set.new(page)

endend

end

{"it"=>#<Set: {"1", "2", "3"}>,"a"=>#<Set: {"3"}>,"banana"=>#<Set: {"3"}>,"what"=>#<Set: {"1", "2"}>,"is"=>#<Set: {"1", "2", "3"}>}

}

Word => [Document]

Page 14: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Querying the index

# query: "what is banana"p index["what"] & index["is"] & index["banana"]# > #<Set: {}>

# query: "a banana"p index["a"] & index["banana"]# > #<Set: {"3"}>

# query: "what is"p index["what"] & index["is"]# > #<Set: {"1", "2"}>

{"it"=>#<Set: {"1", "2", "3"}>,"a"=>#<Set: {"3"}>,"banana"=>#<Set: {"3"}>,"what"=>#<Set: {"1", "2"}>,"is"=>#<Set: {"1", "2", "3"}>}

}

1 32

Page 15: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Querying the index

# query: "what is banana"p index["what"] & index["is"] & index["banana"]# > #<Set: {}>

# query: "a banana"p index["a"] & index["banana"]# > #<Set: {"3"}>

# query: "what is"p index["what"] & index["is"]# > #<Set: {"1", "2"}>

{"it"=>#<Set: {"1", "2", "3"}>,"a"=>#<Set: {"3"}>,"banana"=>#<Set: {"3"}>,"what"=>#<Set: {"1", "2"}>,"is"=>#<Set: {"1", "2", "3"}>}

}

1 32

Page 16: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Querying the index

# query: "what is banana"p index["what"] & index["is"] & index["banana"]# > #<Set: {}>

# query: "a banana"p index["a"] & index["banana"]# > #<Set: {"3"}>

# query: "what is"p index["what"] & index["is"]# > #<Set: {"1", "2"}>

{"it"=>#<Set: {"1", "2", "3"}>,"a"=>#<Set: {"3"}>,"banana"=>#<Set: {"3"}>,"what"=>#<Set: {"1", "2"}>,"is"=>#<Set: {"1", "2", "3"}>}

}

1 32

Page 17: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Querying the index

# query: "what is banana"p index["what"] & index["is"] & index["banana"]# > #<Set: {}>

# query: "a banana"p index["a"] & index["banana"]# > #<Set: {"3"}>

# query: "what is"p index["what"] & index["is"]# > #<Set: {"1", "2"}>

{"it"=>#<Set: {"1", "2", "3"}>,"a"=>#<Set: {"3"}>,"banana"=>#<Set: {"3"}>,"what"=>#<Set: {"1", "2"}>,"is"=>#<Set: {"1", "2", "3"}>}

}

What order?

[1, 2] or [2,1]

Page 18: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Building an Inverted Index

require 'set'

pages = {"1" => "it is what it is","2" => "what is it","3" => "it is a banana"

}

index = {}

pages.each do |page, content|content.split(/\s/).each do |word|

if index[word]index[word] << page

elseindex[word] = Set.new(page)

endend

end

Hmmm?

PDF, HTML, RSS?Lowercase / Upcase?

Compact Index?Stop words?Persistence?

Page 19: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Page 20: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Ferret is a high-performance, full-featured text search engine library written for Ruby

Page 21: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

require 'ferret'include Ferret

index = Index::Index.new()

index << {:title => "1", :content => "it is what it is"}index << {:title => "2", :content => "what is it"}index << {:title => "3", :content => "it is a banana"}

index.search_each('content:"banana"') do |id, score|puts "Score: #{score}, #{index[id][:title]} "

end

> Score: 1.0, 3

Page 22: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

require 'ferret'include Ferret

index = Index::Index.new()

index << {:title => "1", :content => "it is what it is"}index << {:title => "2", :content => "what is it"}index << {:title => "3", :content => "it is a banana"}

index.search_each('content:"banana"') do |id, score|puts "Score: #{score}, #{index[id][:title]} "

end

> Score: 1.0, 3

Hmmm?

Page 23: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

class Ferret::Analysis::Analyzerclass Ferret::Analysis::AsciiLetterAnalyzerclass Ferret::Analysis::AsciiLetterTokenizerclass Ferret::Analysis::AsciiLowerCaseFilterclass Ferret::Analysis::AsciiStandardAnalyzerclass Ferret::Analysis::AsciiStandardTokenizerclass Ferret::Analysis::AsciiWhiteSpaceAnalyzerclass Ferret::Analysis::AsciiWhiteSpaceTokenizerclass Ferret::Analysis::HyphenFilterclass Ferret::Analysis::LetterAnalyzerclass Ferret::Analysis::LetterTokenizerclass Ferret::Analysis::LowerCaseFilterclass Ferret::Analysis::MappingFilterclass Ferret::Analysis::PerFieldAnalyzerclass Ferret::Analysis::RegExpAnalyzerclass Ferret::Analysis::RegExpTokenizerclass Ferret::Analysis::StandardAnalyzerclass Ferret::Analysis::StandardTokenizerclass Ferret::Analysis::StemFilterclass Ferret::Analysis::StopFilterclass Ferret::Analysis::Tokenclass Ferret::Analysis::TokenStreamclass Ferret::Analysis::WhiteSpaceAnalyzerclass Ferret::Analysis::WhiteSpaceTokenizer

class Ferret::Search::BooleanQueryclass Ferret::Search::ConstantScoreQueryclass Ferret::Search::Explanationclass Ferret::Search::Filterclass Ferret::Search::FilteredQueryclass Ferret::Search::FuzzyQueryclass Ferret::Search::Hitclass Ferret::Search::MatchAllQueryclass Ferret::Search::MultiSearcherclass Ferret::Search::MultiTermQueryclass Ferret::Search::PhraseQueryclass Ferret::Search::PrefixQueryclass Ferret::Search::Queryclass Ferret::Search::QueryFilterclass Ferret::Search::RangeFilterclass Ferret::Search::RangeQueryclass Ferret::Search::Searcherclass Ferret::Search::Sortclass Ferret::Search::SortFieldclass Ferret::Search::TermQueryclass Ferret::Search::TopDocsclass Ferret::Search::TypedRangeFilterclass Ferret::Search::TypedRangeQueryclass Ferret::Search::WildcardQuery

Page 24: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

ferret.davebalmain.com/trac

Page 25: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Ranking Results0-60 with PageRank…

Page 26: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Naïve: Term Frequency

index.search_each('content:"the brown cow"') do |id, score|puts "Score: #{score}, #{index[id][:title]} "

end

> Score: 0.827, 3> Score: 0.523, 5> Score: 0.125, 4

Relevance?

3 5 4

the 4 3 5

brown 1 3 1

cow 1 4 1

Score 6 10 7

Page 27: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Naïve: Term Frequency

index.search_each('content:"the brown cow"') do |id, score|puts "Score: #{score}, #{index[id][:title]} "

end

> Score: 0.827, 3> Score: 0.523, 5> Score: 0.125, 4

Skew

3 5 4

the 4 3 5

brown 1 3 1

cow 1 4 1

Score 6 10 7

Page 28: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

TF-IDFTerm Frequency * Inverse Document Frequency

Skew

3 5 4

the 4 3 5

brown 1 3 1

cow 1 4 1

Total # of documents: 10

# of docs

the 6

brown 3

cow 4

Score = TF * IDF

TF = # occurrences / # wordsIDF = # docs / # docs with W

Page 29: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

TF-IDFScore = 0.204 + 0.120 + 0.092 = 0.416

# of docs

the 6

brown 3

cow 4

3 5 4

the 4 3 5

brown 1 3 1

cow 1 4 1

Total # of documents: 10# words in document: 10

Doc # 3 score for ‘the’:4/10 * ln(10/6) = 0.204

Doc # 3 score for ‘brown’:1/10 * ln(10/3) = 0.120

Doc # 3 score for ‘cow’:1/10 * ln(10/4) = 0.092

Page 30: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Frequency Matrix

W1 W2 … … … … … … WN

Doc 1 15 23 …

Doc 2 24 12 …

… … … …

Doc K

Size = N * K * size of Ruby object

Ouch.

Pages = N = 10,000Words = K = 2,000Ruby Object = 20+ bytes

Footprint = 384 MB

Page 31: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

NArrayhttp://narray.rubyforge.org/

NArray is an Numerical N-dimensional Array class (implemented in C)

NArray.new(typecode, size, ...)NArray.byte(size,...)NArray.sint(size,...)NArray.int(size,...)NArray.sfloat(size,...)NArray.float(size,...)NArray.scomplex(size,...)NArray.complex(size,...)NArray.object(size,...)

# create new NArray. initialize with 0.# 1 byte unsigned integer# 2 byte signed integer# 4 byte signed integer# single precision float# double precision float# single precision complex# double precision complex# Ruby object

Page 32: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

NArrayhttp://narray.rubyforge.org/

NArray is an Numerical N-dimensional Array class (implemented in C)

Page 33: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRankthe google juice

Links as votes

Problem: link gaming

Page 34: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Random Surferpowerful abstraction

Follow link from page he/she is currently on.

Teleport to a random location on the web.

P = 0.85

P = 0.15

Page 35: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Surfin’rinse & repeat, ad naseum

Follow link from page he/she is currently on.

Teleport to a random location on the web.

Page K

Page N Page M

Page 36: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Surfin’rinse & repeat, ad naseum

On Page P, clicks on link to K

P = 0.15

P = 0.85

On Page K clicks on link to M

On Page M teleports to X

P = 0.85

Page 37: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Analyzing the Web Graphextracting PageRank

P = 0.6

N

MK

X

P = 0.15

P = 0.20P = 0.05

Page 38: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

What is PageRank?It’s a scalar!

Page 39: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

What is PageRank?it’s a probability!

P = 0.6

N

MK

X

P = 0.15

P = 0.20P = 0.05

P = 0.6

P = 0.15

P = 0.20P = 0.05

P = 0.6

P = 0.15

P = 0.20P = 0.05

Page 40: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

What is PageRank?it’s a probability!

P = 0.6

N

MK

X

P = 0.15

P = 0.20P = 0.05

P = 0.6

P = 0.15

P = 0.20P = 0.05

Higher Pr, Higher Importance?

Page 41: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Teleportation?sci-fi fans, … ?

Page 42: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Reasons for teleportationenumerating edge cases

N

M

K

X

1. No in-links!

M

2. No out-links!

3. Isolated Web

Page 43: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Exploring Graphsgratr.rubyforge.com

•Breadth First Search•Depth First Search•A* Search •Lexicographic Search •Dijkstra’s Algorithm •Floyd-Warshall•Triangulation and Comparability detection

require 'gratr/import'

dg = Digraph[1,2, 2,3, 2,4, 4,5, 6,4, 1,6]

dg.directed? # truedg.vertex?(4) # truedg.edge?(2,4) # truedg.vertices # [5, 6, 1, 2, 3, 4]

Graph[1,2,1,3,1,4,2,5].bfs # [1, 2, 3, 4, 5]Graph[1,2,1,3,1,4,2,5].dfs # [1, 2, 5, 3, 4]

Page 44: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Teleportationprobabilities

N

M

K

X

M

P(T) = 0.03

P(T) = 0.03

P(T) = 0.03

P(T) = 0.03

P(T) = 0.03

P(T) = 0.15 / # of pagesP(T) = 0.03

Page 45: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank: Simplified Mathematical Def’ncause that’s how we roll

𝐿 = 𝑇 =

0.15𝑁

⋮0.15

𝑁

Assume the web is N pages bigAssume that probability of teleportation (t) is 0.15, and following link (s) is 0.85Assume that teleportation probability (E) is uniformAssume that you start on any random page (uniform distribution L), then

Then after one step, the probability your on page X is:

𝐿 ∗ 𝑠𝐺 + 𝑡𝐸

𝐿 ∗ (0.85 ∗ 𝐺 + 0.15 ∗ 𝐸)

Page 46: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

G = The Link Graphginormous and sparse

1 2 … … N

1 1 0 … … 0

2 0 1 … … 1

… … … … … …

… … … … … …

N 0 1 … … 1

Link Graph No link from 1 to N

Huge!

Page 47: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

G as a dictionarymore compact…

{

"1" => [25, 26],

"2" => [1],

"5" => [123,2],

"6" => [67, 1]

}

Page

Links to…

Page 48: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Computing PageRankthe tedious way

Follow link from page he/she is currently on.

Teleport to a random location on the web.

Page K

Page 49: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Computing PageRankin one swoop

Identity matrix

Don’t trust me! Verify it yourself!

𝑞 = 𝑡 𝐼 − 𝑠𝐺 −1𝐸 = 𝑃1

⋮𝑃𝑛

Page 50: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Enough hand-waving, dammit!show me the code

Page 51: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Birth of EM-Proxyflash of the obvious

Hot, Fast, Awesome

Page 52: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Hot, Fast, Awesome

http://rb-gsl.rubyforge.org/

Click there! … Give yourself a weekend.

Page 53: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Click there! … Give yourself a weekend. http://ruby-gsl.sourceforge.net/

Page 54: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank in Ruby6 lines, or less

require "gsl"include GSL

# INPUT: link structure matrix (NxN)# OUTPUT: pagerank scoresdef pagerank(g)

raise if g.size1 != g.size2

i = Matrix.I(g.size1) # identity matrixp = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

s = 0.85 # probability of following a linkt = 1-s # probability of teleportation

t*((i-s*g).invert)*pend

Verify NxN

Page 55: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank in Ruby6 lines, or less

require "gsl"include GSL

# INPUT: link structure matrix (NxN)# OUTPUT: pagerank scoresdef pagerank(g)

raise if g.size1 != g.size2

i = Matrix.I(g.size1) # identity matrixp = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

s = 0.85 # probability of following a linkt = 1-s # probability of teleportation

t*((i-s*g).invert)*pend

Constants…

Page 56: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank in Ruby6 lines, or less

require "gsl"include GSL

# INPUT: link structure matrix (NxN)# OUTPUT: pagerank scoresdef pagerank(g)

raise if g.size1 != g.size2

i = Matrix.I(g.size1) # identity matrixp = (1.0/g.size1) * Matrix.ones(g.size1,1) # teleportation vector

s = 0.85 # probability of following a linkt = 1-s # probability of teleportation

t*((i-s*g).invert)*pend

PageRank!

Page 57: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Ex: Circular Webtesting intuition…

N

K

X P = 0.33

pagerank(Matrix[[0,0,1], [0,0,1], [1,0,0]])> [0.33, 0.33, 0.33]

P = 0.33

P = 0.33

Page 58: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Ex: All roads lead to Ktesting intuition…

N

K

X P = 0.07

pagerank(Matrix[[0,0,0], [0.5,0,0], [0.5,1,1]])> [0.05, 0.07, 0.87]

P = 0.87

P = 0.05

Page 59: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank + Ferretawesome search, ftw!

Page 60: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

require 'ferret'include Ferret

index = Index::Index.new()

index << {:title => "1", :content => "it is what it is", :pr => 0.05 }index << {:title => "2", :content => "what is it", :pr => 0.07 }index << {:title => "3", :content => "it is a banana", :pr => 0.87 }

1

3

2 P = 0.07

P = 0.87

P = 0.05

Store PageRank

Page 61: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

index.search_each('content:"world"') do |id, score|puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})"

end

puts "*" * 50

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:"world"', :sort => sf_pr) do |id, score|puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})"

end

# Score: 0.267119228839874, 3 (PR: 0.87)# Score: 0.17807948589325, 1 (PR: 0.05)# Score: 0.17807948589325, 2 (PR: 0.07)# ***********************************# Score: 0.267119228839874, 3, (PR: 0.87)# Score: 0.17807948589325, 2, (PR: 0.07)# Score: 0.17807948589325, 1, (PR: 0.05)

TF-IDF Search

Page 62: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

index.search_each('content:"world"') do |id, score|puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})"

end

puts "*" * 50

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:"world"', :sort => sf_pr) do |id, score|puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})"

end

# Score: 0.267119228839874, 3 (PR: 0.87)# Score: 0.17807948589325, 1 (PR: 0.05)# Score: 0.17807948589325, 2 (PR: 0.07)# ***********************************# Score: 0.267119228839874, 3, (PR: 0.87)# Score: 0.17807948589325, 2, (PR: 0.07)# Score: 0.17807948589325, 1, (PR: 0.05)

PageRank FTW!

Page 63: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

index.search_each('content:"world"') do |id, score|puts "Score: #{score}, #{index[id][:title]} (PR: #{index[id][:pr]})"

end

puts "*" * 50

sf_pr = Search::SortField.new(:pr, :type => :float, :reverse => true)

index.search_each('content:"world"', :sort => sf_pr) do |id, score|puts "Score: #{score}, #{index[id][:title]}, (PR: #{index[id][:pr]})"

end

# Score: 0.267119228839874, 3 (PR: 0.87)# Score: 0.17807948589325, 1 (PR: 0.05)# Score: 0.17807948589325, 2 (PR: 0.07)# ***********************************# Score: 0.267119228839874, 3, (PR: 0.87)# Score: 0.17807948589325, 2, (PR: 0.07)# Score: 0.17807948589325, 1, (PR: 0.05)

Google

Others

Page 64: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Search*: Graphs are ubiquitous!PageRank is a general purpose hammer

Page 65: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank + Social GraphGitHub

Username GitCred

==============================

37signals 10.00

imbriaco 9.76

why 8.74

rails 8.56

defunkt 8.17

technoweenie 7.83

jeresig 7.60

mojombo 7.51

yui 7.34

drnic 7.34

pjhyett 6.91

wycats 6.85

dhh 6.84

http://bit.ly/3YQPU

Page 66: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank + Social GraphTwitter

Hmm…

Analyze the social graph:- Filter messages by ‘TwitterRank’- Suggest users by ‘TwitterRank’- …

Page 67: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank + Product GraphE-commerce

Link items purchased in same cart… Run PR on it.

Page 68: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank = Powerful Hammeruse it!

Page 69: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Personalizationhow would you do it?

Page 70: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

PageRank + Personalizationcustomize the teleportation vector

𝑇 =

0.15𝑁

⋮0.15

𝑁

Teleportation distribution doesn’t

have to be uniform!

yahoo.com is my homepage!

Page 71: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Gaming PageRankfor fun and profit (I don’t endorse it)

Make pages with links!

http://bit.ly/pagerank-spam

Page 72: New Building Mini-Google in Ruby - 123seminarsonly.com · 2011. 11. 30. · Building Mini-Google in Ruby @igrigorik #railsconf Ruby + Math Optimization PageRank Misc Fun Examples

Building Mini-Google in Ruby @igrigorik #railsconfhttp://bit.ly/railsconf-pagerank

Questions?

The slides… Twitter My blog

Slides: http://bit.ly/railsconf-pagerank

Ferret: http://bit.ly/ferretRB-GSL: http://bit.ly/rb-gsl

PageRank on Wikipedia: http://bit.ly/wp-pagerankGaming PageRank: http://bit.ly/pagerank-spam

Michael Nielsen’s lectures on PageRank:http://michaelnielsen.org/blog