bloom filters, minhashes, and other random stuffbloom filters, minhashes, and other random stuff...

57
Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida

Upload: others

Post on 22-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

BloomFilters,Minhashes,andOtherRandomStuff

BrianBrubachUniversityofMaryland,CollegePark

StringBio 2018,UniversityofCentralFlorida

Page 2: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

What?

• Probabilistic• Space-efficient• Fast• Notexact

Page 3: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Why?

• Datadeluge/Bigdata/Massivedata• Millionsorbillionsofsequences• Humangenome:3Gbp• 1giga basepairs=1billioncharacters

• Microbiomesampleof1.6billion100bp readsgeneratedin10.8days(Caporaso,etal.,2012)• Mediumdata,butonalaptop• Lotsofbioinformaticshappenshere

• BeyondscalabilityofBWT,FM-index,etc.

Page 4: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

(Berger,Daniels,andYu,2016)

Page 5: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

CurseofDimensionality

• Sequencesarecomparedinhighdimensionalspace• Comparing𝑁 sequencestakes𝑁" time• Computingeditdistancebetweentwosequencesoflength𝑛 takes𝑛" time• Allegedly

Page 6: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

CurseofDimensionality• ATGATCGAGGCTATGCGACCGATCGATCGATTCGTA• ATGATGGAGGCTATGGGAACGATCGATCGACTCGTA• ATGATCGAGGCTATGCCACCGATCGAACGATTCGTA• ATCATCGAGGCTATGCGACCGTTCGATCGATTCCTA• GTGATCGTGGCTATGCGACCGATCGATCGATTCGTC• ATGATCGAGGCTATGCCACCGATCGAACGATTCGTA• ATGATCCAGGCTATGCGACCGATCGATGCATTCGTA

Page 7: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

WhyStayinHighDimensions?

• 4%&& possibleDNAstringsoflength100• 4%' ≈ 1billionreads

Page 8: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

k-mers ofaSequence

• Allsubstringsoflengthk• Canonical:lexicographicallysmallestamongforwardandreversecomplement• Forgetthisfornow All 7-mers:

ATCTGAGGTCACATCTGAG TCTGAGG CTGAGGT TGAGGTC GAGGTCA AGGTCAC

Reverse complement:ATCTGAGGTCACGTGACCTCAGAT

Page 9: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Hashfunction

String HashMagic Randomintegerin{1,m}

• Willassumeidealizedmodelofhashingforthistalk• Lotsofresearchinthisarea

Page 10: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

BloomFilterExampleProblem

• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency

Page 11: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

BloomFilterExampleProblem

• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency• Howcanweaddressthiswithhashing?

Page 12: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

BloomFilterExampleProblem

• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency• Howcanweaddressthiswithhashing?• Put𝑘-mers inhashtable

Page 13: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

BloomFilterExampleProblem

• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency• Howcanweaddressthiswithhashing?• Put𝑘-mers inhashtable• Atleast2𝑁𝑘 bitsfordataplustableoverhead

Page 14: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

BloomFilterExampleProblem

• Storealargesetof𝑁 𝑘-mers• Query𝑘-mers againstitforexactmatches• Wantspeedandspace-efficiency• Howcanweaddressthiswithhashing?• Put𝑘-mers inhashtable• Atleast2𝑁𝑘 bitsfordataplustableoverhead

• Whatifwejuststoreonebitateachhashforpresence/absence?• SimpleBloomfilter,potentiallysuboptimal

Page 15: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

BloomFilter

• Probabilisticdatastructure• Fastandspace-efficient• Falsepositives,butnofalsenegatives• Insertandcontains,butnodelete• DuetoBurtonHowardBloomin1970• Gaveexampleofautomatichyphenation• Identifythe10%ofwordsthatrequirespecialhyphenationrules

Page 16: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

BloomFilter

• 𝑁 itemstostore:𝑥%, 𝑥", … , 𝑥/• 𝑚-bitvector• 𝑑 hashfunctions:ℎ%, ℎ", … , ℎ3• Insert(𝑥):setbitsℎ%(𝑥), ℎ"(𝑥), … , ℎ3(𝑥) to1• Contains(𝑦):• Yesifbitsℎ%(𝑦), ℎ"(𝑦), … , ℎ3(𝑦) are1• Noifanyare0

Page 17: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

• 𝑚 = 10,𝑑 = 3,hashfunctions:ℎ%, ℎ", ℎ;

BloomFilterExample

0 0 0 0 0 0 0 0 0 0

Page 18: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

• 𝑚 = 10,𝑑 = 3,hashfunctions:ℎ%, ℎ", ℎ;

BloomFilterExample

0 1 1 0 0 0 1 0 0 0

Insert(𝑥%):ℎ% 𝑥% , ℎ" 𝑥% , ℎ;(𝑥%)

Page 19: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

• 𝑚 = 10,𝑑 = 3,hashfunctions:ℎ%, ℎ", ℎ;

BloomFilterExample

0 1 1 0 0 0 1 1 0 1

Insert(𝑥%):ℎ% 𝑥% , ℎ" 𝑥% , ℎ;(𝑥%) Insert(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")

Page 20: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

• 𝑚 = 10,𝑑 = 3,hashfunctions:ℎ%, ℎ", ℎ;

BloomFilterExample

0 1 1 0 0 0 1 1 0 1

Insert(𝑥%):ℎ% 𝑥% , ℎ" 𝑥% , ℎ;(𝑥%) Insert(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")

Contains(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")

Page 21: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

• 𝑚 = 10,𝑑 = 3,hashfunctions:ℎ%, ℎ", ℎ;

BloomFilterExample

0 1 1 0 0 0 1 1 0 1

Insert(𝑥%):ℎ% 𝑥% , ℎ" 𝑥% , ℎ;(𝑥%) Insert(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")

Contains(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")Contains(𝑦):ℎ% 𝑦 , ℎ" 𝑦 , ℎ;(𝑦)

Page 22: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

• 𝑚 = 10,𝑑 = 3,hashfunctions:ℎ%, ℎ", ℎ;

BloomFilterExample

0 1 1 0 0 0 1 1 0 1

Insert(𝑥%):ℎ% 𝑥% , ℎ" 𝑥% , ℎ;(𝑥%) Insert(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")

Contains(𝑥"):ℎ% 𝑥" , ℎ" 𝑥" , ℎ;(𝑥")Contains(𝑦):ℎ% 𝑦 , ℎ" 𝑦 , ℎ;(𝑦)

FalsePositive!

Page 23: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

FalsePositiveprobability• Pr[onehashmissesabit]

• 1 − %=

• Pr[oneinsertionmissesabit]• 1 − %

=

3

• Pr[allinsertionsmissabit]• 1 − %

=

3>

• Pr[asinglebitflippedto1]• 1 − 1 − %

?

@A≈ 1 − 𝑒C3>/=

• Falsepositiveprobability(assumingindependence)• 1 − 𝑒C3>/= 3

Page 24: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Optimalparameters

• Falsepositiverate𝑝 ≈ 1 − 𝑒C3>/= 3

• Falsepositivesminimizedat𝑑 = =>ln2

• Bitsperitem=>≈ − HIJKL

HA"≈ −1.44log"𝑝

• Approximate:assumingasymptotic,independence,andintegralityof𝑑• 𝑝 = 0.01,needs9.59bitsperitem• 𝑝 = 0.001,needs14.38bitsperitem

• Numberofhashes𝑑 ≈ −log"𝑝

Page 25: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Properties

• Insertandcheckin𝑂(𝑑) time• Independentofnumberofitemsinserted

• Fastandparalleltocomputehashes• CandounionandintersectionwithORandANDofbitvectors• Canestimate𝑁 ifunknown

Page 26: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

EndlessVariations

• Deletions• Counting• Bloomier filters:storingvalues• Cacheoptimizations• Distancesensitive:is𝑥 closetotheset

Page 27: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

𝑘-mer BloomFilter

• Canwedobetterifweknowtheitemsare𝑘-mersfromagenome?

Page 28: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

𝑘-mer BloomFilter

• Canwedobetterifweknowtheitemsare𝑘-mersfromagenome?• Observation:the“items”areoverlappingsubstringsfroma4letteralphabet

Page 29: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

𝑘-mer BloomFilter

• Canwedobetterifweknowtheitemsare𝑘-mersfromagenome?• Observation:the“items”areoverlappingsubstringsfroma4letteralphabet• Aftergettingpositive,• Checkall4preceding𝑘-mers andall4following𝑘-mers• Onemustbeinthesetforatruepositive• Falsepositivenexttoanotherpositivelesslikely

• Canreducefalsepositivesorspace• (Pellow,Filippova,andKingsford,2017)

ATCCxATCTCCx

Page 30: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

BioApplications

• Pan-genomestorage• Bloomfiltertrie (Holley,Wittler,andStoye,2015)

• Short-readRNA-seq database• SplitSequenceBloomtree(SolomonandKingsford,2016)

• SuccinctdeBruijn graphs• ProbabilisticdeBruijn graph(Pell,etal.,2011)• Exactversion(Chikhi andRizk,2012)

• Humangenome:3Gbp,𝑘 = 27,3.7GB,13.2bitspervertex

Page 31: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

LocalitySensitiveHashing(LSH)

• Whatdowetypicallywanttoavoidwhenhashing?

Page 32: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

LocalitySensitiveHashing(LSH)

• Whatdowetypicallywanttoavoidwhenhashing?• Collisions!

• Approximatenearestneighbors:towardsremovingthecurseofdimensionality(Indyk andMotwani,1998)• Idea:getsimilarelementstohashtogether• “Itskeyingredientisthenotionoflocality-sensitivehashing whichmaybeofindependentinterest;…”

Page 33: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

ComparingTwoSequences

• Mash:fastgenomeandmetagenomedistanceestimationusingMinHash (Ondov etal.,2016)• Let𝐴 and𝐵 betwoDNAsequencestocompare• Construct𝑘-mer sets𝐴 and𝐵• Assume 𝐴 = |𝐵| fornow(nottrue)

• Comparethesetssomehow• Notfasteryet,butwe’llgetthere…

Page 34: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Jaccard Index

• Similaritybetweensets𝐴 and𝐵• |U∩W||U∪W|

• CorrelatedwithAverageNucleotideIdentity(ANI)• Empiricalsupport,butdebatable

Page 35: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Jaccard Index:|U∩W||U∪W|

A B

Page 36: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Jaccard Index:|U∩W||U∪W|

• Whatwouldyoudoifyouwerestudyingapopulation?

Peoplewholikepeanutbutter

PeoplewholikejellyA B

Page 37: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Jaccard Index:|U∩W||U∪W|

• Whatwouldyoudoifyouwerestudyingapopulation?Sample!

Peoplewholikepeanutbutter

PeoplewholikejellyA B

Page 38: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Sketch

• Small“fingerprint”ofadatapoint(string)

A B

Page 39: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Warm-up:NaïveSketch

• Sampleeachstringindependently(don’twanttodo𝑁" sketchesforcomparingallpairsof𝑁 strings)

A B

Page 40: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Warm-up:NaïveSketch

• Sampleeachstringindependently(don’twanttodo𝑁" sketchesforcomparingallpairsof𝑁 strings)

A B

Page 41: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Warm-up:NaïveSketch

A B

• Sampleeachstringindependently(don’twanttodo𝑁" sketchesforcomparingallpairsof𝑁 strings)• Smalloverlap

Page 42: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Minhashing/Bottom-𝑑 Sketch

• Ontheresemblanceandcontainmentofdocuments(Broder,1997)• Forcomparingdocuments

• Hasheach𝑘-mer inasequence• Sketch𝑆(𝐴):smallest𝑑 hashvaluesin𝐴• Ortakeminforeachof𝑑 differenthashfunction

• Usesamehashfunctionfor𝑆 𝐴 and𝑆(𝐵)• Letsussketcheachstring,but“simulate”sketchingtheunion𝑆(𝐴 ∪ 𝐵)• Canonicalk-mers,𝐴 and𝐵 couldbereversecomps

Page 43: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Minhashing/Bottom-𝑑 Sketch

• Samplesmallest𝑑 = 6 hashesof𝑘-mers ineachset

A B

Page 44: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Minhashing/Bottom-𝑑 Sketch

• Samplesmallest𝑑 = 6 hashesof𝑘-mers ineachset

A B

10

3

9

2

4

7

1

5

8

Page 45: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Minhashing/Bottom-𝑑 Sketch

• Samplesmallest𝑑 = 6 hashesof𝑘-mers ineachset

A B

10

3

9

2

4

7

1

5

8

Thiscan’thappen

Page 46: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Comparingsketches

• Jaccardestimate𝑗• U∩WU∪W ≈ |f(U∪W)∩f(U)∩f W |

f U∪W

• Get𝑆 𝐴 ∪ 𝐵 bymergesortoperationin𝑂 𝑑 time• Mergeuntil𝑑 uniquehashesseen• Countnumberofmatches𝑐 = 3• 𝑗 = h

3

• Errorofestimateis𝜖 = %3�

𝑆 𝐴2347910

𝑆 𝐴 ∪ 𝐵123457

𝑆 𝐵124578

Page 47: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

BuildingBottom-𝑑 Sketch

• Takes𝑂(𝑛log𝑑) time• Traversestring,hashing𝑘-mers• Keepsortedlistofsmallest𝑑• Checkeachnewhashagainstmaxinlist• 𝑂 log𝑑 timetoinsertifnecessary

• Actuallyexpectedtime𝑂 𝑛 + 𝑑log𝑑log𝑛• BecausePr[𝑖th hashgetsinsertedinlist]= 3

m• Soeffectivelylinear

Page 48: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Minhash parameters

• Probabilitysome𝑘-mer 𝑥 appearsinarandomgenomeoflength𝑛• Pr 𝑥 ∈ 𝐴 ≈ 1 − 1 − Σ Cq >

• Alphabetsize Σ = 4

• For𝑘 = 16,𝑛 = 3Gbp:• Probabilityofagiven16-merinagenomeis≈ 0.5• ≈ 25% of16-mersexpectedtobesharedbetweentworandom3Gbp genomes• Tooshort𝑘-mers canoverestimateJaccard,especiallyfordistantgenomes• Verylongcouldunderestimate,butlessofanissue

Page 49: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Minhash parameters

• Valueof𝑘 toachieveadesiredprobability𝑞 ofseeingagivenk-mer insequencelength𝑛• 𝑘 ≈ log u

> %Cvv

• 5Mbp genome,𝑞 = 0.01, 𝑘 ≈ 14• 3Gbp genome,𝑞 = 0.01,𝑘 ≈ 19• Mashdefault:k=21ands=1000• 8kBpersketch

Page 50: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Mashdistance

• MashdistancebasedonJaccard estimate𝑗• − %

q ln"x%yx

• BasedonPoissonerrormodel• Implicitlyusesaveragesizeofthetwosets,penalizingsetsofdifferentsize

Page 51: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Somerelatedworks

• Assemblyoverlaps• Assemblinglargegenomeswithsingle-moleculesequencingandlocality-sensitivehashing(Berlinetal.,2015)

• Containmentfordifferentsizesets• ImprovingMinHashViatheContainmentIndexwithApplicationstoMetagenomic Analysis(Koslicki andZabeti,2017)

Page 52: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Implementation

• MurmurHash3• OpenBloomFilterLibrary• Mash

Page 53: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

OtherRandomStuff

Page 54: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

OtherRandomStuff

Page 55: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

OtherRandomStuff

Page 56: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

FruitFlyBrains

• LocalitySensitiveHashing(LSH)• Aneuralalgorithmforafundamentalcomputingproblem(Dasgupta,Stevens,andNavlakha,2017)

• Bloomfilters• (Dasgupta,Sheehan,Stevens,andNavlakha,upcoming)• Have3specialproperties

• Continuous-valuednovelty• Distancesensitivity• Timesensitivity

Page 57: Bloom Filters, Minhashes, and Other Random StuffBloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio2018, University of Central

Thanks!