writing a search engine. how hard could it be?

104
WRITING A SEARCH ENGINE. HOW HARD COULD IT BE? ANTHONY BROWN @BRUINBROWN93 [email protected]

Upload: anthony-brown

Post on 15-Apr-2017

527 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Writing a Search Engine. How hard could it be?

WRITING A SEARCH ENGINE. HOW HARD COULD IT BE?

ANTHONY BROWN @BRUINBROWN93 [email protected]

Page 2: Writing a Search Engine. How hard could it be?

ABOUT

ABOUT ME

▸ Consultant at Compositional IT

▸ F# dev for ~3 years now

▸ Interested in Big Data, IoT, Cloud and Distributed Systems

Page 3: Writing a Search Engine. How hard could it be?

COMPOSITIONAL IT

FUNCTIONAL FIRST. CLOUD READY. @COMPOSITIONALIT

Page 4: Writing a Search Engine. How hard could it be?

HOW HARD COULD IT BE?

Every software developer ever

INTRODUCTION

Page 5: Writing a Search Engine. How hard could it be?

IT’S ONLY AN OPERATING SYSTEM, ALL IT DOES IS RUNS PROGRAMS!

Everybody when Windows blue screens

INTRODUCTION

Page 6: Writing a Search Engine. How hard could it be?

IT’S ONLY A MULTIPLAYER ONLINE VIDEO GAME!

Anybody playing a game when lag spikes hit

TEXT

Page 7: Writing a Search Engine. How hard could it be?

IT’S ONLY 2 LINES OF JAVASCRIPT

Backend developer needing to make a small API change

INTRODUCTION

Page 8: Writing a Search Engine. How hard could it be?

DUDE. HOLD MY BEER.

Drunk people 10 seconds before making a terrible mistake

Page 9: Writing a Search Engine. How hard could it be?

SATURDAY MORNING. PLANS CANCELLED.

Page 10: Writing a Search Engine. How hard could it be?

WHAT NEXT? HIT UP GOOGLE.

Page 11: Writing a Search Engine. How hard could it be?

WHAT TO DO IN LONDON THIS WEEKEND?

Page 12: Writing a Search Engine. How hard could it be?
Page 13: Writing a Search Engine. How hard could it be?
Page 14: Writing a Search Engine. How hard could it be?

WRITING A SEARCH ENGINE. HOW HARD COULD IT BE?

Page 15: Writing a Search Engine. How hard could it be?

WRITING A SEARCH ENGINE WITH AZURE AND F# IN A WEEKEND.

Page 16: Writing a Search Engine. How hard could it be?

BUT FIRST.

Page 17: Writing a Search Engine. How hard could it be?

THIS WAS A WEEKEND PROJECT.

Page 18: Writing a Search Engine. How hard could it be?

YOU SHOULD EXPECT: - HACKY CODE.

Page 19: Writing a Search Engine. How hard could it be?

YOU SHOULD EXPECT: - DEMOS TO FAIL.

Page 20: Writing a Search Engine. How hard could it be?

YOU SHOULD NOT EXPECT: - A DEEP DIVE INTO SEARCH ENGINE TECH.

Page 21: Writing a Search Engine. How hard could it be?

SEARCH ENGINE BACKGROUND

CONSTRAINTS

▸ Not a priority

▸ Can’t cost more than £85 per month

▸ No operations investment

▸ Limit to the weekend

Page 22: Writing a Search Engine. How hard could it be?

BACKGROUND

EVERYTHING I KNOW ABOUT HOW SEARCH ENGINES WORK

Page 23: Writing a Search Engine. How hard could it be?

THE ANATOMY OF A LARGE-SCALE HYPER TEXTUAL WEB SEARCH ENGINE

SERGEY BRIN LARRY PAGE

Page 24: Writing a Search Engine. How hard could it be?
Page 25: Writing a Search Engine. How hard could it be?

IT’S 2016. THE WEB’S CHANGED. A LOT.

Page 26: Writing a Search Engine. How hard could it be?

WHAT’S NEW? + SCALE

Page 27: Writing a Search Engine. How hard could it be?

WHAT’S NEW? + USERS

Page 28: Writing a Search Engine. How hard could it be?

WHAT’S NEW? + GLOBALISATION

Page 29: Writing a Search Engine. How hard could it be?

WHAT’S NEW? + CLOUD

Page 30: Writing a Search Engine. How hard could it be?

WHAT’S NEW? + PLATFORM AS A SERVICE

Page 31: Writing a Search Engine. How hard could it be?

WHAT’S NEW? - INFRASTRUCTURE

Page 32: Writing a Search Engine. How hard could it be?

WHAT’S NEW? - PERSONAL HOSTING

Page 33: Writing a Search Engine. How hard could it be?

SEARCH ENGINE BACKGROUND

WHAT’S IMPORTANT?

▸ Search

▸ Scraping

▸ Page rank

Page 34: Writing a Search Engine. How hard could it be?

SEARCH IMPLEMENTATION

HOW TO FIND A NEEDLE IN A HAYSTACK

▸ Take all of your documents

▸ Record all of the words which occur within a file

▸ Invert that index

▸ List of all words and the documents they appear in

▸ For all words in the search query, find the files which appear in every inverted index

Page 35: Writing a Search Engine. How hard could it be?

SOUNDS EASY RIGHT? I DON’T CARE ABOUT IT.

Page 36: Writing a Search Engine. How hard could it be?

AZURE SEARCHMANAGED SEARCH AS A SERVICE

Page 37: Writing a Search Engine. How hard could it be?

AZURE SEARCH

WHAT DOES AZURE SEARCH GIVE US?

▸ Hosted Search as a Service

▸ HTTP API for indexing and retrieving documents

▸ Ability to scale out (more replicas, more indexes)

▸ Free basic tier

Page 38: Writing a Search Engine. How hard could it be?

AZURE SEARCH IN THE AZURE PORTAL.

Page 39: Writing a Search Engine. How hard could it be?

BOOSTING DEMO.

Page 40: Writing a Search Engine. How hard could it be?

WE HAVE SEARCH. WHAT NEXT?

Page 41: Writing a Search Engine. How hard could it be?

INDEXING DATA

WHAT IS A CRAWLER

▸ Autonomously find every web page on the internet

▸ Pull the content from that web page and index it

▸ Read the links on that page and index those links

▸ Recursively process until every page on the internet has been reached

Page 42: Writing a Search Engine. How hard could it be?

THE PROBLEM? THE INTERNET’S PRETTY BIG.

Page 43: Writing a Search Engine. How hard could it be?

AZURE SERVICE BUS

DISTRIBUTED MESSAGE QUEUES

Page 44: Writing a Search Engine. How hard could it be?

INDEXING DATA

WHAT DOES AZURE SERVICE BUS GIVE US?

▸ Scalable durable queues and topics with guaranteed availability

▸ .Net APIs to communicate with the service bus

▸ Free basic tier

Page 45: Writing a Search Engine. How hard could it be?

WORKING WITH A SERVICE BUS QUEUE.

Page 46: Writing a Search Engine. How hard could it be?

WE NEED TO BE GOOD CITIZENS. WE DON’T WANT TO DDOS A SINGLE WEBSITE DURING CRAWLING.

Page 47: Writing a Search Engine. How hard could it be?

SERVICE BUS PROVIDES SUPPORT FOR MESSAGE DE-DUPLICATION BASED ON CONTENT.

Page 48: Writing a Search Engine. How hard could it be?

WE DON’T WANT TO SCRAPE THROUGH EVERY WEB PAGE IN THE WORLD.

Page 49: Writing a Search Engine. How hard could it be?

WE DON’T WANT TO INDEX: - GOOGLE SEARCH QUERIES

Page 50: Writing a Search Engine. How hard could it be?

WE DON’T WANT TO INDEX: - PROTECTED CONTENT

Page 51: Writing a Search Engine. How hard could it be?

WE DON’T WANT TO INDEX: - IRRELEVANT CONTENT

Page 52: Writing a Search Engine. How hard could it be?

DEALING WITH THE ROBOTS.TXT FILE

WRITING BASIC PARSERS IN F#

Page 53: Writing a Search Engine. How hard could it be?

BEING A WELL BEHAVED SCRAPER

WHAT IS ROBOTS.TXT?

▸ Text file standard for telling web scrapers what they should scrape

▸ Opt-in - crawlers can ignore the robots.txt file

▸ Simple file stored at the root of the web server

Page 54: Writing a Search Engine. How hard could it be?

AN EXAMPLE ROBOTS.TXT FILE.

Page 55: Writing a Search Engine. How hard could it be?

SIMPLE PARSING WITH F#.

Page 56: Writing a Search Engine. How hard could it be?

HTML AND INFORMATION RETRIEVAL

QUERYING HTML DOCUMENTS WITH HTML AGILITY PACK

Page 57: Writing a Search Engine. How hard could it be?

WE HAVE A HTML FILE. WE NEED THE CONTENT OUT OF IT.

Page 58: Writing a Search Engine. How hard could it be?

INFORMATION RETRIEVAL FROM HTML DOCUMENTS

WORKING WITH THE HTML AGILITY PACK

▸ Provides a simple query layer over HTML documents

▸ Works with well formatted and poorly formatted HTML

▸ Provides XPath support over the document

▸ Allows for querying for individual properties and elements

Page 59: Writing a Search Engine. How hard could it be?

EXTRACTING LINKS FROM A HTML DOCUMENT

Page 60: Writing a Search Engine. How hard could it be?

EXTRACTING ALL OF THE CONTENT FROM AN HTML DOCUMENT

Page 61: Writing a Search Engine. How hard could it be?

WE NOW HAVE A WEB SCRAPER. WE NEED TO RUN THE WEB SCRAPER.

Page 62: Writing a Search Engine. How hard could it be?

AZURE WEBJOBSSIMPLE HOSTING OF LONG RUNNING PROCESSES

Page 63: Writing a Search Engine. How hard could it be?

AZURE WEB JOBS

WHAT ARE WEB JOBS?

▸ A means of hosting basic executables in the cloud

▸ Provides simplified deployment and monitoring

▸ Pricing per minute of usage

Page 64: Writing a Search Engine. How hard could it be?

WE NOW HAVE A SEARCH ENGINE. KIND OF.

Page 65: Writing a Search Engine. How hard could it be?

SEARCH IS A RECOMMENDATION PROBLEM.

Page 66: Writing a Search Engine. How hard could it be?

HOW DO WE RECOMMEND CONTENT TO USERS?

Page 67: Writing a Search Engine. How hard could it be?

PAGE RANKFINDING THE MOST INFLUENTIAL SITES ON THE INTERNET

Page 68: Writing a Search Engine. How hard could it be?

PAGE RANK

WHAT IS PAGE RANK?

▸ Stanford’s patented algorithm

▸ Helps you find the most influential websites on the internet

▸ Websites with lots of links to them are more influential

Page 69: Writing a Search Engine. How hard could it be?

THE PROBLEM? THERE’S LOTS OF WEBSITES ON THE INTERNET.

Page 70: Writing a Search Engine. How hard could it be?

THERE’S EVEN MORE LINKS BETWEEN WEBSITES.

Page 71: Writing a Search Engine. How hard could it be?

WE HAVE A HUGE LINK GRAPH. WE NEED TO PROCESS THAT GRAPH.

Page 72: Writing a Search Engine. How hard could it be?

BIG DATA PROCESSING WITH MBRACE AND CLOUDFLOWS.

Page 73: Writing a Search Engine. How hard could it be?

WE HAVE A QUERY WHICH NEEDS TO RUN DAILY. WE NEED TO ORCHESTRATE IT.

Page 74: Writing a Search Engine. How hard could it be?

AZURE FUNCTIONS + AZURE RESOURCE MANAGER

USING AZURE FUNCTIONS FOR DEVOPS

Page 75: Writing a Search Engine. How hard could it be?

DEVOPS

WHAT IS AZURE RESOURCE MANAGER?

▸ Declarative way of describing Azure infrastructure

▸ REST APIs to deploy infrastructure template files

▸ APIs to see current deployment status

Page 76: Writing a Search Engine. How hard could it be?

DEVOPS

WHAT IS AZURE FUNCTIONS?

▸ Lightweight scripting of Azure web jobs

▸ Allows for running scripts in response to certain events

▸ Billing based on number of function invocations

Page 77: Writing a Search Engine. How hard could it be?

DEVOPS

USING AZURE FUNCTIONS FOR DEVOPS

▸ Set up a timer triggered Azure Function

▸ Deploy an Mbrace cluster through Azure Resource Manager

▸ Send an event when the job completes

▸ Second Azure Function for deleting the MBrace cluster

Page 78: Writing a Search Engine. How hard could it be?

AZURE FUNCTIONS AND AZURE RESOURCE MANAGER.

Page 79: Writing a Search Engine. How hard could it be?

WE NOW HAVE EVERYTHING IN PLACE FOR A SEARCH ENGINE. NOBODY CAN ACCESS IT THOUGH.

Page 80: Writing a Search Engine. How hard could it be?

AZURE FUNCTIONS

SERVERLESS WEB APIS WITH AZURE FUNCTIONS

Page 81: Writing a Search Engine. How hard could it be?

AZURE FUNCTIONS CAN OPERATE ON HTTP REQUESTS.

Page 82: Writing a Search Engine. How hard could it be?

NO LONG TERM HOSTING COSTS.

Page 83: Writing a Search Engine. How hard could it be?

AZURE FUNCTIONS HTTP API DEMO.

Page 84: Writing a Search Engine. How hard could it be?

DONE. SEARCH ENGINE COMPLETE.

Page 85: Writing a Search Engine. How hard could it be?

HTTP API

AZURE SEARCH

LINK DATABASE

PAGERANK

CLUSTER ORCHESTRATOR

AZURE SERVICEBUS

INDEXER

PAGERANK IMPORTERPAGERANK SCORE

STORE

Page 86: Writing a Search Engine. How hard could it be?

PLENTY OF ROOM FOR IMPROVEMENTS.

Page 87: Writing a Search Engine. How hard could it be?

CACHING SEARCH QUERIES.

Page 88: Writing a Search Engine. How hard could it be?

QUERY AUTO COMPLETE.

Page 89: Writing a Search Engine. How hard could it be?

SEARCH A GIVEN DOMAIN.

Page 90: Writing a Search Engine. How hard could it be?

MULTIPLE LANGUAGE SUPPORT.

Page 91: Writing a Search Engine. How hard could it be?

SUPPORT FOR OTHER DOCUMENT TYPES.

Page 92: Writing a Search Engine. How hard could it be?

BETTER INFORMATION RETRIEVAL ALGORITHMS.

Page 93: Writing a Search Engine. How hard could it be?

WHAT’S NEXT FOR IT? NOTHING.

Page 94: Writing a Search Engine. How hard could it be?

PRODUCTISING A GOOGLE COMPETITOR IS BASICALLY IMPOSSIBLE.

Page 95: Writing a Search Engine. How hard could it be?

IN SUMMARYWRAPPING UP & KEY TAKEAWAYS

Page 96: Writing a Search Engine. How hard could it be?

AZURE + F# = <3

Page 97: Writing a Search Engine. How hard could it be?

AZURE MAKES HARD INFRASTRUCTURE PROBLEMS SIMPLE.

Page 98: Writing a Search Engine. How hard could it be?

F# MAKES HARD SOFTWARE PROBLEMS SIMPLE.

Page 99: Writing a Search Engine. How hard could it be?

TOGETHER THEY MAKE HARD PROBLEMS SIMPLE.

Page 100: Writing a Search Engine. How hard could it be?

IT’S NOT GOOGLE. BUT IT TOOK 1 DEV 2 DAYS.

Page 101: Writing a Search Engine. How hard could it be?

CLOUD IS THE EPITOME OF BUSINESS AGILITY

Page 102: Writing a Search Engine. How hard could it be?
Page 103: Writing a Search Engine. How hard could it be?

COMPOSITIONAL IT

[email protected] FUNCTIONAL FIRST. CLOUD READY.

Page 104: Writing a Search Engine. How hard could it be?

Q&A.