writing a search engine. how hard could it be?
TRANSCRIPT
WRITING A SEARCH ENGINE. HOW HARD COULD IT BE?
ANTHONY BROWN @BRUINBROWN93 [email protected]
ABOUT
ABOUT ME
▸ Consultant at Compositional IT
▸ F# dev for ~3 years now
▸ Interested in Big Data, IoT, Cloud and Distributed Systems
COMPOSITIONAL IT
FUNCTIONAL FIRST. CLOUD READY. @COMPOSITIONALIT
HOW HARD COULD IT BE?
Every software developer ever
INTRODUCTION
IT’S ONLY AN OPERATING SYSTEM, ALL IT DOES IS RUNS PROGRAMS!
Everybody when Windows blue screens
INTRODUCTION
IT’S ONLY A MULTIPLAYER ONLINE VIDEO GAME!
Anybody playing a game when lag spikes hit
TEXT
IT’S ONLY 2 LINES OF JAVASCRIPT
Backend developer needing to make a small API change
INTRODUCTION
DUDE. HOLD MY BEER.
Drunk people 10 seconds before making a terrible mistake
SATURDAY MORNING. PLANS CANCELLED.
WHAT NEXT? HIT UP GOOGLE.
WHAT TO DO IN LONDON THIS WEEKEND?
WRITING A SEARCH ENGINE. HOW HARD COULD IT BE?
WRITING A SEARCH ENGINE WITH AZURE AND F# IN A WEEKEND.
BUT FIRST.
THIS WAS A WEEKEND PROJECT.
YOU SHOULD EXPECT: - HACKY CODE.
YOU SHOULD EXPECT: - DEMOS TO FAIL.
YOU SHOULD NOT EXPECT: - A DEEP DIVE INTO SEARCH ENGINE TECH.
SEARCH ENGINE BACKGROUND
CONSTRAINTS
▸ Not a priority
▸ Can’t cost more than £85 per month
▸ No operations investment
▸ Limit to the weekend
BACKGROUND
EVERYTHING I KNOW ABOUT HOW SEARCH ENGINES WORK
▸
▸
▸
▸
▸
▸
THE ANATOMY OF A LARGE-SCALE HYPER TEXTUAL WEB SEARCH ENGINE
SERGEY BRIN LARRY PAGE
IT’S 2016. THE WEB’S CHANGED. A LOT.
WHAT’S NEW? + SCALE
WHAT’S NEW? + USERS
WHAT’S NEW? + GLOBALISATION
WHAT’S NEW? + CLOUD
WHAT’S NEW? + PLATFORM AS A SERVICE
WHAT’S NEW? - INFRASTRUCTURE
WHAT’S NEW? - PERSONAL HOSTING
SEARCH ENGINE BACKGROUND
WHAT’S IMPORTANT?
▸ Search
▸ Scraping
▸ Page rank
SEARCH IMPLEMENTATION
HOW TO FIND A NEEDLE IN A HAYSTACK
▸ Take all of your documents
▸ Record all of the words which occur within a file
▸ Invert that index
▸ List of all words and the documents they appear in
▸ For all words in the search query, find the files which appear in every inverted index
SOUNDS EASY RIGHT? I DON’T CARE ABOUT IT.
AZURE SEARCHMANAGED SEARCH AS A SERVICE
AZURE SEARCH
WHAT DOES AZURE SEARCH GIVE US?
▸ Hosted Search as a Service
▸ HTTP API for indexing and retrieving documents
▸ Ability to scale out (more replicas, more indexes)
▸ Free basic tier
AZURE SEARCH IN THE AZURE PORTAL.
BOOSTING DEMO.
WE HAVE SEARCH. WHAT NEXT?
INDEXING DATA
WHAT IS A CRAWLER
▸ Autonomously find every web page on the internet
▸ Pull the content from that web page and index it
▸ Read the links on that page and index those links
▸ Recursively process until every page on the internet has been reached
THE PROBLEM? THE INTERNET’S PRETTY BIG.
AZURE SERVICE BUS
DISTRIBUTED MESSAGE QUEUES
INDEXING DATA
WHAT DOES AZURE SERVICE BUS GIVE US?
▸ Scalable durable queues and topics with guaranteed availability
▸ .Net APIs to communicate with the service bus
▸ Free basic tier
WORKING WITH A SERVICE BUS QUEUE.
WE NEED TO BE GOOD CITIZENS. WE DON’T WANT TO DDOS A SINGLE WEBSITE DURING CRAWLING.
SERVICE BUS PROVIDES SUPPORT FOR MESSAGE DE-DUPLICATION BASED ON CONTENT.
WE DON’T WANT TO SCRAPE THROUGH EVERY WEB PAGE IN THE WORLD.
WE DON’T WANT TO INDEX: - GOOGLE SEARCH QUERIES
WE DON’T WANT TO INDEX: - PROTECTED CONTENT
WE DON’T WANT TO INDEX: - IRRELEVANT CONTENT
DEALING WITH THE ROBOTS.TXT FILE
WRITING BASIC PARSERS IN F#
BEING A WELL BEHAVED SCRAPER
WHAT IS ROBOTS.TXT?
▸ Text file standard for telling web scrapers what they should scrape
▸ Opt-in - crawlers can ignore the robots.txt file
▸ Simple file stored at the root of the web server
AN EXAMPLE ROBOTS.TXT FILE.
SIMPLE PARSING WITH F#.
HTML AND INFORMATION RETRIEVAL
QUERYING HTML DOCUMENTS WITH HTML AGILITY PACK
WE HAVE A HTML FILE. WE NEED THE CONTENT OUT OF IT.
INFORMATION RETRIEVAL FROM HTML DOCUMENTS
WORKING WITH THE HTML AGILITY PACK
▸ Provides a simple query layer over HTML documents
▸ Works with well formatted and poorly formatted HTML
▸ Provides XPath support over the document
▸ Allows for querying for individual properties and elements
EXTRACTING LINKS FROM A HTML DOCUMENT
EXTRACTING ALL OF THE CONTENT FROM AN HTML DOCUMENT
WE NOW HAVE A WEB SCRAPER. WE NEED TO RUN THE WEB SCRAPER.
AZURE WEBJOBSSIMPLE HOSTING OF LONG RUNNING PROCESSES
AZURE WEB JOBS
WHAT ARE WEB JOBS?
▸ A means of hosting basic executables in the cloud
▸ Provides simplified deployment and monitoring
▸ Pricing per minute of usage
WE NOW HAVE A SEARCH ENGINE. KIND OF.
SEARCH IS A RECOMMENDATION PROBLEM.
HOW DO WE RECOMMEND CONTENT TO USERS?
PAGE RANKFINDING THE MOST INFLUENTIAL SITES ON THE INTERNET
PAGE RANK
WHAT IS PAGE RANK?
▸ Stanford’s patented algorithm
▸ Helps you find the most influential websites on the internet
▸ Websites with lots of links to them are more influential
THE PROBLEM? THERE’S LOTS OF WEBSITES ON THE INTERNET.
THERE’S EVEN MORE LINKS BETWEEN WEBSITES.
WE HAVE A HUGE LINK GRAPH. WE NEED TO PROCESS THAT GRAPH.
BIG DATA PROCESSING WITH MBRACE AND CLOUDFLOWS.
WE HAVE A QUERY WHICH NEEDS TO RUN DAILY. WE NEED TO ORCHESTRATE IT.
AZURE FUNCTIONS + AZURE RESOURCE MANAGER
USING AZURE FUNCTIONS FOR DEVOPS
DEVOPS
WHAT IS AZURE RESOURCE MANAGER?
▸ Declarative way of describing Azure infrastructure
▸ REST APIs to deploy infrastructure template files
▸ APIs to see current deployment status
DEVOPS
WHAT IS AZURE FUNCTIONS?
▸ Lightweight scripting of Azure web jobs
▸ Allows for running scripts in response to certain events
▸ Billing based on number of function invocations
DEVOPS
USING AZURE FUNCTIONS FOR DEVOPS
▸ Set up a timer triggered Azure Function
▸ Deploy an Mbrace cluster through Azure Resource Manager
▸ Send an event when the job completes
▸ Second Azure Function for deleting the MBrace cluster
AZURE FUNCTIONS AND AZURE RESOURCE MANAGER.
WE NOW HAVE EVERYTHING IN PLACE FOR A SEARCH ENGINE. NOBODY CAN ACCESS IT THOUGH.
AZURE FUNCTIONS
SERVERLESS WEB APIS WITH AZURE FUNCTIONS
AZURE FUNCTIONS CAN OPERATE ON HTTP REQUESTS.
NO LONG TERM HOSTING COSTS.
AZURE FUNCTIONS HTTP API DEMO.
DONE. SEARCH ENGINE COMPLETE.
HTTP API
AZURE SEARCH
LINK DATABASE
PAGERANK
CLUSTER ORCHESTRATOR
AZURE SERVICEBUS
INDEXER
PAGERANK IMPORTERPAGERANK SCORE
STORE
PLENTY OF ROOM FOR IMPROVEMENTS.
CACHING SEARCH QUERIES.
QUERY AUTO COMPLETE.
SEARCH A GIVEN DOMAIN.
MULTIPLE LANGUAGE SUPPORT.
SUPPORT FOR OTHER DOCUMENT TYPES.
BETTER INFORMATION RETRIEVAL ALGORITHMS.
WHAT’S NEXT FOR IT? NOTHING.
PRODUCTISING A GOOGLE COMPETITOR IS BASICALLY IMPOSSIBLE.
IN SUMMARYWRAPPING UP & KEY TAKEAWAYS
AZURE + F# = <3
AZURE MAKES HARD INFRASTRUCTURE PROBLEMS SIMPLE.
F# MAKES HARD SOFTWARE PROBLEMS SIMPLE.
TOGETHER THEY MAKE HARD PROBLEMS SIMPLE.
IT’S NOT GOOGLE. BUT IT TOOK 1 DEV 2 DAYS.
CLOUD IS THE EPITOME OF BUSINESS AGILITY
COMPOSITIONAL IT
[email protected] FUNCTIONAL FIRST. CLOUD READY.
Q&A.