seznam.cz
DESCRIPTION
Seznam.cz. The Czech number one Internet company. Ji ří Materna , Head of Research. What is Seznam.cz. Internet portal with tens of high-quality services: Web search (web search engine successfully competing with Google) Specialized search (Czech companies, e-shops) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Seznam.cz](https://reader038.vdocuments.us/reader038/viewer/2022110215/56816932550346895de0826c/html5/thumbnails/1.jpg)
Jiří Materna, Head of Research
The Czech number one Internet company
Seznam.cz
![Page 2: Seznam.cz](https://reader038.vdocuments.us/reader038/viewer/2022110215/56816932550346895de0826c/html5/thumbnails/2.jpg)
www.seznam.cz
Internet portal with tens of high-quality services:• Web search (web search engine successfully competing with Google)
• Specialized search (Czech companies, e-shops)
• E-mail (the most popular free e-mail in the Czech market)
• News (covering business, politics, lifestyle, sport, whether, TV schedules, etc.)
• Entertainment (video and music online streaming)
• On-line maps (more detailed than Google maps)
• Sklik.cz (advertising system)
• And others…
What is Seznam.cz
@JiriMaterna
![Page 3: Seznam.cz](https://reader038.vdocuments.us/reader038/viewer/2022110215/56816932550346895de0826c/html5/thumbnails/3.jpg)
www.seznam.cz
• More than 1000 employees• Revenue 2.8 billion CZK (108 mil. EUR)• 2.4 million people visit Seznam.cz every day• 1.5 billion crawled web pages
-- 45 % English-- 37 % Czech-- 7.7 % Slovak-- 2.3 % German-- 8 % Others
• 500 queries per second in peak hours
Seznam.cz in numbers
@JiriMaterna
![Page 4: Seznam.cz](https://reader038.vdocuments.us/reader038/viewer/2022110215/56816932550346895de0826c/html5/thumbnails/4.jpg)
www.seznam.cz
Search engine architecture
@JiriMaterna
![Page 5: Seznam.cz](https://reader038.vdocuments.us/reader038/viewer/2022110215/56816932550346895de0826c/html5/thumbnails/5.jpg)
www.seznam.cz
• Query understanding• Graph representation:
- AND- OR- optional- other relations
Query Expander
@JiriMaterna
![Page 6: Seznam.cz](https://reader038.vdocuments.us/reader038/viewer/2022110215/56816932550346895de0826c/html5/thumbnails/6.jpg)
www.seznam.cz
Search aggregators
@JiriMaterna
• Deduplication• Document sub-results• SERP restrictions• Caching
![Page 7: Seznam.cz](https://reader038.vdocuments.us/reader038/viewer/2022110215/56816932550346895de0826c/html5/thumbnails/7.jpg)
www.seznam.cz
• RC-Rank – Boosted regression oblivious trees• Hundreds of features• Our own quality measure
Ranking
@JiriMaterna
![Page 8: Seznam.cz](https://reader038.vdocuments.us/reader038/viewer/2022110215/56816932550346895de0826c/html5/thumbnails/8.jpg)
www.seznam.cz
Index & Indexer
@JiriMaterna
• Indexing: complete, daily, fresh
• Data structures: word barrel – stores the inverted index document barrel – stores document features title barrel – stores processed web pages content and metadata others – query site barrel, site barrel, link barrel, qds barrel, query url
barrel, …
![Page 9: Seznam.cz](https://reader038.vdocuments.us/reader038/viewer/2022110215/56816932550346895de0826c/html5/thumbnails/9.jpg)
www.seznam.cz
• Hadoop, Giraffe, Yarn• 50 mil. documents every day• 1.5 bil. documents out of 50 bil. known
documents stored• duplicity detection
Downloader & document database
@JiriMaterna
![Page 10: Seznam.cz](https://reader038.vdocuments.us/reader038/viewer/2022110215/56816932550346895de0826c/html5/thumbnails/10.jpg)
www.seznam.cz
• Joint projects• Providing our technology• Sharing data (MetaCentrum) • …
Possible models of cooperation
@JiriMaterna
![Page 11: Seznam.cz](https://reader038.vdocuments.us/reader038/viewer/2022110215/56816932550346895de0826c/html5/thumbnails/11.jpg)
www.seznam.cz
Jiří Materna, Head of Research, [email protected]
Thank you for your attention.
@JiriMaterna