hadoop based etl and solr based semantic search

Post on 20-Jun-2015

556 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Kása Károly (CTO - Precognox) Big Data Meetup előadásának diái http://www.meetup.com/Big-Data-Meetup-Budapest/events/158288052/

TRANSCRIPT

HADOOP BASED ETL AND SOLRBASED SEMANTIC SEARCH BEHIND JOBMONITOR.HU

KÁROLY KÁSADEVELOPMENT MANAGER

PRECOGNOX IS A JAVA SHOP SPECIALIZED IN INTELLIGENT, LINGUISTIC-BASED SEARCH, TEXT MINING AND BIG DATA SOLUTIONS

Breathing search technologies for more than 10 years, and commited to quality craftmanship as an agile team.

JOBMONITOR.HU

850 crawled job

categorypage

> 100k upto date job

ads

> 150k search

request per day

PrecognoxInfoharvester

ETL tool +

ApacheSolr basedPrecognox

search

FORMER ARCHITECTURE OF INFOHARVESTER

CrawlerClient

(Extract)

Schedulerweb

applicationserver

Validator/data

transformator(Transform)

XML file storage(Load)

Requestsavailable: „full” and

„diff”

Not a big data problem, we had a scalabilityproblem

• XML file based storage and processing was slow

• single main server memory capacity was not scalable

• while crawler machine did not used its resources

INFOHARVESTERHADOOP EDITION

Precognox Infoharvester

Data Mining

Precognox Search

Search Server SearchIndex

HBase

„SIDE EFFECTS”

Job ads historical data since last year

Search statistics since 2011

• Statistics

• Now-casting prediction

KÁSA KÁROLYPrecognox fejlesztési vezető

karoly.kasa@precognox.com

Kereső világ

http://kereses.blog.hu - A blog about big data, search and text mining

Hungarian Natural Language Processing Meetuphttp://www.meetup.com/Hungarian-nlp/

www.precognox.com

top related