hadoop based etl and solr based semantic search
Post on 20-Jun-2015
556 Views
Preview:
DESCRIPTION
TRANSCRIPT
HADOOP BASED ETL AND SOLRBASED SEMANTIC SEARCH BEHIND JOBMONITOR.HU
KÁROLY KÁSADEVELOPMENT MANAGER
PRECOGNOX IS A JAVA SHOP SPECIALIZED IN INTELLIGENT, LINGUISTIC-BASED SEARCH, TEXT MINING AND BIG DATA SOLUTIONS
Breathing search technologies for more than 10 years, and commited to quality craftmanship as an agile team.
JOBMONITOR.HU
850 crawled job
categorypage
> 100k upto date job
ads
> 150k search
request per day
PrecognoxInfoharvester
ETL tool +
ApacheSolr basedPrecognox
search
FORMER ARCHITECTURE OF INFOHARVESTER
CrawlerClient
(Extract)
Schedulerweb
applicationserver
Validator/data
transformator(Transform)
XML file storage(Load)
Requestsavailable: „full” and
„diff”
Not a big data problem, we had a scalabilityproblem
• XML file based storage and processing was slow
• single main server memory capacity was not scalable
• while crawler machine did not used its resources
INFOHARVESTERHADOOP EDITION
Precognox Infoharvester
Data Mining
Precognox Search
Search Server SearchIndex
HBase
„SIDE EFFECTS”
Job ads historical data since last year
Search statistics since 2011
• Statistics
• Now-casting prediction
KÁSA KÁROLYPrecognox fejlesztési vezető
karoly.kasa@precognox.com
Kereső világ
http://kereses.blog.hu - A blog about big data, search and text mining
Hungarian Natural Language Processing Meetuphttp://www.meetup.com/Hungarian-nlp/
www.precognox.com
top related