introducing hydra – an open source document processing framework
DESCRIPTION
Presented by Joel Westberg, Findwise AB - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 This presentation will detail the document-processing framework called Hydra that has been developed by Findwise. It is intended as a description of the framework and the problem it aims to solve. We will first discuss the need for scalable document processing, outlining that there is a missing link between the open source chain to bridge the gap between source system and the search engine, then will move on to describe the design goals of Hydra, as well as how it has been implemented to meet those demands on flexibility, robustness and ease of use. This session will end by discussing some of the possibilities that this new pipeline framework can offer, such as freely seamlessly scaling up the solution during peak loads, metadata enrichment as well as proposed integration with Hadoop for Map/Reduce tasks such as page rank calculations.TRANSCRIPT
© FINDWISE 2012
Introducing Hydra !An Open Source Document Processing Framework!
Joel Westberg
Tänk på följande: • Skriv ej långa texter • Fly?a ej textrutor • Skapa luBiga bilder • Ändra ej typsni? (HelveGca och Calibri finns på MOSSEN)
• Founded in 2005
• Offices in Sweden, Denmark, Norway and Poland
• 80 employees (April 2012)
• Our objecGve is to be a leading provider of Findability soluGons uGlising
the full potenGal of search technology to create customer business value
About Findwise
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
Tänk på följande: • Skriv ej långa texter • Fly?a ej textrutor • Skapa luBiga bilder • Ändra ej typsni? (HelveGca och Calibri finns på MOSSEN)
Technology independent! CreaGng search-‐driven Findability soluGons based on market-‐leading commercial and open source search technology plaYorms:
ü Autonomy IDOL ü MicrosoB (SharePoint and FAST Search products)
ü Google GSA
ü IBM ICA/OmniFind
ü LucidWorks
ü Apache Lucene/Solr
ü ElasGc Search ü and more…
Generic Search Architecture !
Connecting source to search!
Garbage in, garbage out. But what about unstructured data in?
• Flat data is richer than it appears
• Don’t discard informaGon too soon!
The unstructured structured data paradox
Example: News arGcles
Plain text that contains invaluable metadata for search, such as:
• Title
• Author byline
• Lead paragraph
Enrichment and structuring possibilities!
• Enrich your documents with metadata, to power your search • Language detection
• Sentiment analysis
• Headline extraction
• Regular expression matching and extraction
• Filter out unwanted documents
• Collect staGsGcs
• Export to Staging environments
Classic Pipeline!
Classic Architecture!
The Hydra Architecture!
Main Design Objectives!Scalability
• Horizontally scalable central repository
• Independent processing nodes
Failiure tolerant • Failiure of a stage affects only a single document
• Failiure of a node affects at most n documents
• Failiures can be automaticly detected
Robustness • Independent stages
Development ease • Debug stages from IDE against actual data
• Allow test driven pipeline development
The Hydra Architecture!
Writing a Stage - Example!@Stage(descripGon="This is a Simple Writer")
public class SimpleWriter extends AbstractProcessStage {
@Parameter(descripGon="Name of field to write value to")
private String field;
@Parameter(descripGon="Value to write")
private Object value;
@Override
public void process(LocalDocument doc) throws ProcessExcepGon {
doc.putContentField(field, value);
}
@Override
public void init() throws RequiredArgumentMissingExcepGon {
if(field==null) throw new RequiredArgumentMissingExcepGon("field is missing");
}
}
Hadoop/Big Data integration!
Usecases for document enrichment • Pagerank • AnalyGcs Hadoop & Map/Reduce advantages • Huge scalability • Ability to work on enGre document set at once
Hadoop & Map/Reduce drawbacks • Batch processing • Time-‐to-‐index
Hadoop/Big Data integration!
Blue – First round of indexing only Red – Second round of indexing Purple – All documents
Future Configuration UI!
Open Source initiative!
• Other commi?ers
• The role of Findwise
For more informaNon:
• h?p://www.findwise.com/hydra
• h?p://findwise.github.com/Hydra
• Email: [email protected]
Questions?!
Joel Westberg [email protected]
Thankyou!!