kelly boccia abi natarajan konstantin livitski senthil anand subbanan meyyappan 1
TRANSCRIPT
Agenda
Business Requirements Client Overview Business Problem Business Goal Solution and Scope
Technical Specification System Context Architecture Overview Components & Modules Security Model Document indexing Search Explained
Implementation Plan Resource & Costs Development Environment Production Environment Success Criteria
Prototype Q&A
2
Multi-National Manufacturing & Sales Corporation
Business Growth - Multiple Applications - Multiple Repositories
Business Problem
3
Business Goal
Organize Intellectual Capital and Assets
Accessibility - Connect knowledge workers securely to relevant information
Productivity - Increase productivity and reduce re-work by leveraging knowledge and expertise
Client Overview
4
Security Model
• Integrated with existing GLOCO's security infrastructure• Any access requires authentication• To follow a link in search results, user may need additional
authorization for repository access
9
Document indexing
• Document is anything that a search result can point at• Documents are external to the search engine• Documents include text and metadata • Lucene sees each document as a set of named fields
10
How search works
• Lucene sees each document as a set of named fields • A record is created for each document to store some fields
o URL is usually a stored field• The main index is keyed by search term (i.e. inverted)
o Typical text fields are tokenized, filtered, and stemmed into terms o Indexed fields may be discarded after processing o For each term, a list of document IDs is stored to help locate recordso Also stores frequency and proximity
• Search involves retrieval of document IDs by term, and stored fields by the document ID
11
Resource / Cost Plan
21 weeks total effort 13 member team including GLOCO and Innova INNOVA supports full SDLC with phases
Solution Outline, High Level Design, Detailed Design Build / Test / Deploy and Post Production Support
12
SLATES - Development Environment
Developer workstation to host Virtual Images. Developer workstation to share development
Search Servers Fully configured environment to unit test and
development
13
SLATES - QA / Test and Production
• Sticky load balancer to remember the serving tomcat
• Each Search server
to hold multiple instances.
• Shared / Cached
Network storage to share index
• Similar configuration
for both QA and Production environment
14
Success Criteria and Benchmarks
Most important project success criteria are: 10% time and resource savings on certain R&D activities 75% positive feedback on user surveys 50% of the target user group are actively using the system 5% of available documents have user-defined tags
15
Thank you!
Innova would like to thank:
Zoya KinstlerJeff Parker
Basem NaseimValar Jayaprakash
Classmates Harvard University Extension School
24
Index Growth
• Index size is a percentage of the document corpus size• Maintenance trade-off:
o Expensive segment merges - load all segments, write a new oneo Fragmented index is expensive to query - must read all segments
• Lucene index segments are write-once - helps with concurrency• Updates are done as delete - re-add. Updates should be
batchedo Direct tagging is inefficient
28
Scalability
(Source: Mark Miller, "Scaling Lucene and Solr", Lucid Imagination, 2010)
• Query volume is scaled by replication• Index size and indexing load is scaled by sharding
29
Phase 1 - Work Break Down Chart
• 21 weeks total effort• 13 member team including GLOCO and Innova• INNOVA supports full SDLC with phases - Solution Outline,High
Level Design, Detailed Design, Build / Test / Deploy and Post Production Support
30