apache manifoldcf @ linux day 2012

43
Apache ManifoldCF

Upload: piergiorgio-lucidi

Post on 22-May-2015

1.748 views

Category:

Technology


7 download

DESCRIPTION

An overview about Apache ManifoldCF with an introduction to repositories and search servers. Includes an overview about the latest improvements and new features.

TRANSCRIPT

Page 1: Apache ManifoldCF @ Linux Day 2012

Apache ManifoldCF

Page 2: Apache ManifoldCF @ Linux Day 2012

About me● Open Source ECM Specialist at Sourcesence

● Author and Technical Reviewer at Packt Publishing○ Alfresco 3 Web Services (2010)○ GateIn Cookbook (2012)

● Alfresco Community (nickname OpenPj)○ Alfresco Wiki Gardener○ Top 10 supporter (english and italian) ○ Moderator of the italian forum

● PMC Member at the Apache Software Foundation

● JBoss Community○ Content editor for jboss.org○ Project Leader and Committer for PortletSwap

Page 3: Apache ManifoldCF @ Linux Day 2012

Overview● The story ● What is ManifoldCF?

○ What is a repository?○ What is a search server?

● Why ManifoldCF?● Architecture● The growing path

○ The 0.3-incubating version○ The 0.4-incubating version○ The 0.5-incubating version○ The 0.6 version (graduated ^__^)○ What's new in the 1.0.1 version

● The book: ManifoldCF in Action ● Demo ● Resources

Page 4: Apache ManifoldCF @ Linux Day 2012

The story

The original ManifoldCF code base was granted by MetaCarta Inc., to the Apache Software Foundation in December 2009. The MetaCarta effort represented more than five years of successful development and testing in multiple, challenging enterprise environments. The project was graduated as Apache Top Level Project in July 2012.

^__^

Page 5: Apache ManifoldCF @ Linux Day 2012

What is ManifoldCF?

● Open Source crawler○ schedule jobs to create indexes

■ get contents from repositories■ push contents on search servers

Apache ManifoldCF

Repository 1

Repository 2

Repository 3

Search Server 1

Search Server 2

Search Server 3

Page 6: Apache ManifoldCF @ Linux Day 2012

What is ManifoldCF?

● Open Source crawler○ schedule jobs to create indexes

■ get contents from repositories■ push contents on search servers

● Out-Of-The-Box it is distributed as J2EE web apps

○ REST API○ Authority Service○ Crawler UI

● Can be embedded in any Java application

Page 7: Apache ManifoldCF @ Linux Day 2012

What is a repository?

● Open Source crawler○ schedule jobs to create indexes

■ get contents from repositories■ push contents on search servers

Apache ManifoldCF

Repository 1

Repository 2

Repository 3

Search Server 1

Search Server 2

Search Server 3

Page 8: Apache ManifoldCF @ Linux Day 2012

What is a repository?● central place where to put and get contents● contents are kept is an organized way

○ ER model is the old way ○ Node graph

■ properties (metadata)■ associations■ renditions

● base component of Enterprise Content Management (ECM) systems

● is from Latin repositorium○ table of service○ vessel○ chamber ○ where to keep and find your things!!!

Page 9: Apache ManifoldCF @ Linux Day 2012

Enterprise Content Management

Enterprise content management (ECM) is a formalized means of organizing and storing an organization's documents, and other content, that relate to the organization's processes. The term encompasses strategies, methods, and tools used throughout the lifecycle of the content.

Wikipediahttp://en.wikipedia.org/wiki/Enterprise_content_management

Page 10: Apache ManifoldCF @ Linux Day 2012

Enterprise Content Management

Repository

Enterprise services

Workflows and processes

Users + groups

(LDAP, IDM)BPM

ECM

Page 11: Apache ManifoldCF @ Linux Day 2012

What is a repository? - You use it!!!● Some simple examples:

○ SMTP servers○ Google Drive○ Dropbox

● Some Open Source repository implementations:○ exoJCR○ Apache JackRabbit

● Some Open Source ECM systems for critical usage:○ Alfresco○ Nuxeo○ Hippo

Page 12: Apache ManifoldCF @ Linux Day 2012

What is a repository? - Decoration

Repository

apply metadata retrieve content using

metadata

Query Languages:CMIS

JCR SQLXPath

LuceneFull Text (Google style)

CMISJCR

RESTSOAPIMAP

EMAILFTP

Indexes

Page 13: Apache ManifoldCF @ Linux Day 2012

What is a repository? - Architecture

APIs (CMIS, REST, FTP, WebDAV, IMAP)

Model

Content Store Indexes

Storage

Page 14: Apache ManifoldCF @ Linux Day 2012

What is a repository? - Repo Model

● different point of view of how managing data○ no more Relational databases (ER)

● repositories offers you an API!● based on the JCR Repository Model (JSR-283)

○ workspaces○ identifiers○ users ○ nodes and node types (contents)

■ properties and property types■ associations (shared nodes)

Page 15: Apache ManifoldCF @ Linux Day 2012

What is a repository? - Repo Model

● A node is a generic content stored in a repository○ type○ properties○ associations○ binary streams (optional)

■ renditions■ text document■ Video■ Image■ . . .

Page 16: Apache ManifoldCF @ Linux Day 2012

What is a repository? - Repo Model

Node

Properties (metadata):

- name- description- mimetype- tags- categories

Binary 1 Binary 2 Binary 3

Renditions

Type

Page 17: Apache ManifoldCF @ Linux Day 2012

What is a repository? - Repo Model

Workspace 1

Workspace 2

Workspace 3

Repository

A BC

Root node

D E G

Page 18: Apache ManifoldCF @ Linux Day 2012

Why use a repository?● adding new node types means to add a configuration● you can scale out easily● storing very large amounts of data● storing simple data structures, such as simple JSON

documents● looking up data by keys rather than using queries● searching for data based upon relevance rather than criteria● evolving schemas and/or data structures● caching data in-memory for performance● giving up consistency guarantees for increased availability

Page 19: Apache ManifoldCF @ Linux Day 2012

Why use a repository?● Standard API

○ Content Management Interoperability Services (CMIS)○ Java Content Repository (JCR)

● Hierarchical structure● Transaction support● Versioning● Locking● Observation● References● Navigation services

○ parents○ children○ associated

● Search services

Page 20: Apache ManifoldCF @ Linux Day 2012

What is a search server?

● Open Source crawler○ schedule jobs to create indexes

■ get contents from repositories■ push contents on search servers

Apache ManifoldCF

Repository 1

Repository 2

Repository 3

Search Server 1

Search Server 2

Search Server 3

Page 21: Apache ManifoldCF @ Linux Day 2012

What is a search server?

A search server is an application that allows users to find repository contents quickly using:

● keywords (full text search)● content fields● tags● categories● ranking

The informations kept are indexes.

Page 22: Apache ManifoldCF @ Linux Day 2012

What is a search server?

Indexes

Storage

REST API

Page 23: Apache ManifoldCF @ Linux Day 2012

Why ManifoldCF?

● Reliability ● Incremental● Multi repositories● Security model● Monitoring

Page 24: Apache ManifoldCF @ Linux Day 2012

Why ManifoldCF? - ReliabilityJobs scheduling and configuration are stored in the database to maintain the state of all the executions

Pull Agent Daemon

Database

Repository Search Serverconfiguration and scheduling

Page 25: Apache ManifoldCF @ Linux Day 2012

Why ManifoldCF? - Incremental

Jobs can be optionally configured to re-visit contents incrementally

Repository

N1

N2N4

Apache ManifoldCF

Page 26: Apache ManifoldCF @ Linux Day 2012

Why ManifoldCF? - Multi repositoriesJobs can retrieve contents from the following repositories:

● CMIS-compliant● Alfresco ● IBM FileNet● EMC Documentum ● Microsoft SharePoint● OpenText LiveLink● Autonomy Meridio● Memex Patriarch● Windows Share/DFS ● Generic JDBC ● Generic Filesystem ● Generic RSS and Web

Page 27: Apache ManifoldCF @ Linux Day 2012

Why ManifoldCF? - Multi repositoriesJobs can ingest contents to the following search servers:● Apache Solr● ElasticSearch ● OpenSearchServer● MetaCarta GTS

Page 28: Apache ManifoldCF @ Linux Day 2012

Why ManifoldCF? - Security model

Retrieve per-content ACLsAuthority Service

Pull Agent Daemon

Repository 1

Repository 2

Repository 3

Authority 1

Authority 2

Authority 3

Search Server

user access tokens

doc access tokens

user specific search results

Page 29: Apache ManifoldCF @ Linux Day 2012

Why ManifoldCF? - Monitoring

UI Crawler allows you to:● configure jobs and connectors● monitor jobs execution● monitor contents ingestion

○ status reports■ document status■ queue status

○ history reports ■ simple history■ maximum activity■ maximum bandwidth■ result histogram

Page 30: Apache ManifoldCF @ Linux Day 2012

Architecture

● Pull Agent Daemon ○ Jobs

■ Repository Connectors ■ Output Connectors ■ Authority Connectors

Page 31: Apache ManifoldCF @ Linux Day 2012

Architecture

● Pull Agent Daemon (the core service)○ Jobs (execute the ingestion tasks)

■ Repository Connectors (retrieve contents)■ Output Connectors (ingest contents)■ Authority Connectors (retrieve ACLs)

Page 32: Apache ManifoldCF @ Linux Day 2012

Architecture

Pull Agent Daemon

Database

Search Server 1

Search Server 2

Search Server 3

Repository 1

Repository 2

Repository 3

Authority Service

Page 33: Apache ManifoldCF @ Linux Day 2012

Architecture - Job

A job is an ingestion work that consists of:○ verbal description○ repository connection

■ authority connection (optional) ○ metadata mapping○ output connection (search server)○ crawling model ○ scheduling information (on demand or time ranges)

Page 34: Apache ManifoldCF @ Linux Day 2012

Architecture - Job

JobRepository Search Server

ACLs

- metadata mapping- content ingestion

retrieve content ACL

- verbal description- crawling model- scheduling

- query to retrieve contents

Repository Connector

Output Connector

Authority Connector

Page 39: Apache ManifoldCF @ Linux Day 2012

What's new in 1.0.1 version● Microsoft SharePoint 2010 support● JDBC Connector now manages metadata● CMIS Connector upgraded to OpenCMIS 0.7.0● Several bugfixes

Page 40: Apache ManifoldCF @ Linux Day 2012

The book: ManifoldCF in Action

ManifoldCF in Action by Karl Wright published by Manning Karl is the original developer and the principal committer of Apache ManifoldCF The book is available at the following site:http://www.manning.com/wright

Page 41: Apache ManifoldCF @ Linux Day 2012

DEMO

Page 43: Apache ManifoldCF @ Linux Day 2012

Thank you for your attention!

^__^

http://www.open4dev.com