apache manifoldcf

Apache ManifoldCF

Overview

● The story ● What is ManifoldCF?● Why ManifoldCF? ● Architecture● The 0.3-incubating version● The 0.4-incubating version● What's new in the 0.5-incubating ● The book: ManifoldCF in Action ● Demo ● Resources

The story

The original ManifoldCF code base was granted by MetaCarta Inc., to the Apache Software Foundation in December 2009. The MetaCarta effort represented more than five years of successful development and testing in multiple, challenging enterprise environments. The project is in the Apache Incubator because the community was not yet diverse enough, but now the project is towards graduation.

What is ManifoldCF?

● Open Source crawler○ schedule jobs to create indexes

■ get contents from repositories■ push contents on search servers

What is ManifoldCF?

● Open Source crawler○ schedule jobs to create indexes

■ get contents from repositories■ push contents on search servers

● Out-Of-The-Box it is distributed as J2EE web apps

○ REST API○ Authority Service○ Crawler UI

● Can be embedded in any Java application

Why ManifoldCF?

● Reliability ● Incremental● Multi repositories● Security model● Monitoring

Why ManifoldCF? - Reliability

Jobs scheduling and configuration are stored in the database to maintain the state of all the executions

Why ManifoldCF? - Incremental

Jobs can be optionally configured to re-visit contents incrementally

Why ManifoldCF? - Multi repositories

Jobs can retrieve contents from the following repositories: ● CMIS-compliant● Alfresco ● IBM FileNet● EMC Documentum ● Microsoft SharePoint● OpenText LiveLink● Autonomy Meridio● Memex Patriarch● Windows Share/DFS ● Generic JDBC ● Generic Filesystem ● Generic RSS and Web

Why ManifoldCF? - Multi repositories

Jobs can ingest contents to the following search servers:● ElasticSearch ● OpenSearchServer● Apache Solr● MetaCarta GTS

Why ManifoldCF? - Security model

Retrieve per-content ACLs

Why ManifoldCF? - Monitoring

UI Crawler allows you to:● configure jobs and connectors● monitor jobs execution● monitor contents ingestion

○ status reports■ document status■ queue status

○ history reports ■ simple history■ maximum activity■ maximum bandwidth■ result histogram

Architecture

● Pull Agent Daemon ○ Jobs

■ Repository Connectors ■ Output Connectors ■ Authority Connectors

Architecture

● Pull Agent Daemon (the core service)○ Jobs (execute the ingestion tasks)

■ Repository Connectors (retrieve contents)■ Output Connectors (ingest contents)■ Authority Connectors (retrieve ACLs)

Architecture

Architecture - Job

A job is an ingestion work that consists of:○ verbal description○ repository connection

■ authority connection (optional) ○ metadata mapping○ output connection (search server)○ crawling model ○ scheduling information (on demand or time ranges)

Architecture - Job

The 0.3-incubating version

● CMIS Repository Connector● OpenSearchServer Output Connector● Scripting Language● New Maven build process● Several bug fixes

The 0.4-incubating version

● Alfresco Connector ● JDBC Connector now supports MySQL● CMIS Connector upgraded to OpenCMIS 0.5.0 ● Several bug fixes

What's new in the 0.5-incubating

● Apache Velocity for connectors UI templates● ElasticSearch Output Connector● CMIS Connector upgraded to OpenCMIS 0.6.0● Prebuild connector support: just add jars and go!● New Japanese localization● Several bug fixes

The book: ManifoldCF in Action

ManifoldCF in Action by Karl Wright published by Manning Karl is the original developer and the principal committer of Apache ManifoldCF The book is available at the following site:http://www.manning.com/wright

Resources

Homepage:http://incubator.apache.org/connectors

Download page:http://incubator.apache.org/connectors/download.html

Thank you for your attention!

apache manifoldcf

Technology

building and running manifoldcf

integrating apache nifi and apache flink

manifoldcf for content acquisition

oracle applications cloud licensing information release 13...

integrating apache nifi and apache apex

apache hawq and apache madlib: journey to apache

· b17 in v l) 7 -f v y 7 oss — 11 apache tomcat gcc...

apache®, apache ignite, ignite®, and the apache ignite...

openstandiaバージョン管理 apache subversion...

apache hadoop ingestion patterns & apache flume

apache stratos: the paas from apache

the apache way - · pdf filecassandra directmemory...

writing apache spark and apache flink applications using...

apache kafka - rainfocus · apache kafka scalable message...

apache ds configuration apache directory studio · beside...

understanding the voyager enterprise architecture 2 ·...

manifoldcf for content acquisition karl wright, nokia inc....

apache activemq and apache servicemix

integrating apache sqoop and apache pig with apache...

integrating apache hive with kafka, spark, and...