content management using hadoop

Content Management Using Hadoop

We live in the world of cloud computing, best-of-breed applications and BYOX (bring your own things).

Companies are opening up to the idea of providing freedom and choice of technology and tools. Freedom

to use tools and applications of choice shortens learning curve and promotes focus on innovation and

efficiency. But this freedom comes with cost. Enterprises need to have strong technology infrastructure

and processes to support variety of applications, tools and platforms while ensuring security, privacy and

compliance.

Our experience of working with publishing industry has let us observe this bitter-sweet truth first hand. In

the publishing world, the content is generated by many internal and external contributors. In most cases it

is impossible to enforce usage of single content management systems and ideation-to-publish process. So

companies end up with large amount of content being generated from discrete systems in various formats.

Commonly content being accumulated is: large amount of data, unstructured, in waves and inconsistent.

Learn more about Portals and Content Management Services.

For efficiency and consistency of publishing quality content, it is very important to have set of common

formats in place for content and digital asset management. Common formats promote efficiency,

modularity, standardization, reuse content and other digital assets. Big data platforms like Hadoop can

come handy for publishing firms to apply a layer of common formats and processes on top of large

amount of unstructured content they accumulate from discrete systems and individuals. The Hadoop

Ecosystem provides technology platform required to handle large unstructured content data to support

enterprise scale publishing process.

At Jade Global, we have created a reference architecture to support and enhance publishing process. This

is based on our experience working with companies dealing with large amount of unstructured content

from discrete systems based on Hadoop ecosystem. This architecture covers most commonly sought-after

functions of publishing process like aggregation, filtering, curation, classification, indexing,

standardization, modularization and workflow. There are many more Hadoop Ecosystem components

with potential usefulness for content management and publishing, but the reference architecture covers

most commonly used functions. Also, it is possible to slice ecosystem component to implement each

function separately on top of Hadoop Core.

https://www.jadeglobal.com/development-services/portals-content-management

https://www.jadeglobal.com/development-services/portals-content-management

Core Functions of Reference Architecture:

Aggregate:

Hadoop Flume Agent and Sink are very efficient at collecting unstructured data from discrete systems. In

typical configuration each source system is assigned with a dedicated Flume agent, which will be

configured to collect data in format that source system is capable of providing. The beauty of Flume is

that it supports various formats so that there is no need for changes in source systems. Also at Jade

Global, our team can create custom Flume connectors to collect data from unsupported proprietary

systems. Function of Flume Sink is to apply filter to incoming data and store in Hadoop Distributed File

System. Sink can be used to filter out data that is not needed for further publishing process. Or it can

perform simple transformation functions before storing content.

Storage:

Hadoop Distributed File System provides reliable and high-performance storage for structured and

unstructured dat. Because of high-performance access and support for unstructured data, HDFS is

perfectly suited to store unstructured content from various source systems. Jade’s Hadoop team

specializes in installing, administering, configuring and maintaining Hadoop Core components like

HDFS.

Standardize:

MapReduce is Hadoop’s data analysis, manipulating and programming engine. It delivers high-

performance data transformation capability with almost effortless programming. With MapReduce’s

ability to read, analyze and transform large volume of unstructured data at lightning fast speed, it becomes

powerhouse of standardizing content in the format enterprise publishing process requires. Jade’s

specialists have experience developing MapReduce based standardization processes including removing

unnecessary content (like css styling, HTML tags), changing content format from proprietary to industry

standard open formats, consolidating content files by type of content, modularizing content for future

reuse, duplicate identification & cleanup. Our passion and drive to explore better ways to transform

unstructured data continues to deliver new ways optimize MapReduce for our clients.

Machine Learning:

Mahout Machine learning platform is high-speed and highly scalable self-learning platform which runs on

top of Hadoop. Most common use cases of Mahout in publishing process includes automatic classification

of content segments, identifying search tags for content segments, and automatically generating metadata

for content. Automatic content classification on large amount of unstructured data using Mahout can

bring huge efficiency and standardization benefits to enterprises.

Search and Metadata:

Like standardization, MapReduce can run high-speed search indexing and metadata creation jobs on huge

amount of data. At Jade Global, we have devised highly efficient MapReduce based processes to generate

search indexes from various types of open and proprietary sources. Also we specialize in automatically

identifying custom metadata based on company’s requirements from unstructured discrete content

sources. We also assist our clients in installing, administering, configuring and maintaining HBase

database to store content metadata and other transaction information. HBase is a Hadoop based column-

oriented no-sql database that delivers convenience of a relational database with high-scalability and

lightning fast performance.

Advantages of Reference Architecture for Publishing Process

1. Freedom and Productivity: Implementation of this reference architecture allows authors and contributors to use platform of their choice for ideation, authoring and packaging content. As the

reference architecture includes standardization process, the organization does not need to compromise

security, privacy and standards compliance while allowing discrete systems to generate content.

2. Common Formats and Processes: The reference architecture is designed to support common publishing processes and formats. With high-speed standardization support, the reference architecture and Hadoop

ecosystem allows organizations to define and enforce best practices and processes for publishing. Also, it

allows for continuous optimization of publishing process and formats to keep up with changing business and technology needs.

3. Automation: The reference architecture and Hadoop Ecosystem enables organization to automate large

portions of content publishing process with flexibility of human intervention as needed. From content

aggregation, standardization, classification, indexing, search optimization to open standard publishing can be automated using Hadoop Oozzie workflow engine.

4. Open Format Publishing: This architecture promotes publication of content in open and industry

standard formats to achieve flexibility of publishing to multiple platforms like web, print, mobile, social

or even to content resellers. This allows publishing businesses to explore non-traditional revenue streams and innovative ways to deliver content.

5. Time to Market: Automation, standardization, process focus and high-speed processing of large amount

of data enables businesses to publish content at fast pace. In today’s competitive world of content publishing, each second spent from ideation to publishing is critical for success of content delivery, it’s

popularity and revenue generated from it. The reference architecture and Hadoop Ecosystem enables

enterprises to achieve best-in-class efficiency for publishing process.

content management using hadoop

Science