metadata extraction and content transformation
DESCRIPTION
In this session, we will look first at the rich metadata that documents in your repository have, how to control the mapping of this on to your content model, and some of the interesting things this can deliver. We'll then move on to the content transformation and rendition services, and see how you can easily and powerfully generate a wide range of media from the content you already have.TRANSCRIPT
![Page 1: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/1.jpg)
1
Metadata Extraction and Content TransformationsNick BurchSoftware Engineer, Alfresco
twitter: @gagravarr
![Page 2: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/2.jpg)
2
Introduction – 3 Content Related Services
Covering
• Uses• Interfaces• Calling the Services• Java & JavaScript APIs• Demos• Extensions• Apache Tika
• Metadata Extractor
• Content Transformer
• Renditions
![Page 3: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/3.jpg)
3
The Metadata Extractor Service
What, How, Why?
• For a given piece of content, returns the Metadata held within that• Document Metadata is converted into the content model• Typically used with uploaded binary files• Upload a PDF, extract out the Title and Description, save these as the properties on the Alfresco Node• Powered internally by a number of different extractors• Service picks the appropriate extractor for you• Since Alfresco 3.4, makes heavy use of Apache Tika
![Page 4: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/4.jpg)
4
The Content Transformation Service
What, How, Why?
• Transforms content from one format to another• Driven by mime types, source and destination• Used to generate plain text versions for indexing• Used to generate SWF versions for preview• Used to generate PDF versions for web download • Powered by a large number of different transformers• Transformers can be linked together, eg .doc -> .pdf via Open Office, then .pdf -> .swf via pdf2swf• Since Alfresco 3.4, makes heavy use of Apache Tika
![Page 5: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/5.jpg)
5
The Rendition Service
What, How, Why?
• Can turn content from one kind to another• Or can just alter some content as-is• Used to manipulate images, eg crop and resize• Used to generate HTML .docx previews in Web Quick Start• Often uses the Content Transformation Service• Replaced the Thumbnail Service• Renditions are actions
![Page 6: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/6.jpg)
6
Apache Tika
Apache Tika – http://tika.apache.org/
• Apache Project which started in 2006• Grew out of the Lucene community, now widely used• Provides detection of files – eg this binary blob is really a word file• Plain text, HTML and XHTML versions of a wide range of different file formats• Consistent Metadata from different files• Tika hides the complexity of the different formats, and presents a simple, powerful API• Easy to use and extend
![Page 7: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/7.jpg)
7
Metadata Extractor Service
![Page 8: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/8.jpg)
8
Alfresco 3.3 - Supported Formats
File Formats supported out of the box
• PDF• Word, PowerPoint, Excel• HTML• Open Document Formats (OpenOffice)• RFC822 Email• Outlook .msg Email
![Page 9: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/9.jpg)
9
Alfresco 3.4 - Supported Formats – Page 1
File Formats supported out of the box, Page 1
• Audio – WAV, RIFF, MIDI• DWG (CAD)• Epub• RSS and ATOM Feeds• True Type Fonts• HTML• Images – JPEG, GIF, PNG, TIFF, Bitmap (including EXIF where found)• iWork (Keynote, Pages etc)• RFC822 mbox Mail
![Page 10: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/10.jpg)
10
Alfresco 3.4 - Supported Formats – Page 2
File Formats supported out of the box, Page 2
• Microsoft Outlook .msg Email• Microsoft Office (Binary) – Word, PowerPoint, Excel, Visio, Publisher, Works• Microsoft Office (OOXML) – Word, PowerPoint, Excel• MP3 (id3 v1 and v2)• CDF (Scientific Data)• Open Document Format (Open Office)• Old-style Open Office (.sxw etc)• PDF
![Page 11: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/11.jpg)
11
Alfresco 3.4 - Supported Formats – Page 3
File Formats supported out of the box, Page 3
• Zip and Tar archives• RDF• Plain Text• FLV Video• XML• Java class files
And I probably forgot one...!
![Page 12: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/12.jpg)
12
Calling Apache Tika
• // Get a content detector, and an auto-selecting Parser• TikaConfig config = TikaConfig.getDefaultConfig();• ContainerAwareDetector detector = new ContainerAwareDetector(• config.getMimeRepository()• );• Parser parser = new AutoDetectParser(detector);
• // We’ll only want the plain text contents• ContentHandler handler = new BodyContentHandler();
• // Tell the parser what we have• Metadata metadata = new Metadata(); • metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
• // Have it processed• parser.parse(input, handler, metadata, new ParseContext());
![Page 13: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/13.jpg)
13
Metadata Extractor – Java Use
• MetadataExtractorRegistry registry = (MetadataExtractorRegistry)context.getBean(“metadataExtracterRegistry”);
• MetadataExtracter extractor = registry.getExtracter(“application/vnd.ms-excel”);
• Map<QName, Serializable> properties = new HashMap<QName, Serializable>();
• ContentReader reader = contentService.getReader(nodeRef, ContentModel.PROP_CONTENT);
• extractor.extract(reader, properties);• System.err.println(properties);
![Page 14: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/14.jpg)
14
Metadata Extractor – JavaScript Use
JavaScript
var action = actions.create("extract-metadata");
action.execute(document);
• Full access is not directly available
• You can’t get at the raw properties
• You can, however, trigger extraction and saving to the node easily
• Available via an action
![Page 15: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/15.jpg)
15
Metadata Extractor – Geo Content Model
• <aspect name="cm:geographic">• <title>Geographic</title>• <properties>• <property name="cm:latitude">• <title>Latitude</title>• <type>d:double</type>• </property>• <property name="cm:longitude">• <title>Longitude</title>• <type>d:double</type>• </property>• </properties>• </aspect>
![Page 16: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/16.jpg)
16
Metadata Extractor – Geo Mapping
• # Namespaces• namespace.prefix.cm=http://www.alfresco.org/model/content/1.0
• # Geo Mappings• geo\:lat=cm:latitude• geo\:long=cm:longitude
• # Normal Mappings• author=cm:author• title=cm:title• description=cm:description• created=cm:created
![Page 17: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/17.jpg)
17
Demo:Geo Tagged Image in Share
![Page 18: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/18.jpg)
18
Content Transformation Service
![Page 19: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/19.jpg)
19
Supported Transformations
Transformations Supported in Alfresco v3.4
• Plain Text, HTML & XHTML for all Apache Tika supported text and document formats (around 30 file formats)• PDF to Image• PDF to SWF (for preview)• Office File Formats to PDF (via Open Office, using JODConverter in Enterprise)• Plain Text and XML to PDF• Zip listing to Text• Image to other Images (via ImageMagick)
![Page 20: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/20.jpg)
20
Content Transformer and Tika
Handlers
ContentHandler handler = new BodyContentHandler();
String text = handler.toString();
SAXTransformerFactory factory = SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
StringWriter sw = new StringWriter();
handler.setResult(new StreamResult(sw));
String text = sw.toString();
• Tika generates HTML-like SAX events as it parses
• Uses Java SAX API• Events can be captured or
transformed• Body Content Handler
used for plain text• HTML and XHTML
available• Can customise with your
own handler, with XSLT or with E4X from JavaScript
![Page 21: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/21.jpg)
21
Content Transformer – Java Use
• ContentTransformerRegistry registry = (ContentTransformerRegistry)context.getBean(“contentTransformerRegistry”);
• ContentTransformer transformer = registry.getTransformer(“application/vnd.ms-excel”,”text/csv”, new TransformationOptions());
• ContentReader reader = contentService.getReader(sourceNodeRef, ContentModel.PROP_CONTENT);
• ContentWriter writer = contentService.getReader(destNodeRef, ContentModel.PROP_CONTENT);
• transformer.transform(reader, writer);
![Page 22: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/22.jpg)
22
Content Transformer – JavaScript Use
JavaScript
var action = actions.create("transform");
// Transform into the same folder
action.parameters["destination-folder"] = document.parent;
action.parameters["assoc-type"] = "{http://www.alfresco.org/model/content/1.0}contains";
action.parameters["assoc-name"] = document.name + "transformed";
action.parameters["mime-type"] = "text/html";
// Execute
action.execute(document);
• Full access is not directly available
• You can’t control which property is transformed, it’s always Content
• You can control where the transformed version goes
• Triggering the transformation is easier than in Java
• Available via an action
![Page 23: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/23.jpg)
23
Custom Tika Parsers - Interface
Interface
public interface Parser {Set<MediaType> getSupportedTypes(ParseContext context);
void parse(InputStream stream, ContentHandler handler,Metadata metadata, ParseContext context)throws IOException, SAXException, TikaException;}
• The Tika Parser interface is quite simple
• Need to provide a list of supported mime types, so that auto-detection can work
• Accept an input stream, populate the Metadata object, and fire SAX events to the supplied handler
• That’s it!
![Page 24: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/24.jpg)
24
Custom Tika Parser – Hello World Parser
public class HelloWorldParser implements Parser { public Set<MediaType> getSupportedTypes(ParseContext context) { Set<MediaType> types = new HashSet<MediaType>(); types.add(MediaType.parse("hello/world")); return types; }
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws SAXException { XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); xhtml.startDocument(); xhtml.startElement("h1"); xhtml.characters("Hello, World!"); xhtml.endElement("h1"); xhtml.endDocument();
metadata.set("hello","world"); metadata.set("title","Hello World!"); }}
![Page 25: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/25.jpg)
25
Custom Command Line Transformer <bean id="transformer.worker.helloWorldCMD"
class="org.alfresco.repo.content.transform.RuntimeExecutableContentTransformerWorker"> <property name="mimetypeService“><ref bean="mimetypeService"/></property> <property name="transformCommand"> <bean class="org.alfresco.util.exec.RuntimeExec"> <property name="commandsAndArguments“><map> <entry key=".*“><list> <value>/bin/bash</value> <value>-c</value> <value>/bin/echo 'Hello World - ${source}' > ${target}</value> </list></entry> </map></property> <property name="errorCodes“><value>1,127</value></property> </bean> </property <property name="explicitTransformations"> <list><bean class="org.alfresco.repo.content.transform.ExplictTransformationDetails"> <property name="sourceMimetype“><value>text/plain</value></property> <property name="targetMimetype“><value>hello/world</value></property> </bean></list> </property> </bean>
<bean id="transformer.helloWorldCMD" class="org.alfresco.repo.content.transform.ProxyContentTransformer" parent="baseContentTransformer">
<property name="worker"><ref bean="transformer.worker.helloWorldCMD"/></property> </bean>
![Page 26: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/26.jpg)
26
Custom Transformer – Demo
JS Code
var action = actions.create("transform");action.parameters["destination-folder"] = document.parent;action.parameters["assoc-type"] = "{http://www.alfresco.org/model/content/1.0}contains";action.parameters["assoc-name"] = document.name + "HW";
if(document.mimetype == "hello/world") { action.parameters["mime-type"] = "text/plain";} else { action.parameters["mime-type"] = "hello/world";}
action.execute(document);
• Use our Command Line transformer to generate a “hello/world” version
• Use our Tika transfomer to turn this back into plain text
• Uses the JavaScript API to access the content transformation service
![Page 27: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/27.jpg)
27
Demo 2:Excel to Plain Text, CSV and HTML
![Page 28: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/28.jpg)
28
Rendition Service
![Page 29: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/29.jpg)
29
Standard Rendition Engines
Renditions Supported in Alfresco v3.4
• reformat – access to the Content Transformation Service• image – crop, resize, etc• freemarker – runs a Freemarker Template against the content of the node• html – turns .docx files into clean HTML + images• xslt – runs a XSLT Transformation against the content of the node, XML content nodes only!• composite – execute several renditions in a series, eg reformat followed by image crop
![Page 30: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/30.jpg)
30
Persisted vs Transient Definitions
For your more complicated renditions
• To run a rendition, first create a rendition definition for a given rendering engine• Then set all the parameters against it• Finally execute it against a node
• For very complicated / common renditions, you can save the definition to the data dictionary• It can then be retrieved and run• Rendition Service provides support to create, load, save and execute definitions
![Page 31: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/31.jpg)
31
Rendition Service – Calling From Java
Load, Edit, Save, Run
•// Retrieve the existing Rendition Definition•QName renditionName = QName.createQName( NamespaceService.CONTENT_MODEL_1_0_URI, "myRendDefn");•RenditionDefinition renditionDef = loadRenditionDefinition(renditionName);
•// Make some changes.•renditionDef.setParameterValue(AbstractRenderingEngine.PARAM_MIME_TYPE, MimetypeMap.MIMETYPE_PDF);•renditionDef.setParameterValue(RenditionService.PARAM_ORPHAN_EXISTING_RENDITION, true);
•// Persist the changes.•renditionService.saveRenditionDefinition(renditionDef);
•// Run the Rendition•ChildAssociationRef assoc = renditionService.render(sourceNode, renditionDef);
![Page 32: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/32.jpg)
32
Rendition Service – Calling From JavaScript
Create, Run, List
•var renditionDef = renditionService.createRenditionDefinition("cm:cropResize", "imageRenderingEngine");•renditionDef.parameters["destination-path-template"] = "/Company Home/Cropped Images/${name}.jpg";•renditionDef.parameters["isAbsolute"] = true;•renditionDef.parameters["xsize"] = 50;•renditionDef.parameters["ysize"] = 50;
•renditionService.render(nodeRef, renditionDef);
•var renditions = renditionService.getRenditions(nodeRef);
![Page 33: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/33.jpg)
33
Rendition Service – More Calling Options
Actions, Rules, CMIS
• Renditions are Actions, but normally hidden ones• They won’t show up in Share when defining Rules, or in Explorer for running a Custom Action
• Solution – create a JS Script, or some custom Java• Use this from your Rule / to run as an Action
• No dedicated REST API, but Renditions are available through CMIS• More details available in the CMIS talks!
![Page 34: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/34.jpg)
34
Custom Rendition Engines
When a composite just isn’t enough
• Rendition Engines are a special kind of Action Executor• This delivers lots of flexibility, and means anyone who can write Custom Actions already knows enough to write Custom Rendition Engines!• org.alfresco.repo.rendition.executer.AbstractRenderingEngine provides a helpful superclass
• To learn more about Custom Actions and Custom Action Executors, see Neil McErlean’s talk
![Page 35: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/35.jpg)
35
Demo 1:Crop and Resize an Image
(Using Share Rules)
![Page 36: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/36.jpg)
36
Demo 2:Video Rendition
![Page 37: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/37.jpg)
37
Demo 3:Word .docx -> HTML & Images
(Using Web Quick Start)
![Page 38: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/38.jpg)
38
Any Questions?
![Page 39: Metadata Extraction and Content Transformation](https://reader033.vdocuments.us/reader033/viewer/2022061202/547c19c4b4af9fda158b5068/html5/thumbnails/39.jpg)
39
Learn Morewiki.alfresco.comforums.alfresco.comblogs.alfresco.com/wp/nickb/twitter: @AlfrescoECM @Gagravarr