newcastle university: content migration made easy

35
T4 Import Processes Mike Sales & Paul Thompson 20/11/2014

Upload: terminalfour

Post on 15-Jul-2015

88 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Newcastle University: Content migration made easy

T4 Import Processes

Mike Sales & Paul Thompson20/11/2014

Page 2: Newcastle University: Content migration made easy

Import 150 websites on

www.ncl.ac.uk during

2015….

Page 3: Newcastle University: Content migration made easy
Page 4: Newcastle University: Content migration made easy

We have about half a dozen template types:

All of these ‘top task’ boxes use different HTML

depending on their templates, but they’re going

to be the same content type in T4…

Page 5: Newcastle University: Content migration made easy

The process…

Map all of those UNIX accounts onto a new server…

wwwclone.ncl.ac.uk

Content has a header and footer designed to tidy up the content…

• Some string and regular expression finds and replaces.

• PHP HTML tidy

• Dom Manipulations and Transformations.

Output on another server designed to be spider friendly.

Page 6: Newcastle University: Content migration made easy

The Find/Replace Bit…

/* STRING PARSING BIT */

// convert classes to the ones we want...

$output = str_replace(array("hpBox","hpPadding"),"toptask",$output);

$output = str_replace(array("studentProfile-grid"),"studentprofile",$output);

$output = str_replace('<div class="clear">&nbsp;</div>',"",$output);

Classes are standardized. H3’s become H1’s. Visual and unnecessary HTML fixes are removed.

Page 7: Newcastle University: Content migration made easy

The HTML Tidy Bit…

/* HTML TIDY BIT */

$tidy_config = array(

'clean' => true,

'output-xhtml' => true,

'show-body-only' => true,

'wrap' => 0,

);

$tidy = tidy_parse_string($output, $tidy_config, 'UTF8');

$tidy->cleanRepair();

Page 8: Newcastle University: Content migration made easy

The DOM Manipulation Bit…

/* DOM MANIPUATION BIT */

$dom = new DOMDocument();

$dom->loadHTML("<div id='parseme'>".(String)$tidy."</div>");

$xpath = new DOMXPath($dom);

$nodeCollection=array();

$currentCollector=null;

Header groups are wrapped in appropriate DIVS.HTML4 block elements replaced with HTML5 ones.

Page 9: Newcastle University: Content migration made easy

The DOM Manipulation Bit…/* REMOVE COMMENTS */

foreach ($xpath->query('//comment()') as $comment) {

$comment->parentNode->removeChild($comment);

}

/* CONVERT CLASSES TO THE REAL THINGS*/

foreach ($xpath->query("//p[@class = 'pullquote']") as $item)

{

$blockquote=$dom->createElement("blockquote");

$item->parentNode->insertBefore($blockquote,$item);

$blockquote->appendChild($item);

}

/* REMOVE UNWANTED THINGS */

foreach($xpath->query("//div[@id = 'tickerTape']") as $node ) {

$node->parentNode->removeChild($node);

Page 10: Newcastle University: Content migration made easy

Wwwclone…

Page 11: Newcastle University: Content migration made easy

Wwwclone…

Page 12: Newcastle University: Content migration made easy

Wwwclone…

Page 13: Newcastle University: Content migration made easy

Wwwclone…

Page 14: Newcastle University: Content migration made easy

Wwwclone… Database driven

content shouldn’t

be imported, but

we import their

primary key into

a Staff List object

using a ‘web-obj’

Page 15: Newcastle University: Content migration made easy
Page 16: Newcastle University: Content migration made easy

Database driven

content shouldn’t

be imported, but

we import their

primary key into

a Staff List object

using a ‘web-obj’

<h1><t4 type="content" name="List Title" output="normal" modifiers="" /></h1>

<p><t4 type="content" name="List Description" output="normal" modifiers="" /></p><t4 type="web-obj"method="http"start-url="http://webdata.ncl.ac.uk/code/myimpact/cmsws/t4_stafflist.php?em=$template.Email Addresses$"link-match="/code/myimpact/cmsws/"link-match-import-method="subsection"parse-body="true"use-title-attribute="true"start-tag="<!-- START STAFF -->"end-tag="<!-- END STAFF -->"/>

Page 17: Newcastle University: Content migration made easy
Page 18: Newcastle University: Content migration made easy

Now for Mike Sales…Note that the template currently

implemented on our development server

is not yet complete, the real design

by www.theroundhouse.co.uk

is live at www.ncl.ac.uk/postgraduate

Page 19: Newcastle University: Content migration made easy

Static HTML Site Import Tool

Page 20: Newcastle University: Content migration made easy

• Consists of a handler for the UI/configuration pages and a separate servlet which contains the import code.

• The handler communicates with the servlet via AJAX to provide progress updates (in the form of a jQuery dial widget) and to prevent the client browser timing out.

Static HTML Site Import Tool

Page 21: Newcastle University: Content migration made easy

The handler UI takes the following user input –

• URL of the site to be imported

• Title for the imported site root

• Destination publishing channel

• Media library title

• List of users to be added

Static HTML Site Import Tool

Page 22: Newcastle University: Content migration made easy

Content mappings are set within the handler configuration. This allows HTML element(s) to be entered and the target T4 content type/element(s) to be specified.

The following slides detail the import process …

Static HTML Site Import Tool

Page 23: Newcastle University: Content migration made easy

The live site is crawled using Crawler4j. The crawl collects the following metadata from all links which return HTTP 200/OK response codes:• Page URL• Page path (as above but without protocol and domain)• Page title

The web crawler class runs independently, so all link metadata is added to a temporary JSON file (using JSON.simple).

Static HTML Site Import Tool

Page 24: Newcastle University: Content migration made easy

Once the crawl is complete, the servlet reads in and parses the JSON collected during the crawl.

Each link in the JSON is then iterated. For each link, the following is done:• The link structure is built – the path from the URL is split into an array

(with forward-slash as the delimiter)

• Each element of the path is created as a section in T4 using a loop (various checks are made to ensure the section doesn’t already exist)

• Sections are created using HierarchyManager. The section title is obtained from the JSON array. The last element of the path array is used for the section URI

Static HTML Site Import Tool

Page 25: Newcastle University: Content migration made easy

Next, all page content is loaded into a jSoupdocument and cleaned against a whitelist (valid HTML elements can be defined within the handler configuration page).

Static HTML Site Import Tool

Page 26: Newcastle University: Content migration made easy

Media content (images, documents, etc) are selected from the jSoupdocument and iterated:

Media items are created -• The image/document URL is used to create a temporary file• Metadata is taken from the HTML tag and is used to populate the

Title and Description fields• Image variants are created (any number of these can be specified

within the handler configuration page)• Newly-created Media elements are added to the cache• Using the MediaFilter tag broker, image tags within the jSoup

document are converted to T4 media links (i.e. they have a span wrapper added and the URL points to the media library).

Static HTML Site Import Tool

Page 27: Newcastle University: Content migration made easy

Content elements are created from the current jSoupdocument –

• A loop is used to iterate over each content type (using the mappings defined in the handler configuration page - for example, ‘div.news-item’ is loaded into the Body Text element of the News content type).

• ContentManager is used to create the content items while ContentHierarchy is used to add them to the current section (the last section created in the previous stage).

• Newly-created content items are added to the cache.

Static HTML Site Import Tool

Page 28: Newcastle University: Content migration made easy

After the structure and all content/media elements are created and approved, the channel content is reset (no full cache rebuild necessary!)

Finally, users are created and added to the site, which involves –• Importing users directly from LDAP

• Adding the users to the Moderator context, the relevant content editing groups and Grouper

• Adding the users to the imported site root section

Static HTML Site Import Tool

Page 29: Newcastle University: Content migration made easy

Drag and Drop Media Uploader

Page 30: Newcastle University: Content migration made easy

Uses a handler for the UI/configuration pages and a separate servlet which contains the upload processing code. The Plupload drag and drop component communicates with the servlet via AJAX.

The Media Library hierarchy is displayed in the left column. HierarchyViewer is used to display the Media Library structure – this is so that users only see the sections they have write access to.

Drag and Drop Media Uploader

Page 31: Newcastle University: Content migration made easy

The Plupload component is displayed in the upper-right of the screen. Files can be dragged and dropped here. When the Start Upload button is clicked, the servlet does the following:

• Creates a temporary file for each item in the upload queue

• If a Zip file is encountered, the archive is extracted to a temporary folder

• Gets the specified Title and Alt tags for each upload (except for files extracted from a Zip – instead the filename is used as the title)

• For each item, creates a Media object with the temporary File object, title, alt and any image variant sizes specified in the handler configuration page

• Adds the Media items to the specified category

• Performs a ‘media.add’ cache update with the IDs of the newly-created elements

Drag and Drop Media Uploader

Page 32: Newcastle University: Content migration made easy

User Portal / Dashboard

Page 33: Newcastle University: Content migration made easy

This is a handler which replaces the Welcome page and reuses the Widget/Widget Collection feature to allow different user levels to be presented with dashboards relevant to their activities.

For example, administrators could see widgets relating to server performance and usage, while moderators could see recently updated content, a help blog, news widget, etc.

User Portal / Dashboard

Page 34: Newcastle University: Content migration made easy

Widgets are created as normal and consist of HTML/JS that makes use of the T4 web services API.

Within the handler configuration page, Widgets can be added to collections (Header and Left/Centre/Right-columns) using a drag-and-drop jQuery UI component – this also supports reordering of Widgets.

User Portal / Dashboard

Page 35: Newcastle University: Content migration made easy

Any questions?