IntroThis memo describes steps to configure and run a language resource processing. It is intended for internal use only.Architecture overviewMain componentsThere are three main components involved in the language resources processing: The Resource Server (hereafter RS) manages information about resources, their status and associated files.
The Workflow Server (hereafter WS) is responsible to process resource input files to output files that are loaded to the Virtuoso server. The WS is implemented using Oozie and Hadoop.
DERI and others participants processing components
Data and Processing FlowThe following diagram shows communication between WS and RS during processing a resource:RS-WS-coop_v3.pngThe flow: The flow is started by the administrator with an http call to the RS REST API. The call URL contains resource ID as a parameter. Example: POST /resources/48957c5d-456c-4d7a-abc9-3062c91dafdd/processed
First step in the processing is done by the RS. It downloads the resource input file, uploads it to the SCP server with name: ${resource_id}.ext
The resource server then selects flow by resource type, sets flow properties and starts the flow using WS API of Oozie.
Oozie executes the flow that contains data moving steps and execution of the resource processing components. The penultimate step in the flow moves is the loading of data to the Virtuoso server, that is done by the miniLoader java action.
The last step in the Oozie flow is notification of the resource server about Virtuoso load status. The resource server then notify LRPMA about processing status.
Processing set up overviewThe whole processing is configured by following stepsresource type definition
registration of resource
definition of workflow
Processing set upDefinition of the resource type1st is necessary to create an resource type using the resource server. Creating of the resource type is the HTTP POST request so it is possible to do it either by command line HTTP tool like curl or using a REST client. There are screen-shots from the Postman REST client in following text for illustration. Beside it there are also request parameters in table because it is easier to read. (and copy&paste). The HTTP header ContentType should be set to application/json.
The resource server address is http://54.201.101.125:9999. Suppose that it is necessary to process resources provided by Paradigma ltd. That contains a lexicon so result of processing will be one graph.
RequestPOST http://54.201.101.125:9999/resourcestypes
Example body{ "id":"paradigma", "description": "type intended for processing of resources provided by Paradigma ","graphsSuffixes": ["lexicon"]}
Example response{ "id": "paradigma"}
The resource type define which workflow is used for processing of the resource and the resource type id is used as a name of subfolder on HDFS for Oozie workflow.
Registration of the resourceThe language resource should be registered in the resource server. Normally it is done via the LRPMA but it it is possible to do it manually for test purposes using the resource server REST API.
RequestPOST http://54.201.101.125:9999/resources
Example body{ "id": "48957c5d-456c-4d7a-abc9-3062c91dafE0", "resourceType": "paradigma", "downloadUri": "scp://[email protected]/home/ubuntu/ParadigmaData/hotel_ca_tricks.csv", "credentials": "-----BEGIN RSA PRIVATE KEY----- ..., "language": "ca", "domain": "hotel", "provider": "Paradigma ltd", "licence": "LRGPL", "graphNamesPrefix": "http://www.eurosentiment.com/hotel/ca/lexicon/paradigma/" }
Example response{ "id": "48957c5d-456c-4d7a-abc9-3062c91dafE0"}
Definition of WorkflowProcessing steps are defined by XML work flow file that should be copied to Hadoop Distributed File System to the location that is configured in the Resource file configuration. The flow contains actions. Every action defines next action in case of its success. Properties populated by the resources server are used in the workflow definition XML files.Properties of flows populated by the Resource Server:
Properties calculated or retrieved from the resource properties:PropertyDescription
rsresourceidid of the resource
rsgraphprefixprefix for graphs, please see the miniLoader java action description below
rsgraphsufix0, [rsgraphsufix1]...graph suffixes, one for each file produced by the flow
rsdomaindomain of the processed resource
rslanguagelanguage of the processed resource
rsproviderprovider
rslicenselicense
oozie.wf.application.path${hdfs-folder-uri}/${resourceTypeId}hdfs-folder-uri is specified in conf.properties of the rs, resourceTypeId is property of the resource on the rs
The resource server also copy properties from the resource server configuration file conf/job.properties to the flow properties. It can be used for properties common for all flows like:
PropertyDescription
nameNodeHDFS name node address
jobTrackerMap reduce job tracker address
queueNameMap reduce jobs queue name
user.nameuser used to run the OOzie flow
inputfolderwhere downloaded resource files are stored
rspfilesdirfolder for processed files
rsvirtuosoloadfolderabsolute path to the folder where files for loading are stored
rsvirtuosohosthostname or address of the virtuoso server
rsvirtuosojdbcportJDBC port
rsvirtuosojdbcuserruser
rsvirtuosojdbcpasswdpassword
rsprocessedurlurl to send result of the virtuoso load
Example:
Configuring ActionsWork flows usually contains following sequenceMove of data to place when it can be reached by the first processing component
Processing by the first component
Move of data to place when it can be reached by the second processing component
Processing by second component
.
Load to the Virtuoso triple store
Moving the resource file to the processing componentsThe following snippet shows an example of configuration of first step in flow to move the resource files to folder where it can be picked up by a processing component.
ubuntu@ptwf ${moveScriptPath} -onlyCopy ${inputfolder}${rsresourceid}* ubuntu@ptnuig:/home/ubuntu/data/${rsresourceid}.csv
Configuring processingThe following xml snippet shows an example of processing by the Lomon Marl generator.
ubuntu@ptnuig ~/bin/runLemonMarlGeneratorParadigma.sh /home/ubuntu/data/${rsresourceid}.csv /home/ubuntu/data/outputs/${rsresourceid}.ttl ${rsdomain} ${rslanguage} ${rsgraphprefix}${rsgraphsufix0}
Moving data to Virtuoso ServerThe following xml snippet shows an action which move output of previous step to the Virtuoso server.
ubuntu@ptnuig ${moveScriptPath} /home/ubuntu/data/outputs/${rsresourceid}.ttl ${virtuosoUser}@${rsvirtuosohost}:${rsvirtuosoloadfolder}${rsresourceid}.ttl
Load data to the Virtuoso ServerThe following xml snippet shows an example configuration of the miniLoader component that is used for load of the processed resources files to the Virtuoso server.
${jobTracker} ${nameNode} mapred.job.queue.name ${queueName} com.sindice.miniloader.Miniloader ${rsvirtuosohost} ${rsvirtuosojdbcport} ${rsvirtuosojdbcuser} ${rsvirtuosojdbcpasswd} ${rsvirtuosoloadfolder}${rsresourceid}.ttl ${rsgraphprefix}${rsgraphsufix0}
Notifying the resource serverLast step notifies the RS that data was loaded to the Virtuoso server.
${jobTracker} ${nameNode} curl -H Content-Type:application/json -X POST -d ${wf:actionData('load2virtuoso')['miniloader_json4rs']} ${rsprocessedurl}${rsresourceid}/processed
Copy the configuration to the HDFSThe property hdfs-folder-uri in conf.properties RS configuration file define the path where the configuration should be stored.
The resource type ID (paradigma) is part of the HDFS path so it is firs necessary to check if exists:
If the folder for given resource file does not exists yet it is necessary to create it.Now is necessary to copy the workflow and required jars. In this case only the miniloader jar is required and it should be copied to the lib subfolder.
hadoop fs -put workflow.xml /user/ubuntu/nuig-flows/paradigma/fs -put ~/virtuoso-miniloader-0.0.1-SNAPSHOT.jar /user/ubuntu/nuig-flows/paradigma/lib
Processing ResourcesProcessing is started by HTTP POST request to the RS server with empty body.
It is possible to control status of the processing using Oozie web console:
clicking the running line the detail window appears
When processing finished all step should have status OK
When resource is processed successfully it is possible to make a sparql request to verify the content.
Appendix A: example of whole flow definition
ubuntu@ptwf ${moveScriptPath} -onlyCopy ${inputfolder}${rsresourceid}* ubuntu@ptnuig:/home/ubuntu/data/${rsresourceid}.csv
ubuntu@ptnuig ~/bin/runLemonMarlGeneratorParadigma.sh /home/ubuntu/data/${rsresourceid}.csv /home/ubuntu/data/outputs/${rsresourceid}.ttl ${rsdomain} ${rslanguage} ${rsgraphprefix}${rsgraphsufix0}
ubuntu@ptnuig ${moveScriptPath} /home/ubuntu/data/outputs/${rsresourceid}.ttl ${virtuosoUser}@${rsvirtuosohost}:${rsvirtuosoloadfolder}${rsresourceid}.ttl ${jobTracker} ${nameNode} mapred.job.queue.name ${queueName} com.sindice.miniloader.Miniloader ${rsvirtuosohost} ${rsvirtuosojdbcport} ${rsvirtuosojdbcuser} ${rsvirtuosojdbcpasswd} ${rsvirtuosoloadfolder}${rsresourceid}.ttl ${rsgraphprefix}${rsgraphsufix0} ${jobTracker} ${nameNode} curl -H Content-Type:application/json -X POST -d ${wf:actionData('load2virtuoso')['miniloader_json4rs']} ${rsprocessedurl}${rsresourceid}/processed ${jobTracker} ${nameNode} mkdir ${rspfilesdir}/${rsresourceid}
${jobTracker} ${nameNode} mv ${rsvirtuosoloadfolder}${rsresourceid}.ttl ${rspfilesdir}/${rsresourceid}
SSH action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]