amazon web services – plagiarism application danijel novaković january 31 st, 2012 supervisor:...

Amazon Web Services – Plagiarism Application

Danijel Novaković

January 31st, 2012

Supervisor: Prof. Amin Anjomshoaa

Outline

Scenario

Used Tools– The AWS Toolkit for Eclipse – Karmasphere Studio For Amazon– Apache PDFBox

Realization of Scenario Steps

Final Conclusions & Personal Opinion

Scenario

The sample PDF files are stored in an S3 bucket under the following Endpoint: http://exercise2.ws2011.s3-website-eu-west-1.amazonaws.com/ (first a user should be authenticated in order to access these files).

The files are read and stored in an Amazon queue for further processes.

An Amazon EC2 instance processes the queued items and extracts the paragraphs out of that as text. The result should be stored in the second Amazon S3 bucket.

As the next step, Elastic MapReduce should be applied to the resulting data of the previous step. The MapReduce process should simply make a word counting and for each paragraph calculate the top ten high frequency words. The result should be then stored in a SimpleDB.

Finally some sample queries that receives some keywords and returns the list of paragraphs that matches the best to those keywords should be provided.

Scenario

The AWS Toolkit for Eclipse – An open source plug-in for the Eclipse Java IDE that makes it easier

for developers to develop, debug, and deploy Java applications using Amazon Web Services.

– With the AWS Toolkit for Eclipse, you’ll be able to get started faster and be more productive when building AWS applications.

– The AWS Toolkit for Eclipse features: AWS SDK for Java AWS Explorer AWS Elastic Beanstalk Deployment and Debugging Support for multiple AWS Accounts

– http://aws.amazon.com/eclipse/

Used Tools I

Karmasphere Studio For Amazon– Graphical environment that supports the complete lifecycle for

developing for Amazon Elastic MapReduce, including prototyping, developing, testing, debugging, deploying and optimizing Hadoop Jobs.

– By simplifying development, Karmasphere Studio increases the productivity of developers, saving time and effort.

– Comes in versions compatible with Eclipse.– Two different licensing models

License Included (the Karmasphere software has been licensed by AWS)

Bring-Your-Own (designed for customers who prefer to use existing Karmasphere)

– http://aws.amazon.com/elasticmapreduce/karmasphere/– http://karmasphere.com/ksc/karmasphere-studio-for-amazon.html

Used Tools II

Apache PDFBox– Java PDF Library– Open source Java tool for working with PDF documents– Used for PDF to text extraction– http://pdfbox.apache.org/

Used Tools III

Scenario – part I

import com.amazonaws.auth.PropertiesCredentials;

import com.amazonaws.services.s3.AmazonS3;

import com.amazonaws.services.s3.AmazonS3Client;

import com.amazonaws.services.s3.model.ListObjectsRequest;

import com.amazonaws.services.s3.model.ObjectListing;

import com.amazonaws.services.s3.model.S3ObjectSummary;

import com.amazonaws.services.sqs.AmazonSQS;

import com.amazonaws.services.sqs.AmazonSQSClient;

import com.amazonaws.services.sqs.model.CreateQueueRequest;

import com.amazonaws.services.sqs.model.ReceiveMessageRequest;

import com.amazonaws.services.sqs.model.SendMessageRequest;

Scenario – part I

AmazonS3 s3 = new AmazonS3Client(new PropertiesCredentials(MainClass.class.getResourceAsStream("AwsCredentials.properties")));

AmazonSQS sqs = new AmazonSQSClient( new PropertiesCredentials(MainClass.class.getResourceAsStream(“AwsCredentials.properties")));

String inputBucketName = "exercise2.ws2011";

String mainBucketName = “introduction.to.cloud.computing";

String vFolderWithParagrapfsName = "pdf.extracted.paragraph";

String queueName ="myQueue01"+UUID.randomUUID();

int numberOfSentMessages = 0;

CreateQueueRequest createQueueRequest = new CreateQueueRequest(queueName);

String myQueueUrl = sqs.createQueue(createQueueRequest).getQueueUrl();

Scenario – part I

ObjectListing objectListing = s3.listObjects(new

ListObjectsRequest().withBucketName(inputBucketName));

for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries())

String fileName= objectSummary.getKey();

sqs.sendMessage(new SendMessageRequest(myQueueUrl, fileName));

numberOfSentMessages++;

Scenario – part I

Scenario – part II

// create bucket

s3.createBucket(mainBucketName);

// create virtual folder in created bucket

String tmpFileName = "tmpFile.txt";

Boolean successfullCreated= new File(tmpFileName).createNewFile();

File tmpFile = new File (tmpFileName);

s3.putObject(newPutObjectRequest(mainBucketName,vFolderWithParagrapfsName+"/",tmpFile));

ReceiveMessageRequest receiveMessageRequest = new ReceiveMessageRequest

(myQueueUrl);

int totalNumberOfReceivedMessages=0;

int numberOfReceivedMessages=0;

while(numberOfSentMessages!=totalNumberOfReceivedMessages)

List<Message> messages = sqs.receiveMessage(receiveMessageRequest).getMessages();

numberOfReceivedMessages=messages.size();

for (Message message : messages)

String fileName = message.getBody();

String messageRecieptHandle = message.getReceiptHandle();

sqs.deleteMessage(new DeleteMessageRequest(myQueueUrl, messageRecieptHandle));

String sURL = s3.generatePresignedUrl(inputBucketName, fileName, null).toString();

downloadFromUrl(sURL, pdfDir+"/"+fileName);

PDFTextParser pdfTextParserObj = new PDFTextParser();

String pdfToText = pdfTextParserObj.pdftoText(pdfDir+"/"+fileName);

pdfTextParserObj.writeTexttoFile(pdfToText, pdfDir+"/"+fileName2);

Scenario – part III

while(numberOfSentMessages!=totalNumberOfReceivedMessages){

List<Message> messages = sqs.receiveMessage(receiveMessageRequest).getMessages();

numberOfReceivedMessages = messages.size();

for (Message message : messages) {

String fileName = message.getBody();

String messageRecieptHandle = message.getReceiptHandle();

sqs.deleteMessage(new DeleteMessageRequest(myQueueUrl, messageRecieptHandle));

String sURL = s3.generatePresignedUrl(inputBucketName, fileName, null).toString();

downloadFromUrl(sURL, pdfDir+"/"+fileName);

PDFTextParser pdfTextParserObj = new PDFTextParser();

String pdfToText = pdfTextParserObj.pdftoText(pdfDir+"/"+fileName);

pdfTextParserObj.writeTexttoFile(pdfToText, pdfDir+"/"+fileName2);

forEachParagraph:

s3.putObject(new PutObjectRequest(mainBucketName, vFolderWithParagrapfsName+

"/"+ fileName3, paragrafContent));

totalNumberOfReceivedMessages+=numberOfReceivedMessages;

sqs.deleteQueue(new DeleteQueueRequest(myQueueUrl));

Scenario – part III

Scenario – part III (Results)

Scenario – part IV

import com.amazonaws.services.s3.AmazonS3;

import com.amazonaws.services.s3.AmazonS3Client;

import com.amazonaws.services.s3.model.ListObjectsRequest;

import com.amazonaws.services.s3.model.ObjectListing;

import com.amazonaws.services.s3.model.PutObjectRequest;

import com.amazonaws.services.s3.model.S3ObjectSummary;

import com.amazonaws.services.simpledb.AmazonSimpleDB;

import com.amazonaws.services.simpledb.AmazonSimpleDBClient;

import com.amazonaws.services.simpledb.model.Attribute;

import com.amazonaws.services.simpledb.model.BatchPutAttributesRequest;

import com.amazonaws.services.simpledb.model.CreateDomainRequest;

import com.amazonaws.services.simpledb.model.DeleteAttributesRequest;

import com.amazonaws.services.simpledb.model.DeleteDomainRequest;

import com.amazonaws.services.simpledb.model.Item;

import com.amazonaws.services.simpledb.model.PutAttributesRequest;

import com.amazonaws.services.simpledb.model.ReplaceableAttribute;

import com.amazonaws.services.simpledb.model.ReplaceableItem;

import com.amazonaws.services.simpledb.model.SelectRequest;

AmazonS3 s3 = new AmazonS3Client(new PropertiesCredentials(ExecuteJobs.class.getResourceAsStream("AwsCredentials.properties")));

AmazonSimpleDB sdb = new AmazonSimpleDBClient(new PropertiesCredentials(ExecuteJobs.class.getResourceAsStream("AwsCredentials.properties")));

//domainName in Amazon SimpleDB

String domainName = "IntroductionToCloudComputing";

String mainBucketName = "introduction.to.cloud.computing";

String vFolderWithParagrapfsName = null;

String pdfDir="pdfTemp"+UUID.randomUUID();

sdb.createDomain(new CreateDomainRequest(domainName));

ObjectListing objectListing = s3.listObjects(new

ListObjectsRequest().withBucketName(mainBucketName));

HadoopJob hj = new HadoopJob();

File tmpFile=null;

for (S3ObjectSummary objectSummary : objectListing.getObjectSummaries())

if(objectSummary.getSize()>0)

//it is a file read from Amazon S3, not a folder

//code is on the next slide

vFolderWithParagrapfsName = objectSummary.getKey().substring(0,

objectSummary.getKey().length() - 1);

if(objectSummary.getSize()>0) //it is a file read from Amazon S3, not a folder

String fileName = objectSummary.getKey();

String sURL = s3.generatePresignedUrl(mainBucketName, fileName, null).toString();

fileName = fileName.substring(vFolderWithParagrapfsName.length()+1);

String dTmpFilePath = pdfDir+"/"+fileName;

downloadFromUrl(sURL, pdfDir+"/"+fileName); //in pdfDir+”/”+fileName Paragraphs are stored

forEachParagraph

hj.doMyJob(pdfDir+"/"+"temp.txt", pdfDir+"/output"+"/"+fileName.substring(0,

fileName.indexOf(".txt"))+"/"+fileName.substring(0, fileName.indexOf(".txt"))+"_"+n);

int numberOfWords=10;

MyArray array = getTopWords(hadoopOutputFilePath, numberOfWords);

sdb.batchPutAttributes(new BatchPutAttributesRequest(domainName,

createSampleData(fileName.substring(0, fileName.indexOf("_Paragraphs.txt")),

shorterStmp,n,numberOfWords,array)));

public class HadoopMapper extends Mapper <Object, Text, Text, IntWritable>

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context) throws IOException,

InterruptedException

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens())

word.set(itr.nextToken());

context.write(word, one);

Scenario – part IV (Mapper)

public class HadoopReducer<Key> extends Reducer<Key, IntWritable, Key, IntWritable>

private IntWritable result = new IntWritable();

public void reduce(Key key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException

int sum = 0;

for (IntWritable val : values)

sum += val.get();

result.set(sum);

context.write(key, result);

Scenario – part IV (Reducer)

public static void initJob(Job job)

org.apache.hadoop.conf.Configuration conf = job.getConfiguration();

conf.setJobName("wordcount");

job.setInputFormatClass(org.apache.hadoop.mapreduce.lib.input.TextInputFormat.class);

job.setMapperClass(HadoopMapper.class);

job.setMapOutputKeyClass(org.apache.hadoop.io.Text.class);

job.setMapOutputValueClass(org.apache.hadoop.io.IntWritable.class);

job.setReducerClass(HadoopReducer.class);

job.setOutputValueClass(org.apache.hadoop.io.IntWritable.class);

job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.class);

Scenario – part IV (Driver)

public void doMyJob(String inputFileName, String outputFolderName) throws Exception

Job job = new Job();

initJob(job);

/* Tell Task Tracker this is the main */

job.setJarByClass(HadoopJob.class);

/* This is an example of how to set input and output. */

FileInputFormat.setInputPaths(job, inputFileName);

Path p = new Path(outputFolderName);

FileOutputFormat.setOutputPath(job, p);

/* And finally, we submit the job. */

job.submit();

job.waitForCompletion(true);

Scenario – part IV (Driver)

Scenario – part V

private static List<ReplaceableItem> createSampleData(String fileName, String paragraphContent, int paragraphNumber,int numberOfWords, MyArray array) throws IOException

List<ReplaceableItem> sampleData = new ArrayList<ReplaceableItem>();

sampleData.add(new ReplaceableItem(fileName+"_Paragraf_"+paragraphNumber).withAttributes(

new ReplaceableAttribute("Paper",fileName+".pdf", true),

new ReplaceableAttribute("Paragraph_Content",paragraphContent, true),

new ReplaceableAttribute(array.getKey(0),String.valueOf(array.getNumberOfAppearances(0)), true),

…. )));

return sampleData;

Scenario – part V

Scenario – part V (Final Results)

Web>’1’as=’2’

Query results

Final Conclusions & Personal Opinion

The Amazon Web Services (AWS) are a collection of remote computing services that together make up a cloud computing platform.

The importance and advantages of the usage of the Cloud Computing technology is proven in every day praxis.

Amazon Simple Storage Service (S3)– Folder structure among buckets is not completely supported;

Amazon Simple Queue Service (SQS) – Better for systems of a large number of sent messages;

Amazon SimpleDB– Service limits (http://thecloudtutorial.com/amazonsimpledb.html).

Thank you!

amazon web services – plagiarism application danijel novaković january 31 st, 2012 supervisor:...

scenario slide

amazon s3 bucket

aws toolkit

receivemessagerequest

bucket s3

aws applications

aws sdk

createqueuerequest import

Documents

wedding book ivana i danijel

nesc workshop open issues in grid scheduling ali anjomshoaa...

university of niŠ implementation of ects at faculties prof....

danijel deronda

dynamical evolution of the seinajoki asteroid family...

dispersion due to meandering dean vickers, larry mahrt coas,...

preveo saša novaković

hydrogen storage in magnesium based alloys jasmina grbović...

andrew m. novakoviĆ

eva katarina glazer, danijel Štruklec

parallalization of molecular dynamics danijel novaković,...

city of vienna integration and diversity in vienna project...

dragan novaković - cisco.com€¢unifies wired and wireless...

hiroshi tsuruoka, kazu z. nanjo, naoshi hirata (eri),...

danijel kabiljo.pdf

student standard in slovenia by: kristijan jejcic*, danijel...

document2 - carnetov portal za školeglazbena kultura, 6....

ms mathematics -...

|epcc| nesc workshop open issues in grid scheduling ali...

ادخ مان هب -...