mendeley: crowdsourcing and recommending research on a large scale

Post on 28-May-2015

798 Views

Category:

Education

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

I was invited to be the keynote speaker at a special track on Recommendation; Data Sharing and Research Practices in Science 2.0 at the I-KNOW 2011 conference (http://i-know.tugraz.at/) on 2011/09/07. It presents the challanges involved in crowdsourcing the world's largest research catalogue and then building a recommendation service on top of them that scales to serve millions of users.

TRANSCRIPT

Mendeley:crowdsourcing and

recommending researchon a large scale

Kris Jack, PhDData Mining Team Lead

➔ what is mendeley?

➔ crowdsourcing on a large scale

➔ recommendations on a large scale

➔ data for you

Summary

...a startup company

...going to change the way that we

do research...

Mendeley is...

...organise their research

...collaborate with one another

...discover new research

Mendeley provides tools to help users...

...organise their research

...organise their research

...collaborate with one another

...discover new research

Mendeley provides tools to help users...

...organise their research

...organise their research

...collaborate with one another

...discover new research

Mendeley provides tools to help users...

...organise their research

...organise their research

...collaborate with one another

...discover new research

Mendeley provides tools to help users...

...organise their research

...organise their research

...collaborate with one another

...discover new research

Mendeley provides tools to help users...

...organise their research

➔ what is mendeley?

➔ crowdsourcing on a large scale

➔ recommendations on a large scale

➔ data for you

SummarySummary

works like this:

1) Install “Audioscrobbler”

2) Listen to music

3) Last.fm builds your music profile and recommends you music you also could like

Last.fmMendeley

and it’s the world’slargest open musicdatabase!

Last.fmMendeley

research libraries

researchers

papers

disciplines

music libraries

artists

songs

genres

Screenshot taken from www.mendeley.com on 04/09/11

Mendeley is the world’slargest crowdsourced research catalogue!

assimilate research artefacts into catalogue in real time (pdfs + citation metadata)

recognise duplicate and non-duplicate artefacts in noisy input

Catalogue Crowdsourcing:System Requirements

articles

catalogue

catalogue generator

Main types of input:

→ article PDFs → article metadata (e.g. reference)

Main sources of input:

→ Mendeley Desktop → Mendeley Web Importer → External catalogue imports (e.g. ArXiv) → External catalogue lookups (e.g.

CrossRef)

articles

catalogue

catalogue generator

Aims:

→ Cluster documents together → Generate catalogue entries

articles

catalogue

catalogue generator

Process:

→ Filehash check (SHA-1) → Identifier check (e.g. PubMed id) → Document fingerprint (full text) → Metadata similarity check → Update individual article page

articles

catalogue

catalogue generator

Catalogue with:

→ article metadata → aggregated statistics → support recs, etc.

➔ what is mendeley?

➔ crowdsourcing on a large scale

➔ recommendations on a large scale

➔ what does this mean for you?

SummarySummary

generate personal article recommendations for users(i.e. “here are some articles that may interest you”)

update recommendations every 24 hours

Article Recommendation:System Requirements

Output:Recommend 10 articles to each user

Input:User libraries

Recommendation through collaborative filtering

Article's in library or not (e.g. binary input)

Various similarity metrics (e.g. cooccurrence, loglikelihood, tanimoto)

16 months ago

Test:10-fold cross validation50,000 user libraries

Results:<0.025 precision at 10

Recommendation through collaborative filtering

Article's in library or not (e.g. binary input)

Various similarity metrics (e.g. cooccurrence, loglikelihood, tanimoto)

Test:10-fold cross validation50,000 user libraries

10 months ago (i.e. + 6 months)

Results:~0.1 precision at 10

Recommendation through collaborative filtering

Article's in library or not (e.g. binary input)

Various similarity metrics (e.g. cooccurrence, loglikelihood, tanimoto)

Test:Release to a subset of users

10 months ago (i.e. + 6 months)

Results:~0.4 precision at 10

Article Recommendation Acceptance RatesA

ccep

tan

ce r

ate

(i.e

. acc

ept/

reje

ct c

l ick

s)

Number of months live

generate personal article recommendations users(i.e. “here are some articles that may interest you”)

update recommendations every 24 hours

Article Recommendation:System Requirements

1 million users!

days!

How to scale up?

Test:10-fold cross validation50,000 user libraries

So, results comparable to non-distributed recommender

Completely distributed, so can easily run on EC2 within 24 hours...

Article Recommendation Precision Across User Library Sizes

Pre

cis i

on a

t 10

art

icle

s

Number of articles in user library

(using cooccurrence)

How will real users react?

➔ what is mendeley?

➔ crowdsourcing on a large scale

➔ recommendations on a large scale

➔ data for you

SummarySummary

Public Data

library readership library stars

Obtain from: http://dev.mendeley.com/datachallenge

user libraries

50,000 libraries4,848,724 articles

3,652,285 unique articles

Mendeley's API

www.mendeley.com

top related