improving software package search quality dan fingal and jamie nicolson

15
Improving Software Package Search Quality Dan Fingal and Jamie Nicolson

Upload: ami-hall

Post on 24-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving Software Package Search Quality Dan Fingal and Jamie Nicolson

Improving Software Package Search Quality

Dan Fingal

and

Jamie Nicolson

Page 2: Improving Software Package Search Quality Dan Fingal and Jamie Nicolson

The Problem

Search engines for software packages typically perform poorly

Tend to search project name and blurb Don’t take quality metrics into account Poor user experience

Page 3: Improving Software Package Search Quality Dan Fingal and Jamie Nicolson

Improvements in Indexing

More text relating to the package Every package is associated with a

website that contains much more detailed information about it

We spidered those sites and stripped away the html for indexing

Page 4: Improving Software Package Search Quality Dan Fingal and Jamie Nicolson

Improvements in Quality Metrics

Software download sites have additional data in their repositories

Popularity: Frequency of download Rating: User supplied feedback Vitality: How active development is Version: How stable the package is

Page 5: Improving Software Package Search Quality Dan Fingal and Jamie Nicolson

System Architecture

Pulled data from Freshmeat.net and Gentoo.org into mySQL database

Spidered and extracted text for associated homepages

Text indexed with Lucene, using Porter Stemming and a Stop Words list

Similarity metrics as found in CS276A

Page 6: Improving Software Package Search Quality Dan Fingal and Jamie Nicolson

Ranking Function

Lucene returns an ordered list of documents (packages)

Copy this list and order by the various quality metrics

Scaled Footrule Aggregation combines lists into one results list

Scaled Footrule and Lucene field parameters can be varied

Page 7: Improving Software Package Search Quality Dan Fingal and Jamie Nicolson

Scaled Footrule Aggregation

Scaled footrule distance between two rankings of an item is the difference in its normalized position

E.g., one list puts an item 30% down, another puts it 70% down. Scaled footrule is |.3 − .7| = .4

Scaled footrule distance between two lists is the sum over the items in common

Scaled footrule of a candidate aggregate list and the input rankings is the sum of the distances between the candidate and each input ranking

Minimum cost maximum over bipartite graph between elements and rank position optimizes

Page 8: Improving Software Package Search Quality Dan Fingal and Jamie Nicolson

Measuring Success

Created a gold corpus of 50 queries to relevant packages

One “best” answer per query Measured recall within the first 5 results

(0 or 1 for each query) Compared results with search on

packages.gentoo.org, freshmeat.net, and google.com

Page 9: Improving Software Package Search Quality Dan Fingal and Jamie Nicolson

Sample Gold Queries

“C compiler”: GCC “financial accounting”: GNUCash “wireless sniffer”: Kismet “videoconferencing software”:

GnomeMeeting “wiki system”: Tiki CMS/Groupware

Page 10: Improving Software Package Search Quality Dan Fingal and Jamie Nicolson

Training Ranking Parameters

Boils down to choosing optimal weights for each individual ranking metric

Finding an analytic solution seemed too hairy, so we used local hill-climbing with random restarts

Page 11: Improving Software Package Search Quality Dan Fingal and Jamie Nicolson

Training Results

Uniform parameters gave a recall of 29/50 (58%)

Best hill-climbing parameters gave recall of 37/50 (74%)

We're testing on the training set, but with only five parameters the risk of overfitting seems low

Popularity was the highest-weighted feature

Page 12: Improving Software Package Search Quality Dan Fingal and Jamie Nicolson

Comparing to Google

Google is a general search engine, not specific to software search

How to help Google find software packages? Append “Linux Software” to queries

How to judge the results? Package name appears in page? Package name appears in search results?

Page 13: Improving Software Package Search Quality Dan Fingal and Jamie Nicolson

Gentoo Freshmeat Google PackageMiner (untrained)

PackageMiner (trained)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.26

0.38

0.58 0.58

0.74

System Performance on Gold Corpus

Recall in Top 5 Results

Page 14: Improving Software Package Search Quality Dan Fingal and Jamie Nicolson

Room for Improvement

Gold corpus Are these queries really representative? Are our answers really the best? Did we choose samples we knew would

work well with our method? Training Process

Could probably come up with better training algorithms

Page 15: Improving Software Package Search Quality Dan Fingal and Jamie Nicolson

Any questions?