improving software package search quality dan fingal and jamie nicolson
TRANSCRIPT
Improving Software Package Search Quality
Dan Fingal
and
Jamie Nicolson
The Problem
Search engines for software packages typically perform poorly
Tend to search project name and blurb Don’t take quality metrics into account Poor user experience
Improvements in Indexing
More text relating to the package Every package is associated with a
website that contains much more detailed information about it
We spidered those sites and stripped away the html for indexing
Improvements in Quality Metrics
Software download sites have additional data in their repositories
Popularity: Frequency of download Rating: User supplied feedback Vitality: How active development is Version: How stable the package is
System Architecture
Pulled data from Freshmeat.net and Gentoo.org into mySQL database
Spidered and extracted text for associated homepages
Text indexed with Lucene, using Porter Stemming and a Stop Words list
Similarity metrics as found in CS276A
Ranking Function
Lucene returns an ordered list of documents (packages)
Copy this list and order by the various quality metrics
Scaled Footrule Aggregation combines lists into one results list
Scaled Footrule and Lucene field parameters can be varied
Scaled Footrule Aggregation
Scaled footrule distance between two rankings of an item is the difference in its normalized position
E.g., one list puts an item 30% down, another puts it 70% down. Scaled footrule is |.3 − .7| = .4
Scaled footrule distance between two lists is the sum over the items in common
Scaled footrule of a candidate aggregate list and the input rankings is the sum of the distances between the candidate and each input ranking
Minimum cost maximum over bipartite graph between elements and rank position optimizes
Measuring Success
Created a gold corpus of 50 queries to relevant packages
One “best” answer per query Measured recall within the first 5 results
(0 or 1 for each query) Compared results with search on
packages.gentoo.org, freshmeat.net, and google.com
Sample Gold Queries
“C compiler”: GCC “financial accounting”: GNUCash “wireless sniffer”: Kismet “videoconferencing software”:
GnomeMeeting “wiki system”: Tiki CMS/Groupware
Training Ranking Parameters
Boils down to choosing optimal weights for each individual ranking metric
Finding an analytic solution seemed too hairy, so we used local hill-climbing with random restarts
Training Results
Uniform parameters gave a recall of 29/50 (58%)
Best hill-climbing parameters gave recall of 37/50 (74%)
We're testing on the training set, but with only five parameters the risk of overfitting seems low
Popularity was the highest-weighted feature
Comparing to Google
Google is a general search engine, not specific to software search
How to help Google find software packages? Append “Linux Software” to queries
How to judge the results? Package name appears in page? Package name appears in search results?
Gentoo Freshmeat Google PackageMiner (untrained)
PackageMiner (trained)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.26
0.38
0.58 0.58
0.74
System Performance on Gold Corpus
Recall in Top 5 Results
Room for Improvement
Gold corpus Are these queries really representative? Are our answers really the best? Did we choose samples we knew would
work well with our method? Training Process
Could probably come up with better training algorithms
Any questions?