subverting subversion? (or “an lhcb ‘commitment’ poll”?)
DESCRIPTION
Subverting subversion? (or “An LHCb ‘commitment’ poll”?). Rob Lambert. Remember… it’s just for fun!. Aims. Impersonal: Mine the data available from svn Gauge the number of individuals contributing to core software Gauge something about contributions to core software Personal - PowerPoint PPT PresentationTRANSCRIPT
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 1
Subverting subversion?(or “An LHCb ‘commitment’ poll”?)
Rob Lambert
Remember… it’s
just for fu
n!
Aims Impersonal:
Mine the data available from svn Gauge the number of individuals contributing to core software Gauge something about contributions to core software
Personal I’ve been in this collaboration for 8.5 years (and probably not much longer) How much have I actually contributed to the codebase?
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 2
Remember… it’s
just for fu
n!
Things you notice on SVN1. Release managers commit much more often
2. There are some projects which have skewed statistics
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 3
Tagging svn cp, from trunk to tags
Svn often “blames” the copier
Necessary workbut doesn’tchange the code
DBASE/PARAM Stripping Erasmus/Urania
Data filesno diffs or millions of diffs
“Analysis-specific”software
Single users withvastly different usage habits
Legacy codeLegacy data
Not all core software
List of metrics I need metrics which can avoid these issues
1. svn blame
2. svn blame without comments!
3. Total number of commits
4. Total number of touched files (sum over all commits)
5. Total number of changed lines (sum over all commits)
Collect statistics by:a) Projectb) User
Remember, this is all public information anyway (and JFF)
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 4
Basic Totals
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 5
My survey draws from the following data
So, on average we’d have 8k blames per person
But there are major differences in the different metrics
Before I can rank contributions, I need to examine the metrics
Stat Core Soft Entire RepoCommitters 361 495
Total lines blamed 2.8 M 7.9 M
Commits 88 k 125 k
Total diffs 57M 112 M
Total files changed 0.5 M 0.7 M
Methods have different strengths and weaknesses (further details in the backup slides)
To overcome all problems, I must combine the metrics (Don’t need the most basic blame, combine all others)
Comparing metrics
Method Historical
Current
Ignore Tags
Ignore Copies
IncludeComments
IncludeDocument’n
IgnoreWhitespace
Project name reliable?
Include Tests (qmt)
Include .ref?
Suppress long.ref diffs?
Scales with contribution?
Scales as # projects?
Scales with complexity?
Includes clean-ups?
blame
- comment
commits
files
diffs
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 6
Metrics
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 7
0 50000 100000 150000 200000 250000 3000000
50000100000150000200000250000300000350000400000450000
f(x) = 1.38254397482164 x − 443.815370839477R² = 0.976246064286745
Blames against blames -comment
blames-comment
Blam
es
0 50000 100000 150000 200000 250000 3000000
2000400060008000
1000012000140001600018000
f(x) = 0.0205507257539513 x + 125.198568106126R² = 0.190659707654524
Commits against Blames - comment
Blames - comment
Com
mits
97% correlated, don’t use both!
20% correlated, need to use both!
0.0 1000000.0 2000000.0 3000000.0 4000000.0 5000000.00
100002000030000400005000060000700008000090000
f(x) = 0.0143535384063346 x + 127.011190305812R² = 0.891450803039921
Files against diffs
diffs/2
File
s
89% correlated, but measure different things! Weight down slightly
Combining the metrics The metrics are very disparate in magnitude and spread
1. Take a safe logarithm to alleviate the scale problem2. Scale that down to the max of the metric3. Average over historical metrics, weight down historical by 2/34. Re-exponentiate to return a reasonable spread, express as %
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 8
name files commits blames blame-comment lines/2cattanem 80877 15439 46559 35337 4425280.0jonrob 52054 6567 192877 132495 4652467.5
mvesteri 1258 153 113553 99524 197790.0
]6exp[/6exp100,
,...,max,1,maxlog
max
max
userusermetrics
metricsmetric
metricuser
metric
user
metriclast
metricfirst
metricusermetric
metricuser
mGPAw
ssw
m
sssxs
Pause for thought?
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 9
Was I lying to myself, and on my CV? Let’s find out!
LHCb has around 10 application experts, plus
one per subdetectorLHCb core-soft
developer base consists of around 100 developers.
Other Stats Top 20 experts created 66% of the code
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 10
20 people, 66%341 people, 33%
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 950
50
100
150
200
250
GPA
Freq
uenc
y, #
peop
le
Other Stats Top 105 developers created the rest of the code
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 11
236 people, 5%125 people, 95%
0 50 100 150 200 250 300 350 4000.0
0.2
0.4
0.6
0.8
1.0
1.2
# of people
Cum
ulat
ive
Cont
ribut
ion
The big reveal
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 12
Ranking
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 13
The top 20, with exclusions
(i.e. no Erasmus/Urania/obsolete/PARAM/DBASE/Stripping)
#### name GPA files commits blames diffs/2 blame - comment--------------------------------------------------------------------------------------------------------------------------------------------------------
1 jonrob 76.1 52054 6567 192877 4652468.0 1324952 cattanem 72.6 80877 15439 46559 4425280.0 353373 ibelyaev 60.3 21721 3683 408993 804127.0 2509034 marcocle 57.5 69126 1904 71563 4262422.0 569015 frankm 56.0 16768 3421 305320 609614.5 2371796 mcoombes 35.5 13586 363 79560 1692002.0 649217 jpalac 34.5 10193 6181 32880 117944.0 243168 robbep 33.7 5759 1247 82471 460789.0 607879 pkoppenb 33.5 14345 4086 19768 306867.0 13475
10 rlambert 32.8 10268 2710 19527 614237.5 1472311 graven 32.5 9702 3937 35308 112972.5 2630612 gcorti 30.3 7578 3198 22404 240781.5 1641513 odescham 28.0 5977 1261 49687 125332.5 3793014 wouter 27.8 3324 1718 64833 81623.5 5353915 hmdegaud 26.9 5348 1454 20019 259001.5 1892616 truf 26.5 4753 1218 46148 79648.5 4143617 jost 24.0 2000 799 86178 78254.5 6158618 barrand 22.4 4791 1644 4271 978583.5 289319 ocallot 22.3 2914 746 45950 63892.5 3480420 mneedham 22.0 3945 940 28740 62806.5 21351
The medallists
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 14
12 3
Ranking (2)
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 15
The top 20, no exclusions
Ranking is pretty stable due to combined metric
#### name GPA files commits blames diffs / 2 blame-comment-----------------------------------------------------------------------------------------------------------------------------------------------------------
1 cattanem 69.1 92881 17699 99870 5770542.0 883752 jonrob 65.3 57520 7525 211669 9503448.0 1470873 ibelyaev 54.4 28234 4990 536087 1254729.0 3409854 marcocle 52.9 74768 2305 146540 4437568.0 1314355 frankm 44.8 18365 3745 305320 675987.0 2371796 pkoppenb 42.7 29061 8060 64013 815972.0 529247 pseyfert 41 7882 1019 838184 715032.5 8272458 graven 38.2 62388 4332 35361 693566.0 263309 diegoms 38 11082 1669 118837 2845016.0 108698
10 robbep 36.1 15393 2663 109255 709928.5 8312311 rlambert 34.6 11407 3082 84333 663746.0 7180312 jpalac 32.3 13881 7465 39626 234795.5 3046713 gcorti 31.6 12007 4303 34411 766920.5 2535314 mcoombes 29.1 13886 418 86968 1750572.0 6999915 liblhcb 25.4 19902 54 166164 1271841.0 16616316 phunt 23.9 4400 351 181278 258571.0 15653917 ocallot 23.6 6605 1420 46338 185653.5 3519218 mkarbach 23.4 3385 516 155646 179205.0 14367219 odescham 23.3 6361 1357 54241 133133.5 4152820 wouter 22.8 3447 1805 70385 86388.5 57510
The medallists
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 16
13
Interesting outliers
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 17
Metric Paul Seyfert Vanya Belyaev Thomas Bird
GPA (exclusions) (#93) : 6.5 % (#3) : 54.4 % (#217) : 2.0 %
GPA (inclusive) (#9) : 36.7 % (#3) : 60.3 % (#37) : 17.8 %
Blamed (inclusive) (#1) : 0.8 M (#2) : 0.5 M (#4) : 0.30 M
Comments (inclusive) (#22) : 11 K (#1) : 0.2 M (#2) : 0.14 M
#1 or 2(it’s cool that the computingproject leader is the leader!)
(15% of all code)
Looks like we have around 1/3 comments
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10
10
20
30
40
50
60
70
Proportion of blamed lines which were comments
Freq
uenc
y, #
con
tribu
ters
Comments
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 18
(Vanya sits here)
(Paul sits here)
Make your own metric? You could use a standard tool:
svnplot, svnstat, statsvn, plotsvn, mypy-svn-stat … Most standard tools are based on pysvn and sqlite, and as
Marco/Ben know, I gave up on trying to have Lbscripts, sqlite3 and pysvn working at the same time
Of course, I’ve committed my code into SVN … SvnPollTools https://svnweb.cern.ch/trac/lhcb/browser/packages/trunk/SvnPollTools Complete lists as csv files here
Other possible improvements: Ignore all ref files and test directories also in the line diffs (probably won’t significantly change the outcome) Add LHCbDirac’s svn (will add Joel and Philippe et al. ) Add Gaudi’s svn (will make Marco Cle #1 probably…) Multiply by the call graph?
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 19
Conclusions LHCb “core applications” appear to have
~20 experts, who contributed 2/3 of the total codebase ~100 developers who contributed 1/3 of the codebase ~250 contributors whose core soft contribution is very tiny
Interesting observations I can also make over dinner: Years of service GPA GPA Institute affiliation Permanence of current position…
And in the end, well, at least I know I helped
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 20
Remember… it’s
just for fu
n!
End Backups are often required
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 21
(1) blamed
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 22
Inclusions Exclusions.h .cpp .icpp .xml .xsd .dec .py Exclude blank lines
Exclude tags and branches
Pros ConsMeasures a real contribution to the currently used code
Can mis-attribute entire lines thanks to a single small modification
Measures current activity No attribution for past activity
Eliminates “tagging” activity Not all lines are equal in contribution
Includes comment lines
Incorrect attribution of svn cp command
(2) blamed - comments
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 23
Inclusions Exclusions.h .cpp .icpp .xml .xsd .dec .py Exclude blank lines
Exclude tags and branches
Exclude comments!
Pros ConsMeasures a real contribution to the currently used code
Can mis-attribute entire lines thanks to a single small modification
Measures current activity No attribution for past activity
Ignores comment lines Not all lines are equal in contribution
Eliminates “tagging” activity Comments -> important documentation!
Incorrect attribution of svn cp command
(3) #commits
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 24
Inclusions ExclusionsEverything Exclude tags
Pros ConsMeasures historical contribution to the codebase
Not all commits are equal in contribution
Eliminates “tagging” activity Also sensitive to whitespace changes
Eliminates pure svn-cp commits No subdivision for massive commits
No accounting for comments
Can mis-attribute to wrong project
(4) #files
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 25
Inclusions ExclusionsEverything Exclude tags
Pros ConsMeasures historical contribution to the codebase
No value judgement on the files themselves
Eliminates “tagging” activity Also sensitive to whitespace changes
Eliminates pure svn-cp commits No subdivision for massive commits
Ranks different commits No accounting for comments
Includes documentation changes Can mis-attribute to wrong project
(5) #diffs
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 26
Inclusions ExclusionsEverything Exclude tags
Exclude whitespace diffs
Pros ConsMeasures historical contribution to the codebase
No value judgement on changes that were made
Eliminates “tagging” activity No accounting for comments
Eliminates pure svn-cp commits Can mis-attribute to wrong project
Ranks different commits Takes a long time to poll svn
Includes documentation changes
Eliminates whitespace changes
Correlations
Rob Lambert, NIKHEF LHCb Core Soft, 5th March 2014 27
0 20000 40000 60000 80000 1000000.0
10.020.030.040.050.060.070.080.0
f(x) = 0.00112048936312839 x + 4.86136456006535R² = 0.607656019875388
GPA against Files
Files
GPA
120
0140
0160
0180
0110
001
1200
114
001
1600
118
001
0.010.020.030.040.050.060.070.080.0
f(x) = 0.00711953374680521 x + 4.3052659797783R² = 0.617984658577826
GPA against commits
Commits
GPA
0.0 100000.0 200000.0 300000.00.0
20.0
40.0
60.0
80.0f(x) = 0.00031312750049335 x + 4.24244084616698R² = 0.539661711051237
GPA against Blames - comment
Blames - comment
GPA
0 1000000 2000000 3000000 4000000 50000000.0
10.020.030.040.050.060.070.080.0
f(x) = 1.59313991347065E-05 x + 4.76027043649553R² = 0.521285505848452
GPA against diffs
Diffs / 2
GPA