chemometrics the newsletter

13
the Chemometrics (special interest group) Newsletter o Jan 2011 Issue 1 o chemometrics (’k¯ eo’metriks) n. 1. (as sing.) Scientific discipline within analytical chemistry, mathematical or statistical methods as applied to chemical data. [formal def, e.g., see International Chemometrics Society (ICS)] 2. (gen., collective term) any procedure (or collection of procedures) that help chemists perform well-designed experiments, better understand data and rapidly interpret data with higher confidence. 3. (more gen., unattrib.) Something all chemists [should - sic ?] do. 4. (inform., humour?) what chemometricians do [oft. attrib‘d: Swante Wold, ca. 1975]; linked with def (informal, humour?) Chemometricians people who drink beer and steal ideas from statisticians.

Upload: others

Post on 12-Sep-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chemometrics the Newsletter

the

Chemometrics(special interest group)

Newslettero

Jan 2011

Issue 1

o

chemometrics (’ke•mo’me•triks) n. 1. (as sing.) Scientific discipline within

analytical chemistry, mathematical or statistical methods as applied to chemical

data. [formal def, e.g., see International Chemometrics Society (ICS)] 2. (gen.,

collective term) any procedure (or collection of procedures) that help chemists

perform well-designed experiments, better understand data and rapidly interpret

data with higher confidence. 3. (more gen., unattrib.) Something all chemists

[should− sic?] do. 4. (inform., humour?) what chemometricians do [oft. attrib‘d:

Swante Wold, ca. 1975]; linked with def (informal, humour?) Chemometricians

people who drink beer and steal ideas from statisticians.

Page 2: Chemometrics the Newsletter

Jan 2011, Issue 1 2

Welcome and Introductions

Welcome to the first edition of the new Chemometrics Newsletter.

The RSC Chemometrics Special Interest Group has been around for a while. During its lifetime it has grownsteadily and is now an international body with nearly 200 members.

However, there was no newsletter and this is one of the first things we felt we should do something about.

We are currently planning to issue the newsletter three or four times a year as a means of keeping us all up-to-date withthe activities of the group and its members. We also intend using the newsletter as a means of disseminating informa-tion about new techniques, meetings and other work that we think might be interest to you. If you want to get moreinvolved with either the RSC at large or the Special Interest Group in particular, why not start by signing up to MyRSC(http://my.rsc.org)? The Group’s own pages, currently under review, are also at http://www.rsc.org/chemometrics.

We hope the newsletter is of interest, and welcome any feedback. We would be delighted to receive any sug-gestions, ideas or submissions for future editions of the newsletter.

mailto:[email protected] January 2011

Newsletter MiscellaneousThe Chemometrics Newsletter is produced by the RoyalSociety of Chemistry (RSC) Chemometrics Special In-terest Group (SIG) Committee and freely distributed tomembers of the Special Interest Group.

Newsletter Editor:Karl Ropkins, University of Leeds.mailto:[email protected]

RSC Chemometrics SIG:http://www.rsc.org/chemometrics

Contact RSC:Royal Society of ChemistryThomas Graham HouseScience Part, Milton RoadCambridge CB4 0WFTel +44(0)1223 420066 (switchboard)

Contents and submissions:We actively encourage readers to submit articles to thenewsletter.

In line with the newsletter style, submitted articlesshould be concise, intended to provide an overview ofan area of likely interest to those associated with chemo-metrics, and, wherever possible, links to and referencesfor sources of further information for those interested infinding out more.

This newsletter does not accept or publish adverti-sements, but articles, any suggestions for future articlesand any feedback regarding the newsletter or relatedmatters should be emailed to the editor.

Note:The RSC is a Registered Charity (Number: 207980).

Contents

Welcome and Introductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Newsletter Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Something All Chemists Do... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Lessons Learnt: Charles Joseph Minard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Signature Source Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Puzzle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Royal Society Of Chemistry Chemometrics Special Interest Group

Page 3: Chemometrics the Newsletter

Jan 2011, Issue 1 3

Something All Chemists Do...

Maybe saying this in a Chemometrics Newsletter is just preaching to the converted but it perhaps needs to be said:Chemometrics is something all chemists do.

Whether it is something as mundane as calculating confidence limits on a routine instrument calibration, as challen-ging as balancing the environmental and commercial costs of modern production lines or as rewarding as interpretingnovel data collected at the forefronts of analytical science or international concerns, we all use statistics. Some of uslove it, some accept it is as a necessity, a tool to be used much like a mass spectrometer or oscilloscope, and some, tobe honest, would rather it were not the case.

Now, before you do anything else, spare a thought for this last group. After all, we are all chemists. We all take pridein our science. It is easy to see how needing some test - perhaps something that you do not really see the relevance ofor even fully understand the point of - to validate all your own hard work could sometimes seem like an admissionof failure. It is also easy to appreciate how daunting other scientific disciplines can be for new or occasional users,and, more importantly, how easy it is to become disillusioned after being misled by an impressive but ill-posed statistic.

On-going developments in monitoring technologies, instrumental automation and computer processing powerall mean that larger and more diverse datasets are becoming ever more commonplace and statistical techniques thathelp us make sense of it all are rapidly becoming not just more useful but more often essential. So, now, possiblymore so than ever before, it is important that we make good chemometrics highly accessible to all our colleagues.

So, what are we doing to promote chemometrics?

International Year of Chemistry

2011 is the International Year of Chemistry (IYC 2011).

It is a great opportunity for us to highlight the many po-sitive contributions that chemistry has made, and thereare numerous events in many countries that you may beinterested in getting involved with. If you want to findout more about IYC 2011, find out what is going on nearyou or even organise an event of your own, check outhttp://www.chemistry2011.org/ for more details.

IYC 2011 also means it is a particularly good timefor all of us to be engaging with both the larger scientificcommunity and general public, talking about the roleof chemometrics within chemistry and statistics withscience, and, most importantly, demonstrating the valueof the work we do.

Events

We are currently organising a conference that is intendedto provide attendees from the larger research commu-nity with an overview of Chemometrics and some of itsapplications.

This meeting is intended to be the first in a programmeof events that we are currently in the process of coor-dinating. Where possible we are aiming to host eventsjointly with other RSC groups and external parties. So,these events will not just be a chance for us all to metup, they will also be a means of networking with othersthat you might not otherwise meet.

Details of these events will be posted on MyRSC (http://my.rsc.org/home), the Chemometrics Special InterestGroup webpages (http://www.rsc.org/chemometrics)and in the newsletter, and we strongly encourage youto get involved wherever possible. Likewise, if you haveany ideas for meeting themes or new audiences youwould like to engage with, please get in contact.

Royal Society Of Chemistry Chemometrics Special Interest Group

Page 4: Chemometrics the Newsletter

Jan 2011, Issue 1 4

Engaging Young Researchers

The RSC Education Division is involved in numerous initiatives, and works to promote chemistry to students in school,throughout further education and even beyond into continued professional development (CPD). They provide supportto students and teachers alike, and run a broad range of events that have enabled hundreds of students to gain abetter appreciation of the chemical sciences.

http://www.rsc.org/Membership/Networking/InterestGroups/EducationDivision/

ChemNet is a very successful component of the Education Division’s activities. It is a specialist network that provides16-18 year old chemistry students with a forum to engage with other young scientists and learn more about theircareer options.

http://www.rsc.org/Membership/Networking/ChemNet/

ChemNet is just one of many initiatives run by the Education Division. There are also lecture series intended tostimulate students and teachers alike, support programmes for chemistry teachers looking to develop their skills andknowledge, and even science clubs for primary school children.

www.rsc.org/teachers http://www.rsc.org/Education/SchoolStudents/SchoolEvents.asphttp://www.rsc.org/Membership/Networking/InterestGroups/Analytical/SchoolsComp/

Also see:http://www.rsc.org/Chemsoc/Activities/IYC/GetInvolved/

However, all of these activities need a huge amount of support. So, if you are a budding ChemNet Ambassador, ifyou think you can help teachers to bring their classes alive or keep up-to-date, or if you feel that you have that veryspecial balance of expertise, enthusiasm and dedication that can make chemometrics fun for the next generation ofresearchers, why not get in contact with the Education Division?

mailto:[email protected]

Showcasing Good Chemometrics.

Through the events, newsletter and discussion we willbe identifying what we feel are the very best examplesof good chemometric practices.

For example, this newsletter includes articles on CharlesJoseph Minard and his inspirational work on data visua-lisation and the application of a probabilistic approachto signature source assignment.

The intention is to provide a series of concise articles thatwill give the casual reader an overview of a topic butalso to provide links and references for those interestedin learning more. We hope you will become activelyinvolved in this process, letting us know about yourwork, the work of others that interests you and, perhapsmost importantly, the areas that you feel need furtherdevelopment or more careful consideration.

Working with Industry

In the current financial climate it is particularly impor-tant that we all make best use of the data we collect.Be it experimental design, routine data handling andprocessing or more detailed analysis and interpreta-tion, there is always room for improvement and oftenlessons to be learnt for the other sectors. So, throughnetworking, training and partnership agreements, weare looking at ways of increasing academic, consultancyand industrial collaborations that will allow us all towork more effectively.

So, if you are in an industrial sector where you thinkchemometrics could improve productivity, or providebetter measures of performance, if you have a chemo-metric technique with potential industrial applicationsthat you would like to showcase, or if you have a casestudy that demonstrates the added value gained froma strong collaboration, why not get in contact with theRSC Industry Technology Forum Executive (ITFEX)?

http://www.rsc.org/AboutUs/Governance/BoardsandCommittees/SEIB/ITFEX/

Royal Society Of Chemistry Chemometrics Special Interest Group

Page 5: Chemometrics the Newsletter

Jan 2011, Issue 1 5

Getting More Involved.

Finally, one of the best ways to promote chemometrics is to get more involved with the RSC through the ChemometricsSpecial Interest Group.

If you have not already checked out MyRSC, why not have a look now?

MyRSC is the online professional networking tool hosted by the Royal Society of Chemisty (UK) and it can beused by individuals at all stages of their careers in the chemical sciences to share information and news.

http://my.rsc.org

We at the Chemometrics Special Interests Group would also love to hear from you. If nothing else, it is nice to know abit about the folks reading the newsletter. So, if you have any feedback, suggestions for future articles or, better yet,would like to contribute something yourself, please do get in contact.

mailto:[email protected]

Lessons Learnt: Charles Joseph Minard

We all accept the value of graphics. Most of us use data visualisation as both the first and last step of data analysisand interpretation. When presenting findings to others we often return time and again to that old cliché “A picture isworth a thousand words”. When reflecting on the work of others, it is more often the well thought-out diagrams andfigures, rather than their formulas or words, that we first draw on.

So, it is no surprise that numerous texts are dedicated to data visualisation.

Tufte’s ‘The Visual Display of Quantitative Information’ is undoubtedly one of the most beautifully illustratedand thought provoking works on the subject. Likewise, Wilkinson’s second edition of ‘The Grammar of Graphics’ is atruly great guide to sound working practices. However, there are many other good examples but few are in completeagreement, because fundamentally good graphics, like good art, is in part a question of personal taste. Therefore,examples that are widely identified as exceptional, especially those that have stood the test of time, are particularlynoteworthy. Several names are often cited by modern day contemporaries, such as William Playfair and FlorenceNightingale, but few would not say that one man, Charles Joseph Minard, was not one the truly greats of the field.Therefore, it is perhaps worth taking a little time to consider the man, his work and what lessons we can learn fromhis example.

Minard

Charles Joseph Minard (1781-1870) was born in Dijon, France. He trained as a scientist, mathematician and civilengineer and worked for much of his early career on various dam, canal and bridge projects which took him throu-ghout Europe. In later life he became first a Superintendent at the École Nationale des Ponts et Chaussées (School ofBridges and Roads) and then Inspector in the Corps des Ponts (Corps of Bridges), before retiring in 1851, after whichhe dedicated himself to private research.

For more biographic details, see, e.g. the Wikipedia article on Minard or Michael Friendly’s more comprehen-sive timeline and biography:

http://en.wikipedia.org/wiki/Charles_Joseph_Minardhttp://www.math.yorku.ca/SCS/Gallery/minard/minchron.pdfhttp://www.math.yorku.ca/SCS/Gallery/minard/biography.pdf

During his working life, Minard pioneered the use of graphics in engineering and statistics. Even today, many of hisfigures and diagrams are still widely considered to be some of the finest examples of good statistical graphics.

Royal Society Of Chemistry Chemometrics Special Interest Group

Page 6: Chemometrics the Newsletter

Jan 2011, Issue 1 6

Napoleon’s March on Moscow

Perhaps the most famous example of his work is “Carte figurative des pertes successives en hommes de l’ArméeFrançaise dans la campagne de Russie 1812-1813” which is replicated in Figure 1. It depicts Napoleon’s infamousRussian campaign of 1812.

A Translation of Minard’s Depiction of Napoleon’s March on MoscowSource: http://www.napoleonic-literature.com/1812/1812.htm

The figure provides six types of information. The beige and black lines show the outward and returning paths of thearmy, thereby giving direction. The paths themselves provide location with major features and locations marked toadd extra geographical information. The insert below the main figure shows the temperatures faces by the retreatingsoldiers.

However, most telling of all the width of troop movement lines is proportional to the size of Napoleon’s army,one millimeter equalling 10,000 men in Minard’s original. Napoleon set off with 422,000 men, but by the time hereached Moscow, death and desertion had reduced numbers to about 100,000. Three quarters of his forces had beenlost before stepping onto the battlefield. Then, in retreat the survivors were victim to temperatures as low as −37.5 ◦Cand nearly half of those that reached the Berezina River were lost during the crossing alone. As a result only 10,000returned alive.

Étienne-Jules Marey said that this �gure �de�es the pen of the historian in itsbrutal eloquence�. Edward Tufte said very simply that this �may well be thebest statistical graphic ever drawn.�

Napoleon’s March Revisited

Many contemporaries have revisited his data sourcesand reworked his diagrams using modern methods.For example, Michael Friendly presents a number ofexcellent alternative depictions of Carte figurative despertes successives en hommes de l’Armée Française dansla campagne de Russie 1812-1813 in his papers “revi-sions of Minard” and “Visions and Re-visions of Minard”.Many of these work add additional depth to the originalbut few out-class the original. In Friendly’s own words“This work influenced several generations of statisticiansand cartographers and still has deep lessons from whichwe may learn...”

Menno-Jan Kraak’s 3D Revision of Napoleon’s March(Here the extra dimension provides a timescale)

Royal Society Of Chemistry Chemometrics Special Interest Group

Page 7: Chemometrics the Newsletter

Jan 2011, Issue 1 7

Minard’s Many Other Works

This example is only one of a large number of his works, and each in its own way is a concise testimony to Minard’stalent. A number of his other works can be seen at, e.g.:

http://www.math.yorku.ca/SCS/Gallery/minbib/index.htmhttp://cartographia.wordpress.com/category/charles-joseph-minard/

In its day Minard's work was considered so iconic amongst policy markers thatbetween approximately 1850 and 1860 all French Ministers of Public Works usedone of his �gures or charts as the backdrop for their portraits.

Minard’s Influence

As evidence of the strength of Minard’s methods, here isa recent work that could have been drawn by the manhimself.

It is also just happens to be one of the best examples ofdata visualisation in its field.

Source: https://www.llnl.gov/str/Energy.html

References

Friendly, M. Re-Visions of Minard. Statistical Computingand Graphics Newsletter, 1999, 11(1).

Friendly, M. Re-Visions of Minard. Visions and Revi-sions of Charles Joseph Minard, Journal of Educationaland Behavioral Statistics, 2002, 27(1), 31-51.

Tufte, E.R. ‘The Visual Display of Quantitative Infor-mation.’ Graphic Press, Cheshire, Connecticut, US, 1983.

Wilkinson, L. ‘The Grammar of Graphics. Second Edi-tion.’ Springer, New York, US, 2005.

Some Final Thoughts

So, what fundamental messages can we take away fromMinard’s work:

• Firstly, know your message. Identify the funda-mental message within your data set, and thencheck and double check that it is correct. Thenmake this the centerpiece when your design yourfigure. An image with a clear and accurate mes-sage at its heart will also have an intrinsic value,and, just as surely, a misrepresentation, either de-liberate or accidental, will undermine any work.

• Use graphic representation imaginatively.There as multiple ways to present data and justas many ways to convey information. Within anyplot or figure, some combinations will be moreaccessible than others. Try to identify those thatwork best for your particular data set and ana-lysis, and use these to convey the key messageor messages that you are trying to present to theviewer as quickly as possible. If there are manylevels of information within your figures, whereverpossible try to make the most fundamental themost obvious.

• Recognise the trade-offs that you make. In ma-king any figure you make a series of decisions:What data to include and what to exclude, whichincluded data to emphasize and which to down-play. Minard was famous for his belief in the “ty-ranny of precise geographical position”. He al-ways favoured clear representation of the dataover absolute geographical accuracy and wouldsometimes slightly modify a coastline or region inorder to clarify a message. Identifying the funda-mental messages and the peripheral informationin a figure helps you make sensible trade-offs likethis – just be fair to the viewer and be clear wherethe trade-offs have been made. For example, insuch cases Minard labelled his maps as “cartesfiguratives et approximatives.”

Most important of all, learn from the lessons of others.Look at the figures and diagrams that catch your atten-tion and try to identify their strengths.

Royal Society Of Chemistry Chemometrics Special Interest Group

Page 8: Chemometrics the Newsletter

Jan 2011, Issue 1 8

Resources

There are numerous resources available to help us in our work, and many of these are free to use. In coming issueswe will highlight some examples that might be of interest to you.

In this, the first issue of newsletter, we look at some of the on-line guidance and software associated with thestatistical analysis of chemical data on the Royal Society of Chemistry’s own website:

• AMC Technical Papers. The Analytical Methods Committee (AMC) of the RSC Analytical Division providestechnical notes, background papers and recommendation statements on different aspects of the statisticalanalysis of chemical data. These documents are drafted by expert subcommittees and then independentlyreviewed prior to publication. Subject matters include best practice guidance on both common issues suchas outliers, uncertainty and detection limits and more specialist topics such as nanoparticle characterisation,asbestos-containing materials and X-ray fluorescence analysis:

http://www.rsc.org/Membership/Networking/InterestGroups/Analytical/AMC/TechnicalBriefs.asp

• AMC Software. The AMC also provides a range of freely distributed software on an “as is” basis, including arange of Minitab and MS Excel add-ins:

http://www.rsc.org/Membership/Networking/InterestGroups/Analytical/AMC/Software/index.asp

• Chemical Resources. A range of on-line teaching resources and support material are available via the RSCwebsite, specially tailored for in-school and at home use:

http://www.rsc.org/Education/Teachers/Resources/index.asphttp://www.rsc.org/Education/Teachers/Resources/OnlineResourcesHome.asp

• RSC Library, Virtual Library and ebooks. The RSC provides a range of on-line library resources to its members.The Library and information is quite possibly Europe’s foremost chemical knowledge source, and includes theVirtual Library collection ebooks, journals and databases which can be accessed by RSC members from anyinternet connected PC anywhere in the world.

http://www.rsc.org/Library/index.asphttp://www.rsc.org/Library/LICMember/index.asp

http://www.rsc.org/publishing/ebooks/

• ChemSpider. ChemSpider is one of the most recent additions to the RSC website. It is a free-to-access on-linecollection of chemical compound data, centralised and structured to provide both text and structure searchingfacilities. ChemSpider is quite possibly the largest single source of structure-based chemistry information that isreadily accessible to the whole research community.

http://www.rsc.org/ChemSpider/http://www.chemspider.com/

Royal Society Of Chemistry Chemometrics Special Interest Group

Page 9: Chemometrics the Newsletter

Jan 2011, Issue 1 9

Signature Source Assignment

We often have to quantify the likelihood of a given sample coming from a given source. The most obvious applicationsare evidentiary, e.g., oil spill, explosive or drug sample source assignment, but this type of question is common toalmost all research areas.

Base Case:One Parameter, One Source

In the simplest cases, a measurement is made of theinvestigated sample and this is compared to analogousmeasurements made of a range of samples from a popu-lation of reference samples collected for the suspectedsource. This measurement can be anything that is consi-dered potentially diagnostic of the source, e.g. theconcentration of a chemical constituent, the value of aphysical property or a composite measurement such as aratio of two or more such measurements. The likelihoodthat a sample is derived from a given source is thenestimated as the likelihood of the measurement obtainedfor it being part of the population of measurementsobtained for a set of samples known to be from thatsource. We can estimate this using a kernel densityfunction.

Kernel Density Function

f (x , h) =1

nh

n∑

i=1

φ

x − x i

h

Wheref is the estimated density,

n is the number binning interval over which the kernelfunction is fitted to the data set X

x1 to xn�

,and h and φ are the smoothing term and standard

normal density function, respectively.

See, e.g., Silverman (1986), Sheather Jones (1991) orAMC (2006) for discussions of the selection of optimalsmoothing parameters.

False Positives

In the real-world, a single measurement is rarely anunambiguous diagnostic.

Firstly, parameter measurements often vary significantlywithin a given population. For example, the ratio of keyflavour markers in wines vary hugely with climate andweather conditions during the grape growing seasonsand artisan practices during the wine production.

Secondly, there is often significant overlap betweenthe single parameter measurement ranges of discretesource populations. For example, the trace elementcomposition of two pieces of chinaware produced bydifferent manufacturers can often be highly similar, es-pecially if they source raw materials from nearby orgeographically similar regions.

Therefore, we typically measure and compare mul-tiple parameters to minimise chances of reporting a falsepositive.

With multiple parameters, the direct comparison ofranges is often a sufficient means of ruling out mostfalse positives. For example, the standard source assi-gnment method used in the Heroin Signature Program(HSP) compares the ratios of selected alkaloids in seizeddrug samples with those of previously seized and sourceregion assigned samples using just this approach (UN,1998). However, if two or more sources have significantparameter overlap this cannot provide any measure ofthe relative likelihoods. Similarly, if one of the parametermeasures is at or just outside the range of the referenceset, this ‘boundary’ position cannot be factored into thesource assignment using such a crude (non-probabilistic)approach.

Royal Society Of Chemistry Chemometrics Special Interest Group

Page 10: Chemometrics the Newsletter

Jan 2011, Issue 1 10

Extending the Base Case:Multiple Parameters, Multiple Sources

Bayes’ theorem provides a practical frame to extend the kernel density function to multiple source/multiple parametermeasurement questions.

For a relatively simple case, a single parameter measurement (p1) being derived from one of two sources (no-minally, A of A and B), we assign the probability (Pr) as:

Pr�

A|p1�

=Pr(p1|A)Pr(A)

(Pr(p1|A)Pr(A)+Pr(p1|B)Pr(B))

WherePr(pn|S) is the probability estimate for the nth parameter, p, measurement

as a member of source population, S,and Pr(S) an additional measurement of source likelihood.

Depending on available information, Pr(S) can be the proportion of source S samples in the data set or a modi-fication of this based on any prior or external evidence regarding relative proportions of samples from different sources.

The compound likelihood of two discrete parameter measurements, p1 and p2, deriving from source A is then:

Pr�

A|p1, p2�

=Pr(p2|A)Pr(A|p1)

(Pr(p2|A)Pr(A|p1)+Pr(p2|B)Pr(p2|B))

This approach can be iteratively scaled up for use with any (larger) number of sources and/or parameter measurementsto provide an estimate of the relatively probability of different sources.

The Null Hypothesis

One thing we should always consider is the possibility that the investigated sample does not come from any of thesupplied reference samples.

The above framework will provide an estimate of the relative probabilities of all supplied cases. It does notconsider others. Nor does it provide an estimate of the actual likelihood of the most likely of the considered casesbeing the actual source! Therefore, we should always add one additional ‘source’ to our analysis, e.g., for the above Aand B source set:

Pr�

NU LL|p1�

=Pr(p1|NU LL)Pr(NU LL)

(Pr(p1|NU LL)Pr(NU LL)+Pr(p1|A)Pr(A)+Pr(p1|B)Pr(B))

WherePr�

p1|NU LL�

is the nominal probability threshold,e.g. 1/n from the kernel density function,

Pr (NU LL) is the prior probability,often set low for an established data set, e.g. � lowest Pr (S)

Further Information

Further discussion of Bayesian theory and its application to chemometric data are presented in the comprehensivetwo-part review Armstrong & Hibbert (2009) and Hibbert & Armstrong (2009).

Hibbert et al (2010) also describe the application of such a probabilistic approach to the source region assign-ment of seized heroin samples.

Royal Society Of Chemistry Chemometrics Special Interest Group

Page 11: Chemometrics the Newsletter

Jan 2011, Issue 1 11

An Example

A simulated data set of four parameters (p1 to p4) measured for samples taken from four sources (A to D) is presentedin below.

source parameter count min median maxA p1 101 10.004 15.890 19.928A p2 101 0.035 1.016 1.998A p3 101 0.102 11.621 19.984A p4 101 10.008 12.556 14.884B p1 100 10.021 12.546 14.915B p2 100 0.005 4.000 7.944B p3 100 0.036 10.859 19.986B p4 100 3.090 9.859 17.947C p1 100 0.001 2.489 4.909C p2 100 10.021 13.518 17.886C p3 100 5.152 7.811 9.981C p4 100 5.055 6.762 8.997D p1 100 0.223 8.496 19.973D p2 100 10.049 11.189 11.958D p3 100 5.072 7.408 9.995D p4 100 7.260 24.284 36.988

Summarised statistics (count, min, median and max)for simulated data set of parameter measurements (p1− 4) from four source populations, A− D.

Plot of simulated data set of parameter measurements (p1− 4) from four source populations, A− D.

Note, a jitter has been applied to the y-axis of the above plot highlight the degree of over-plotting. This is one ofseveral methods used to aid the visualisation of densely plotted data.

An additional sample of unknown origin x�

p1= 14.91, p2= 1.425, p3= 15.02, p4= 12.49�

can be comparedwith this data set using standard ‘within-range’ testing.

A B C Dp1 TRUE TRUE FALSE TRUEp2 TRUE TRUE FALSE FALSEp3 TRUE TRUE FALSE FALSEp4 TRUE TRUE FALSE TRUE

ALL TRUE TRUE TRUE FALSE FALSE

Standard TRUE/FALSE ‘within-range’ signature source assignment

This analysis indicates that x is highly unlikely to come from either sources C or D, but could come from eithersources A or B.

Royal Society Of Chemistry Chemometrics Special Interest Group

Page 12: Chemometrics the Newsletter

Jan 2011, Issue 1 12

However, using the probability density function and the above framework we can extend this analysis to pro-vide an estimate of relatively likelihoods.

Pr(Pn|S) is estimated for each parameter|source combination using the probability density function (Gaussiankernel, n = 512).

A B C D NULL*p1 0.1040 0.0918 0.0003 0.0460 0.0020p2 0.4244 0.0971 0.0003 0.0003 0.0020p3 0.0588 0.0483 0.0004 0.0004 0.0020p4 0.1903 0.0635 0.0004 0.0274 0.0020

Individual probabilities Pr�

parameter|source�

* 1/512

Pr(S) is measured on a proportional basis, e.g.:

Pr (A) = count(A)(count(A)+count(B)+count(C)+count(A))

= 4011601= 0.2509

...and for all sources:

A B C D NULL*Pr(S) 0.2509 0.2484 0.2484 0.2484 0.0040

Reference sample set assigned source distribution* pr (NU LL) user assumption

Pr(S|pn) is also measured on a proportional basis, e.g.:

Pr�

A|p1�

=Pr(p1|A)Pr(A)

(Pr(p1|A)Pr(A)+Pr(p1|B)Pr(B)+Pr(p1|C)Pr(C)+Pr(p1|D)Pr(D)+Pr(p1|NU LL)Pr(NU LL))

Pr�

A|p1�

= 0.1040×0.2509(0.1040×0.2509+0.0918×0.2484+0.0003×0.2484+0.0460×0.2484+0.0020×0.0040)

= 0.4319

...and for all parameters and (by iteration) all sources:

A B C D NULLPr(p1) 0.4319 0.3771 0.0014 0.1893 0.0003

Pr(p1,p2) 0.8333 0.1664 0.0000 0.0003 0.0000Pr(p1,p2,p3) 0.8589 0.1411 0.0000 0.0000 0.0000

Pr(p1,p2,p3,p4) 0.9510 0.0490 0.0000 0.0000 0.0000

Probabilistic signature source assignment

Here, we would conclude that source A was the most likely source for sample x . However, as this is above but veryclose to the 95% confidence interval (i.e. 95.1%), we could consider measuring additional parameters to further testthe robustness of this conclusion.

Royal Society Of Chemistry Chemometrics Special Interest Group

Page 13: Chemometrics the Newsletter

Jan 2011, Issue 1 13

References

AMC (Analytical Methods Committee) (2006) ‘Representing Data Distributions with kernel Density Estimates.’ RSCAMC Technical Brief. Royal Society Of Chemistry, Cambridge, UK.

Armstrong N, Hibbert DB (2009) An introduction to Bayesian methods for analyzing chemistry data. Part 1:an introduction to Bayesian theory and methods. Chemometr Intell Lab Syst 97, 194-210.

Hibbert DB, Armstrong N (2009) An introduction to Bayesian methods for analyzing chemistry data. Part II: areview of applications of Bayesian methods in chemistry. Chemometr Intell Lab Syst 97, 211-220.

Hibbert DB, Blackmore D, Li J, Ebrahimi D, Collins M, Vujic S, Gavoyannis P (2010) A probabilistic approachto heroin signatures. Anal Bioanal Chem (2010) 396, 765-773

Sheather SJ, Jones MC (1991) A reliable data-based bandwidth selection method for kernel density estimation. J RoyStat Soc, B, 683-690.

Silverman BW (1986) ‘Density Estimation.’ London: Chapman and Hall.

UN (United Nations) (1998) Recommended methods for testing opium, morphine and heroin. In: ‘U.N.I.D.C.Programme’, United Nations, New York.

Puzzle

ACROSS 3. Spanish; 7. English; 10. Latvian; 12. Croa-tian; 13. Estonian; 14. Norwegian.

(clue - the answer to 7 ACROSS is chemometrics)(bigger clue - all the answers are chemometrics)

DOWN 1. Swedish; 2. Hungarian; 3. Portuguese;4. Slovenian; 5. Hebrew; 6. Bulgarian; 7. Greek; 8.French; 9. Czech; 11. Sudanese.

Royal Society Of Chemistry Chemometrics Special Interest Group