transitive closure, inbreeding coefficients, pedigrees, on line collection and on line presentation

University of Ballarat

Bachelor of Computing Honours thesis

Transitive Closure, InbreedingCoefficients, Pedigrees, on line

Collection and on line PresentationAuthor:Charles E. Esson

Supervisor:Gregory L. Simmons

November 3, 2009

I, Charles Esson, declare that this thesis titled, ‘Transitive Closure, In-breeding Coefficients, Pedigrees, on line Collection and on line Presentation’and the work presented in it are my own. I confirm that this work has notbeen submitted for the award of any other degree or diploma in any tertiaryinstitution.

Signed:

Date:

1

Abstract

An on-line system was developed to collect and display pedigree data, testthe developed algorithms and present the results. The systems uses a tran-sitive closure, minimum path length, maximum path length and an adja-cency list to deliver inbreeding coefficients, pedigree trees, descendant treesand enforcement of pedigree constraints in real time. Real time methodsto display an animal pedigree and descendants, to calculate the inbreedingcoefficient and enforce the constraints, incrementally maintain the maximumpath length between nodes where developed along with improved routines toincrementally insert and delete edges from a transitive closure.

Key words: Pedigree, transitive closure, SQL view maintenance, inbreedingcoefficient

Acknowledgements

Thankyou:Trish Esson For the large data set used for testing

and for putting up with me.Andrew James For the second data set.Greg Simmons For supervision, encouragement and as-

sistance.Sudanthi Wijewickrema For proof reading.

List of Figures

2.1 Heritability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Example pedigree . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1 Restoration of deleted paths . . . . . . . . . . . . . . . . . . . 383.2 Three head to tail joins of trusted paths required. . . . . . . . 393.3 Shortest and longest path . . . . . . . . . . . . . . . . . . . . 403.4 Node codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1 Animal record structure . . . . . . . . . . . . . . . . . . . . . 56

6.1 Partially opened Merrrit pedigree display for animal 4461 . . . 626.2 Displaying descendants . . . . . . . . . . . . . . . . . . . . . . 64

A.1 Application data input and output . . . . . . . . . . . . . . . 77

B.1 Intertask communication . . . . . . . . . . . . . . . . . . . . . 80B.2 Field select drag and drop . . . . . . . . . . . . . . . . . . . . 84

C.1 Public and logged on browsers . . . . . . . . . . . . . . . . . . 87C.2 Entry screen with screen areas highlighted. . . . . . . . . . . . 88C.3 Public screen: herd data . . . . . . . . . . . . . . . . . . . . . 89C.4 Herd edit screen. . . . . . . . . . . . . . . . . . . . . . . . . . 90C.5 Herd setup screen. . . . . . . . . . . . . . . . . . . . . . . . . 92C.6 Animal screen, display only . . . . . . . . . . . . . . . . . . . 93C.7 Animal detail screen . . . . . . . . . . . . . . . . . . . . . . . 96C.8 Example animal add file . . . . . . . . . . . . . . . . . . . . . 98C.9 Three steps to upload a file . . . . . . . . . . . . . . . . . . . 99C.10 File download screen . . . . . . . . . . . . . . . . . . . . . . . 101C.11 Example animal download file . . . . . . . . . . . . . . . . . . 102

D.1 Projecting a 3D point onto a 2D plane . . . . . . . . . . . . . 107

1

List of Tables

2.1 Hardy-Weinberg law . . . . . . . . . . . . . . . . . . . . . . . 132.2 Arrangement of animals from longest path to shortest . . . . . 272.3 A matrix for example pedigree given in figure 2.2 (Mrode, 1996). 28

3.1 Edges table for pedigree given in figure 2.2 . . . . . . . . . . . 343.2 Transitive closure of example pedigree from figure 2.2 . . . . . 35

2

Contents

1 Introduction 71.1 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Genetics 112.1 Early history . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Two alleles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Quantitative genetics . . . . . . . . . . . . . . . . . . . . . . . 132.4 Genotype and phenotype . . . . . . . . . . . . . . . . . . . . . 142.5 Heritability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5.1 Children phenotype - Parent phenotype . . . . . . . . . 152.5.2 Breeding value - Parent phenotype . . . . . . . . . . . 162.5.3 Additive genetic effect . . . . . . . . . . . . . . . . . . 162.5.4 Correlation . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6 Ordinary least squares . . . . . . . . . . . . . . . . . . . . . . 182.7 Generalized least squares . . . . . . . . . . . . . . . . . . . . . 182.8 Best linear unbiased prediction . . . . . . . . . . . . . . . . . 192.9 Predicting animal breeding values . . . . . . . . . . . . . . . . 222.10 Within herd estimated breeding values . . . . . . . . . . . . . 24

2.10.1 No members in a group . . . . . . . . . . . . . . . . . . 262.11 Relationship matrix . . . . . . . . . . . . . . . . . . . . . . . . 26

2.11.1 Calculating the relationship matrix . . . . . . . . . . . 262.12 Calculating the inbreeding coefficient . . . . . . . . . . . . . . 292.13 Incrementally maintaining A−1 . . . . . . . . . . . . . . . . . 312.14 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Graphs and transitive closures 333.1 Directed acyclic graph . . . . . . . . . . . . . . . . . . . . . . 343.2 Transitive closure . . . . . . . . . . . . . . . . . . . . . . . . . 353.3 Only one transitive closure needs to be maintained . . . . . . 35

3

3.4 Incremental maintenance of a transitive closure . . . . . . . . 363.4.1 Insertion of an edge into a transitive closure . . . . . . 363.4.2 Deletion of an edge from a transitive closure . . . . . . 373.4.3 Three self joins of ”TRUSTY” to recover all paths . . . 373.4.4 Fragments . . . . . . . . . . . . . . . . . . . . . . . . . 393.4.5 The shortest and longest Path . . . . . . . . . . . . . . 403.4.6 Building a transitive closure from a depth first search . 423.4.7 Alternative to using a transitive closure . . . . . . . . . 423.4.8 Building the transitive closure one herd at a time . . . 44

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4 Incremental maintenance of a transitive closures using SQL 454.1 PostgreSQL transitive closure edge insert . . . . . . . . . . . . 46

4.1.1 PostgreSQL version 7.1 code . . . . . . . . . . . . . . . 464.2 PostgreSQL transitive closure edge unlink . . . . . . . . . . . 474.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5 Using the transitive closure, the SQL code 545.1 Introduction to database table structure . . . . . . . . . . . . 555.2 Valid birth date . . . . . . . . . . . . . . . . . . . . . . . . . . 555.3 Can’t change the sex of an animal with descendants . . . . . . 575.4 Calculation of the inbreeding coefficient . . . . . . . . . . . . . 585.5 Descendant records . . . . . . . . . . . . . . . . . . . . . . . . 585.6 Ancestor records . . . . . . . . . . . . . . . . . . . . . . . . . 595.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6 Displaying pedigrees and descendants 616.1 Ancestors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.2 Descendants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7 Theses conclusion and future research 667.1 Thesis conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 677.2 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.2.1 Can node codes be maintained incrementally? . . . . . 677.2.2 Did the application meet it’s underlying design goal? . 687.2.3 Are production gains increased? . . . . . . . . . . . . . 687.2.4 Will this work affect Lambplan? . . . . . . . . . . . . . 69

References 71

Appendices 74

4

A Code Structure 74A.1 Input values and return values . . . . . . . . . . . . . . . . . . 75A.2 Error codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75A.3 The table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

A.3.1 base . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76A.3.2 db string . . . . . . . . . . . . . . . . . . . . . . . . . . 76A.3.3 dbt string . . . . . . . . . . . . . . . . . . . . . . . . . 76A.3.4 strings . . . . . . . . . . . . . . . . . . . . . . . . . . . 76A.3.5 data convert . . . . . . . . . . . . . . . . . . . . . . . . 76A.3.6 type . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77A.3.7 logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77A.3.8 heading string . . . . . . . . . . . . . . . . . . . . . . . 77A.3.9 location . . . . . . . . . . . . . . . . . . . . . . . . . . 77

A.4 Data entry and display . . . . . . . . . . . . . . . . . . . . . . 77A.4.1 input to db . . . . . . . . . . . . . . . . . . . . . . . . 78A.4.2 test db change . . . . . . . . . . . . . . . . . . . . . . 78A.4.3 get data . . . . . . . . . . . . . . . . . . . . . . . . . . 78A.4.4 output from db . . . . . . . . . . . . . . . . . . . . . . 78

B Ajax 79B.1 File upload progress . . . . . . . . . . . . . . . . . . . . . . . 80B.2 Yahoo user interface; selecting fields to display . . . . . . . . . 84

C The Application 85C.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86C.2 Login and logoff . . . . . . . . . . . . . . . . . . . . . . . . . . 86C.3 The herd screens . . . . . . . . . . . . . . . . . . . . . . . . . 86

C.3.1 Herd edit screen . . . . . . . . . . . . . . . . . . . . . . 89C.3.2 Herd setup screen . . . . . . . . . . . . . . . . . . . . . 91

C.4 Animal screens . . . . . . . . . . . . . . . . . . . . . . . . . . 93C.4.1 Public animal setup screen . . . . . . . . . . . . . . . . 94C.4.2 Animal setup screen . . . . . . . . . . . . . . . . . . . 95

C.5 Detail screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96C.5.1 Detail setup screen . . . . . . . . . . . . . . . . . . . . 97

C.6 Upload screens . . . . . . . . . . . . . . . . . . . . . . . . . . 98C.6.1 Upload file structure . . . . . . . . . . . . . . . . . . . 98C.6.2 Upload sequence . . . . . . . . . . . . . . . . . . . . . 100

C.7 Download files . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5

D Linear Regression 103D.1 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104D.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106D.3 Normal equation . . . . . . . . . . . . . . . . . . . . . . . . . 107

6

Chapter 1

Introduction

7

In 2005 the Australian cashmere Industry received funding from the Ru-ral Industry and Development Corporation (RIRDC) for a sire referencingscheme. Further funding was granted to develop a method to generate es-timated breeding values (EBV) across the entire industry. This project re-sulted in a batch solution developed by Andrew T.James (James, 2009). Theindustry now feels the EBV project is at a stage where it is appropriate tomove to on line collection, and presentation of pedigree and phenotype data,direct calculation of some results and display of additional results from thebatch process. An ideal system would present inbreeding coefficients andwithin herd estimated breeding values without running the batch process.

This research effort aims to develop methods that are fast enough tomeet the ideal design goals. An online system has been constructed and thedeveloped methods implemented to make sure performance is acceptable.

The key insights documented are:

- That if we incrementally maintain a transitive closure of the pedigreealong with the maximum path length between nodes, we can quicklyobtain a list of all the ancestors and order the ancestors in the orderthey enter the pedigree.

- Pang, Dong, & Ramamohanarao, 2005 work can be extended to main-tain the maximum path between nodes.

- An animal’s inbreeding coefficient is only affected by it’s ancestors. Re-cent ancestors have a greater effect on the inbreeding coefficient andtherefor, the calculation speed can be increased by limiting the mini-mum path length between nodes to a maximum value.

- When an animal is added, the inverse of the additive genetic rela-tionship between animals (A−1) can be maintained incrementally; theentries affected are limited to a subset of all entries and these entriescan be determined using a list of descendants and their ancestors.

- The performance of the SQL incremental edge deletion routine can beimproved by creating a table of segments that may be useful in recre-ating excess edges deleted when suspect (see section 3.4.1) is removedfrom the old transitive closure. The segments that may be useful arethose that start and end on a path that has been deleted.

- If you are maintaining a transitive closure and a minimum path length,you do not have to maintain an edge table.

8

Testing showed that for the large industry wide herd, incrementally main-taining the pedigree transitive closure in real time was too slow. Methodswere therefor developed to speedup the algorithms. The final solution to thisproblem was the maintenance of a transitive closure for each user’s hard,and to use a batch process to maintain an industry wide transitive closure.Methods to do this and decide which data structure to use for display weredeveloped.

1.1 Previous work

The mathematics and methods for calculating inbreeding coefficient and es-timated breeding values as a batch process across the entire data set are welldeveloped and solutions implemented. Efficiently querying large pedigreeshas been looked at by Elliott, Akgul, Ozsoyoglu, & Manilich, 2006. Theirwork used “node codes”, which provide a complete description of the geneticflow from one animal to another for all animals in the herd, and aid thequerying of the pedigree structure. However a depth first search was used tocreate the codes, which is a time consuming calculation, making the proposedsolution unsuitable for incremental calculation.

1.2 Methodology

The design science paradigm seeks to extend the boundaries of human andorganizational capabilities by creating new and innovative artifacts. It ad-dresses research through the building and evaluation of artifacts to meetidentified needs (Hevner, March, Park, & Ram, 2004). The industrie’s iden-tified needs for the artifact were:

- On line collection of pedigree and phenotype data.

- On line presentation of any results that can be generated in real time.

- Presentation of results calculated off line.

Presenting the research as design science however underplays the achieve-ments. Methods to calculate the inbreeding coefficients in real time andmethods to incrementally maintain A−1 where important results; as wereimprovements to the incremental transitive closure algorithms and methodsto incrementally maintain the maximum path between nodes. As well asmeeting the industry needs, the developed program was used to test thesuccess or failure of the proposed solutions and algorithm improvements.

9

1.3 Thesis structure

To understand Merrrit and how the design goals were reached several con-cepts need to be understood and in some cases extended. Some chaptersintroduce new material, others look at implementation details.

Chapter 2 looks at the genetics and mathematics behind the calculation ofinbreeding coefficients and estimated breeding values. It is noted that tocalculate the inbreeding coefficient, you need the animals ancestors, in theorder they entered the pedigree. It is also noted that a list of descendantsand their ancestors can be used to incrementally update A−1.

Chapter 3 considers incremental maintenance of a transitive closure. Amethod to maintain the maximum path length is introduced, this is neededto return a list of ancestors in the order needed to calculate the inbreedingcoefficient (using the method discussed in chapter 2). Maintaining the min-imum path length removes the need to maintain the edge table as has beendone in previous work in this area. Methods to reduce the execution time ofincremental edge deletion are also introduced.

Chapter 4 takes the results from chapter 3 and presents the SQL code toimplement the edge insertion and deletion solutions.

Chapter 5 looks at how the transitive closure is used by the application toperform the functions the application needs to perform. The SQL code usedto maintain the integrity of the pedigree, and to obtain the list of animalsrequired to calculate the inbreeding coefficient is also introduced.

Chapter 6 discusses the on line displaying of pedigrees and descendant trees.

Chapter 7 presents the conclusion and further research.

Appendix A looks at the code structure of the developed application, ap-pendix B at AJAX, and how it is used in the application and appendix Cdocuments the application. Appendix D provides an introduction to linearregression, the results are used in chapter 2.

10

Chapter 2

Genetics

11

The industry is interested in identifying animals that have the potentialto increase the economic returns quickly. Calculating accurate estimatedbreeding values (EBV) and encouraging their use across the cashmere indus-try is the underlying aim of this and previous projects. To identify the bestanimal across multiple herds, we need to use mathematics that remove herdspecific effects. The mathematics to do this was developed in the 60’s and70’s (Henderson, 1975). The techniques developed require an industry widepedigree and matings across herd boundaries (the sire reference scheme).

When developing breeding strategies, within herd EBVs are useful to thegrowers. EBVs that take into account the performance of ancestors anddescendants are better at estimating an animal’s genetic merit than EBVsbased on the animals phenotype records alone.

Inbreeding coefficients are an indication of how closely individuals arerelated and are used by breeders to control their inbreeding and out-breedingprograms.

This chapter takes a quick look at the history of genetics, the basic ideasbehind quantitative genetics and then tries to give some insight into heritabil-ity. Sections 2.6, 2.7 and 2.8 build towards understanding best linear unbi-ased prediction (BLUP), a method that uses phenotype records from relatedanimals to better predict an animals breeding value. Ordinary least squaresassumes there is no relationship between the residual errors, generalized leastsquares assumes the events are independent but there is a known relationshipbetween the residuals, and best linear unbiased prediction assumes there isa known relationship between events. When predicting breeding values anevent is an animal, and the relationship between events is controlled by thegenes passed from one generation to the next.

2.1 Early history

Quantitative Genetics developed in the early 20th century and resulted fromthe melding of Mendelian genetics, rediscovered in 1900, and biometrics, abranch of genetics founded by Francis Calton (1869, 1889), a science to whichthe foundation of most modern statistics can be traced. Francis Caltonfocused on continuously varying characteristics (those that can’t be easilyseparated into separate classes). However, the melding of the two ideas didnot occur without argument. The death of W.F.R. Waldon in 1906 ( one ofthe key protagonists) along with the publication of key experiments resultedin the rapid emergence of the multiple factor hypothesis(Lynch & Walsh,1997). To understand the multiple factor hypothesis we need to understandthe factors.

12

2.2 Two alleles

Genotype AA Aa aaFrequency p2 2pq q2

Table 2.1: Hardy-Weinberglaw

An allele is an alternative DNA sequences atthe same physical gene locus. A diploid or-ganism ( many plants are not diploids) inher-its one sequence from one parent and anothersequence from the other. A diploid organismcarries two alleles, but only passes one of theinherited alleles to it’s offspring. In the sec-ond simplest case, there are only two differentalleles in the population with these being ex-pressed as three possible combinations.

If the gene frequency of allele type A is p and the frequency of type a isq, then the frequencies of the various combinations is as given in table 2.1.To see this, forget about how the alleles are carried by the organism andconsider the probability of selecting AA, aa and Aa from a large population(this is Hardy-Weinberg’s law restated).

The probability of selecting your first A is p, the probability of selectingthe second A is also p, and therefor the probability of selecting two As ina row is p2. There are two ways to select Aa (if the order doesn’t matter),the probability of either is pq, so the probability of an Aa outcome is 2pq.Similarly the probability of picking two a alleles in a row is q2.

2.3 Quantitative genetics

If there are four alleles for a gene location in the population with frequencyp1, p2, p3 and p4, the frequencies of the various combinations is given by:

(p1+p2+p3+p4)2 = p2

1+p22+p2

3+p24+2p1p2+2p1p3+2p1p4+2p2p3+2p2p4+2p3p4

(2.1)

If there are i alleles the frequency of the combinations is given by:

(p1 + · · ·+ pi)2 (2.2)

If there are two gene locations, the frequency of the combinations is givenby:

((p1 + · · ·+ pi)2 + (q1 + · · ·+ qj)

2)2 (2.3)

Formula 2.3 can easily be extended to k gene locations. As the numberof different alleles increases the number of possible outcomes explodes, wellseparated classes disappear and the desire to think in terms of alleles fades.

13

A trait that doesn’t have well separated classes is called a quantitativetrait. Quantitative traits tend to be normally distributed. To observe thisphenomenon, let us simplify the situation and pretend that we have a traitthat is affected by many gene locations. Let us further assume that theprobability of getting an allele that improves the outcome is p and is equal forall location and that the incremental improvements add up. This describesa Bernoulli trail(Rozanov, 1964). Each animal is a separate event, and theherd is the trail.

If the number of gene locations is n and the number of selections thatcontributed positively to the outcome is k, the probability for a particularsequence is:

P(ω) = pk(1− p)n−k (2.4)

The number of ways we can select the k positive outcomes is given by:

Cnk = (kn) =

n!

k!(n− k)!(2.5)

The probability distribution for this simplified case is given by:

Pξ(k) = Cnk p

k(1− p)n−k (2.6)

This is the binomial distribution and when n is large, it can be approximatedby a normal distribution.

Life is a normal distribution; it should however be noted that a normaldistribution is not assumed when deriving the equations for ordinary leastsquares (OLS), generalized least squares (GLS), or best linear unbiased pre-diction (BLUP).

2.4 Genotype and phenotype

A phenotype can be divided into a genotype and environment effects. Thegenotype is the portion of the phenotype that occurred because of the genes,that is, the portion that is related to previous animals. If we consider ananimal as an event, the genes connect events. The environment effects are theportion that varies because of the animals environment (nature or nurture).

P (phenotype) = G(genotype) + E(environment) (2.7)

14

As well as additive effects, the genotype includes dominance effects andepistatic effect, which can’t be measured so they are taken out of the geno-type and along with the measurement errors, they are lumped together as aresidual and to get:

P (phenotype) = A(additive genetic effects) +E(environment) + e (2.8)

An equation to predict an animal’s phenotype could be written:

zi(phenotype predicted) = E(environment)+h2xi(additive genetic effects)+ei

(2.9)

0Parent phenotype

Offspringphenotype

0

+500

+75

Figure 2.1: bO,P = 1/2h2

The formula h2xi = A uses the an-imal’s phenotype record to predict it’sbreeding value. xi is the deviation fromthe herd mean phenotype, and ei is thedifference between the predicted breed-ing value and the actual breeding.

Best linear unbiased prediction (dis-cussed in section 2.8 uses the relativesphenotype records to produce a pre-dicted breeding value that is closer tothe real breeding value.

2.5 Heritability

Geneticists talk about broad heritabil-ity (H2) and narrow heritability (h2).We can’t measure broad heritability soin a practical sense it is of no interest.There are several definitions of narrowheritability and investigating how they are related helps us understand her-itability and breeding values. We will start the exploration with an experi-ment that can be performed.

2.5.1 Children phenotype - Parent phenotype

Given a herd with known average phenotype (µherd), and the mating of sev-eral sires of known phenotype (Psire) with a different random group of doesfrom the herd, the average phenotype of the offspring (Poffspring) is expectedto be.

Pprogeny = bO,P (Psire − µparent.herd) + µprogeny (2.10)

15

Where (bO,P ) is the gradient of the least squares line of best fit (regressioncoefficient), of the parents’ phenotype against the offspring’s phenotype. If:

bO,P ≡ 1/2h2 (2.11)

Pprogeny = 1/2h2(Psire − µparent.herd) + µprogeny (2.12)

The symbol h2 is used to represent heritability, and 1/2 appears in the aboveformula because the progeny only gets 1/2 of it’s genes from the sire.

The regression coefficient can be calculated using linear regression, or asa ratio of covariance over the variance ( see appendix D).

bO,P =σPoffspring ,Pparent

σ2Pparent

(2.13)

Or using a different syntax:

bO,P =σPoffspring

σparent

σPparentσPparent

(2.14)

2.5.2 Breeding value - Parent phenotype

Now that we have a definition of heritability, we can get some insight intobreeding values. The regression coefficient of the breeding values against itsphenotypes is also equal to heritability. Recalling that a regression coefficientis the ratio of the covariance over the variance, we get:

bA,P =σA,Pσ2P

= h2 (2.15)

Taking this and the previous definition leaves us with an animal’s breedingvalue being twice the influence it is likely to have have on it’s progeny. Theinfluence is halved because one parent only supplies half the genes to anoffspring.

2.5.3 Additive genetic effect

Heritability is also defined as the fraction of the phenotype variance ( σ2P )

that is due to the additive genetic effect. That is:

h2 =σ2A

σ2P

(2.16)

To prove this we need to use equation 2.15 and show:

σA,P = σ2A (2.17)

16

The phenotype measurement is the sum of the inherited attributes and theenvironment (see equation 2.7).

P = A+ E (2.18)

σ[(x+y),(w+z)] = σ(x,w) + σ(y,w) + σ(x,z) + σ(y,z) (2.19)

(Lynch &Walsh,1997, equation3.10g)

σA,P = σ[A,(A+E)] = σ2A + σA,E (2.20)

If the environment and genetic random variables are orthogonal σA,E will bezero and:

σA,P = σ2A (2.21)

The equations h2 =σ2

A

σ2P

is important as variance of the additive genetic

affect (σ2A) is used in the BLUP equations and needs to be related back

to heritability, . (see section 2.9).

2.5.4 Correlation

The correlation between an animal’s adjusted phenotype and it’s breedingvalue is equal to the square root of it’s heritability. This leads to someinteresting formula manipulation but provides little insight.

√h2 =

√σ2A

σ2P

See equation(2.16)

=

√σ2A√σ2P

h =σAσP

rPA =σ2A

σPσA

=σAσAσPσA

=σAσP

= h

17

2.6 Ordinary least squares

Ordinary least squares (OLS) looks for a vector β to linearly combine mea-sured estimators to provide an estimate, that is:.

y = Xβ + ε The actual y values. (2.22)

y = Xβ The estimated y values. (2.23)

Where X is a matrix of the measured independent values for all animals,y is a vector of the measured dependent values for all animals and y is avector of estimated dependent values for all animals.

An an example in this application space, ordinary least squares could beused to create a formula to predict down-weight based on measured downlength, measured down diameter, sex and herd.

Ordinary least squares projects a hyperspace where the number of di-mensions is equal to the number of samples, onto a space where the numberof dimensions equals the number of independent variables used to describethe data. The difference between the samples and the predictions generatedusing the independent variables are the residuals. The residuals are orthog-onal to the projection. The derivation of the normal equations is given inappendix D.3. Using matrix notation, the normal equation is written as:

β = (X′X)−1X′y (2.24)

β gets a hat because the result is the predicted β that results from the dataset y and X not the actual β. The predicted value is:

y = Xβ (2.25)

Ordinary least squares is discussed further in appendix D.

2.7 Generalized least squares

If the residuals ( ε) are normally distributed with some relationship, thelinear proportion of the relationship between the residual variances can beexpressed with the equation:

var(ε) = Rσ2 (2.26)

The linear proportion of the relationship between the residual variancecan be removed by performing a transform on equation 2.22, and β can thenbe found using ordinary least squares. Lynch & Walsh, 1997 transformed

18

the equation with the inverse of the square root of R, where, R1/2R1/2 = R.Faraway, 2002 used the Cholesky Decomposition where SST = R, with 2.22being transformed using S−1 :

S−1Y = S−1Xβ + S−1ε (2.27)

The transform works if the residual variances no longer have a linear rela-tionship.

var(ε) = Rσ2 = SSTσ2 = Sσ2ST (2.28)

var(S−1ε) = S−1var(ε)S−1T

= S−1Sσ2STS−1T

= Iσ2I

= Iσ2

Replacing X and y in equation 2.24 with the transformed values gives thegeneralized least square (GLS) solution:

β = (XTR−1X)−1XTR−1y (2.29)

2.8 Best linear unbiased prediction

To summarize previous sections; the phenotype of an animal can be esti-mated using independent variables which can be measured; the herd, the sexand the age when a measurement is taken, are examples. The independentmeasurements are referred to as independent variables, or estimators. Themodel is obtained by minimizing the differences between the measured de-pendent variable and the value estimated using the derived equation and theestimators. This is done using ordinary least squares if it is assumed theestimators are independent and have the same random error. If the estima-tors are not independent (they have some covariance between them) and therelationship is known and/or the random errors are not the same and thedifference in size is known, then general least squares can be used to improvethe result.

An OLS or GLS regression answers the question: given a set of indepen-dent variables what do you expect of the dependent variable?’. A best linearunbiased prediction ( BLUP) answers the question: ’given a series of eventswhich are related and for which the independent variables have been mea-sured, what do you expect the next event to be1?’. If we are predicting the

1Very similar to the questions asked in Bayesian Statistics.

19

next event in a time series, it is obvious that the events are measurements ofpoints in the series. When used with genetics an event is a generation.

The value predicted for the next event, based on the knowledge of adisturbance, or the knowledge of the relationship between the animals, isobtained by minimizing the residual variance. In other words we are assum-ing the new sample will not increase the variance. The effect that animalrelationships have on the variance is known and is discussed in section 2.11.

The BLUP is easier to follow if it is separated from complex family re-lationships. We consider a signal affected by a known disturbance. Thefollowing derivation is based on Goldburger, 1962. In the following deriva-tion ( as with the rest of the thesis), lowercase bold symbols are used torepresent vectors, a matrices are represented by upper case bold symbolsand values are typeset normally.

The predicted value for the new event is:

y = (xTβ) + ε (2.30)

The aim is to find the correlation between ε and ε. We are hoping that:

E (ε) = 0 (2.31)

E (εε) = ω (2.32)

We need a vector of constants that projects y (the vector of all previousdependent measurements) to a predicted value. That is:

p = cTy (2.33)

Recall that the vector of y values can be written:

y = Xβ + ε (2.34)

And that:

E(ε) = 0 (2.35)

And let:

E(εεT ) = V (2.36)

We want a solution that minimizes:

σ2p = (p− y)(p− y)T (2.37)

Subject to:

E(p− y) = 0 (2.38)

20

Using 2.34 equation 2.33 can be written:

p = cTXβ + cT ε (2.39)

Using 2.39 and 2.30 equation (p− y) can be rewritten:

p− y = (cTX− xT )β + cT ε− ε (2.40)

If the prediction is unbiased:

cTX− xT = 0 (2.41)

And using 2.40 and setting cTX− xT to zero:

p− y = cT ε− ε (2.42)

And

σ2p = (p− y)(p− y)T

= (cT ε− ε)(cT ε− ε)T

= cT εεT c + ε2 + 2cT εε

= cTVc + ε2 + 2cTω (2.43)

To minimize 2.43 subject to 2.41 we find the minimum of a Lagrange function,say:

Λ(c, λ) = f(c)− 2λ(g(c)− c

)g(c)− c = cTX− xT − 0

= cTX− xT

f(c) = cTVc + ε2 + 2cTω

Λ(c, λ) = cTVc + ε2 + 2cTω − 2λ(cTX− xT )

(2.44)

Finding the partial derivatives of 2.44, setting them to zero and solving forc gives us:

c = V−1X(XTV−1X)−1x+ V−1(I−X(XTV−1X)−1)XTV−1)ω (2.45)

Using equation 2.33 and 2.45:

p = cTy

= xT (XTV−1X)−1XTV−1y + ωTV−1y − ωTV−1X(XTV−1X)−1XTV−1y(2.46)

21

Looking back to section 2.7 ( generalized least squares) we see that:

β = (XTR−1X)−1XTR−1y (2.47)

If the Cholesky Decomposition of V can be found, we can write:

β = (XTV−1X)−1XTV−1y (2.48)

p = xT β + ωTV−1y − ωTV−1Xβ

= xT β + ωTV−1(y −Xβ) (2.49)

2.9 Predicting animal breeding values

.The model of interest has the form:

y = Xβ + Za + e (2.50)

X is a design matrix for the fixed effects: that is, the effects that are notinfluenced by additive genetic effects. Z is the design matrix used to selectbreeding values (the animal). If there is only one record per animal, Z willbe the identity matrix. The residuals left over when the random effects areremoved are contained in the vector e.

If there are two fixed effects, say a doe can be on farm 1 or 2, the predictionequation for the phenotype of doe i is.

yi = β0 + β1xi0 + β2xi1 + ai + e (2.51)

If the animal is on farm 1 then xi0 will be one and xi1 will be zero.

The variance and covariances defined from the animal model with a desirefor some simplicity are:

var(a) = Gσ2 = Akσ2 = Aσ2a Where A is the the relationship matrix

var(e) = R = Iσ2

cov(a, e) = 0

cov(e, a) = 0

(2.52)

22

G is a symmetric matrix because A is symmetric. From the definitions abovevar(ε) can be calculated :

var(ε) = V

= var(Za + e) (see 2.50)

= Zvar(a)ZT + var(e) + cov(Za, e) + cov(e,Za)

= Zvar(a)ZT + var(e) + Zcov(a, e) + cov(e, a)ZT

= (ZGZT + R)σ2 (2.53)

The covariance we are putting the effort into extracting is: cov(ε, ε).

cov(ε, ε) = ω = zTGZ (2.54)

Replacing ω and V in equation 2.49 gives:

p = xT β + zTGZT (ZGZT + R)−1(y −Xβ) (2.55)

Which is the formula given by Henderson, 1975 as the best linear unbiasedpredictor of y given G and R.

In 1950 Henderson provided a set of equations to simultaneously find aand β. These are now referred to as the mixed model equations.[

XTR−1X XTR−1ZZTR−1X ZTR−1Z + G−1

] [βa

]=

[XTR−1yZTR−1y

](2.56)

In 1959 Henderson proved that β in 2.56 was a solution to 2.47 and in1963 he proved that a in 2.56 is equal to GZT (ZGZT + R)−1(y − Xβ) in2.55 (Henderson, 1975).

If R−1 = Iσ2E and Gσ2 = Aσ2

A then 2.56 can be rewritten:[XTX XTZ

ZTX ZTZ +σ2

E

σ2AA−1

] [βa

]=

[XTyZTy

](2.57)

23

σ2A

σ2P

= h2 (see equation 2.16)

σ2E = σ2

P − σ2A

σ2E

σ2P

=(σ2

P − σ2A)

σ2P

σ2E

σ2P

= 1− h2

σ2E

σ2A

=(σ2

E/σ2P )

(σ2A/σ

2P )

σ2E

σ2A

=1− h2

h2

and equation 2.57 can be written:

[XTX XTZ

ZTX ZTZ + 1−h2

h2 A−1

] [βa

]=

[XTyZTy

](2.58)

As discussed in section 2.4, heritability can be determined in many ways;restricted maximum likelihood (REML) can be used to calculate a value fromthe data set (Lynch & Walsh, 1997). Selection alters heritability; in otherwords for the value of h2 to change, a new generation is required, and thegeneration must be selected, this requires time. Heritability figures are takenfrom the batch process as real time calculation is not required.

2.10 Within herd estimated breeding values

Before this is discussed, we need to consider what animal model the applica-tion is going to support, as there are many options available.

It makes sense to support within herd groups, male and female is a groupclassification that comes to mind, as is the animals age when the phenotyperecord was created. The mathematics to support multiple phenotype recordsof the same trait taken over time is a little more complex than the math-ematics discussed in the previous sections as the variance between recordshas to be determined by other means. Chapter 7, edition 2 of Mrode’s book(Mrode, 2005) looks at the model needed for longitudinal data.

To apply the techniques described in section 2.12, the pedigree must beset up in the order the animals enter the pedigree. However the A matrixused in equation 2.58 can be set up in any order, as long as the cells withinthe new line and column contain the correct data. That is, the diagonal

24

cells contain the inbreeding coefficients and the off diagonal cells contain therelationship value between the two animals represented by the intersectionof the column and row.

The last animal to be added to the pedigree can take the last row in Aand as a result the last row in X and Z.

The left matrix in equation 2.58 can be written as:[XTX XTZ

ZTX ZTZ + 1−h2

h2 A−1

]=

[XTX XTZZTX ZTZ

]+

[0 0

0 1−h2

h2 A−1

](2.59)

The diagonal elements of the sub matrix X−1X count the number ofanimals in each group and the off diagonal elements the number in each pairof groups. If the number of groups doesn’t change, the size of this sub matrixremains constant and to update, one is added to the appropriate counts.

The number of columns in the sub matrix ZTX is equal to the numberof groups and the number of rows is equal to the number of animals. Whenwe add an animal to the end of the matrix, it grows by a row. One is placedin a column if the animal belongs to the group in that column, zero in allothers. The matrix XTZ is the transpose of ZTX.

The sub matrix ZTZ is a matrix that contains a diagonal entry if aphenotype record is available for the animal and zero if not.

Excluding A−1 from the consideration, updating the left hand size of 2.58involves the addition of a row and column to a large matrix, and a little bitof work in the top left hand corner to get the counts in order.

Matrix A−1 is added to the bottom right hand corner. The incrementalupdating of A−1 is discussed in section 2.13.

Considering now the right hand side; XTy contains an entry per group,the sum of all y values for the group, adding the new value to the relevantsummation will update these value. ZTy selects the appropriate y value fora particular row in the equation set.

When adding the row to the equations we get the last value on the righthand side correct if we set it to the phenotype record, and zero if there is nophenotype.

While it’s possible to incrementally set up the BLUP equations, we areleft with a set of simultaneous equations to solve, the number equaling thenumber of animals in the herd plus the number of groups.

Providing a screen for the user to initiate the estimated breeding calcula-tion and using AJAX to provide the user with progress details is the solution.If we are going to solve the simultaneous equations in this manner, we mayas well setup the equations from scratch, and avoid incrementally updatingthem. To build the equations form scratch, the animals must be sorted in

25

the order they entered the pedigree. The transitive closure and maximumpath length can be used to put the animals in the correct order. We arrangethe animals from those with the longest maximum path length between theanimal and its latest descendant to those with the shortest.

2.10.1 No members in a group

If the number of groups remains constant, it was noted above that updat-ing counts could be used to update XTX. With this knowledge it may betempting to set up the equations with all possible groups. If a group has nomembers the simultaneous equations will not have a unique solution.

2.11 Relationship matrix

In section 2.9 it was mentioned that G is defined as Aσ2A where A is the

the relationship matrix. The relationship matrix records the flow of geneticinformation through the pedigree, that is, it records the relationship betweenthe variances. The matrix has twice the coefficient of co-ancestry (2Θij)between animal i and j in the off diagonal cells ij and ji, and the inbreedingcoefficient plus one (Fi + 1) along the diagonal cells (Lynch & Walsh, 1997,page763).

The inbreeding coefficient (Fi) is the probability that the two gameteswithin an individual are common by descent. The coefficient of kinship,the coefficient of co-ancestry and the coefficient of consanguinity(Θij) aredifferent names for the same thing and is the probability that two gametesare common between two individuals. An offspring’s inbreeding coefficient isequal to the parent’s coefficient of kinship (Lynch & Walsh, 1997, page 135).As “common by descent’ can only be found within the known relationships,it is assumed that individuals are unrelated (Lynch & Walsh, 1997, page132). As it is assumed that individuals are unrelated when they enter thepedigree, we only need related individuals to calculate the coefficients.

2.11.1 Calculating the relationship matrix

Example pedigree

This section uses the example pedigree in figure 2.2 to illustrate the calcula-tions.

26

Kid Sire Dam3 1 24 1 unknown5 4 36 5 2

Unknown 1 2

4 3

5

6

Figure 2.2: Example pedigree (Mrode, 1996)

Ordering of animals

Calculation of the coefficients requires that the matrix be arranged fromearliest generation to oldest ( parents precede their offspring). This couldbe done by arranging the records from earliest birth date to most recent, ifall birth dates are known. If the maximum path depth is maintained, thelist can start from records with the longest path to a descendant to thosewith the shortest path. For example: if an animal has no descendants, thenthe longest path is zero; while a grandparent with grand-offspring having nooffspring, will have a longest path of two. Table 2.2 gives the longest pathfor the example pedigree given in 2.2. Note that animal two has a short pathof one edge to animal six and a long path of three edges to animal six.

Relationship between animals

Animal Longest Depth1 32 33 24 25 16 0

Table 2.2: Arrangement ofanimals from longest pathto shortest

How related the new animal is to the other an-imals is dependent on the parents. There arefour entries involved.

i = Offspring

s = Sire

d = Dam

j = All other animals

u = Unknown

aji = aij = 0.5(ajs + ajd) (2.60)

aii = 1 + 0.5(asd) (2.61)

If the parent of an animal isn’t known, theinbreeding coefficient is set to zero. The in-

27

breeding coefficient for an animal is equal toFi = 1/2(asd).

animal 1 2 3 4 5 61 1.00 0.00 0.50 0.50 0.50 0.252 0.00 1.00 0.50 0.00 0.25 0.6253 0.50 0.50 1.00 0.25 0.625 0.5634 0.50 0.00 0.25 1.00 0.625 0.3135 0.50 0.25 0.625 0.625 1.125 0.6886 0.25 0.625 0.563 0.313 0.688 1.125

Table 2.3: A matrix for example pedigree given in figure 2.2 (Mrode, 1996).

The a11 A matrix entry for animal 1 (unknownsire and dam) equals:

auu = 0

a11 = 1 + 0.5(asd) = 1 + 0.5(auu)

= 1

The entry for animal 2 (unknown sire and dam) equals:

a12 = a21 = 0.5(a2s + a2d)

= 0.5(a2u+ a2u)

= 0

auu = 0

a22 = 1 + 0.5(asd)

= 1 + 05(auu)

= 1

28

The entry for animal 3 (sire equals animal 1, dam equals animal 2) equals:

a13 = a31 = 0.5(a1s + a1d)

= 0.5(a1s + a1d)

= 0.5(a11 + a12)

= 0.5(1 + 0)

= 0.5

a23 = a32 = 0.5(a2s + a2d)

= 0.5(a21 + a22)

= 0.5(0 + 1)

= 0.5

a33 = 1 + 0.5(asd)

= 1 + 0.5(a12)

= 1 + 0.5(0)

= 1

etc.

2.12 Calculating the inbreeding coefficient

To aid the selection of animals for joining, we wish to present to the growerthe inbreeding coefficient (these are equal to the diagonal elements of Aminus one). Quass, 1976 presented a method to quickly calculate A−1, ideaspresented by Quass, 1976 can be used to calculate the desired data.

A positive definitive symmetrical matrix (which A is) can be decomposedas:

A = TDTT (2.62)

Where T is a lower triangular matrix with one’s along the diagonal andD is a diagonal matrix. If we define L as L

√D, we have the Cholesky

Decomposition (Strang, 2003, page 334). Then:

A = LLT (2.63)

L is a lower triangular matrix so:

aii =i∑

m=1

l2im (2.64)

29

That is, the entries in row i of L are l0 to li, the entries in column i ofLT are l0 to li. If we multiply row i with column i we end up with the sumof the entries squared.

Putting it another way:

a11 = l211

a22 = l221 + l221

a33 = l231 + l231 + l231

etc.

Henderson, 1976 provided rules for calculating the diagonal elements of L.

lii =√di

=√

[0.5− 0.25(Fs + Fd)]

=√

[0.5− 0.25(ass − 1 + add − 1)]

=√

[1.0− 0.25(ass + add)]

=

√√√√[1.0− 0.25( s∑m=1

l2sm +d∑

m=1

l2dm)]

The off diagonal elements for column j are calculated using:

lij = 0.5(lsj + ldj) s, d ≥ j (2.65)

The diagonal elements of A can be calculated using two vectors that havea length equal to the number of animals in the pedigree. One vector containsthe diagonal element of A progressively calculated, and on completion all thediagonal elements of A will be present. The other vector contains a currentcolumn of L and the column changes as the calculation progresses.

30

Using the example pedigree introduced in section 2.12:

l11 = sqrt1.0− 0.25(ass + add) Where ass and add are unknown.

= sqrt1.0− 0.25(0 + 0)

= 1

a11 = l211

= 12

= 1

l21 = 0.5(lsj + ldj) Where s and d are unknown

= 0

l31 = 0.5(lsj + ldj) Where s = 1 and d = 2

= 0.5(l11 + l21)

= 0.5(1 + 0) = 0.5

etc.

As mentioned above, the animals must be arranged from the first animalin the pedigree to the last, this is dealt with further in section 3.4.5

2.13 Incrementally maintaining A−1

In the previous section it was noted that:

A = TDTT (2.66)

The inverse is equal to:

A−1 = (T−1)TD−1T−1 (2.67)

The inverse of a diagonal matrix is the reciprocal of the diagonal elements.Find D and it is a simple matter to find D−1. T−1 is a triangular matrixwith 1 along the diagonals and -0.5 to the right where the columns for ananimal meet the row for the parents (Mrode, 1996).

Incrementally calculating the elements of D was discussed on section 2.12.To fully update D, the value for the added animal and all the ancestors ofit’s descendants have to be updated. The set to update is easy to find usingthe transitive closure discussed in the next chapter.

To update T−1, a row and a column are added to the matrix with theappropriate elements set to -0.5. After updating D and T−1, equation 2.67an be used to calculate A−1.

31

2.14 Conclusion

Merrrit is being developing as a tool to increase the industries economic per-formance. This section introduced the theory behind selection and the mathsneeded to remove fixed effects that result from animals being born and breedon different properties. To use the methods articulated we need the pedi-gree. In the next section we look at representing the pedigree as a graph, thetransitive closure and the incremental maintenance of the transitive closure.How the transitive closure is used by the application to meet the design goalsis left to chapter 5.

32

Chapter 3

Graphs and transitive closures

33

The on line creation of pedigrees by herd owners is a primary project aim.With multiple users creating the pedigree, it is important that the systemchecks the integrity of the data as it is entered, and that it does not restrictthe order of data entry. For example, the parents of an animal must be bornbefore the offspring. If we allow animal data to be entered in any order, thedata for multiple offspring can be entered before the parent data. When theparent data is entered we need to be able to generate the birth data of thefirst born offspring and make sure we are setting a reasonable date for theparent’s birth.

In section 2.11.1, it was noted that we can calculate the inbreeding co-efficient using an animal’s ancestors if they are arranged in the order theyappear in the pedigree. To perform the calculation, we need to be able tofind all descendants and arrange them in order.

The pedigree’s transitive closure is the foundation used to perform bothof these operations quickly. Putting the animals in the order they appear inthe pedigree requires us to maintain the maximum path length between theanimal and it’s descendants.

From ToUnknown 4

4 55 61 41 33 52 6

Table 3.1: Edges table for pedi-gree given in figure 2.2

This chapter describes the transitiveclosure and the algorithms to incremen-tally maintain the transitive closure. Asperformance suitable for the on line main-tenance of the pedigree is required, con-sideration is given to methods that can beused to speed up the incremental mainte-nance of large graphs ( greater than tenthousand nodes).

The chapter also describes an algo-rithm to incrementally maintain the min-imum distance between graph nodes anddescribes how to extend previous workso that the application can maintain themaximum distance between nodes.

3.1 Directed acyclic graph

A pedigree can be represented by a directed acyclic graph.Geometrically a graph is defined as a set of points (vertices or nodes)

in space which are connected by a set of lines (edges). For a graph G , thevertex set is donated by V and the edge set by E , and we write: G = (V ,E )

A simple graph contains no self loops and no parallel edges. As mammals

34

cannot parent themselves and as only one strand of DNA can pass from aparent to child, a mammal pedigree is a simple graph.

A directed acyclic graph is a directed graph with no directed cycles; thatis, for any vertex v , there is no nonempty directed path that starts and endson v . A pedigree is a directed acyclic graph as gene flow is from parents tooffspring and if you exclude time travel, your offspring can’t be your parents.

3.2 Transitive closure

From To From ToUnknown 4 1 4Unknown 5 1 5Unknown 6 1 6

4 5 1 34 6 1 55 6 1 6

3 52 32 6

Table 3.2: Transitive closure ofexample pedigree from figure 2.2

A path from vi to vj is a sequence P =v1 , e1 , v2 , e2 , · · · , ei−1 , vi . If the graphis simple (only one edge between ver-tices), the path can be represented as:p = v1 , v2 , · · · , vi .

Two nodes are connected if there is apath from one to the other. The transi-tive closure of a graph G = (V ,E ) is thegraph G+ = (V ,E+) such that for allnode pairs (v ,w) in V there is an edge inE+ if there is a path between (v ,w) inG .

Every directed acyclic graph has atopological ordering; that is, an order-ing of the vertices such that the start-ing node of every edge occurs earlier inthe ordering than the ending node of theedge. This is important for the development of proofs for the algorithms usedfor the addition and removal of edges.

3.3 Only one transitive closure needs to be

maintained

If you have a directed acyclic graph and there is a path from vi and vj

and you change the direction of all paths, there will now be a path fromvj and vi . The transitive closure contains a list of paths. Changing thedirection of the path in the original graph will not create new connectionsbetween node pairs or destroy connections present. The direction of the pathhowever will be reversed. If you store a transitive closure as a pair of nodesthat connects parents to all decedents, then all descendants can be found

35

by selecting a particular parent. All ancestors can be found by selecting aparticular descendant.

3.4 Incremental maintenance of a transitive

closure

There have been several papers published (Guozhu, Leonid, Jianwen, & Lim-soon, 1999 and Dong & Su, 1995) that deal with the theoretical possibilityof the incremental maintenance of a transitive closure, these also provide ex-ample SQL code. This section extends these ideas with the aim of providinga solid background for their use in the application.

3.4.1 Insertion of an edge into a transitive closure

If you insert an edge between a and b, then all nodes that had a path to awill have a path to b and similarly, all nodes that had a path from b will nowhave a path from a.

There will now be paths between both sets of nodes; that is, nodes thathad a path to a and nodes that had a path from b. There will also be a pathbetween a and b.

When nodes are presented as a pair, they represent a path whose nodeorder gives the direction. Generating the transitive closure (TC) after the in-sertion of an edge between a and b requires us to generate two sets of nodes.Lets describe any path in the TC as (nodei , nodej ), with the path starting ati and ending at j . The new edge being inserted is (nodea , nodeb).

The set of nodes that have paths that end at node a is given by:

A = {nodei TC(nodej ) = a} (3.1)

The set of nodes with a path that start at node b is given by:

B = {nodej TC(nodei) = b} (3.2)

The set of nodes that have a path the inserted edge starts at has a singleentry, a. Similarly the inserted edge ends at b and the set of nodes theinserted node ends at is a one entry set containing node b.

The new path created when the edge (nodea , nodeb) is added can be found

36

using the union of Cartesian products between the node sets described above:

P1 = (A× b) All the paths that end at b.

P2 = (a × B) All the paths that start at a.

P3 = (A× B) Paths that start in set A and end in set B .

P4 = (a × b) The new edge.

PS = P1 ∪ P2 ∪ P3 ∪ P4 (3.3)

The new TC is the union between the new paths and the old:

TCnew = TC ∪ PS

Intersect set

The set of new paths may contain paths that exist in the old TC; that is,there may be an intersect set. It is the finding of this intersect set thatcreates problems when an edge is deleted.

3.4.2 Deletion of an edge from a transitive closure

Let the original set be TCold . Once again let an edge be represented by(nodei , nodej ), with the node order giving the direction. The set of all possiblepaths that could have been created by the insertion of the edge being deletedare suspect paths. This set of paths that can be created as set PS was createdby formula 3.3 in the previous section.

Paths that are not in the “SUSPECT” set but are in TCold are the“TRUSTY” paths.

“SUSPECT” will contain paths that would be in the intersection set ( seesection 3.4.1) if the edge was being added to the transitive closure. Thesecan be recovered from “TRUSTY”. There are two cases (discussed in figure3.1) that need to be considered.

The paths represented by the case shown to the left of figure 3.1 canbe recovered by joining tails to heads. The case to the right cannot berecovered unless “TRUSTY“ contains all edges other than the edge (a, b). Ifthe shortest path length is maintained, the edges can be kept in ”TRUSTY“by making the subtraction of ”SUSPECT“ from ”TRUSTY“ conditional onthe shortest path length being greater than one.

3.4.3 Three self joins of ”TRUSTY” to recover all paths

The results given in Dong & Su, 1995 show that joining “TRUSTY” head totail three times recovers all paths that should not have been deleted when

37

1

2

3

a

b

After “SUSPECT” has been re-moved, the following node pairswill still exit within “TRUSTY”:(1 , a) (1 , 2 ) (2 , 3 ) along with oth-ers. The node pair (1 , 3 ) wouldhave been removed because thatpair started and ended a path thatalso went through a, b. Joiningheads to tails will join (1 , 2 ) and(2 , 3 ). The path (1 , 3 ) will be re-stored.

1

2

a

b

After “SUSPECT” has been re-moved, the node pair (1 , b) and(1 , 2 ) will have been removedbecause node 1 was connectedto node b and 2 through a.These will not be recovered us-ing the method to the left unless“TRUSTY” still contains all edgesother than the edge between (a, b).

Figure 3.1: Restoration of deleted paths

38

1

2

3

a

b

After “SUSPECT” has been re-moved, the paths from 1 and 2 tob and 3 have also been removedas there where paths between thesenodes through a and b. The pathfrom any node on the upper side to2 still exists. The path from 2 tob still exists because it is an edge.The path from b to any node onthe down side will still exist. Threehead to tail joins of “TRUSTY”will recover the missing path.

Figure 3.2: Three head to tail joins of trusted paths required.

“SUSPECT” was removed from “TRUSTY“. The case requiring three joinsis shown in figure 3.2.

The joining of heads to tails will generate a lot of paths that were no in“SUSPECT”. Because a union removes duplicates, this doesn’t matter.

Let (nodei , nodej ) be one path in “TRUSTY” and (nodev , nodew) be an-other. Then:

Ta = {nodei T (nodej ) = T (nodev)}Tb = {nodew T (nodej ) = T (nodev)}U = Ta × Tb

Ua = {nodei U (nodej ) = T (nodev)}Ub = {nodew U (nodej ) = T (nodev)}

I = Ua × Ub

(3.4)

The new transitive closure equals:

TCnew = {TCold − SUSPECT short.path > 1} ∪ I (3.5)

3.4.4 Fragments

The intersect set (see section 3.4.1) will only contain paths with start nodesand end nodes that are in ”SUSPECT“. ”SUSPECT“ is generally smallerthan the transitive closure. When the pedigree is large, speed becomes an

39

Unknown 1 2

4 3

5

6

The longest path between nodetwo and six is three edges, theshortest path between node twoand six is one edge.

Figure 3.3: Shortest and longest path

issue and reducing the set size becomes important.

Sa = {nodei SELECT (nodei)}Sb = {nodej SELECT (nodei)}Fa = {(nodei) Tnew(nodei) = Sa OR Tnew(nodej ) = Sb}Fb = {(nodej ) Tnew(nodei) = Sa OR Tnew(nodej ) = Sb}

I = Fa × Fb

3.4.5 The shortest and longest Path

The utility of the transitive closure is increased if the system maintains theshortest and longest paths between nodes. For example, if the shortest pathlength is maintained, the number of offspring is equal to the number of pathsthat start from a node that have a path length of 1.

Several of the algorithms described in the previous section require themaintenance of an edge table or the shortest path ( a path with a shortdistance of one is an edge). The longest path is required when orderingthe data for the creation of the vectors needed to calculate the inbreedingcoefficient (see section 2.12).

The algorithms for deleting and inserting an edge rely on joining headsto tails. The maximum length of a new path is equal to the sum of themaximum lengths of the joined paths. The minimum path length is equal tothe sum of the minimum path lengths of the joined paths. Issues arise whenthere are multiple paths between vertices and the cartesian product of nodesresult in entries with the same start and end node but different minimum ormaximum path lengths.

The tuples now have four fields (nodefrom , nodeto , pathmin , pathmax ), andare identical only if all fields are the same. If there are multiple paths between

40

vertices with different path lengths, the union operations will leave multiplerecords for the same path. These need to be reduced to one record usingaggregate operators.

Lets describe any path in the TC as (nodei , nodej , pathmin , pathmax ), withthe path starting at i and ending at j . The new edge being inserted is(nodea , nodeb , 1, 1).

The set of nodes that have paths that end at node a is given by:

A = {nodei , pathmin , pathmax TC(nodej ) = a} (3.6)

The set of nodes with a path that starts at node b is given by:

B = {nodej , pathmin , pathmax TC(nodei) = b} (3.7)

If the transitive closure is correctly constructed, there will be one tuple pernode pair and the tuple will contain the maximum and minimum path lengthsbetween the node pair. A will then only contain one tuple per node that isjoined to a and that tuple will contain the maximum and minimum lengthsto a. Similar arguments apply to set B .

The set of nodes that have a path the inserted edge starts at has a singleentry, a, whose minimum and maximum path lengths will be one. Similarly,the inserted edge ends at b and the set of nodes the inserted node ends at isa one entry set containing b.

The new path created when the edge (nodea , nodeb , 1, 1) is added can befound using the union of Cartesian products between the node sets describedabove:

P1 = (A× b,Apathmin + bpathmin ,Apathmax + bpathmax )

= (A× b,Apathmin + 1,Apathmax + 1)

P2 = (B × a,Bpathmin + apathmin ,Bpathmax + apathmax )

= (B × a,Bpathmin + 1,Bpathmax + 1)

P2 = (A× B ,Apathmin + Bpathmin + 1,Apathmax + Bpathmax + 1)

P4 = (a, b, 1, 1)

PS = P1 ∪ P2 ∪ P3 ∪ P4 (3.8)

The new TC is the union between the new paths and the old. There will bemultiple entries for each path if there are paths with different lengths.

TCpossible = TCold ∪ PS

The common paths with different lengths have to be reduced to one usingthe aggregate function. The minimum path length will be the minimum path

41

found in the group with a common path, and the maximum path lengthwill be the maximum path found in the group with a common path. Themaximum and minimum paths may come from different tuples within thegroup.

3.4.6 Building a transitive closure from a depth firstsearch

In section 3.4.7, it is mentioned that the transitive closure can be generatedfrom node codes. Node codes can be built from a depth first search of anadjacency list. We have chosen to maintain the transitive closure using in-cremental maintenance. However, the system effectively keeps the adjacencylist as the animal numbers of the parents are kept in the animal record tospeed up the generation of animal data. This list could be used to generatethe node codes and check the incrementally maintained transitive closure.

3.4.7 Alternative to using a transitive closure

We have chosen to represent the pedigree as a transitive closure, but alter-native methods have also been developed.

Adjacency list

An adjacency list has pointers that point to the next record in the chain. Anadjacency list is another name for an edge table. The list contains a recordfor every edge, and the record contains a start node and an end node. Thedata is complete, but no structure information is present, and hence, to findancestors and descendants, links have to be followed using multiple queries.Asmentioned above the applications effectively maintains a adjacency list as thesire and dam animal numbers are kept in the animal record.

Node codes

Node codes have been proposed by Elliott, Akgul, Mayes, & Ozsoyoglu, 2007as a method of representing pedigrees. A node code table is larger than atransitive closure with all paths along with the traversed edges recorded in astring. Each node code is recorded in a table along with an animal identity.

The node codes for the example pedigree ( see figure 2.2) are shown infigure 3.4. The foundation animals are given codes 0 to n from left to right.The codes for the offspring of a node are given in sibling order. The nodecodes record all paths that can be taken from the foundation animals to the

42

node, and a separator records the sex of the node traversed. Elliott et al.,2007 uses ”.“,”,“ and ”;“ to donate female, male and don’t know.

Unknown 1 2

4 3

5

6

Animal Codes Animal Codes1 0 5 0.0.02 1 0.1,03 0.1 1,0,0

1,0 6 0.0.0.04 0.0 1,1

Figure 3.4: Node codes

If an offspring is inserted into thedatabase, the node code of the offspringis created by copying the parent nodecodes, appending a separator based onthe parent’s sex and then appending theoffspring code. The insertion of a leaf issimple, but the insertion of an ancestor isa little bit more problematic as the codesfor all offspring have to be altered. El-liott et al., 2007 allocated the codes byperforming a depth first search on a pedi-gree that has been built and encoded us-ing an adjacency list. The developmentof algorithms for incremental additions tothe pedigree and concurrent maintenanceof the node code table would be an inter-esting exercise.

Nodes that have common ancestorshave node codes with a common prefix.This fact can be used to generate the in-breeding coefficient using the node codesfor a particular animal.

The selection of ancestors and offspring is similar to the finding of thesame data using a transitive closure. Offspring are selected by selecting nodecodes that have a prefix in common with the ancestor of interest. An ancestormay have many node codes but all of these prefixes will be propagated tothe children with the path taken from ancestor to offspring encoded in theoffspring’s suffix.

All the ancestors can be found by selecting records with node codes thathave the same prefix strings as the node codes of the animal of interest.

Changing node codes to a transitive closure

As mentioned in the introduction to this section, node codes record all pos-sible paths to a node. From the node codes, we can determine the startnode, end node and number of edges. If the maximum paths for all nodepairs is determined and placed in a table along with the start nodes and endnodes we will have the transitive closure with the maximum path lengths.Such a table along with the method described in section 2.12 is an alterna-

43

tive method to that described in Elliott et al., 2007 for the calculation ofinbreeding coefficients.

3.4.8 Building the transitive closure one herd at a time

It was found that edge insertion and deletion times become excessive whenthe pedigree is greater than ten thousand animals, this was downgrading theuser experience for large and small herds. The decision was taken to incre-mentally maintain the transitive closure for a user’s herd and update thelinks between all herds (the main industry transitive closure) in the back-ground. This section details the steps that need to be taken to support sucha strategy.

Which transitive closure should be used?

A user tends to use herd data more often than he or she updates it. Theindustry wide herd provides industry wide details of an animal’s progeny andancestors so it is a preferable transitive closure to use. The system will onlyuse the herd transitive closure for animal display if it contains updates thathave not been transferred to the full herd.

Updating the main transitive closure

When a link is inserted or deleted from the local herd it is added to a table( the main update table) of updates that are to be inserted or deleted fromthe main transitive closure. The main transitive closure is up to date whenthe main update table contains no updates for the local herd.

3.5 Conclusion

It is possible to incrementally maintain the transitive closure, the minimumpath length and the maximum path length. Node codes as used by Elliottet al., 2007 where introduced and there relationship to the transitive closureconsidered. The next section considers the implementation of the algorithmsintroduced in this section using PostgreSQL. As mentioned previously chap-ter 5 will consider how the translative closure is used within the application.

44

Chapter 4

Incremental maintenance of atransitive closures using SQL

45

This chapter looks at using PostgreSQL version 7.2 and version 8.0 toimplement the incremental maintenance of a transitive closure. The theoryunderpinning this code is discussed in chapter 3.

4.1 PostgreSQL transitive closure edge insert

Writing code that utilized the improvements delivered with PostgreSQL ver-sion 8 did not improved performance of the edge insert code significantly, sothe version 7.1 code is used in version 8 systems.

As the code maintains the maximum path length, this code extends thework presented by Pang et al., 2005.

4.1.1 PostgreSQL version 7.1 code

TRUNCATE TABLE tc_temp;TRUNCATE TABLE deltas;

Possible paths created by the insertion of the edge are contained in tc_temp.This includes paths that already exist in the transitive closure, that is itincludes the intersect set. The intersect set is the set of paths that exists intc_temp and the transitive closure.

INSERT INTO tc_temp (SELECT from_node,child_node AS to_node ,max_depth+1 AS max_depth

min_depth+1 AS min_depthFROM parent_child_tc WHERE to_node=parent_node

UNIONSELECT parent_node AS from_node, to_node , max_depth+1 AS max_depth,

min_depth+1 AS min_depthFROM parent_child_tc WHERE from_node=child_node

UNIONSELECT TC1.from_node AS from_node,TC2.to_node AS to_node,

((TC1.max_depth)+(TC2.max_depth)+1) AS max_depth,((TC1.min_depth)+(TC2.min_depth)+1) AS min_depth

FROM parent_child_tc AS TC1, parent_child_tc AS TC2WHERE TC1.to_node=parent_node AND TC2.from_node=child_node

UNIONSELECT parent_node AS from_node,child_node AS to_node,

1 AS max_depth,1 AS min_depth);

Remove the intersect set to leave the new paths and only the new paths indelta.

46

INSERT INTO deltas (SELECT * FROM tc_temp WHERE NOT EXISTS (

SELECT * FROM parent_child_tcWHERE parent_child_tc.from_node=tc_temp.from_node AND

parent_child_tc.to_node = tc_temp.to_node)

);

Insert the new paths into the transitive closure

INSERT INTO parent_child_tc SELECT * FROM deltas;

The paths in tc_new that where already in the transitive closure can havedifferent path lengths. We make sure the maximum path length is set to themaximum path found in the set with the same start and end node.

UPDATE parent_child_tc SET max_depth=c_temp.max_depth FROM tc_tempWHERE parent_child_tc.from_node=tc_temp.from_node AND

parent_child_tc.to_node=tc_temp.to_node ANDparent_child_tc.max_depth < tc_temp.max_depth ;

And we make sure the minimum path is set to the minimum path.

UPDATE parent_child_tc SET min_depth=tc_temp.min_depth FROM tc_tempWHERE parent_child_tc.from_node=tc_temp.from_node AND

parent_child_tc.to_node=tc_temp.to_node ANDparent_child_tc.min_depth > tc_temp.min_depth ;

4.2 PostgreSQL transitive closure edge un-

link

The code presented maintains the minimum and maximum path length be-tween nodes. Pang et al., 2005 presents code to maintain the minimum pathlength, and extends the work of Pang et al., 2005 in the following ways. Thecode doesn’t delete edges from the transitive closure when “SUSPECT” isremoved; this is possible because the minimum path length is maintained.The code uses “SUSPECT” and the remaining transitive closure after “SUS-PECT” is removed (“TRUSTY” in version 7.1) to generate a fragment tablethat contains all paths that may be useful in the regeneration of lost paths.If many paths don’t go through the deleted edge this is faster than regen-erating the lost paths directly from trusted paths. The code maintains themaximum path length as well as the minimum path length.

The edge being inserted is from child_node to parent node. The tran-sitive closure table has the columns: from, to, min_depth and max_depth..

47

PostgreSQL version 8 code

Truncating the table is faster than creating it. TRUNCATE is a PostgreSQLextension

TRUNCATE TABLE SUSPECT, fragments, delta ;

Paths that go through the link to be deleted are added to the set “SUS-PECT”. Paths in “SUSPECT” will be removed if the minimum path lengthis greater than one.

INSERT INTO SUSPECT (SELECT X.from_node AS from_node, Y.to_node AS to_node

FROM parent_child_tc AS X,parent_child_tc AS YWHERE X.to_node = parent_node AND Y.from_node= child_node

UNIONSELECT X.from_node AS from_node, child_node AS to_node

FROM parent_child_tc AS XWHERE X.to_node = parent_node

UNIONSELECT parent_node AS from_node, X.to_node AS to_node

FROM parent_child_tc AS XWHERE X.from_node = child_node

UNIONSELECT parent_node , child_node

);

Deleting is faster than generating a new table. We delete from parent_child_tc

any path in “SUSPECT” that has a minimum path length greater than one.We also remove the edge that is being deleted.

DELETE FROM parent_child_tc AS tc ". USING SUSPECT AS spWHERE (

((sp.from_node=tc.from_node) AND(sp.to_node=tc.to_node) AND(tc.min_depth<>1)) OR(tc.from_node= parent_node AND tc.to_node= child_node )

);

We didn’t delete anything from the transitive closure if it has a minimumpath length equal to one (other than the edge being deleted). This leave inthe transitive closure entries with a minimum path of one, but an unknownmaximum path length. To deal with this, we set the maximum path length

48

to one if a record with the same start and end node is in “SUSPECT”. Wedon’t have to test the minimum path length as the previous code will haveremoved paths that are in the “SUSPECT” set if the minimum path lengthwas greater than one.

UPDATE parent_child_tc SET max_depth = 1 FROM SUSPECT WHERE(parent_child_tc.from_node=SUSPECT.from_node) AND(parent_child_tc.to_node= SUSPECT.to_node) AND

;

Fragment is a subset of the transitive closure that can regenerate paths thatshould not have been deleted. The idea is simple; if the remaining path in thetransitive closure doesn’t have a start and end that is in the deleted set, thepath can’t be helpful in regenerating paths that should not have been lost.Finding this subset takes less time that working with the full “TRUSTY”set when regenerating the paths.

INSERT INTO fragments (SELECT pc.from_node,pc.to_node,pc.min_depth,pc.max_depth

FROM parent_child_tc AS pc JOIN SUSPECT AS spON (sp.from_node=pc.from_node)

UNIONSELECT pc.from_node,pc.to_node,pc.min_depth,pc.max_depth

FROM parent_child_tc AS pc JOIN SUSPECT AS spON (sp.to_node=pc.to_node)

);

Using the fragment set we construct delta, this is the set of paths that mayhave been lost that should not have been. This will contain paths that arealready in the transitive closure. The additional paths are deleted in the nextstep. Three self joins of fragments is required to recover all possible pathslost (see section 3.4.3).

INSERT INTO delta (SELECT fr1.from_node,fr2.to_node,

(fr1.min_depth+fr2.min_depth) AS min_depth,(fr1.max_depth+fr2.max_depth) AS max_depth

FROM fragments AS fr1, fragments AS fr2WHERE fr1.to_node=fr2.from_node

UNIONSELECT fr1.from_node,fr3.to_node,

(fr1.min_depth+1+fr3.min_depth) AS min_depth,(fr1.max_depth+1+fr3.min_depth) AS max_depth

FROM fragments AS fr1, fragments AS fr2, fragments AS fr3

49

WHERE (fr1.to_node=fr2.from_node) AND(fr2.to_node = fr3.from_node)

);

All we want is the paths that should not have been lost.

DELETE FROM deltaUSING parent_child_tc AS pc

WHERE ((delta.from_node=pc.from_node) AND(delta.to_node=pc.to_node)

);

In the stuff that was lost there can be multiple paths but we are only inter-ested in the shortest and longest path of each group ( common start and endnode).

INSERT INTO parent_child_tc(SELECT a.from_node,a.to_node,min(a.min_depth),max(a.max_depth)

FROM delta AS aGROUP BY a.from_node,a.to_node

);

PostgreSQL version 7.1 code

Version 7.1 doesn’t support the truncation of multiple tables with one com-mand.

TRUNCATE TABLE SUSPECT;TRUNCATE TABLE fragments;TRUNCATE TABLE delta;TRUNCATE TABLE TRUSTY;TRUNCATE TABLE tc_new;

Version 7.1 and version 8.0 create “SUSPECT” using the same strategy.

INSERT INTO SUSPECT (SELECT X.from_node AS from_node, Y.to_node AS to_node

FROM parent_child_tc AS X, parent_child_tc AS YWHERE X.to_node=parent_node AND Y.from_node=child_node

UNIONSELECT X.from_node AS from_node, child_node AS to_node

FROM parent_child_tc AS X

50

WHERE X.to_node=parent_nodeUNION

SELECT parent_node AS from_node,X.to_node AS to_nodeFROM parent_child_tc AS XWHERE X.from_node=child_node

UNIONSELECT parent_node,child_node;

Version 8 creates the trusted set by deleting “SUSPECT” from the transitiveclosure. The DELETE statement in version 7.1 is not as well developed, soinstead a “TRUSTY” table has to be created. A left outer join is usedinstead of the NOT EXIST code used in the code presented by Guozhu et al.,1999 because the 7.1 optimizer is not smart enough to convert the queryused by NOT EXIST to a left outer join. Instead in executes the inner queryin a loop resulting in very poor performance. This is not a universal problemwith version 7.1 as the NOT EXIST clause used in the edge insert code wascorrectly optimized. The problem disappeared in version 8.

INSERT INTO TRUSTY (SELECT parent_child_tc.from_node,parent_child_tc.to_node,

parent_child_tc.min_depth,parent_child_tc.max_depthFROM parent_child_tc LEFT OUTER JOIN SUSPECT ON (

SUSPECT.from_node=parent_child_tc.from_node ANDSUSPECT.to_node = parent_child_tc.to_node

)WHERE (SUSPECT.from_node IS null) AND(parent_child_tc.min_depth<>1)

UNIONSELECT parent_child_tc.from_node,parent_child_tc.to_node,1,1

FROM parent_child_tcWHERE (parent_child_tc.min_depth=1) AND

(NOT(parent_child_tc.from_node=parent_node ANDparent_child_tc.to_node={$child_node})

));

The version 7.1 fragment set is created in a similar manner to the version8 set. The version 8 code uses the reduced transitive closure set created bydeleting “SUSPECT” from the original transitive closure, while the version7.1 code uses the “TRUSTY” table created above.

INSERT INTO fragments (

51

SELECT pc.from_node,pc.to_node,pc.min_depth,pc.max_depthFROM TRUSTY AS pc JOIN SUSPECT AS sp ON

(sp.from_node=pc.from_node)UNION

SELECT pc.from_node,pc.to_node,pc.min_depth,pc.max_depthFROM TRUSTY AS pc JOIN SUSPECT AS sp ON

(sp.to_node=pc.to_node));

Version 8 code created delta and then deleted paths that were already in thetransitive closure that remained after “SUSPECT” was removed. BecauseDELETE is not as well developed, version 7.1 creates a separate table tc_new

and then only puts into delta those paths not in “TRUSTY”. The version7.1 code to create tc_new is similar to the version 8 code that creates theinitial delta.

INSERT INTO tc_new (SELECT fr1.from_node,fr2.to_node,

(fr1.min_depth+fr2.min_depth) AS min_depth,(fr1.max_depth+fr2.max_depth) AS max_depth

FROM fragments AS fr1, fragments AS fr2WHERE fr1.to_node=fr2.from_node

UNIONSELECT fr1.from_node,fr3.to_node,

(fr1.min_depth+1+fr3.min_depth) AS min_depth,(fr1.max_depth+1+fr3.min_depth) AS max_depth

FROM fragments AS fr1, fragments AS fr2, fragments AS fr3WHERE (fr1.to_node=fr2.from_node) AND

(fr2.to_node = fr3.from_node));

In version 8, trusted paths are deleted from delta, while in version 7.1, pathsin tc_new that are not in “TRUSTY” are put in delta. Once again, we usea left outer join because NOT EXIST is not being converted to a outer join bythe version 7.1 optimizer.

INSERT INTO delta (SELECT tc_new.from_node,tc_new.to_node,

tc_new.max_depth,tc_new.min_depthFROM tc_new LEFT OUTER JOIN TRUSTY ON

(TRUSTY.from_node=tc_new.from_node ANDTRUSTY.to_node=tc_new.to_node)

WHERE TRUSTY.from_node IS null);

52

Because DELETE is not as well developed, parent_child_tc is rebuilt fromscratch.

TRUNCATE parent_child_tc;

And into the new parent_child_tc contains “TRUSTY” and a single entryfor each path found in delta, with the entry containing the maximum andminimum path lengths within the group.

INSERT INTO parent_child_tc(SELECT * FROM TRUSTY) ".

UNION (SELECT a.from_node,a.to_node,min(a.min_depth),max(a.max_depth)

FROM delta AS aGROUP BY a.from_node,a.to_node

);

4.3 Conclusion

We have considered the PostgreSQL version 7.1 and version 8.0 code neededto implement the algorithms presented in chapter 3. Different algorithmsare used for each PostgreSQL version because of PostgreSQL version 7.1limitations. Different SQL databases offer slightly different version of SQLso the code presented here may require modification if different databasesystems are used. The next chapter looks at how the transitive closure isused within the application, and the SQL code needed to implement thedesired functionality.

53

Chapter 5

Using the transitive closure,the SQL code

54

This chapter discusses the various ways the transitive closure has beenused to meet the application goals.

5.1 Introduction to database table structure

To understand the following code, some of the database table structure needsto be introduced (see figure 5.1). Users of Merrrit are allowed to build pedi-grees backwards; that is, the user is allowed to enter data for recent animalsand then add data for older animals as it becomes available. Older data isoften less accessible because it is stored on obsolete systems, or because itis in paper form requiring additional work to convert it into electronic form.When the system is used in this way, the identification of older animals is dis-covered when the sire and dam are loaded as data within an animal record.In this situation the only information available is the animal’s identity andsex ( a sire is male a dam female).

When a parent is identified and it is not in the database, an animal_select

record is created, but an animal record is not. The animal_select recordcontains the internal animal identification (created by the system), the tagdata used by the farmer to identify the animal, and the sex.

The animal record can identify four other animals. The sire, the biologi-cal mother, the birth mother and the raising mother. When embryo transferhas occurred the biological mother and birth mother will be different. Theraising mother can be different when offspring are miss mothered. The sir,dam, birth mother and raising mother are recorded in the animal record us-ing internal numbers, external identifications data is only stored in the selectrecord.

The parent_child_tc contains an entry if there is a genetic path betweentwo animals. The longest path linking the two animals, and the shortest path,are maintained as part of the record.

5.2 Valid birth date

Merrrit allows the user to build the pedigree backwards. Animals are addedto the database when input data specifies an unknown animal as a sire ordam. The birth dates are set only if a record giving that data is found. Thebirth date of a parent has to be earlier than the birth date of descendants.To enforce this restriction, the transitive closure is used to collect the earliestbirth date of all descendants. The earliest birth date is obtained with thefollowing code.

55

h

id tag details sex

id more data

animal_select table

animal table

Table Descriptionanimal select The animal identification

and sex. Created whena record is required fora parent or for an an-imal whose data is be-ing entered. When thisrecord is created, an in-ternal animal number isallocated by the system(animal id).

animal All other animal details.This record is only cre-ated if the animal isreferenced in some wayother than being a par-ent. That is, not allanimals have an animalrecord, and not all ani-mals have a birth date.

Figure 5.1: Animal record structure

56

SELECT min(an.birth_date) FROM parent_child_tc AS tcLEFT JOIN animals AS an ON tc.to_node = an.animalWHERE tc.from_node = ’animal_of_interest’ AND

tc.min_depth = 1;

The PostgreSQL version 7.1 optimizer has problems with the following code.The optimizer fails to note that the select can be converted to a join, theinner select ends up in a loop and the performance is terrible.

SELECT min(birth_date) FROM animals WHERE(animals.animal IN

( SELECT to_node AS animalFROM parent_child_tcWHERE from_node= ’animal_of_interest’

))

;

Merrrit users can also build the database in a forward direction. If this isdone, we must check that the new animal is not being born before it’s parents.The following code selects the latest birth date of the parents.

SELECT max(an.birth_date) FROM parent_child_tc AS tcLEFT JOIN animals AS an ON tc.from_node = an.animalWHERE tc.to_node = ’animal_of_interest’ AND tc.min_depth = 1

;

5.3 Can’t change the sex of an animal with

descendants

When creating an animal select record because the animal was specified asa sire or a dam in an offspring’s record, we set the sex; a dam is femaleand a buck is male. Input records seen later cannot change the sex withoutcorrupting the pedigree. In other words, if there are no descendants theanimals sex can be changed, if there are it cannot.

The following code counts the number of descendants.

SELECT count(*) FROM parent_child_tc AS tcWHERE tc.from_node=’animal_of_interest’;

57

5.4 Calculation of the inbreeding coefficient

To calculate the inbreeding coefficient, we need a vector of ancestors fromthe first animal to enter the pedigree to the most recent. This was discussedin section 2.12.

The short piece of code below performs this task. In the following codethe animal we are interested in is referred to as the animal_of_interest.The transitive closure contains a from_node and a to_node. The from_node

is the ancestor node, the to_node is the descendant node. If the to_node

equals the animal_of_interest the from_node gives the animal numberof an ancestor. The max_depth data tells how many generations back theanimal entered the pedigree. The min_depth data tells how recently theanimal appears in the pedigree.

To get the records in order, the result is sorted on the depth field, deepestfirst. The LEFT OUTER JOIN with the animal table picks up the sire and damanimal records, if they exist. The SELECT after the UNION is picking up thedetails for the animal we want the information on. An animal does not havea transitive closure entry pointing to itself.

SELECT tc.from_node,an.sire,an.dam,tc.max_depthFROM parent_child_tc AS tcLEFT OUTER JOIN animals AS an

ON tc.from_node = an.animalWHERE

(tc.to_node=animal_of_interest) AND(tc.max_depth < ’max_depth’)

UNIONSELECT ans.animal AS from_node,an.sire,an.dam,0 AS max_depth

FROM animal_select AS ansLEFT OUTER JOIN animals AS an

ON (ans.animal = an.animal)WHERE ans.animal=’animal_of_interest’

ORDER BY max_depth DESC ;

5.5 Descendant records

To display the descendants, we need data from the descendant records. Wecould follow the links in the animal records and read in new records as thedescendants are discovered (this would require a depth first search). It isfaster to use the transitive closure and read the entire set with one databaseread request. Care must be taken when using this code as memory usage can

58

be a problem if an animal has too many descendants. This issue is discussedfurther in section 6.2.

SELECT ans.animal,ans.sex,ans.birth_herd,an.sire,an.dam,ans.first_tag,ans.second_tag,ans.e_tag,ans.name

FROM parent_child_tc AS tcLEFT OUTER JOIN animals AS an

ON (tc.to_node = an.animal)JOIN animal_select AS ans

ON (tc.to_node = ans.animal)WHERE (tc.from_node=’animal_of_interest’)

UNIONSELECT ans.animal,ans.sex,ans.birth_herd,an.sire,an.dam,

ans.first_tag,ans.second_tag,ans.e_tag,ans.nameFROM animal_select AS ansLEFT OUTER JOIN animals AS an

ON (ans.animal=an.animal)WHERE ans.animal=’animal_of_interest’

;

5.6 Ancestor records

To display the pedigree, the ancestor animal_select records are required.To obtain this with one database read, the transitive closure is used to selectthe records.

SELECT ans.animal,ans.sex,ans.birth_herd,an.sire,an.dam,ans.first_tag,ans.second_tag,ans.e_tag,ans.name

FROM parent_child_tc AS tcLEFT OUTER JOIN animals AS an

ON (tc.from_node = an.animal) ".JOIN animal_select AS ans ON (tc.from_node = ans.animal)

WHERE (tc.to_node=’{$animal_of_interest’) AND(tc.max_depth < ’{$max_depth}’)

UNIONSELECT ans.animal,ans.sex,ans.birth_herd,an.sire,an.dam,

ans.first_tag,ans.second_tag,ans.e_tag,ans.nameFROM animal_select AS ansLEFT OUTER JOIN animals AS an ON (ans.animal=an.animal)

WHERE ans.animal=’{$animal_of_interest’;

59

5.7 Conclusion

The transitive closure is used to maintain data integrity, to select the recordsneeded to display the pedigree and descendants, and to generate the datastructure needed to calculate the inbreeding coefficient. We now move ontoa new topic, efficiently displaying the pedigree and descendants using abrowser.

60

Chapter 6

Displaying pedigrees anddescendants

61

A human pedigree aims to display relationships and siblings common tothat relationship in birth order (Bennett et al., 1995). This creates complexstructures that are difficult to display, and as a result there is a body of litera-ture that considers the problem (Tores & Barillot, 2001). The flow of geneticmaterial is important when displaying animal pedigrees, not relationships.The flow of genetic material to an individual diploid can be represented bya binary tree.

Figure 6.1: Partially opened Merrritpedigree display for animal 4461

A survey of the animal genet-ics literature indicates animal pedi-grees still tend to use the Greeksymbols for Mars (♂) and Venus(♀) to donate male and female, withthe standard symbols (circles andsquares) used by human geneticists(Schott, 2005) not being favored .

The pedigree displayed by Mer-rrit revolves around an individualanimal. One diagram shows the an-imal’s ancestors and another its off-spring.

6.1 Ancestors

A tree structure is used to displaythe pedigree, with the animal ofinterest to the left and the ances-tors to the right. Next to each ani-mal node is a link that can be usedto display the full animal record.The structure to be displayed is cre-ated using PHP classes form thepear package HTML TreeMenu (Radi& Heyes, 2008). The PHP code that uses these classes links together objects(instances of the PHP class) to describe the desired menu. The printMenu()method then uses the resulting structure to output Javascript that thebrowser uses to display the menu. The browser puts together the tree displayusing icons pointed to when the PHP class objects are added to the PHPstructure described above.

Figure 6.1 shows an example pedigree. The arrows embedded in thegraphics call JavaScript code when they are clicked, and are used to open

62

and close pedigree branches. When the branch is open, the arrow points intothe branch, and when the branch is closed ( details not displayed) the arrowpoints down. The depth of the undisplayed pedigree can be determined usingthe “depth” data displayed as part of the animal’s URL.

The application code to create the PHP structure used to output theJavaScript is found in the file /class/pedigree.php. The method used tocreate a node in the tree is recursive, calling itself if there are ancestors toitself. The recursion returns an object which is linked to a new object that isthen returned. In other words, it is a classic depth first search using recursivecode. The recursive code also returns how deeply the recursion is nested, andthe depth value returned is used as part of the data string sent to the browseras the link text.

To reduce the number of calls to the SQL server, the code uses the tran-sitive closure to select all of an animal’s ancestor records in one read. Thedepth first search then becomes a memory operation that is reasonably fast.

The pedigree downloaded to the browser contains all known descendants,but only the top four branches are displayed open. This reduces the sizeof the initially displayed tree. If interested, the user can click on the arrowicons to open closed branches.

6.2 Descendants

A tree structure is used to display descendants with the animal of intereston the left and offspring to the right. The initial rendering displays all theoffspring (see figure 6.2) but none of the offspring’s descendants. Arrows areadded to the graphics if there are not too many offspring; these arrows maybe used to open up branches that show the identity of the descendants of theoffspring.

Animals early in the pedigree of a large herd can have many offspring.For example, one of the herds used to test the system which has over 10000records, had an early buck, KASHMERE NERO that had considerable in-fluence on that herd and had over 5000 descendants. Using the transitiveclosure to read all records needed to display all descendants can be dangerous,and if not constrained, the server may not have enough memory allocatedto the task for record storage. Further, if communication speeds are mod-erate and the number of progeny is not restrained, downloading all progenyinformation to the browser can take considerable time.

The transitive closure is used to determine the number of descendantsbefore the display of descendants is started. If there are too many descen-dants, the transitive closure is used to read only the offspring records whose

63

The offspring of the animal are ini-tially rendered, with the branchesdetailing the descendants of theoffspring closed.

Opening branches displays addi-tional descendants.

Figure 6.2: Displaying descendants

minimum path length equals one and the offspring are displayed using theset of records thus read. Using natural breeding methods, the number ofchildren is limited to the number of animals that can be sired in a year bythe number of offspring born per animal (for goats less than 200) and by thenumber of years the sire is sexually active (for goats less than 10), giving anupper limit of approximately 2000 records. With modern breeding methods,there is no guarantee that this solution will work under all circumstances.

If the number of descendants is within reasonable limits, all descen-dant records are selected using the transitive closure (see section 5.5). Aswith the code that displays the pedigree, a recursive routine located in thefile /class/pedigree.php links class objects together to describe the treeand the class method printMenu(), and then uses the structure to outputJavaScript code to the browser. The JavaScript code is then used by thebrowser to display the descendant tree.

The number of offspring and descendants of each animal node displayedis determined using the transitive closure and the results are included in thedata displayed (see figure 6.2).

64

6.3 Conclusion

This chapter discusses using a browser to display an animal pedigree and theanimals descendants. The next chapter concludes the thesis and discussesfuture research.

65

Chapter 7

Theses conclusion and futureresearch

66

7.1 Thesis conclusion

Previous work in this area has used strings that describe the genetic path be-tween animals (nodes codes) to improve the performance of pedigree searchesand the calculation of inbreeding coefficients. This application maintains theminimum and maximum path length, an adjacency list and a transitive clo-sure. The structures used are maintained incrementally. If the above datastructures are maintained, it is possible to enforce sensible birth dates inreal time, calculate the inbreeding coefficient and present in real time thepedigree and decedent trees.

Incrementally maintaining a transitive closure is viable if the number ofanimals is less than ten thousand. If there are larger herds than this, thenherd transitive closures can be maintained, with the industry herd updatesoccurring in the background.

It is possible to incrementally update the BLUP equations, but the num-ber and size of simultaneous equations to be solved prevent us from gener-ating results in real time. The solution is to allow the user to initiate thecalculation for their herd.

Pedigrees and descendant trees can be displayed by a browser using thepear package HTML_TreeMenu.

7.2 Future research

Designing useful artifacts is complex due to the need for creativeadvances in domain areas in which existing theory is often insuf-ficient. ”As technical knowledge grows, IT is applied to new ap-plication areas that were not previously believed to be amenableto IT support” (Markus et al. 2002, p. 180). The resultant ITartifacts extend the boundaries of human problem solving and or-ganizational capabilities by providing intellectual as well as com-putational tools. Theories regarding their application and impactwill follow their development and use. (Hevner et al., 2004)

This thesis focus on the creative advance in the domain area, the artifact wascreated to test the ideas. As predicted by Hevner et al., 2004, the creationof the artifact opens up new research possibilities.

7.2.1 Can node codes be maintained incrementally?

Previous work in this area has used node codes to speed up pedigree queries.This work has shown that an adjacency list, minimum path length, maxi-

67

mum path length and a transitive closure can be used to speed up similarqueries. We have also shown that the data structures used can be maintainedincrementally. A proof that the node codes can or can’t be maintained in-crementally would be interesting.

7.2.2 Did the application meet it’s underlying designgoal?

Increasing the use of EBVs across the cashmere industry was the underlyingaim of the project. We aimed to make the effort required more rewarding byproviding instant feedback, and the process easier to use by offering differentmethods to build the pedigree tree. It would be interesting to observe howthe use of EBVs within the industry changed as a result of this work. Thecreation of the artifact makes this research possible.

7.2.3 Are production gains increased?

The cashmere industry has been running a national fleece competition fordecades, where prizes are awarded for the most valuable fleece. The compe-tition is dominated by two breeders, one of which has a genetics backgroundand has been using EBV’s for many years. The other is an old fashionedbreeder using phenotypes and a feel for what is right. The latter herd haswon the most valuable fleece award for almost a decade.

The two herds (one is in Queensland and the other in Victoria) are difficultto compare. It would be interesting to have answers to the follower questions.

- Does the provision of EBVs based on best linear unbiased predictionhelp experienced breeders?

- Will the provision of EBV data as described above alter the decisionsof experienced breeders?

- Do experienced breeders and the methodologies described in the thesispick the same animals as the best animal?

With EBV’s now available to the entire industry, the herd belonging tothe breeder who has a ’feel’ for the problem could be randomly divided, bothmethods applied and the outcomes compared.

An argument could be mounted that taking the “feel“ out of the selectionis advantageous as a solid outcome can be had by anyone willing to undertakesystematic recording and the application of the methodology. Is the previousstatement a reasonable argument? Another possible research question.

68

7.2.4 Will this work affect Lambplan?

Lambplan and Merinoplan are industry wide programs that use best linearunbiased prediction to calculate EBV’s for the lamb meat and merino indus-tries. Lambplan and Merinoplan are run by Sheep Genetics a joint programcreated by Meat and Livestock Australia (MLA) and Australian Wool In-novation Limited (AWI). Both systems are based on a centralized databaseand batch calculations. There is a flat rate of $330 and a charge of $1.65 peranimal with a maximum charge of $2750 per breeder (Fee Schedule, 2009).This is down from the 2004 charge of $355 flat rate and $2.32 per animal(McNair, 2004). In 2009 the per animal charge only applies to the currentdrop, and historic data ( the very foundation of EBV’s based on BLUPs)is stored for free. To use the service, you export your pedigree and pheno-type records and receive in return a file containing EBV’s and inbreedingcoefficients. Your data must be correctly formatted before sending it

Because of it’s small size, the development of a solution for the cashmereindustry was unattractive to the current operator. This forced the cashmereindustry to develop it’s own solution, which resulted in the creation of Mer-rrit, the subect of this thesis. We aimed to develop a system that couldbe self managed. The techniques developed could be used to self generatewithin herd EBV’s and inbreeding coefficients for any industry, and this couldreduce the cost. Industries would still need centralized control to generateacross herd EBV’s as it is unlikely any breeder will entrust his full datasetto another.

The author has listened to lamb and merino breeders complain aboutthe cost of their current solutions and the accuracy of the EBV’s provided.An understand of best linear unbiased prediction leads one to ask: Do thecomplaints about the provided EBV value results from poor industry datastructure ( not enough cross herd links)? This is research that should beundertaken by the Lambplan and Merino plan providers.

The possibility of extending this solution to other industries results inseveral interesting research questions:

- How general is the dissatisfaction with Lambplan and Merino planEBV’s and why has it occurred?

- If Lambplan and Merino plan users were supplied with tools to displaytheir pedigree and to get inbreeding coefficients and within herd EBVswithout waiting for the industry calculation, would their satisfactionwith the service improve?

- Industry wide EBVs are only valuable if breeders believe the results.For accurate across industry EBVs you need sound genetic links across

69

herds. Are Lambplan and Merinoplan results questionable because ofits poor data structure?

- Because cross herd/flock links are difficult and costly to maintain oneneeds to ask the question: As a general rule, are breeders looking forwithin herd EBVs or industry wide EBVs?

- Would greater productivity gains be achieved if costs were reduced bymaking available online systems for the generation of in herd EBVs?

70

References

Bennett, R. L., A., S. K., Uhrich, S. B., O’Sullivan, C. K., Resta, R. G., &Lochner-Doyle, D. (1995). Recommendation for standardized humanpedigree nomenclature. Journal of Genetic Counseling, 4 (4).

Dong, G., & Su, J. (1995). Incremental and decremental evaluation oftransitive closure by first-order queries. Information and Computation,120 (1), 101–106.

Elliott, B., Akgul, S. F., Mayes, S., & Ozsoyoglu, Z. M. (2007). Efficientevaluation of inbreeding queries on pedigree data. In Ssdbm ’07: Pro-ceedings of the 19th international conference on scientific and statisticaldatabase management. Washington, DC, USA: IEEE Computer Soci-ety.

Elliott, B., Akgul, S. F., Ozsoyoglu, Z. M., & Manilich, E. (2006). A frame-work for querying pedigree data. In Ssdbm ’06: Proceedings of the 18thinternational conference on scientific and statistical database manage-ment (pp. 71–80). Washington, DC, USA: IEEE Computer Society.

Faraway, J. (2002). Practical regression and anova using r. http://cran.r-project.org/.

Fee schedule. (2009). Website. PO Box U254 UNE, Armidale, NSW, 2351:Sheep Genetics. (http:www.sheepgenetics.org.au)

Goldburger, A. S. (1962). Best linear unbiased prediction in the generalizedlinear regression model. Journal of the American Statistical Associa-tion, 57 (298), 369–375.

Guozhu, D., Leonid, L., Jianwen, S., & Limsoon, W. (1999). Maintainingtransitive closure of graphs in sql. In Int. J. Information Technology,5.

Henderson, C. R. (1975). Best linear unbiased estimator and predictionunder a selection model. Biometrics, 31 (2), 423–447.

Henderson, C. R. (1976). A simple method for computing the inverse of anumerator relationship matrix used in the prediction of breeding values.Biometrics, 33 (1), 69–83.

Hevner, A. R., March, S. T., Park, J., & Ram, S. (2004, march). Design

71

science in information systems research. MIS Quarterly, 28 (1), 75–105.James, A. T. (2009). Estimating breeding values for better cashmere - using

the australian cashmere. PO Box 4776 KINGSTON ACT 2604: RuralIndustries Research and Development Corporation.

Lynch, M., & Walsh, B. (1997). Genetics and analysis of quantitative traits.23 Plumtree Road, Sunderland, MA 01375 U.S.A: Sinauer Associates,Inc.

McNair, S. (2004). Mla responds to lambplan critics.Website. Stock and Land. (http://sl.farmonline.com.au/news/state/agribusiness-and-general/general/

mla-responds-to-lambplan-critics/9187.aspx)Mrode, R. (1996). Linear models for the prediction of animal breeding values.

Wallingford Oxon: CAB International.Mrode, R. (2005). Linear models for the prediction of animal breeding values

(2 ed.). Wallingford Oxon: CAB International.Pang, C., Dong, G., & Ramamohanarao, K. (2005). Incremental maintenance

of shortest distance and transitive closure in first-order logic and sql.ACM Transactions on Database Systems., 30 (3), 698–721.

Quass, R. L. (1976). Computing the diagonal elements and inverse of a largenumerator relationship matrix. Biometrics, 32 (4), 949–953.

Radi, H., & Heyes, R. (2008). Html treemenu. Website. (http://pear.php.net/package/html_treemenu.html)

Rozanov, Y. A. (1964). Probability theory a concise course (R. A. Silverman,Ed.). New York: Dover Publication Inc.

Schott, G. D. (2005). Sex symbols ancient and modern: their origins andiconography on the pedigree. British Medical Journal, 331 (7531), 1509–1510.

Strang, G. (2003). Introduction to linear algebra. Wellesley: Wellesley-Cabridge Press.

Tores, F., & Barillot, E. (2001, February). The art of pedigree drawing:algorithmic aspects. Bioinformatics (Oxford, England), 17 (2), 174–179.

Yahoo user interface. (2009). Website. (http://developer.yahoo.com/yui/)

72

Appendices

73

Appendix A

Code Structure

74

The key to successful large project development is a solid foundation.A large project is completed when the code base is abandoned, not before.Design goals change, algorithms are discovered, users generate new ideas andnew team members schooled in the latest buzz words come and go. As thecode base grows, the investment increases and the option of starting theproject over fades. This chapter documents the foundations. Hopefully wehave a foundation flexible enough to support the project over the long term.

A.1 Input values and return values

All method and function inputs are passed by value. The problem inherentin C type languages ( that they can only return one value) is dealt withby having functions and methods return an array if multiple data items arerequired. PHP associate arrays make this methodology very attractive asthe index name can replace the data item.

A.2 Error codes

Functions and methods return data in an associate array entry called error_code

if there are problems. The application can use the error_code to control it’sactions. The methods and functions also return a error_string if there areproblems; the error_string is used to tell the user about the problem. Theerror string should be returned in the user’s language. Code that generateserror strings is stored in the language directory (/ENGLISH).

A.3 The table

The application code is driven by tables; if a table is needed, the tableaddress is always passed into code as a call parameter. The code makes theassumption that index values within a table row will be there and that thevalues are for a particular use. The code base makes no assumptions on thenames of fields in the database or their encoding. To add new data fields tothe system, we add the fields to the database table and to the tables thatdrive the application. Tables can be found in the files:\ENGLISH\translation_array_animals.php,\ENGLISH\translation_array_shearing.php,\ENGLISH\translation_array_x.php and\ENGLISH\translation_array_herd_standard.php.

75

The arrays are stored in a language directory as they contain the string tobe used as column headings when the data is displayed.

This design breaks the development problem into two parts: code tointerpret the tables and the development of the tables. Hopefully the optionssupported by the table interpreter are broad enough to accommodate mostof the changes that will invariably be required as the application ages.

A.3.1 base

Data can be entered into the application using the web interface or via filescontaining comma separated fields, with the field type being set using head-ings. The file input code supports quite a variety of data encodings, but onlyone encoding is needed for the web interface. The one encoding needed isthe base encoding.

A.3.2 db string

The name of the field in the base database table (the unjoined table). Whichtable is the base table depends on the field, and the location is given in thetable entry: location.

A.3.3 dbt string

There is always a data structure maintained that has the tables joined. Thisis the field name in the joined table.

A.3.4 strings

As mentioned above, files can be used to enter data into the system. Thetable entry points to a table of strings that can be used as heading strings inupload files. The heading string selects a table entry to handle the column.The code that selects the table entry does a string match on the table pointedto by this table entry.

A.3.5 data convert

This points to an object that inputs and outputs the data from the database.The methods of this object are discussed further in section A.4.

76

A.3.6 type

Data can be of two types: data to select the record, and data that is data.This field identifies the two types.

A.3.7 logic

The tables control the select logic. If this is set to AND, then the select fieldis used as an AND condition. If set to OR, then the select field is set to OR.The different tag types (first tag, second tag etc) are set as OR conditions,while the herd as an AND condition. Different herds can have the sametag because the animal has to belong to a particular herd. Within a herd,different tag types can be used to identify the animals. If a tag type is to beuseful, each animal within the herd has to have a unique tag.

A.3.8 heading string

The string to be used as the data’s name when displaying the data item.Different languages will have different headings.

A.3.9 location

The location of the data is given by location and db_string.

A.4 Data entry and display

input_to_db test_db_change get_data output_from_db

Database

Figure A.1: Application data input and out-put

The tables mentioned abovecontain a pointer to a class ob-ject. The class object meth-ods are used to input andoutput data. Checking thata date is within acceptablerange is not an applicationproblem, but is a data in-put problem and the data in-put/output objects look af-ter it. The data entryclasses are stored in the file\ENGLISH\data_conversion.php. The input strings used to set databasevalues and the display string for particular database values are languagedependent, and hence, the file is located in the language directory.

77

A.4.1 input to db

This class method is used to convert user input into the database format.As an example, the database stores the sex as ’m’ or ’f’, but an English userinputs ’doe’, ’buck’ and many other options to represent the sex. The input tothe method is the user sting. As the method has to be able to return an errormessage or a result, the method returns an associative array with 3 possibleindex values, data, error_code and error_string. If the input data issuccessfully converted, then there will be no error_code or error_string

fields in the associative array. The code in this method should clean outattempts at SQL injection and cross site scripting.

This method should only concern itself with the field in question astest_db_change looks after how the data relates to the rest of the database.

A.4.2 test db change

After the application has worked out which animal the input data is referringto, read the relevant animal records and worked out which fields are beingaltered, this method is called. The test_db_change method in a class look-ing after birth dates would check that the birth date is valid; that is, makesure that it is after it’s ancestors and before it’s progeny. This method hasaccess to and returns the update array so it has the option of removing oraltering the update and the fields updated or returning an error message.

A.4.3 get data

Some data doesn’t have a one to one relationship to the database. Whensuch data is entered, the test_db_change method works out which fieldsshould be changed and alters the update array accordingly. On a read thismethod generates the composite field from the data read from the database.An example is fleece variance. The test micron and standard deviation isstored in the database and the variance is generated form those fields.

A.4.4 output from db

There is a one to one relationship between input to the method and outputfrom the method. Internal database codes are converted to display strings.For example, the sex codes “m” and “f” are converted to “doe” and “buck”.

78

Appendix B

Ajax

79

Asynchronous Javascript and XML (AJAX) is a current buzz word. It’sactually all very simple and has a lot to do with HTML and very little to dowith XML. A Javascript application is event driven and the code is availablefor execution while a web page is being displayed. JavaScript can place a re-quest to a server that is independent of the page request the user sent to getthe page being displayed; this is the Asynchronous part. Different browserbrands require different code to start the request and this complicates theJavaScript code a little but nothing more. The replies to the Asynchronousrequest are events that are used to call some Javascript code. The documentthat is displayed to the user is structured because it was generated usingHTML and the location of the various elements within the structure is doc-umented and is described as the document object model (the DOM). Usingthe DOM, the programmer can work out which object to modify if he orshe wishes to change the data displayed by the object. The Javascript thatreceives the reply to the asynchronous request retrieves the response string(which may be encoded as XML data, but then again may not), decodes itand modifies the objects that describe the document. The browser modifiesthe document.

All that is required on the server side is a page that responds to theasynchronous request. The server side request is just another GET and theserver doesn’t know or care that it is an asynchronous request. Having theserver output XML code is optional.

B.1 File upload progress

Database

task1 task2file

Browser

Asynchronousrequest to task 2

File sent to task1

Figure B.1: Intertask communi-cation

Merrrit allows users to upload files thatare used to update the database. This op-tion is provided to upload data containedin historic databases and to make it eas-ier to enter large data sets. The files cancontain hundreds of records and the up-date can take several minutes. AJAX isused to provide the user with a progressreport. When the database is being up-dated in this manner, one HTML serverthread is performing the update, and an-other is responding to the browser’s re-quest for a progress report.

Intertask communication on serversthat we don’t control can be a problem, so the application uses a file for

80

communication (see figure B.1). The thread doing the update seeks the com-munication file to a known location ( zero) and writes the input file activeline number to the communication file. The thread replying to the browser’sasynchronous request reads the data from the communication file. A Unixsystem will present the file reading thread with the last value written by thefile writing thread.

The server code is found in the file ajax_upload_animal, and the codeand the comments pretty much say all that needs to be said.

<?php$file_path =$_SERVER[’SCRIPT_FILENAME’];

$app_directory =substr ($file_path,

0,-(strlen($file_path) -(strrpos($file_path,’/’))));

//this was sent as a hidden field in the//file screen_upload_add_animal.php etc.//It is made from the session id which//is different for each active user.$ajax_data_file =

(isset($_REQUEST[’ajax_data_file’]))? $_REQUEST[’ajax_data_file’]: ’’;

//The file read by this code is updated in//screen_upload_add_animal.php etc. as records//are added to the database.//We are responding to an asyc read using the Javascript//found in the file animal_upload.jsfunction record_count_open ($app_directory, $session_id){

$hostname="{$app_directory}/temp/{$session_id}";$handle = fopen($hostname,"r");return $handle;

}//the server writes the line at offset zero so//we only need to read the first linefunction record_count_read ( $handle) {

$input = ’0’;if (!$handle) return $input;$output = fgets($handle);return $output;

}$handle =

record_count_open ($app_directory,$ajax_data_file);

81

$records = record_count_read($handle);echo $records;

?>

Note that we are outputting only one value and we do not decorate itwith XML encoding. The Javascript code is downloaded to the browseras Javascript normally is; that is, with a <script></script> tag sent aspart of the header.

<script src=\"animal_upload.js\" type=\"text/Javascript\"></script>

The JavaScript sent to the browser pretty much documents what needs tobe done at the browser end:

function ajaxFunction(){{var xmlHttp;try {

// Firefox, Opera 8.0+, SafarixmlHttp=new XMLHttpRequest();

}catch (e) {

// Internet Explorertry {

xmlHttp=new ActiveXObject("Msxml2.XMLHTTP");}catch (e) {

try {xmlHttp=new ActiveXObject("Microsoft.XMLHTTP");

}catch (e) {//So they don’t get a record load update,//don’t upset the user//alert("Your browser does not support AJAX!");return false;

}}

}

}

xmlHttp.onreadystatechange=function(){if(xmlHttp.readyState==4){

document.ajax_form.records_loaded.value =

82

xmlHttp.responseText;}

}

var dataFile = ’./ajax_upload_animal.php?ajax_data_file=’+ document.ajax_form.ajax_data_file.value+ ’&’;

xmlHttp.open("GET",dataFile,true);xmlHttp.send(null);

}

This code only contains the code to initiate the request and deal with theresponse. The code to deal with the response is contained in the method:

xmlHttp.onreadystatechange.

The method writes to the DOM element:

document.ajax_form.ajax_data_file.value.

The name of the DOM element depends on the document structure. ajax_formand records_load are names given in the HTML code sent by the server todisplay the original page.

As mentioned above, Javascript code is event driven. The event to initiatethe asynchronous request is created when the submit button is clicked. TheHTML code is in the file originally sent to display the page and has thefollowing form:

<input type=submitclass="form"size="60"name="accept"value="submit"onclick= "window.setInterval’."(’ajaxFunction()’, 2000)"

>

The interval event set by the code above will end when a new page is sentto the browser by the server thread processing the submitted file (the userfile containing records to update the database). Care needs to be taken withthe server thread timeout value as files that add large numbers of recordsto large transient closures can take a long time to process. Using AJAX toprovide feedback stops the user panicking but not the server.

83

B.2 Yahoo user interface; selecting fields to

display

“The YUI Library is a set of utilities and controls, written in JavaScript, forbuilding richly interactive web applications using techniques such as DOMscripting, DHTML and AJAX” Yahoo User Interface, 2009.

The YUI consists of several Javascript libraries that extend the function-ality of Javascript and to this is added application specific code written bythe application developer. The code written by the application developer canask the browser to load the Javascript libraries from the Yahoo server or thedeveloper can move the required files to a computer under his control andsend code to the browser that asks for the libraries to be loaded from there.

Figure B.2: Field select drag anddrop

Merrrit is an application used bymany different breeders and has supportfor many different data fields. For exam-ple, there are four options for identifyingan animal: primary tag, secondary tag,name and e tag. A user will only wantthe fields he or she uses displayed in thedata editing screens. There needs to be ascreen the user can use to select the fieldsof interest. As the Yahoo User Interfaceoffers drag and drop support, drag anddrop is an option.

The Javacode that looks after Dragand Drop manipulates the DOM tree butdoesn’t send an asynchronous request tothe server.

The application specific code is foundin the file /class/display_order.php.The list of available fields is created usingthe table described in section A.3. The options selected for display by theuser are sent to the server when the user clicks the submit button. Theoptions selected are encoded in a semicolon separated string which is savedin a herd specific database table.

84

Appendix C

The Application

85

C.1 Overview

The application screen is divided into four (see figure C.2) :

- A top left icon area.

- A browser area down the left side, which is used to navigate the appli-cation.

- A link bar across the top, which has two important items, login andlog-off. These links are used by herd owners to log onto and off theapplication.

- The active display area located at the bottom right.

The application has a public face and a herd owners’ face (See figure C.1).The public face presents screens that can be used to view the data andthe herd owners’ face presents screens to view, edit, and upload data aswell as screens to setup the fields to be displayed. When a user enters theapplication, the browser area displays the public tree, which contains a herdbranch. There is a list of herds under the herd branch. Clicking on the herdcode displays herd details, and clicking on the arrow to the left of the herdcode opens another branch. Within that branch, the user finds screens todisplay animal data. When entering the application, the active screen areadisplays an application summary.

C.2 Login and logoff

In the previous section it was mentioned that users get the option of editingdata and setting up the herd displays if they are logged on.

To log on, you click the login link found in the link bar that runs acrossthe top of the application screen. When the link is clicked, the browser isreplaced with the login screen. The herd and password have to be enteredfor a successful login. The [give up] button abandons the login attempt.

C.3 The herd screens

If a user clicks on a herd code, herd details are displayed in the active displayarea. Figure C.3 shows a typical screen. The herd screen is divided into four:

- Text that provides an overview of the herd. The text can be altered bythe herd owner using the herd edit screen.

86

The public may look at the data.

Once logged on, the herd ownersmay edit data, control the fieldsdisplayed, upload files to alter datain batch lots, and download datainto files on their local machines.

Figure C.1: Public and logged on browsers

87

Figure C.2: Entry screen with screen areas highlighted.

88

Figure C.3: Public screen: herd data

- Data from the herd record. The herd owner can alter which fields aredisplayed using the herd setup screen and the contents of the fieldsusing the herd edit screen.

- A picture, which the herd owner can upload using the herd edit screen.

- System generated data. For example, the number of animal birthsregistered against the herd.

If the herd owner provides a link to his/her web site, the link is presented inthe title bar.

C.3.1 Herd edit screen

If the herd owner is logged on, there is an edit branch in the browser tree,and within that branch is a link to the herd edit screen. A representativescreen is shown in figure C.4.

The herd edit screen is divided into three parts, and each part can beupdated independently.

89

Figure C.4: Herd edit screen.

90

- Herd details are entered using a text box. HTML tags can be usedto add headings and highlight text etc. [upload herd details] must beclicked to transfer data to the server.

- To change the picture displayed, a file must be uploaded, and there is a[Browse] button to select the file. After the correct file has been found,the [upload photo] button must be clicked to upload the new file.

- Fields to edit the herd record. The fields in this set are setup usingthe herd setup screen. The fields that can be displayed in the herd editscreen is greater than those that can be displayed in the herd screen.For instance, if the herd password is one of the displayed fields, it isdisplayed in the herd edit screen but not in the herd screen. In the herdsetup screen, the fields that are displayed in both the herd edit screenand the herd screen are displayed in a different colour.

C.3.2 Herd setup screen

If the herd owner is logged on, there is a setup branch in the browser tree,and within that branch is a link to the herd setup screen. An example herdsetup screen is shown in figure C.5.

The application has a wide range of data fields to suit the needs of manydifferent users. Users don’t want their screens cluttered up with unused data.The herd setup screen is used by the user to select the fields of interest andthe width of the selected fields.

The herd setup screen is divided into two areas: one is used to select thefields to display and the order in which they are displayed, and the other isused to set the field width.

Displayed fields and order

You set the displayed fields and order using a drag and drop applicationbased on the Yahoo User Interface (see section B.2). There are two columns:the first contains available fields that have not been selected for display, andthe second column displays the fields selected. The field order in the displaycolumn will set the field display order in the edit herd screen, the herd screenand in the field length section of this screen.

Brown fields will be displayed in both the edit herd screen and the herdscreen. Purple fields will be displayed in the edit herd screen only.

To move a field from the available column to the display column, movethe mouse over the field, left click on it, holding the mouse key down dragthe field over to the display column and release the left click button (drop

91

Figure C.5: Herd setup screen.

92

Figure C.6: Animal screen, display only

it). Dragging and dropping the fields in the display column alters the orderof display. The [Update display] button must be clicked to send the changesto the server.

Field width

The server displays the current list of selected fields in the field width section.Different field widths can be set and the [update width] button is used toupload the changes to the server. It is best to select your fields, set the orderand click [update display] and then set the field width in the new screen.

C.4 Animal screens

The animal screen comes in a public and private version. The public versionis used to display a summary of several animals and the private version canbe used to edit the displayed data. When the user is logged onto a herd, the

93

animal screen is found in the browser’s edit branch, and the link brings upthe private version.

The first animals displayed and the order of display are determined bythe tests and values set above the columns. An example screen is shown infigure C.6.

The fields displayed in the public version of the animal screen are setusing the animal public setup screen. The link to this screen is found inthe setup branch, and the icon used to represent this screen contains a largeyellow mallet. Clicking on the link next to the icon displays a screen thatcan be used to beat the public animal screen into some sort of order.

The fields displayed to a logged on user are set in the animal setup screen,a link to which can also be found in the setup branch, with an icon thatcontains a small hammer. Delicate use of this tool results in a private animalscreen setup to meet your editing needs.

As mentioned above, the fields above the columns are used to request thedisplay of records from a set starting point and order. You have a choice oftwo tests: equal to or greater and equal. If you want to display all DYN doeswith a primary tag above 4000, you would set the column headings up asshown in figure C.6.

There are three buttons and two fields below the displayed data. The[display] button takes you back to the start of the list of records set by thecolumn headings, the [next] button takes you to the next group and the[previous] button takes you back to the previous group. The record field isused to set the number of records displayed on the screen and the offset fieldis used to set the offset within the record set. The offset field changes whenthe [next] and [previous] buttons are used.

Next to each record is a view button, which takes you to the screen thatwill display the full animal record (see section C.5).

The private and public version of the animal screen is almost the same.If you have the right to edit a record, the private version allows you to enternew data using the fields that display the current data. If you have the rightto edit the record, a delete button appears next to the record. The viewbutton takes you to a screen that can be used to edit all animal details,including the animal phenotype records.

C.4.1 Public animal setup screen

If the user is logged on, the public animal setup screen has a link in thebrowser’s setup branch.

To meet the needs of a varied user base, the animal records contain manyfields. Take as an example the data used to identify animals. Some breeders

94

give their animals a name, a primary tag, a secondary tag and an electronictag. The same breeder may only want to make the primary tag public.Another breeder may only use the primary tag, if he does not want hisprivate or public animal screen cluttered with fields that are not used. Tofurther complicate matters, different users may want the same data displayedin different ways.

The public animals setup screen is divided into two areas: one is used toselect the fields to display and the order in which they are displayed, and theother is used to set the field width.

Displayed fields and order

You set the displayed fields and order using a drag and drop applicationbased on the Yahoo User Interface (see section B.2). There are two columns,the first of which contains available fields that have not been selected fordisplay, and the second displays the fields selected. The field order in thedisplay column will set the field display order in the public animal screen.

To move a field from the available column to the display column, movethe mouse over the field, left click on it, holding the mouse key down dragthe field over to the display column and release the left click button (dropit). Dragging and dropping the fields in the display column alters the orderof display. The [Update display] button must be clicked to send the changesto the server. The [Update display] button is located under the availabledata. As there are many options, you will probably have to use the slide baron the right hand side to find it.

Field width

The server displays the current list of selected fields in the field width section.As the display field and order section is quite long, you may need to use theright hand side bar to see this section. Different field widths can be set foreach selected field and the [update width] button can be used to upload thechanges to the server. It is best to select your fields, set the order and click[update display] and then set the field width in the new screen.

To display many items on one line, it pays to keep the field widths assmall as possible.

C.4.2 Animal setup screen

This screen is used to set up the users animal screen (the private animalscreen). A separate setup screen gives the user the option of using the animal

95

Figure C.7: Animal detail screen

screen to edit and view private data. This screen functions in the same wayas the screen described in the previous section.

C.5 Detail screen

The detail screen is used to view an animal’s data, which includes ancestorand descendant information, phenotype records, animal data and calculatedvalues. The screen comes in tow forms: the public version displays data, andthe private version allows you to edit the displayed data.

Select fields

The detail screen ( see figure C.7 for a part display) is long and dividedinto several areas. Along the top are fields used to select a record. The fields

96

displayed to select a record are the select fields picked for display in the detailsetup screen (discussed later). First tag, second tag, name, e tag and herdare the select fields.

The herd must be set and the system insists that all tag fields are uniquewithin the herd. Therefore, any tag field that has been set can be used tosearch for an animal.

Animal record

The animal record section displays animal record data. If you are logged on,the data can be edited where displayed and updated by clicking the updatebutton located under this section.

Photograph

A photo of the animal can be added. If the user is not logged on, the photois displayed to the right of the animal record data. If the user is logged on,the screen contains a field to select a file containing an animal image and abutton to initiate a file upload.

Pedigree

Next is displayed the pedigree. Chapter 6 discusses the displaying of pedi-grees in some detail. The animal identifications displayed in the pedigree arelinks to detail screens for the animal identified. The tree structure displayshow the animals are related.

Descendants

The descendants are displayed next. Chapter 6 discusses the displaying ofdescendants in some detail. The descendant tree is available for browsingpleasure, but if the number of descendants is too large, only the offspringlinks are displayed.

Phenotype records

The phenotype records follow the descendants. At the time of writing, onlyshearing records are implemented.

C.5.1 Detail setup screen

If the user is logged on, the detail setup screen has a link in the browser’ssetup branch. The screen is setup in a manner similar to the other setup

97

screens. There are sections for the animal record and supported phenotyperecords. As with the animals screen, there are different setup screens for thepublic and private versions of the detail screen.

C.6 Upload screens

The browser displays an upload branch when the herd owner is logged on.These screens are used to upload data files. There are screens to add records,update records and to delete records.

C.6.1 Upload file structure

The files can be comma separated files exported from the user’s current herdrecording system, hand crafted files created by the user, or comma separatedfiles downloaded form this system. An example file is shown in figure C.8.

//File to build R.A.Mrode Examplefirst tag;sire first tag;dam first tag;birth_DD/MM/YY;sexE3;E1;E2;01/9/99;doeE4;E1;UNSET;1/12/00;buckE5;E4;E3;1/12/01;buckE6;E5;E2;1/12/02;buck

Figure C.8: Example animal add file

Files may contain comments, which start with a // and go to the end ofthe line. Comments are ignored by the decoder. Comments in hand createdfiles are useful, as they can be used to comment out records that you areunsure of, and add details that are not used by the system. Figure C.8 hasa comment telling us that the file is used to create the example used in R.A. Mrode’s book.

The first active line must contain a list of column names separated by thesame separator as that used in the rest of the file. The system selects theseparator, checking that the separator creates more than one field and thatthe number of fields found in the first and second active line are the same.Valid separators are ’ ^’, ’,’ , ’:’ , ’;’ and ’tab’. The upload screen displaysa list of valid string headings. For a discussion of how they are stored in theapplication, see section A.3.4.

Hand created files may use " or ’ to bring the field value set in theprevious line to the current line. This speeds up data entry. For example,

98

Figure C.9: Three steps to upload a file

99

twins have the same sire and dam, and the tag details from the first kidrecord can be used for the second.

As mentioned above, the number of fields in the first and second activeline must be the same, but the last field can be dropped in following lines.It pays to make the last field a comment, as adding or leaving the commentoff is then optional.

C.6.2 Upload sequence

The upload operation proceeds in three stages (see figure C.9):

- A screen to select and upload the file is presented.

- The system decodes the file, and checks that the data types within thefields are correct. If there are errors, the table is presented back to theuser, with error fields highlighted and a [reject] button.

- If the data types are correct, the system checks that the proposed datawill not violate database integrity tests. For example: it ensures thatbirth dates are in order, and if we are adding records, that there is norecord for the animal present already. If the data doesn’t pass thesetests, it is presented back to the user with the faulty records highlightedand a [reject] button. If all tests are passed, the data is presented backwith an [accept] and [reject] button.

- If the data is good and the user clicks the [accept] button, the data isloaded into the database.

C.7 Download files

If the user is logged on, the browser displays a download branch containingdownload screens. The download screens are used to initiate the downloadingof a semicolon separated data file. The first line in the file describes thecolumns, with the fields separated by a semicolon, and the following linescontain data, one line per record. Users download data files if they wish touse local applications to analyze the data.

The download screen has a column of available fields and a downloadcolumn that contains a list of fields that have been selected for inclusion inthe download file ( see fig C.10). The order of the fields in the downloadcolumn determines the order of the fields in the download file. To drop afield into the download column, drag it from the available column.

100

Figure C.10: File download screen

101

If the columns selected for download change, the [update fields] buttonneeds to be clicked. When the update list is returned from the server, thedownload can be started by clicking the [download] button.

Breeders with the privilege to do so can download the records from allherds. This privilege is needed if the user is calculating industry wide esti-mated breeding values.

first_tag;sex;birth_date;birth_herd;sire_first_tag;dam_first_tag;E1;BUCK;;TST;;;E2;DOE;;TST;;;E3;DOE;1999-09-01;TST;E1;E2;E4;BUCK;2000-12-01;TST;E1;UNSET;E5;BUCK;2001-12-01;TST;E4;E3;E6;BUCK;2002-12-01;TST;E5;E2;UNSET;;DOE;;TST;;;

Figure C.11: Example animal download file

102

Appendix D

Linear Regression

103

Chapter 2 mentioned linear regression and then moved on. This appendixdiscusses the topic further.

The covariance over the variance gives the regression coefficient which isthe gradient of a line fitted to the data using least squares. Linear regressioncan be explained using linear algebra, statistics or calculus. All three methodsprovide some insight.

D.1 Calculus

A line has the formula.

y = bx+ a (D.1)

To fit the line you alter a and b until the predicted value is as close to theactual value as possible. The difference between the predicted and the actualvalue is the residual. To find the minimum, you need a formula that tellsyou how the sum of the residuals change over the data set as a and b vary.

For a point the residual is:

ri = yi − f(xi) (D.2)

You can square both sides.

r2i = (yi − f(xi))

2 (D.3)

This gives a quadratic surface with the the sum of the residuals on one axisand the values of a and b on the other. A quadratic surface has a minimumor a maximum when the derivative is zero. We want to find the minimumas we alter a and b. We take the partial derivatives relative to a and b andsolve the resulting simultaneous equations.

∂(r2)

∂a=

∂a

2(r)∂r

r = (yi − bxi + a)

∂r

∂a= −1

0 = 2(r)∂r

∂b0 = −2(yi − bxi − a) (D.4)

104

∂(r2)

∂b=

∂b

2(r)∂r

r = (yi − bxi + a)

∂r

∂b= −xi

0 = 2(r)∂r

∂b0 = −2(yi − bxi + a)(xi) (D.5)

Formula D.4 and D.5 are the residual for a point, and if we sum over thewhole set and put the result in matrix form, we get:[∑

x2i

∑xi∑

xi n

] [ba

]=

[∑xiyi∑yi

](D.6)

Solving for a

a =

∑yi∑x2i −

∑xi∑xiyi

n∑x2i − (

∑xi)2

(D.7)

Adding a few n’s gives you:

a =y∑x2i − x

∑xiyi∑

x2i − n2/n(

∑xi2/n2)

(D.8)

Solving for b

b =n∑yi∑xi −

∑xi∑xi

n∑x2i − (

∑xi)2

(D.9)

Adding a few n’s gives you:

b =

∑yi∑xi − n(x2)∑

x2i − n(x)2

(D.10)

It is important to note:

1. That normal distributions have nothing to do with this, as we aresimply fitting a function using least squares.

2. Least squares work out well because the first derivative has a minimum.

3. That b is the gradient of a line and a is where the line cuts through they axis when x is zero.

105

D.2 Statistics

Variance was introduced into the lexicon by Fisher in his 1918 paper (Lynch& Walsh, 1997) and is defined as:

σxx =

∑(x2

i )− nx12

n− 1(D.11)

Covariance is defined as:

σxy =

∑(xiyi)− nxyn− 1

(D.12)

The regression coefficient is defined as:

b = σxy/σxx (D.13)

Using D.11, D.12 and D.13:

b =

∑(yixi)− n(xy)/n− 1∑x2i − n(x)2/n− 1

(D.14)

After simplification:

b =

∑yi∑xi − n(x2)∑

(x2i )− n(x)2

(D.15)

Which gets us back to D.10.

Two important points to note are:

1. The regression coefficient is nothing more than the gradient of the linefitted using least squares.

2. Statisticians tell us to use n − 1 to calculate the sample standard de-viation and n for the population standard deviations, all that mattersin this application is consistent use, the value used cancels out 1.

1There is good reason to use n − 1 when calculating the standard deviation: we areworking with differences, and the number of differences is one less than the number ofpoints. We are still dealing with differences when dealing with the entire population,mere mortals should still use n − 1, statisticians being the high priests of science arecommunicating with god to obtain prior knowledge, so they can use n.

106

D.3 Normal equation

ey

1

2

Xb

x

x

Figure D.1: Projecting a 3Dpoint onto a 2D plane

If you think in terms of sub spaces the deriva-tion of the normal equations is straight for-ward. The matrix equation we wish to solvesis:

y = Xβ + e

y is a column vector in a hyperspace, thenumber of rows gives the dimension. Xβ iscolumn vector that lies in a subspace of thehyperspace. The dimension of the subspaceis determined by the number of column vec-tors in X (figure D.1 shows a plane definedby two column vectors in X, you only needtwo column vectors because a subspace hasto go through zero). e is the column vector you have to add onto Xβ to get

back to y. If the column vector e is orthogonal to Xβ then e is as short aspossible, and their dot product is zero, or using matrix notation:

XTe = 0

Looking at figure D.1 we can see:

e = y −Xβ

So:

XT (y −Xβ) = 0

Rearrange and you have the normal equation:

XTy −XTXβ = 0

XTy = XTXβ

(XTX)−1XTy = (XTX)−1XTXβ

(XTX)−1XTy = Iβ

β = (XTX)−1XTY

107

transitive closure, inbreeding coefficients, pedigrees, on line collection and on line presentation

Documents

example pedigree

line collection

transitive closure

sql view maintenance

key words

andrewjame fortheseconddataset

breeding coecients

maximumpath length