ii - graduate school imgs/sample manuscript.pdfselection and phenotypic analysis. he has also an...

ii

ii

PREDICTIVE HYBRID RICE BREEDING USING GENOMIC SELECTION

AND ITS INTEGRATION INTO RICE BREEDING PROGRAMS

USING RESEARCH MANAGEMENT APPROACHES

TAMERLANE MARK SIARON NAS

SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

UNIVERSITY OF THE PHILIPPINES LOS BAÑOS

IN PARTIAL FULFILLMENT OF THE

REQUIREMENTS FOR

THE DEGREE OF

DOCTOR OF PHILOSOPHY

(Genetics)

JUNE 2016

iv

iv

BIOGRAPHICAL SKETCH

The author is a rice researcher in Syngenta, a leading global crop solutions and crop

biotechnology company, as a Genetics Project and Molecular Breeding Lead since 2014.

Previously, he was involved in hybrid rice breeding at DuPont Pioneer (2006-2014) and at

the International Rice Research Institute (1999-2006), and in fruit crops breeding at the

Institute of Plant Breeding in UPLB (1997-1999). His current interests and research focus

are increasing genetic gain, rice heterotic pools and associated methods such as genomic

selection and phenotypic analysis. He has also an interest in the application of good

leadership and management principles in breeding programs.

He graduated with a B.S. Biology degree major in Genetics from UPLB in 1997.

In 2003, he received his M.Sc. degree in Genetics also from UPLB with a minor in Plant

Breeding, and was a DOST scholar.

The author is the eldest among five children of Mr. Felicito C. Nas and Mrs. Loida

Siaron Nas of Polangui, Albay. He is married to Mrs. Gretchen Ocampo Nas, and blessed

with one daughter, Rebekah Ysabelle.

TAMERLANE MARK S. NAS

v

v

ACKNOWLEDGEMENTS

I wish to share this very significant milestone of my career with the following

individuals and institutions, and honor them for their invaluable contributions to this work.

First, my academic advisers, Dr. Jose E. Hernandez, Dr. Merlyn S. Mendioro, Dr.

Consorcia E. Reaño, Dr. Ma. Genaleen Q. Diaz and Dr. Mimosa C. Ocampo, for their

guidance since the beginning of my graduate program and through the conduct of this

research study. Syngenta was very generous in providing financial support in every aspect

of this work. Dr. John de Leon, Dr. Manny Logroño and Dr. Harish Gandhi, my

Syngenta superiors for supporting my career development and for allowing me to take this

graduate program on top of my responsibilities. Dr. Suresh Kadaru, my Syngenta

colleague, provided tremendous help in many areas of this study, particularly the marker

work. Dr. Nonoy Bandillo of University of Nebraska at Lincoln and Dr. Franco Asoro

of Iowa State University were very thorough in sharing their knowledge on various

strategies in genomic selection, and programming codes used in this study. Syngenta’s

Trialing Team provided excellent legwork in conducting the field trials, while the HMU

Team performed a great job in producing the hybrid seeds. My colleagues in Seed Product

Development also deserve a toast to teamwork, for setting a very rewarding workplace

which is very conducive for this study. I want to thank my former team in DuPont Pioneer:

Jahleel Acedo Mendoza, Jomar Punzalan, Gelo Fontanilla, Nerio Camposano, Jessie

Fernandez, Jr. and Herson Arcilla for our shared loyalty to integrity in conducting

research. My former superiors, Dr. Dennis Byron and Mr. Emmanuel Serrano (DuPont

Pioneer), Dr. Sant S. Virmani (IRRI) and Dr. Violeta N. Villegas (IPB), also influenced

my decisions regarding my career by serving as role models and inspirations. Dr. Conrad

Balatero, Dr. Glenn Gregorio, Dr. Bert Collard and Dr. Emma Sales served as mentors

in various points of my professional life. My spiritual family, Victory General Santos,

kept me grounded on what is essential in life. My wife Gretchen and daughter Rebekah

Ysabelle were always there to give their love, support, and understanding. Finally to Jesus

Christ, my Lord and Savior for providing all these wonderful people, to You be the glory

and honor.

vi

vi

TABLE OF CONTENTS

PAGE

IPR PAGE

TITLE PAGE

APPROVAL PAGE

BIOGRAPHICAL SKETCH

ACKNOWLEDGEMENT

TABLE OF CONTENTS

LIST OF TABLES

LIST OF FIGURES

LIST OF APPENDICES

LIST OF ACRONYMS

ABSTRACT

CHAPTER 1. INTRODUCTION

CHAPTER 2. REVIEW OF LITERATURE

Increasing Genetic Gain

Yield Trialing and Phenotypic Analysis

Achieving High Heritability in Field Trials

Best Linear Unbiased Prediction (BLUP)

Use of General Combining Ability in Hybrid Breeding

Marker-Aided Selection

Current use of markers in rice breeding

Mapping Quantitative Trait Loci

i

ii

iii

iv

v

vi

xi

xiii

xvi

xvii

xix

1

5

5

6

7

8

10

11

11

12

vii

vii

Limitations of Traditional MAS

Linkage disequilibrium-based mapping

Genomic Selection

Statistical Models of Estimating GEBV

Limitations of Stepwise Regression Models

Ridge Regression BLUP

Bayesian Methods

Kernel and Machine Learning Methods

Accuracy of Genomic Selection

Research Management of Plant Breeding Programs

Knowledge, Experience and Skill Requirements from

Plant Breeders

Breeding Programs as Part of Meta-Organizations

Structure of Research Organizations

Introducing Change into Breeding Organizations

CHAPTER 3. MATERIALS AND METHODS

Phenotyping and phenotypic analysis

Genotyping

Preparation and Processing of Tissue Samples

Quality Filtering and Reformatting of SNP Markers

Estimation of Genetic Relationships

Implementing Genomic Selection Models

Design of Training and Validation Populations

PAGE

14

16

17

19

19

20

21

24

25

26

26

27

28

29

32

32

33

33

33

34

34

35

viii

viii

Procedure for Cross Validation

Comparison of Prediction Accuracies

Optimization of Genomic Selection Parameters

Creating a Genomic Selection Project Proposal

Declaration of Research Funding and Non Conflict of Interest

CHAPTER 4. RESULTS AND DISCUSSION

Quality of Field Trial Data

By-Location Coefficients of Variation

Distributions of Yield, Days to 50% Flowering and Plant Height

Distribution of Hybrids Across Locations

Analysis of Multiple Locations

Variance Components and Computed Trait Heritabilities

Deriving BLUPs: Fitting Linear Mixed Models

Shrinkage Towards the Mean

Deriving General Combining Ability

Marker Coverage and Population Structure

Descriptive Statistics on SNP Marker Data

Genomic Relationships and Principal Components

Evaluation of Genomic Prediction Methods

Genomic BLUP (GBLUP) and Ridge Regression

Effect of Trait Heritability on Prediction Accuracy

Effect of Training Population Size on Prediction Accuracy

PAGE

36

37

38

39

39

40

40

40

41

43

43

44

46

47

49

51

52

53

56

57

59

62

ix

ix

Effect of Genomic Selection Model on Prediction Accuracy

Correlations of the Different Genomic Selection Models

Population Structure as Covariate


Whole Model Test for the Generalized Linear Model

Effect Summary and Effect Tests

Prediction Profiles and Application to Breeding Programs

Integrating Genomic Selection into Hybrid Rice Breeding Programs

Assumptions on the Hypothetical Breeding Program

Rationale on Increasing the Effectiveness of Breeding

Programs

Objectives of the Project Being Proposed

Stakeholder Analysis

Problem Analysis

Project Planning Matrix

Implementation Schedule

Management Arrangements

Budgetary Requirements

Recommendations for Inbred Rice Breeding Programs

CHAPTER 5. SUMMARY AND CONCLUSION

Usefulness of Genomic Selection

Optimizing Genomic Selection Procedures

PAGE

65

69

71

76

78

78

80

83

83

85

85

86

88

92

95

96

99

102

105

105

106

x

x

Implementing Genomic Selection through a Research Management

Approach

LITERATURE CITED

APPENDICES

PAGE

107

108

118

xi

xi

LIST OF TABLES

TABLE PAGE

1

2

3

4

5

6

7

8

9

10

11a

11b

Full factorial design for optimization of genomic selection

parameters.

Variance components of the phenotypes yield, days to 50%

flowering and plant height derived from REML-fitted linear

mixed models, and computed heritabilities. Variance components,

except genotypic variance, were divided by the number of levels

per source of variation.

LSD threshold matrix comparing prediction accuracy means of

GBLUP and Ridge Regression for GCA of 122 rice parental lines

for three traits.

HSD threshold matrix of genomic selection accuracy means

between all pairs of traits.

HSD threshold matrix of prediction accuracies between two cross

validation methods or training population size.

HSD threshold matrix among prediction accuracy means between

all pairs of genomic selection models.

Spearman’s rank correlation coefficient between pairs of genomic

selection models across all traits.

HSD threshold matrix of prediction accuracy means between

subpopulations and mixed population.

HSD threshold matrix of prediction accuracy means between

pairs of genomic selection models using subpopulation prediction

Full factorial design of genomic selection accuracy means,

heritability, training population size and genomic selection model

used in optimization.

Whole model test for the generalized linear model created to

optimize genomic selection.

Goodness of fit test for the generalized linear model created to

optimize genomic selection.

38

45

58

62

65

68

69

73

76

77

78

78

xii

xii

TABLE

12

13

14

15

16

17

18

19

20

LogWorth, FDR LogWorth and FDR p-values of main effects and

interaction in the generalized linear model.

Effects Test of main effects and interactions in the generalized

linear model.

Operational considerations of a hypothetical hybrid rice breeding

program using DH as a means of rapid inbred production.

Stakeholder map of the project summarizing the concerns of the

most relevant stakeholders in project implementation.

Project planning matrix on increasing genetic gain of breeding

programs by integrating genomic selection.

Project Gantt chart showing milestones in project

implementation.

Service provided to breeders by breeding program support staff.

Cost comparison between breeding programs with and without

genomic selection.

Projected budget of integrating genomic selection over a ten-year

period

PAGE

79

79

84

87

93

96

98

100

101

xiii

xiii

LIST OF FIGURES

FIGURE PAGE

1

2

3

4

5

6

7

8

9

10

11

A general genomic selection scheme applied to a breeding

population showing the partitioning of the population into

training and prediction sets. Only the training set is phenotyped

(e.g. yield trials) instead of the whole population.

A diagram of a matrix organization with product delivery

managed as projects across functions.

Cross validation procedure for ten-fold and three-fold schemes,

representing 0.9 and 0.667 proportions of training population

size. Each partition was successively used as validation

population from the prediction model derived from the training

population.

Box plots of coefficients of variation of 332 locations grouped

into seasons. Connecting letter report was derived from

comparison of means using Student’s t-test, α=0.05.

Trait x location box plots used in checking location data for

quality showing the spread of data points per location.

Unexplained data points in each location were discarded.

Distributions of yield, days to flowering (DTF) and plant height

across locations planted in wet and dry seasons.

Heat map showing distribution of genotypes across locations.

Male and female parents are also shown, as represented by hybrid

progenies and not as tested genotypes.

Observed and adjusted values (BLUPs) for yield, days to 50%

flowering and plant height of 510 genotypes showing shrinkage

of adjusted means toward the analysis mean.

General combining ability (GCA) for yield (kg), DTF (days) and

plant height (cm) of parental lines.

Distribution of {-1,0,1} allele calls in the marker matrix derived

from scores from 43,344 SNP loci.

Heat map of realized relationship matrix of 122 parental lines

showing three main clusters representing one female and two

male clusters.

19

28

37

41

42

42

44

49

51

53

55

xiv

xiv

FIGURE

12

13

14

15

16

17

18

19

20

21

22

Principal component analysis using marker data: (a) highest two

principal components in 2D plot, (b) highest three principal

components in 3D plot, (c) scree plot showing magnitude of

eigenvalues and variance explained by the principal components,

and (d) summary table of eigenvalues.

Correlation of GEBVs in RR-BLUP using marker data directly

(Ridge Regression) and genomic relationships (GBLUP).

Variability chart of prediction accuracies per trait in 122 rice

parental lines. First and second level factors were interchanged

between the two graphs.

Box plots of prediction accuracy per trait in 122 rice parental

lines and comparison circles based on Tukey’s honest significant

difference test.

Variability chart of prediction accuracy per cross validation

method. The first and second level factors were interchanged

between charts.

Box plots of prediction accuracy per cross validation method

(training population size) and comparison circles based on

Tukey’s honest significant difference test.

Variability chart of prediction accuracy per genomic selection

model. The first and second level factors were interchanged

between charts.

Box plots of prediction accuracy per genomic selection model

and comparison circles based on Tukey’s honest significant

difference test.

Scatterplot matrices of correlations between pairs of genomic

selection models for all traits and each trait individually.

Box plots of prediction accuracy of overall means using whole

population and subpopulations 2 and 3 jointly, and comparison

circles based on Tukey’s honest significant difference test.

Variability chart for prediction accuracy showing mean

differences for mixed population (All) and subpopulation

(Subpop) predictions for each genomic selection model per trait.

PAGE

56

58

61

62

64

65

67

68

70

73

74

xv

xv

FIGURE

23

24

25

26

27

28

29

30

31

Box plots of prediction accuracy of GS model means using

subpopulations 2 and 3 jointly, and comparison circles based on


Contour map showing general trend of relationship between

genomic selection accuracy, heritability and training population

size.

Prediction profiles of selected combinations of variables in the

genomic selection model: (A) Low heritability and small training

population size, (B) Low heritability and large training population

size, (C) High heritability and large training population size, and

(D) High heritability and small training population size.

A scheme for hybrid breeding program using a reciprocal

recurrent selection that creates 10,000 new inbreds and 10,000

new hybrids every breeding cycle.

Problem tree diagram showing some causes and effects of low

rate of genetic gain in breeding programs.

Life cycle of products from a breeding program that releases one

new hybrid every year. Monitoring of objectively verifiable

indicators as described in the project planning matrix is shown by

the arrows.

Organization structure of the hypothetical breeding program in

which genomic selection is to be applied in the project proposal.

A breeding scheme with full integration of genomic selection

showing the porportions of tested and predicted inbred GCAs

Breeding schemes for inbred rice development with and without

genomic selection. Genomic selection can drastically reduce trial

plots. In these schemes, testcrossing is not required.

PAGE

75

80

82

83

91

92

97

102

104

xvi

xvi

LIST OF APPENDICES

APPENDIX

A

B

C

D

E

Sample script for deriving BLUPs and GCAs implemented in

R.

Sample script for predicting phenotypes using RR- BLUP

implemented in R.

Sample script for predicting phenotypes using Bayesian Ridge

Regression implemented in R.

Sample script for predicting phenotypes using Bayesian CPi

implemented in R.

Sample script for predicting phenotypes using Bayesian Lasso

implemented in R.

PAGE

118

120

121

122

123

xvii

xvii

LIST OF ACRONYMS

BC

BLB

BLUE

BLUP

COGS

CV

DA

DH

DNA

DTF

EBV

FDR

GBLUP

GCA

GEBV

GLM

HRCP

LASSO

LSD

MAS

OFTD

Backcross

Bacterial Leaf Blight

Best Linear Unbiased Estimates

Best Linear Unbiased Prediction

Cost of Goods

Coefficient of Variation

Department of Agriculture

Doubled Haploids

Deoxyribonucleic Acid

Days to 50% flowering

Empirical Breeding Value

False Discovery Rate

Genomic Best Linear Unbiased Prediction

General Combining Ability

Genome Estimated Breeding Values

Generalized Linear Model

Hybrid Rice Commercialization Program

Least Absolute Shrinkage and Selection Operator

Least Significant Difference

Marker Assisted Selection

On-Farm Techno-Demo

xviii

xviii

QTL

REML

RFLP

RIL

RR-BLUP

SCA

SNP

TBV

Quantitative Trait Loci

Residual Maximum Likelihood

Restriction Fragment Length Polymorphism

Recombinant Inbred Lines

Ridge Regression Best Linear Unbiased Prediction

Specific Combining Ability

Single Nucleotide Polymorphism

True Breeding Value

xix

xix

ABSTRACT

NAS, TAMERLANE MARK SIARON. University of the Philippines Los Baños, June

2016. Predictive Hybrid Rice Breeding Using Genomic Selection and Its Integration

into Rice Breeding Programs Using Research Management Approaches.

Major Professor: Jose E. Hernandez, Ph.D.

This is the first research work on genomic selection in hybrid rice, a new procedure

in crop breeding that uses genotype data from a large set of random markers across the

genome that allows prediction of the phenotype from the marker data alone, the genomic

prediction model being trained from a population with phenotype and genotype data.

The accuracy of genomic selection in predicting general combining ability (GCA)

for the quantitative traits yield, days to 50% flowering and plant height was assessed, each

trait with computed heritabilities of 0.3130, 0.5036 and 0.5486, respectively. GCA of 122

parental lines were computed from yield trials and historical data using Best Linear

Unbiased Predictions (BLUPs). Concurrently, the 122 parental lines were fingerprinted

using 60,000 SNP markers, resulting to 43,344 high-quality SNP loci. Principal component

analysis and the realized relationship matrix revealed a population structure with one

female cluster (CMS) and two male clusters (restorers). Four genomic selection models

were explored: Ridge Regression BLUP (RR-BLUP), Bayesian Ridge Regression

(BayesRR), Bayesian C Pi (BayesCPi) and Bayesian Lasso (BayesL). In addition, training

population sizes of 0.667 and 0.9 as proportions of overall population size were used.

Marker densities were not considered as these were not relevant for chip-based marker

platforms.

Overall prediction accuracy was significantly influenced by trait heritability and

training population size, but not by genomic prediction model. Plant height has the highest

xx

xx

mean prediction accuracy (0.59) while yield has the lowest (0.31). Larger training

population size (0.9) has a higher mean prediction accuracy of 0.47 compared to a smaller

training population size (0.667) with a mean accuracy of 0.36. There were no significant

differences among the prediction accuracy means for the four genomic selection models.

To account for population structure, prediction was done on pooled members of the

two clusters representing restorer lines, excluding the cluster with CMS lines. Mean

accuracy of within subpopulation prediction was 0.52 compared to 0.45 obtained with

whole population predictions. All trait predictions also increased, in particular, yield with

an increased prediction accuracy of 0.38.

A generalized linear model was used to create prediction profiles for various

combinations of trait heritability, training population size and genomic selection model.

The prediction profiles can simulate heritability and training population size values not

included in the experiment, which can be used to optimize genomic selection parameters.

We propose the introduction of genomic selection into a crop breeding program

using a research management approach by identifying and analyzing the problem (low

genetic gain), identifying impact to stakeholders, and proposing a project that will

implement genomic selection proofs of concept and training for breeders. Proofs of

concepts can be completed within two years while full scale up of genomic selection to

steady state can be attained in six years. Genomic selection is projected to save 32% of

breeding program resources by substituting DNA fingerprint for phenotype plot data.

1

CHAPTER 1

INTRODUCTION

Plant breeding programs in many research institutions and companies worldwide

are constantly being upgraded to maximize genetic gain. These programs incorporate

schemes that increase response to selection such as improving the accuracy of phenotyping,

increasing selection differential, and reducing cost and breeding cycle time. Response to

selection has been a key driver of decisions in product development strategies particularly

in the private industry. Decision-makers involved in setting the general direction of plant

breeding programs realize that frontloading investments are necessary to answer key

questions and establish optimized schemes that will ultimately increase genetic gain in

routine breeding processes.

Efforts to implement more efficient breeding programs by addressing the associated

factors of the breeding process are part of a coordinated approach of plant breeding and

other fields such as social sciences on the greater issue of global food security. The global

food security outlook for the 21st century is not very bright. FAO (2009) estimates that the

world’s population will reach 9 billion by 2050, 90% of which will occur in the developing

world. Consequently, cereal production needs to increase by 50%, from 2.1 to 3.0 billion

tons; 80% of the food production increases is projected to come from increases in yields

and cropping intensity. Plant breeding programs should exert constant intervention by

improving crop yields per unit area and unit time, thus this task presents a formidable

challenge to plant breeders.

2

Conventional plant breeding has proven to be a then cutting edge tool in combating

an impending worldwide famine in the middle of the 20th century. The Green Revolution

transformed agriculture worldwide with an increase in world grain production by over

250% from 1950 to 1984 (Kindall and Pimentel, 1994). With the current increasing global

trends in population, shortage of arable land, and decrease in yield gains, plant breeding

will once again address this challenge. But it will do so in a better capacity because plant

breeders have at their disposal new tools such as molecular markers. The challenges now

will not be solved by the same conventional technologies that brought us the Green

Revolution. In rice, molecular markers have been documented to dramatically simplify

breeding (Collard et al., 2008). An excellent example of a high-impact marker-enhanced

breeding is the development of submergence tolerant rice varieties (Septiningsih et al.,

2009).

However, molecular markers in rice breeding have been used mostly for major

genes with large effects. Small-effect quantitative trait loci (QTL) governing extremely

complex traits that are agronomically important such as yield are currently being assessed

by phenotyping in multiple locations. This adds constraints to plant breeding research in

developing countries, which is largely underfunded. Phenotyping, or conducting multi-

environment yield trials for example, is perhaps the most expensive component of a

breeding program.

In large agri-biotech companies, a significant percentage of resources and funding

of local breeding programs are devoted to multi-location testing. In corn breeding

programs, breeders evaluate tens of thousands of doubled haploid (DH) lines and

testcrosses every cropping season in multiple locations in a target population of

3

environments. In a purely phenotype-based evaluation system, this will translate to tens of

thousands of yield plots. The capacity to test genotypes is usually the baseline from which

logistical decisions are made in a local breeding program, such as number of populations

or families, number of inbreds or DHs per family, number of testers and number of

locations. It is therefore important for breeding programs to devise and adopt strategies that

will address the factors that contribute to genetic gain through efficient evaluation of

genotypes.

The general objective of this research was to evaluate the use of genome-wide

markers in augmenting phenotyping of quantitative traits that are most important in hybrid

rice breeding by predicting breeding values, and initially explore the use of genomic

prediction models in a hypothetical breeding program.

The specific objectives of this research were as follows:

1. Investigate the usefulness of whole-genome markers in generating genomic

prediction models for yield, plant height and days to flowering in hybrids rice

breeding by conducting multi-location trials and augmenting the data generated

with existing trial datasets.

2. Compare the adequacy of different statistical models for genomic selection for

quantitative traits in different types of training and validation population and

different types of genomic selection models, and create an optimized model for

various combinations of heritability, training population size and genomic selection

model.

3. Propose a research management-based approach in introducing genomic selection

into existing breeding programs

4

This research was conducted in various Syngenta locations. Genotyping service

was provided by Syngenta's high-throughput SNP marker facilities located in Toulouse,

France. Field trials were conducted in various locations throughout the Philippines.

5

CHAPTER 2

REVIEW OF LITERATURE

Increasing Genetic Gain

A breeding program's year-over-year success is usually measured by genetic gain.

Genetic gain (G), broadly defined as the increase in performance through artificial genetic

improvement programs, is positively correlated to the standardized selection differential

(i), accuracy of selection (rA) broad-sense heritability (H) and phenotypic standard

deviation (σP). In the seed industry, cost (c) and time (t) are considered in the area of

resource management and are used in estimating return of research investment, as shown

in the equation below.

𝐺 =𝑖𝑟𝐴√𝐻𝜎𝑃

(𝑡)(𝑐)

Improvements in breeding processes typically address these factors. Use of markers

to screen for traits increases the mean of selected individuals and therefore increases

standardized selection differential. In most cases, such as yield trials, testing under

conditions that simulate the target environment ensures high correlation between testing

and target environments and therefore increases phenotyping accuracy. Optimization of

trial designs by improving field techniques to minimize within-location and across-location

errors increase broad sense heritability. A carefully-planned breeding scheme takes into

account the need for necessary variation, and would address phenotypic standard deviation.

6

The easiest genetic gain component to manipulate so far is breeding cycle time. In

a hybrid breeding program based on heterotic pools, the most critical stage is to testcross

new inbreds. In many seed companies with an industrialized inbred production system,

new inbreds are generated in a sufficiently rapid manner, hence hybrid crop breeders would

generally focus more on evaluating testcrosses rather than evaluating segregating breeding

populations. To rapidly generate inbreds for testcrossing, a number of approaches have

been incorporated into breeding programs worldwide such as rapid generation advance

(Ikehashi and HilleRisLambers, 1977) and doubled haploid technology (Maluszinski et al.,

2003). A breeding program that uses doubled haploids can provide thousands of

homozygous lines for testcrossing per year.

Yield Trialing and Phenotypic Analysis

Phenotyping has been the cornerstone of the numerous plant breeding success

stories, and will continue to be so in the future. Yield is arguably the most important crop

trait. It is also the most evaluated, yet it has one of the lowest heritabilities (Teich, 1984;

Gomez and Gomez, 1984). This is mainly due to genotype x environment interactions or

GxE (Horner and Frey, 1957; Fox and Rosielle, 1982), spatial trends in the field (Vollmann

et al., 1996) and extraneous errors associated with experimental procedure (Gilmour A.R.

et al., 1997).

Fisher (1930) first proposed the decomposition of phenotypic variance in his classic

book “The Genetical Theory of Natural Selection,” and further elaborated in succeeding

plant breeding textbooks (Falconer, 1960; Lynch and Walsh, 1998; Bernardo, 2010). The

general model for phenotypic variance is:

7

𝜎𝑃2 = 𝜎𝐺

2 + 𝜎𝐺𝑥𝐸2 + 𝜎𝑒

2

The variance due to environment (𝜎𝐸2) is held at zero in the context of plant breeding

because genotypes are assumed to be tested in similar environments. Genetic variance (𝜎𝐺2)

can further be decomposed into additive variance (𝜎𝐴2), dominance variance (𝜎𝐷

2) and

epistatic variance (𝜎𝐼2).

Achieving High Heritability in Field Trials

Heritability can provide an estimate of the quality of a field trial by comparing the

realized heritability with published values from other experiments. Broad sense heritability

(H) is defined as the proportion of the phenotypic variance that is explained by genetic

variance:

𝐻 =𝜎𝐺

2

𝜎𝐺2 + 𝜎𝐺𝑥𝐸

2 + 𝜎𝑒2

Heritability is also defined as the regression coefficient of G on P:

𝛽𝐺,𝑃 =𝑐𝑜𝑣(𝐺, 𝑃)

𝜎𝑃2 =

𝜎𝐺2


2 + 𝜎𝑒2

= 𝐻

For most quantitative traits such as yield, plant breeders are more interested in the

narrow sense heritability (ℎ2) because alleles and not genotypes are passed from parent to

8

progeny (Bernardo, 2010). Narrow sense heritability is the amount of phenotypic variance

that can be attributed to additive genetic variance, and can be represented as the regression

of breeding values on phenotypic values:

𝛽𝐴,𝑃 =𝑐𝑜𝑣(𝐴, 𝑃)

𝜎𝑃2 =

𝜎𝐴2


2 + 𝜎𝑒2

= ℎ2

The variance of phenotypic means across 𝐽 environments and 𝑅 replications is

given by the formula:

𝜎𝑃2 = 𝜎𝐺

2 +𝜎𝐺𝑥𝐸

2

𝐽+

𝜎𝑒2

𝐽𝑅

The formula above reinforces the importance of locations and replications in field

trials because they reduce the effects of “masking” variances (Bernardo, 2010). Trials can

be established in multiple locations, in homogenous environmental conditions and in

several replications. Increasing the number of replications reduce 𝜎𝑒2 while increasing the

number of locations minimize both 𝜎𝐺𝑥𝐸2 and 𝜎𝑒

2.

Best Linear Unbiased Prediction (BLUP)

BLUP was originally developed in animal breeding but has not gained immediate

appreciation in plant breeding until in recent years, particularly in annual crops. Piepho and

co-workers (2007) postulate that this is due to the large amount of phenotypic information

generated per genotype so best linear unbiased estimates (BLUE) and BLUP may not

necessarily be advantageous. Also in animal breeding, estimation procedures are necessary

9

due to lack of direct observations as in the case of selecting bulls for dairy milk yield

breeding. In fact, the first application of BLUPs was in dairy herds. Another reason is the

relative inaccuracy of genetic variance estimates in plants, due to limited number of

genotypes and more complex covariance structures.

Henderson (1949, 1950) first used linear mixed models in animal breeding by using

correction factors for estimating genetic improvement and predicting breeding values in

dairy herds. These concepts were subsequently proven mathematically as BLUEs

(Henderson et al., 1959) and BLUPs (Henderson, 1963) although it was not until 1973 that

the term Best Linear Unbiased Estimates and Best Linear Unbiased Predictions were

coined (Henderson, 1973).

BLUP is a method for estimating random effects from the mixed model represented

by the equation:

𝑦 = 𝑋𝛽 + 𝑍𝑢 + 𝑒

where 𝑦 is a vector of phenotypic observations, 𝛽 is a vector of fixed effects, 𝑢 is a vector

of random effects, 𝑋 and 𝑍 are the associated matrices and 𝑒 is a vector of random residuals.

The fixed effects can be estimated by Best Linear Unbiased Estimates (BLUEs). The

random effects, assumed have the distribution 𝑢~𝑀𝑉𝑁(0, 𝐺) and 𝑒~𝑀𝑉𝑁(0, 𝑅) where

𝑀𝑉𝑁(𝜇, 𝑉) denotes the multivariate normal distribution with mean vector 𝜇 and variance-

covariance matrix 𝑉, can be estimated by BLUPs (Piepho et al., 2007). Variance

components of 𝐺 and 𝑅 are estimated by statistical programs usually by using Residual

Maximum Likelihood (REML) proposed by Bartlett (1937) and first applied on estimating

components of variance in unbalanced data by Patterson and Thompson (1971).

10

A desirable property of BLUP is shrinkage towards the mean. In plant breeding

context, the “shrinkage mean” anticipates the regression of progeny to observed mean (Hill

and Rosenberger, 1985). Shrinkage increases accuracy by reducing variance, resulting to

smaller mean squared errors (MSE). Searle et. al. (2006) reported that BLUP generally

maximizes correlation of true genotypic values and predicted genotypic values, which is

aligned with the phenotyping accuracy component of the genetic gain equation. BLUP can

also exploit information from relatives by using genetic correlation arising from pedigrees

and marker relationships.

Use of General Combining Ability in Hybrid Breeding

The concept of general combining ability (GCA) and specific combining ability

(SCA) was originally defined in corn by Sprague and Tatum (1942), in which they defined

GCA as “the average performance of a line in hybrid combinations,” and SCA as cases

wherein “certain combinations do relatively better or worse than would be expected on the

basis of the average performance of the lines involved.” An incomplete diallel that excludes

reciprocal crosses will have this model for the trait to be analyzed:

𝑌𝑖𝑗𝑘 = 𝜇 + 𝑔𝑖 + 𝑔𝑗 + 𝑠𝑖𝑗 + 𝑒𝑖𝑗𝑘

where 𝑔𝑖 and 𝑔𝑗 are the GCAs of parents i and j, and 𝑠𝑖𝑗 (= 𝑠𝑗𝑖) is the SCA of i x j cross.

If the parents are drawn from the same distribution, for example absence of heterotic

pooling and no gender limitations (e.g. male sterility), the total phenotypic variance is:

𝜎𝑃2 = 2𝜎𝐺𝐶𝐴

2 + 𝜎𝑆𝐶𝐴2 + 𝜎𝑒

2

11

However in established hybrid crops such as corn, distinct heterotic pools exist,

while in hybrid rice, parental lines are usually classified into restorers and cytoplasmic

male-sterile (CMS) lines. The model can more clearly be represented as:

𝑌 = 𝜇 + 𝐺𝐶𝐴𝑚𝑎𝑙𝑒 + 𝐺𝐶𝐴𝑓𝑒𝑚𝑎𝑙𝑒 + 𝑆𝐶𝐴𝑓𝑒𝑚𝑎𝑙𝑒𝑥𝑚𝑎𝑙𝑒 + 𝑒

GCAs are often assumed similar to breeding values in hybrid breeding hence

additive variance is the most relevant component of genetic variance to hybrid breeders. In

reciprocal recurrent selection schemes, alleles are continuously accumulated every

breeding cycle, which is very important in achieving increases in rate of genetic gain. Use

of GCA and SCA have been further elaborated by Comstock and co-workers (1949).

Marker-Aided Selection

Current Use of Markers in Rice Breeding

Markers have been a key component of rice breeding programs in the last few

decades. Improvements in marker technologies from the first use of isozyme markers

(Tanksley and Rick, 1980) and restriction fragment length polymorphisms or RFLPs

(Beckmann and Soller, 1986) to the current technologies such as single nucleotide

polymorphisms (SNPs) and genotyping by sequencing (GbS) have increased the

popularity, accuracy and usefulness of MAS particularly for simple, monogenic traits, as

well as assessment of germplasm diversity (Virk et al., 1996).

12

In rice breeding, markers have been generally used to screen for characters

controlled by major genes in inbred parental lines. Such characters include fertility

restoration (Nas et al., 2000; Sattari et al., 2007), bacterial leaf blight (Zhai et al., 2002;

Shanti et al., 2010; Basavaraj et al., 2010), rice blast (Singh et al., 2012) and

thermosensitive genetic male sterility (Nas et al., 2005). The most common traits on which

markers were used in hybrid rice breeding are disease resistance (bacterial leaf blight and

blast). Mackill (2007), Collard et al. (2005) and Collard et al. (2008) provided a

comprehensive reviews of the use of markers in rice breeding and associated technologies

in the marker lab that can be translated into the field.

Although markers have been used in diversity analysis, purity assessments, plant

variety protection, hybridity screens and many other applications, mapping and tagging

genes comprise the majority of marker use (Mackill, 2007).

Mapping Quantitative Trait Loci

One of the main uses of molecular markers is the construction of linkage maps,

which are useful in determining regions on chromosomes that contain genes. Chromosome

regions containing genes that control complex traits (polygenic or multifactorial traits) are

called quantitative trait loci (QTL).

Mapping QTLs require generating bi-parental populations of which the parental

lines have contrasting phenotypes for the trait of interest, for example, chalky and non-

chalky grains in rice. Mohan and co-workers (1997) suggests a population size of 50 to 250

individuals for preliminary mapping and larger populations for higher resolution mapping.

It is ideal that both parental lines are highly homozygous, which is not an issue in self-

13

pollinated crops such as rice. In corn and other cross-pollinated crops, inbreeding

depression may present some challenges.

Mapping populations may be derived from various generations. F2 populations are

the easiest to create as these are just selfed seeds harvested from F1 plants. Backcross (BC)

populations are generated by crossing the F1 to one of the parental lines. Recombinant

inbred lines (RILs) are usually obtained by single seed descent from an F2 population for

six or more generations. Doubled haploids (DHs) are essentially similar to RILs except that

they are derived from F1 plants through anther culture or crossing with an inducer genotype.

McCouch and Deorge (1995), and Paterson (1996) outlined the advantages and

disadvantages of these types of mapping populations.

QTL mapping is based on the association between the phenotype and the genotype.

It is therefore important to attain accurate and precise phenotyping because marker-trait

associations will be concluded from this initial phenotyping of the mapping population,

which will then be subsequently applied across breeding populations.

Collard (2005) explains QTL mapping as dividing the mapping population into

different genotypic groups and finding significant differences in phenotype between

groups. Phenotypic means between groups are compared and significant differences would

indicate that the marker locus being used to partition the population is linked to a QTL

controlling the trait of interest.

Some statistical methods to detect QTLs include single marker analysis, simple

interval mapping and composite interval mapping (Tanksley, 1993). Single marker analysis

can be done by linear regression wherein the coefficient of determination (R2) from the

marker explains the phenotypic variation arising from the QTL linked to the marker

14

(Collard, 2005). Simple interval mapping simultaneously analyzes intervals between

linked markers along chromosomes and is widely considered to be more powerful than

single marker analysis (Liu, 1998). Composite interval mapping combines interval

mapping with linear regression in a statistical model and includes additional markers in

addition to the adjacent markers that define an interval (Jansen, 1993).

Limitations of Traditional MAS

Dekkers and Hospital (2002) described the current methods in MAS as better suited

to genes with major effects than genes with small effects, which is in agreement with the

review of the technology by Xu and Crouch (2008). MAS is not effective for traits

controlled by many genes with small effects. In forward breeding, any addition of loci for

MAS would mean increasing the effective population size to attain the target population

after conducting MAS. For example, a target F2 population size of 100 individuals after

performing MAS translates to an initial 400 individuals if a breeder wants to select one

homozygote class for a single locus (p = 0.25). Selection for homozygotes for two loci (p

= 0.0625) puts the effective population at 1,600 individuals to attain a target population

size of 100 after MAS.

While allele enrichment schemes can manipulate probabilities by including

heterozygotes, these approaches are suitable only for few major genes. Unfortunately,

many traits that are of agronomic importance are controlled by small-effect minor genes

and such traits are important to the success of new crop varieties (Crosbie et al., 2003).

Heffner et al. (2009) cite two primary limitations of MAS: bi-parental populations used in

most QTL studies do not readily translate to breeding applications, and the statistical

15

methods used are inadequate for polygenic traits controlled by numerous small-effect loci.

Collard (2013, pers. comm.) also points out that statistical methods currently in place for

MAS in public rice research institutes are not yet capable enough in resolving polygenic

traits.

The method by which MAS performs QTL mapping may be poorly suited to crop

improvement (Jannink et al., 2010). Bi-parental populations may not represent the level of

allelic diversity and phase of the breeding program and other breeding programs that intend

to use QTL mapping results. MAS partitions QTL mapping into two components: QTL

identification and estimation of effects. This frequently results to bias in estimation of

marker effects (Beavis, 1994; Melchinger et al., 1998), and small-effect QTLs may be

disregarded in the model (Lande and Thompson, 1990) due to the use of stringent

significance thresholds.

Estimation bias has been demonstrated in a simulation by Beavis (1994), showing

the impact of sampling estimated effects of QTLs from a truncated distribution. In the

study, Beavis showed that the average estimates of phenotypic variances associated with

correctly identified QTL were greatly overestimated with smaller population size (n=100),

slightly overestimated with n=500, and fairly close to the actual magnitude when n=1,000.

This phenomenon has been subsequently called the Beavis effect. A theoretical exploration

of the Beavis effect was performed by Xu (2003), and a statistical explanation has been

since put forward that will improve interpretation of QTL mapping results.

Other limitations such as those mentioned by Xu and Crouch (2008) are not

limitations of the technology per se, but on the lack of emphasis on the applied value of the

16

technology in plant breeding, as he argued that logistical constraints in applying MAS are

rarely addressed in scientific publications.

Linkage Disequilibrium-Based Mapping

Linkage disequilibrium refers to a non-random association between alleles at

different loci (Bernardo, 2010). It is a parameter of a population and it is essentially the

ability to predict the allele at one locus based on the allele state at another locus, hence it

is defined in terms of correlation of alleles on two loci. Linkage disequilibrium (LD) is

measured as the difference between observed frequency of a gamete in a population and

the product of the frequencies of the corresponding alleles:

𝐿𝐷 = 𝜌(𝐴𝑖𝐵𝑗) − 𝜌(𝐴𝑖)𝜌(𝐵𝑗)

where 𝜌(𝐴𝑖𝐵𝑗) is the observed frequency of the 𝐴𝑖𝐵𝑗 gamete; 𝜌(𝐴𝑖) is the frequency of 𝐴𝑖

allele; and 𝜌(𝐵𝑗) is the frequency of 𝐵𝑗 allele. By definition, linked loci are in LD with

each other but loci from different chromosomes or linkage groups can be in LD with one

another. LD is influenced by many factors such as recombination events in the pedigree,

selection history, allele frequencies, and random drift among others.

Linkage disequilibrium-based mapping or association mapping addresses the non-

relevance of bi-parental populations by utilizing breeding populations (Rafalski, 2002).

Kraakman and co-workers (2004) demonstrated this approach by identifying QTLs for

yield and yield stability in breeding populations of spring barley.

17

Association mapping allows identification of QTLs on breeding populations

already with existing extensive phenotypic data across locations and years. The obvious

advantage over bi-parental populations is that there is no need of developing such mapping

populations that would impose added investment on the breeding program. Directly using

breeding populations also eliminates the need for costly QTL validation experiments

because QTL values can be directly used in MAS on the breeding population (Breseghello

and Sorrels, 2006).

Recent association mapping studies in various cereal species explored the genetic

architecture of complex traits such as aluminum tolerance (Famoso et al., 2011) and harvest

index (Li et al., 2012) in rice, plant height components and inflorescence architecture in

sorghum (Morris et al., 2013), and Fusarium head blight resistance in wheat (Miedaner et

al., 2011). As with traditional QTL mapping however, association mapping uses arbitrary

significance thresholds that may result to identification of only a few QTLs with

overestimated effects (Beavis, 1998).

Genomic selection

Genomic selection (genome wide prediction or genome wide selection) is a

procedure that uses genotype data from a large set of random markers across the genome

that allows prediction of the phenotype from the marker data alone. It was first proposed

by Meuwissen and co-workers (2001) as an improvement on the two-stage procedure of

Lande and Thompson (1990). While the two-stage procedure requires selection of

significant markers from a large set then combining this information with phenotypic data

to create a selection index, Meuwissen's method uses all available data – locus, haplotype,

18

or marker effects – in a single stage to calculate genome-estimated breeding values

(GEBV).

Genomic selection simultaneously estimates all effects included in the model while

traditional MAS first identifies significant QTLs and subsequently estimates their effects.

Genomic selection captures the total genetic variance through markers by fitting both large

and small effect QTLs without significance testing, and jointly analyzes all markers on a

population to explain the total genetic variance (de los Campos et al., 2009).

Figure 1 illustrates a general breeding scheme with genomic selection on a

population derived from a bi-parental cross. The basic concept is prediction of breeding

values for individuals (prediction set or population) with genotype data alone (hence,

genomic-estimated breeding values) based on a model created or "trained" from a separate

set of individuals (hence training set or population) having both genotype and phenotype

data. A subset of progenies from a bi-parental cross can be genotyped and phenotyped to

compose the training population or training set. The training population is used to estimate

parameters for the prediction model, which is then applied to the remainder of the

population using genotype information.

One of the most revolutionary ideas in genomic selection is that phenotyping is no

longer used as a means to select individuals, but as a means to train the prediction model.

GEBVs are predicted for untested individuals based on SNP profile alone, and selections

are then done on GEBVs. Effects are valid for entire population and are stable over

generations due to small segments of the chromosome represented. Genome-assisted

predictive hybrid breeding is best utilized in well-defined heterotic pools where parental

lines of inbred families share co-ancestry, having been derived from a set of founder lines.

19

Here, the prediction model is based on the information obtained from multiple families (as

opposed to a single bi-parental population).

Figure 1. A general genomic selection scheme applied to a breeding population showing

the partitioning of the population into training and prediction sets. Only the

training set is phenotyped (e.g. yield trials) instead of the whole population.

Statistical Models of Estimating GEBV

Limitations of Stepwise Regression Models

The availability of high-density genotyping panels pushed the development of

genomic selection to predict complex traits (Meuwissen et al., 2001). As discussed,

traditional MAS models arbitrarily set marker effects to zero (not significant) or the full

value (significant), which results to overestimation of marker effects. Meuwissen and co-

workers (2001) attempted to address the bias of overestimated marker effects by avoiding

selection of "significant" markers during estimation of marker effects and calculation of

Training Set Prediction Set

Breeding Population

Prediction Model

Phenotyping and Genotyping

Genotyping

Calculate GEBVs

Make selections

20

genetic values. This resulted to the number of predictor effects (p) to be estimated being

larger than the number of observations (n). Using least squares is not appropriate for

analyzing these "large p, small n" datasets due to insufficient degrees of freedom (reviewed

by Lorenz et al., 2011).

To resolve this, several statistical models for genomic selection have been proposed

and used in other crop species. The general model described by Moser et al. (2009) is as

follows:

𝑌𝑖 = 𝑔(𝑥𝑖) + 𝑒𝑖

where 𝑌𝑖 is the observed value of the phenotype of individual i, xi is a 1 x p vector of SNP

genotypes on individual i, g(xi) is a function relating genotypes to phenotypes or the GEBV,

and ei is the error term. Meuwissen et al. (2001) enumerated several statistical models such

as Ridge Regression Best Linear Unbiased Prediction (RR-BLUP) and Bayesian methods

(BayesA, BayesB , Bayes Cπ).

Ridge Regression BLUP

RR-BLUP is one of the first models proposed for genomic selection in bi-parental

crosses (Whittaker et al., 2000). Compared with stepwise regression models in traditional

MAS for which the number of markers cannot be more than the number of observations,

RR-BLUP is not limited by “large p, small n” problems. The basic model for RR-BLUP is

as follows:

𝑌 = 𝑊𝐺𝑢 + 𝑒

21

where u is a vector of marker effects with a normal distribution, mean of zero and variance

of 𝐼𝜎𝑢2, G is the genotype matrix, and W is the design matrix relating lines to observations

(Y). Marker effect BLUPs can be estimated as:

�̂� = (𝑍′𝑍 + 𝜆𝐼)−1𝑍′𝑌

where Z = WG. The ridge parameter λ is the ratio between the residual and marker

variances, 𝜆 = 𝜎𝑒2 𝜎𝑢

2⁄ (Searle et al., 2006).

RR-BLUP shrinks marker effects toward zero and assumes markers as random

effects with a common variance (Whittaker et al., 2000). Bernardo and Yu (2007) clarified

that common variance does not mean that all markers have the same effects but that the

shrinkage toward zero are equal. This assumption however is not realistic; markers do not

have equal variances. Despite the fact that RR-BLUP incorrectly assumes equal marker

variances, it is superior to traditional MAS models (e.g. stepwise regression) because it can

simultaneously estimate marker effects, avoiding biases associated with selecting markers

in a stepwise regression.

Bayesian Methods

RR-BLUP can have the tendency to overshrink large effects. Bayesian models have

been applied to address the equal variances assumption and account for marker effects of

different sizes (Hayes, 2007), where separate variances are estimated for each marker and

are assumed to follow a specified prior distribution, allowing each marker to be shrunken

toward zero to a different degree. The Bayesian approach to analysis takes into account the

22

following: prior knowledge about the parameters before data are observed, likelihood

probability of observing the data given a certain value of the parameters, and posterior

knowledge about the parameters after the data are observed, and estimates of the

“compromise” between the Data and Prior are derived.

Fernando (2007) describes that prior probabilities quantify beliefs about parameters

before the data are analyzed. Parameters are related to the data through the model or

“likelihood”, which is the conditional probability density for the data given the parameters.

The prior and the likelihood are combined using Bayesian theorem to obtain posterior

probabilities, which are conditional probabilities for the parameters given the data.

Inferences about parameters are based on the posterior. The Bayesian theorem is illustrated

by Fernando (2009) as follows:

Let 𝑓(𝜃) denote the prior probability density for θ;

Let 𝑓(𝑦|𝜃) denote the likelihood;

Then the posterior probability of θ is:

𝑓(𝜃|𝑦) = 𝑓(𝑦|𝜃)𝑓(𝜃)

𝑓(𝑦)

∝ 𝑓(𝑦|𝜃)𝑓(𝜃)

Meuwissen et al. (2001) initially proposed two types of prior distribution of marker

variance. In BayesA, each marker effect k is drawn from a normal distribution with its own

variance: 𝑁(0, 𝜎𝛽𝑘

2 ). BayesA uses an inverted chi-square distribution with degrees of

freedom and scale parameters chosen so that the mean and variance of the distribution

match the expected mean and variance of the marker variances (Heffner et al., 2009).

BayesA however, does not permit the value of zero for marker variances. The second type

23

of prior, BayesB, assigns a probability that a marker has no effect at all, offering a more

realistic model since some regions in the genome will have no QTLs for a particular trait,

and therefore it is expected that markers on these region would have zero effects. The

Bayesian model can be represented as (Lorenz et al., 2011):

𝐺𝐸𝐵𝑉 = 𝑔(𝑋𝑖) = ∑ 𝑥𝑖𝑘𝛽𝑘𝛾𝑘

𝑝

𝑘=1

where 𝑔(𝑋𝑖) or GEBV is the sum of p marker effects, 𝑥𝑖𝑘 represents SNP score for

individual i at marker locus k, 𝛽𝑘 is the effect of marker k, and 𝛾𝑘 is an indicator variable

specifying the presence of marker k in the prediction model. The prior distribution of 𝛽𝑘

variances in BayesB are mixed such that 𝜎𝛽𝑘

2 = 0 with probability π and 𝜎𝛽𝑘

2 ~ 𝛸−2(𝑣, 𝑆)

with probability (1 − 𝜋). If 𝜋 = 0, the model becomes BayesA.

Lorenz et al. (2011) noted that reasonable values for the parameter π will be

unknown in the context of biological organisms. BayesCπ addresses this limitation by

estimating the parameter π itself, setting a uniform prior distribution between 0 and 1 for

the parameter. BayesCπ assumes the marker effect 𝛽𝑘 is zero when the indicator variable

𝛾𝑘 is also zero, and that that the prior variance for the effects of all markers for which 𝛾𝑘 =

1 is equal, or 𝛽𝑘 ~ 𝑁(0, �̂�𝛽2). This approach groups the markers into zero and non-zero

effects, from which estimates for marker effect variances are obtained. Bayesian Lasso

(Legarra et al., 2011) models the SNP effect 𝑎 as:

𝑝(𝑎|𝜎2, 𝛾) =𝛾

2𝑎exp [

−𝛾|𝑎|

𝜎]

24

Kernel and Machine Learning Methods

Gianola and van Kaam (2008) were the first to apply reproducing kernel Hilbert

spaces (RKHS) regression for genomic selection by combining a classical additive genetic

model with a kernel function, shown as follows:

𝑌 = 𝜇 + 𝐾ℎ𝛼 + 𝑒

with prior distributions for marker effects 𝛼 ~ 𝑁(0, 𝐾ℎ𝜎𝛼2) and residuals 𝑒 ~ 𝑁(0, 𝐼𝜎𝑒

2).

The kernel matrix 𝐾ℎ is defined as:

𝐾ℎ(𝑥𝑖, 𝑥𝑗) = exp (−ℎ𝑑𝑖𝑗)

where 𝑑𝑖𝑗 is the squared Euclidean distance between individuals i and j derived from

marker genotypes and h is defined as 2/𝑑∗ where 𝑑∗ is the mean of the Euclidean distances.

This method was used by Neves et al. (2012) on mice populations.

Machine learning is being explored for massive amounts of information wherein

there is a need to mine knowledge from large, noisy, redundant, missing and fuzzy data,

extracting hidden relationships that exist in these huge volumes of data and do not follow

a particular parametric design (Gonzalez-Recio, 2010). An example of a machine learning

method is Random Forest, an ensemble learning method for classification (and regression)

that operates by constructing a multitude of decision trees at training time and outputting

the class that is the mode of the classes output by individual trees (Breiman, 2001).

Ensembles are combinations of different simple methods or models, resulting to very good

25

predictive abilities compared to the individual models if used separately. Ensembles have

known statistics properties and have no prior assumptions similar to Bayesian methods.

Some of the advantages of Random Forest include not requiring specified inheritance

models (e.g. additive, dominance and epistasis), ability to capture more complex

interactions in the data, and reduction of error prediction by a factor of the number of trees

(Breiman, 2001).

Accuracy of Genomic Selection

The accuracy of genomic selection models is usually expressed in terms of Pearson

correlation coefficient of GEBV predicted by the model and the observed (empirical)

phenotypic data ((Storlie and Charmet, 2013), i.e. 𝑟(𝐺𝐸𝐵𝑉: 𝐸𝐵𝑉). Although a great

majority of researchers report genomic selection accuracy as 𝑟(𝐺𝐸𝐵𝑉: 𝐸𝐵𝑉), other

researchers such as Lorenz et al. (2011) argue that correlated error component generated

by 𝐺 × 𝐸 will be obtained for both GEBV and EBV if training and validation data are

collected in the same environment resulting to bias, i.e. overestimated prediction accuracy.

Therefore, correlation with the true breeding value 𝑟(𝐺𝐸𝐵𝑉: 𝑇𝐵𝑉) may be obtained by

using the assumption:

𝑟(𝐺𝐸𝐵𝑉: 𝐸𝐵𝑉) = 𝑟(𝐺𝐸𝐵𝑉: 𝑇𝐵𝑉) × 𝑟(𝐸𝐵𝑉: 𝑇𝐵𝑉)

which is true if the only component common between GEBV and EBV is the TBV.

Specifically, the training and validation data should be obtained from different

environments to satisfy the condition that residuals should be uncorrelated:

26

𝐺𝐸𝐵𝑉 = 𝑇𝐵𝑉 + 𝑒1

𝐸𝐵𝑉 = 𝑇𝐵𝑉 + 𝑒2

The correlation 𝑟(𝐸𝐵𝑉: 𝑇𝐵𝑉) is equal to the square root of heritability within the validation

set.

Research Management of Plant Breeding Programs

Plant breeding research programs are becoming increasingly complex due to the

integration of allied sciences in addressing common social issues such as poverty and food

security. Plant breeders are now working much more closer with pathologists,

physiologists, social scientists, statisticians, geneticists, engineers and many others. It is

therefore crucial for plant breeders to be adept not just in genetics, breeding, statistics and

other technical skills, but also in the areas of leadership and management.

Knowledge, Experience and Skill Requirements from Plant Breeders

Repinski and co-workers (2011) discussed the expectations of various stakeholders

(public and private sectors, and institutes from developing countries) of plant breeders in

terms of critical knowledge, experience and skills. Knowledge in plant breeding, breeding

methodology, quantitative genetics, statistics and experimental design are highly required.

Equally important are knowledge in project management, which includes managing

personnel and budgets, establishing goals and timelines, and maintaining relationships

among multiple support teams within the organization and with external teams. Crucial

experiences include field know-how which includes data collection and analysis, writing

27

scientific reports, mentorship, and oral presentations. Skills identified by the authors as

critical are leadership and teamwork. The multi-dimensional competencies required from

plant breeders have been proposed by Gepts and Hancock (2006) and Applegate (2002),

emphasizing the shift from purely research goals to a more inter-disciplinary approach.

Breeding Programs as Part of Meta-Organizations

Almost all private breeding programs are part of a meta-organization, in this case

the agriculture business organization. Large multinational companies such as Syngenta,

Monsanto, East-West Seeds, DuPont and Bayer all have breeding programs for crops that

fit their strategic direction. One common crop for these companies is hybrid corn.

Companies that also market a wide range of pesticides have breeding programs for rice and

a diverse range of vegetable crops. Monsanto’s seed business is largely based on biotech

traits, hence rice is not a good fit for their overall strategy. Private companies have large

support functions for allied sciences such as pathology, bioinformatics, statistics and other

fields. A substantial part of large projects are also outsourced such as building database

suites and analyses software, which are temporary undertakings. Logistics support such as

greenhouse and nursery teams, finance, purchase and procurement and legal teams provide

service to R&D teams and other teams as well. The core business functions responsible for

delivering products to the market include production, marketing and sales; there may be

variations among companies but the essential tasks are represented by these three business

functions.

28

A typical breeding program consists of breeders and assistant breeders, support

scientists for pathology, molecular markers, doubled haploids and other technical areas,

program support for greenhouse and field management, field trial establishment and

maintenance, and database management. A full-fledged breeding program in the private

sector may have an annual funding of USD 250,000-500,000 (Bliss, 2006).

Structure of Research Organizations

Different research organizations have different approaches in organizing their R&D

function, the most common of which is the matrix organization (Galbraith, 1971). A matrix

organization uses teams of employees to accomplish work, in order to take advantage of

the strengths, as well as make up for the weaknesses, of functional and decentralized forms.

A matrix may exist within the meta-organization by grouping products as projects as shown

in Figure 2, and managing these projects across functions.

Figure 2. A diagram of a matrix organization with product delivery managed as projects

across functions.

29

Within research organizations or R&D, matrix structure may also exist by

managing specific projects across research logistics service groups. For example, a

breeding organization that aims for resistance to bacterial leaf blight (BLB) in rice hybrids

may manage the project as follows:

1. Breeding crosses and maintenance of breeding populations are done through a team

in charge of field nurseries and hybridization work.

2. Sampling of leaf tissues and subsequent genotyping are accomplished through a

genotyping team.

3. Inbreds are screened for resistance to BLB by a pathology team.

4. Inbreds that are selected are submitted to a research seed production team for

creation of hybrids.

5. Hybrid seeds are handed over to a trialing team for evaluation in several locations

and in BLB hotspots.

6. Hybrids are also given to the pathology team for confirmation of BLB resistance.

Introducing Change into Breeding Organizations

Breeding technologies such as marker assisted selection have been introduced into

classical breeding programs in the past (Collard et al., 2008). These changes were largely

brought about by the maturity of the technology and numerous research works that confirm

the usefulness of MAS.

Kotter (2012) proposed the following steps in leading change, which have been

annotated in this work with examples from a plant breeding organization:

30

1. Create a sense of urgency. The need to feed a predicted world population of 9 billion

in 2050 is an urgent scenario that must be addressed by all breeding programs in

public and private sectors. Among breeding programs in the private sector, the

urgency is expressed in preventing lost revenues and market share.

2. Build a guiding coalition. Kotter suggests to mobilize sponsors who are effective

people — coming from its own ranks — to guide, coordinate and communicate the

planned change. In plant breeding programs implementing genomic selection, a

guiding coalition may be composed of senior members of the group.

3. Form a strategic vision and initiatives. Breeding programs may target an additional

percent of increase of genetic gain and identify initiatives on how to attain the

vision such as implementing genomic selection and improving robustness of field

trialing and phenotyping.

4. Enlist a volunteer army. People who are open to new ideas are included in the early

stages of implementing new technologies. These individuals then become

champions for change and their stories serve to inspire buy-in from others.

5. Enable action by removing barriers. Implementation of change requires breeders’

substantial time and effort taken away from routine activities. A genomic selection

proof of concept allows the change being introduced to be managed as a project

separate from the whirlwind of everyday activities and provides focus to the

persons and teams involved.

6. Generate short-term wins. Gain in selection, resources saved and status of proof of

concept experiments must be collected and tracked to energize the team in pushing

the change forward.

31

7. Sustain acceleration. Results from proof of concept experiments should be adapted

by the organization quickly to stay the course toward the vision of why genomic

selection helps attain that vision.

8. Institute change. Connections between adopting the change and organizational

success must be formally communicated and introduced to ensure that new

behaviors are repeated over the long term.

32

CHAPTER 3

MATERIALS AND METHODS

Phenotyping and Phenotypic Analysis

Multi-location yield trials of 510 genotypes (experimental hybrids and checks)

were conducted from July 2013 to May 2015 as part of Syngenta’s trialing program,

consisting of 24,415 plots distributed across 332 locations. Since the trials were

incorporated into an established commercial breeding program, experimental design was

highly unbalanced as hybrids were differentially advanced and rejected during the two-

year trialing duration. Trials were conducted in randomized complete block design

(RCBD) with three replications.

Days to 50% flowering or DTF (in days after sowing) was measured according to

the Standard Evaluation System for rice (IRRI, 1980). Flowering date is a reliable measure

of crop maturity. Flowering usually occurs two weeks after heading, and the crop is ready

to harvest after another four weeks. Plant height (in cm) was measured from the base of

the plant to the tip of the primary panicle (IRRI, 1980) of pre-determined plants in the

harvest area of a plot. Plot yield (YLD) was obtained by harvesting the inner rows and

adjusting the weight to 14% moisture to compute for yield in kg per hectare.

Data were checked for quality by eliminating unexplained outliers in the

distributions visualized by running the datasets in JMP® software (SAS Institute). Cleaned

datasets were analyzed in R Version 3.2.4 (R Development Core Team, 2015) and best

linear unbiased predictors (BLUPs) for the traits including general combining ability

estimates which were computed using the R package lme4 (Bates et al., 2015).

33

Genotyping

Preparation and Processing of Tissue Samples

Of the 214 parental lines of experimental hybrids, 122 were classified as elite lines

based on historical performance and usage in breeding crosses (not included in this study).

Hence, only these 122 lines were included in genotyping.

Leaf samples were collected from 21-d-old plants sown in Syngenta’s breeding

station in General Santos City, Philippines. Samples were freeze-dried for 48 hours using

a Virtis 12ES lyophilizer (SPS SCIENTIFIC, Gardiner, NY, USA), at −50 °C and 30.0

mTorr pressure and shipped to Syngenta’s high-density genotyping facility in Toulouse,

France. DNA was extracted using a sap (or juice) extractor (MEKU Erich Pollähne

G.m.b.H) and genomic fingerprints were generated using Syngenta’s proprietary 60K SNP

chip. The chip was built specifically for Syngenta’s rice germplasm and can assay 60,000

SNP loci.

Quality Filtering and Reformatting of SNP Markers

Quality filtering was applied to all 60,000 SNP markers. Markers with certainty of

<0.9 were not included. Markers with alleles having <0.05 frequency were also removed

from the dataset as they represent rare alleles which are not useful in genomic selection

although they are valuable in other types of analyses such as diversity and haplotype

analyses. About 43,000 high quality SNPs were selected.

Marker dataset was reformatted to a form required by prediction algorithms by

converting the SNP nucleotide calls into {-1,0,1} where “-1” represents the allele with

34

lower frequency in the population, “1” represents the allele with the higher frequency, and

“0” refers to a heterozygote call. Naive imputation was used for missing SNP data for all

lines and markers. Naive imputation takes the mean score of the SNP markers per genotype

as values for the missing scores.

Estimation of Genetic Relationships

Relationships among parental lines were estimated using a realized relationship

matrix proposed by VanRaden (2008) for dairy cattle. A more detailed description on the

use of this matrix is discussed in the Results section. Principal component analysis (PCA)

explaining the genetic relationships was implemented in JMP® to assess the population

structure.

Implementing Genomic Selection Models

Genomic selection was implemented using various statistical models: Ridge

Regression Best Linear Unbiased Prediction (RR-BLUP), Bayesian Ridge Regression

(BRR), BayesCPi and Bayesian LASSO (BL). The rrBLUP package developed by

Endelman (2011) was used to estimate marker effects and breeding values using ridge

regression and genomic BLUPs (GBLUP).

Matrix algebra functions were used to obtain genetic and error variances (𝜎𝑔2 and

𝜎𝑒2) in training populations sampled or simulated from the dataset generated. The shrinkage

parameter 𝜎𝑒2 𝜎𝑔

2⁄ was included in the mixed model to estimate the marker effects.

35

Bayesian methods were run as single chains of 2,000 iterations using BGLR

package (de los Campos and Perez, 2013) with the first 1,000 runs discarded as burn-ins.

Three variations of the models were implemented: Bayesian Ridge Regression (BayesRR),

Bayesian Cπ (BayesCPi) and Bayesian Lasso (BayesL). Descriptions of these models are

discussed in the Results section.

Marker effects computed from the models were used to predict the estimated

genetic values in the validation populations sampled from the dataset. The GEBV

prediction model is

𝐺𝐸𝐵𝑉 = 𝑀�̂�

where 𝑀 is the marker matrix and �̂� are the estimated values for marker effects.

Design of Training and Validation Populations

To validate the accuracy of GEBV, the dataset of phenotypic values were divided

into training sets and validation sets. Factors identified in the design of training and

validation populations are population size and prediction model. Training population size

was varied by performing two different cross validation procedures. A 90% training

population size is where 90% of the population is assigned to the training set and the

remaining 10% is assigned to the validation set.

36

Procedure for Cross Validation

Repeated ten-fold and three-fold cross validation (Fig. 3) were performed on 122

lines with available phenotypic and genotypic data. These folds represent 90% and 67%

training sets, respectively. The parental lines dataset was split into 𝑛 partitions equal to

number of desired cross-validation folds. Statistical models were implemented on 𝑛 − 1

partitions to create the prediction models which were then applied on the remaining 𝑛 −

(𝑛 − 1) partition. Pearson’s correlation was determined between the 𝑛 − 1 training set and

𝑛 − (𝑛 − 1) validation set for each round of cross validation and the average correlations

were determined to obtain the prediction accuracy of the model.

Cross validation was performed for every genomic selection model implemented

in rrBLUP and BGLR packages. The sample R code below performs a ten-fold cross

validation for 122 lines, each partition with 12 lines.

for(i in 1:10)

{

yldTrain <- yldShuff

yldTrain[count, 2] <- NA

modelBRR <- BGLR(y=yldTrain[,2], ETA=ETA, burnIn = 1000,

nIter=2000, verbose=FALSE)

BRRGebvs <- modelBRR$yHat[count]

correl[i] <- cor(BRRGebvs, yldShuff[count, 2])

tf.brr[count,] <- BRRGebvs

if(i<10) count = count + 12 else count = count + 13

print(correl[i])

}

37

Figure 3. Cross validation procedure for ten-fold and three-fold schemes, representing 0.9

and 0.667 proportions of training population size. Each partition was

successively used as validation population from the prediction model derived

from the training population.

Comparison of Prediction Accuracies

As discussed, accuracy may be defined as the correlation between the empirical

breeding values (EBV) or observed values and the genomic estimated breeding values

(GEBV) in the validation set for each training population design. Accuracies were

compared for each set of means for genomic selection model, training population size, and

trait. Comparison of means was implemented in JMP®. Correlations between pairs of

genomic selection models were obtained from multivariate analysis and were also

implemented in JMP®.

38


Average correlations per cross validation were taken as genomic selection

accuracies. The lowest and highest trait heritabilities were used in place of nominal traits

so that quantitative optimization can be performed at least for two factors. These accuracies

were obtained from a 2x2x4 full factorial design (Table 1).

Table 1. Full factorial design for optimization of genomic selection parameters.

FACTOR NO. OF LEVELS LEVELS

Heritability

2

0.3130

0.5486

Training population size 2 0.90

0.667

Genomic selection model 4 rrBLUP

BayesRR

BayesCPi

BayesL

Optimizations, construction of model, and determination of contributions of main

effects and interactions were implemented in JMP® using a generalized linear model

(GLM) to predict genomic selection accuracy values with varying heritability, training

population size and genomic selection model. Prediction profiles were generated and

recommendations were generalized for the dataset used in the study.

39

Creating a Genomic Selection Project Proposal

Using a research management perspective, a breeding program using genomic

selection was simulated and proposed as a project for implementation. To compare, a

breeding program without genomic selection was used as baseline. The project proposal

was based on the initial problem of low rate of genetic gain, from which a number of root

causes and ensuing effects were identified. Root causes related to management of a

research organization were identified and addressed using a project approach to integrate

genomic selection.

Declaration of Research Funding and

Non Conflict of Interest

This research study was funded by Syngenta. The author declares no conflict of

interest and that this research was solely for academic purposes in support of Syngenta’s

strategy to develop its employees. While Syngenta data was used, no confidential

information, such as pedigrees, DNA fingerprints, SNP marker names, genomic prediction

models and breeding strategies, was released. This study does not in any way reflect

Syngenta’s breeding strategies, as methods used in this study are publicly available.

40

CHAPTER 4

RESULTS AND DISCUSSION

Quality of Field Trial Data

By-Location Coefficients of Variation

In any field trial, partitioning of the variance will always include residuals as the

error term. Location errors can be minimized by statistical design such as replications and

accounting for spatial trends (Gomez and Gomez, 1984). Extraneous errors, which are due

to procedures associated with conducting the experiment such as fertilizer application,

harvesting method and measurement method, can be controlled by improvement of

experimental protocols. Errors can be further minimized by removing unexplained outliers

from a field trial dataset, a method commonly referred to as data quality control.

The phenotypic dataset included 332 locations after discarding locations with more

than 20% coefficient of variation for yield after cross referencing the location observations

from the researchers involved. Valid conditions for high CV included disease pressure,

drought, lodging and pest damage. A few locations with CV>20% were included after no

apparent factors were validated that would explain such CV.

Figure 4 shows box plots of CVs of the locations grouped into seasons. Dry season

location CVs are slightly lower than CVs of wet season locations as shown in the

comparison of means. This aligns well with experience that dry season trials are less

exposed to diseases and heavy rains that would induce lodging, and hence would have less

experimental errors.

41

Figure 4. Box plots of coefficients of variation of 332 locations grouped into

seasons. Connecting letter report was derived from comparison of

means using Student’s t-test, α=0.05.

Distributions of Yield, Days to 50% Flowering and Plant Height

The identification of best performing genotypes for the population of environments

or groups of environments is not within the scope of this dissertation, hence analyses of

genotype performance in comparison to checks were not performed. Distributions of trait

measurements were obtained for each location and outliers were eliminated before

performing BLUP analysis. A snapshot of trait distributions across 332 locations is

presented in Figure 5. Examination and, if necessary, elimination of outliers were done on

a per location basis due to the varying location means. Figure 6 shows the distribution of

traits per season. Visual examination of the histograms suggests that observations were

drawn from the normal distribution, which is a characteristic of most biological datasets.

42

Figure 5. Trait x location box plots used in checking location data for quality showing the

spread of data points per location. Unexplained data points in each location were

discarded.

Figure 6. Distributions of yield, days to flowering (DTF) and plant height across locations

planted in wet and dry seasons.

43

Distribution of Hybrids Across Locations

Not all 510 hybrids and checks were planted in every location, hence the data is

highly unbalanced, which can be visualized in the heat maps in Figure 7. The heat maps

also show the distribution of male and female parents on how these were used as parental

lines in the hybrids tested in 332 locations. The combined analysis in succeeding

discussions considered the whole set of locations as the subset of the target population of

environments, hence environmental variance with regard to locations was assumed to be

zero. This does not mean however that there are no differences among locations. In terms

of plant breeding practice, the whole set of locations was held as an orthogonal set. This

assumption will be reinforced by Best Linear Unbiased Prediction (BLUP) analysis, which

predicts the performance of hybrids in similar locations not included in the field trials.

Analysis of Multiple Locations

The highly unbalanced data structure requires prediction of genotype performance

to obtain adjusted means. For this requirement, several steps were taken to obtain BLUPs

for yield, days to flowering and plant height. Main effects and interactions (location,

season, location x season, replication, genotype, male parent, female parent, female x

male, genotype x location, genotype x season, and genotype x location x season) were

fitted in a linear mixed model that also includes effects necessary to derive general

combining ability (GCA) for the traits being analyzed.

44

Figure 7. Heat map showing distribution of genotypes across locations. Male and female

parents are also shown, as represented by hybrid progenies and not as tested

genotypes.

Variance Components and Computed Trait Heritabilities

Variance components were assessed to partition genotypic, environmental and

genotype x environment variances from which trait heritabilities were computed (Table 2).

The general mixed model equation fitted by Residual Maximum Likelihood (REML) using

lme4 package (Bates et al., 2015) in R was of the form:

𝑡𝑟𝑎𝑖𝑡 ~ (1 | 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛) + (1 | 𝑠𝑒𝑎𝑠𝑜𝑛) + (1 | 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛: 𝑠𝑒𝑎𝑠𝑜𝑛) + (1 | 𝑟𝑒𝑝) + (1 | 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒)+ (1 | 𝑚𝑎𝑙𝑒) + (1 | 𝑓𝑒𝑚𝑎𝑙𝑒) + (1 | 𝑓𝑒𝑚𝑎𝑙𝑒: 𝑚𝑎𝑙𝑒) + (1 | 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒: 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛) + (1 | 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒: 𝑠𝑒𝑎𝑠𝑜𝑛) + (1 | 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒: 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛: 𝑠𝑒𝑎𝑠𝑜𝑛)

45

Table 2. Variance components of the phenotypes yield, days to 50% flowering and plant

height derived from REML-fitted linear mixed models, and computed

heritabilities. Variance components, except genotypic variance, were divided by

the number of levels per source of variation.

SOURCE OF

VARIATION

YIELD

(H=0.313)

DTF

(H=0.5036)

PLANT HEIGHT

(H=0.5486)

df Variance df Variance df Variance

Environment

Location (L)

Season (S)

L X S

Reps (Environment)

Genotype (G)

Male

Female

Female X Male

G X L

G X S

G X S X L

Pooled Error

331

1

331

2

509

167

47

440

7939

567

7939

1566097.2

747332.3

66760.4

664.8

318181.7

24477.4

10242.9

13662.8

188247.8

9470.3

273579.1

227158.9

327

1

327

2

509

167

47

440

7857

567

7857

25.635899

0.215002

1.416065

0.001114

10.164752

2.196878

2.860957

0.505497

1.924406

2.582941

3.380668

2.129475

330

1

330

2

509

167

47

440

8025

567

8025

40.413

7.648

15.829

0.001

45.501

9.775

18.56

0.001

10.558

3.837

1.775

21.272

The variance components suggest the presence of genotype x environment (GxE)

interaction, the environment being different locations or seasons. Incorporation of GxE into

the prediction model is not within the scope of this study; hence the overall genomic

prediction analysis was done in two stages: obtaining adjusted means for each genotype

across all environments, and fitting the adjusted values into the prediction model.

Within locations, presence of replication variance components in yield variance

suggests the presence of local variation. Experimental errors which are location parameters

can be minimized by blocking (Gomez and Gomez, 1984). Blocking creates homogeneous

partitions in the field in which the nuisance factors are held constant to increase the

detection of variation in the factor of interest. In the design used in this study, blocking was

implemented on paddy fields separated by bunds or levees. Blocking was done primarily

46

to counteract variation on fertilization and irrigation among paddy fields. Rows and ranges

were not recorded hence spatial adjustment cannot be performed. Replication did not

significantly contribute to DTF and plant height variance, hence was not included in the

BLUP model for these traits.

Heritabilities were computed from genotype GxE variances and pooled error

variance. Environmental variances were excluded in the heritability equation because these

are considered orthogonal in the context of plant breeding, i.e. the genotypes are tested in

similar environments. The equation for heritability is

𝐻 = 𝜎𝑔

2

𝜎𝑔2 +

𝜎𝑔𝑙2

𝐿 + 𝜎𝑔𝑠

2

𝑆 + 𝜎𝑔𝑙𝑠

2

𝐿𝑆 + 𝜎𝑒

2

𝑛

where 𝜎𝑔2 is the genotypic variance, 𝜎𝑔𝑙

2 is the genotype x location variance, 𝜎𝑔𝑠2 is the

genotype x season variance, 𝜎𝑔𝑙𝑠2 is the genotype x location x season variance, and 𝜎𝑒

2 is

the pooled error variance.

Deriving BLUPs from Linear Mixed Models

Applications of best linear unbiased predictions have been extensively reviewed by

Robinson (1991), Piepho et al. (2007) and others. In this study, BLUPs are applied to

predict genotype performance as adjusted trait measurements from unbalanced data

(Bernardo, 1995), i.e. genotypes are not planted in all locations, years, seasons and other

combinations of environments.

47

All variables were assigned as random effects. The ranef function (e.g.

yldr<ranef(yldblupmodel) ) of the lme4 package extracts the conditional modes of the

random variables.

Shrinkage Toward the Mean

BLUP adjustments shrinked the yield towards the analysis mean, which is

consistent with the BLUP concept (Robinson, 1991) and empirical results on phenotypic

BLUPs (Bernardo, 1996a and 1996b; Piepho et al., 2007) and marker-based BLUPs

(Crossa et al., 2010).

The shrinkage mean or BLUP mean for yield and other response variables for a

given level of a random factor (years, location, etc.) is a weighted combination of the

analysis mean, based on the fixed effects, and the ordinary mean for the level of the random

factor. The variety means are calculated as:

𝑦�̅� =∑ 𝑌𝑖𝑗

𝑛𝑗=1

𝑛

for 𝑛 replications for a single location. The variety effect can be calculated by subtracting

the overall mean �̅� from the variety mean:

𝜏�̂� =∑ (𝑌𝑖𝑗 − �̅�)𝑛

𝑗=1

𝑛

48

which corresponds to the Best Linear Unbiased Estimator or BLUE (Gilmour, 2010).

BLUEs estimate the posterior conditions of a field trial but it is not the best predictor of

future performance. Location error variance and genotypic variance are taken from

expected mean square computations and can be represented by the ratio:

𝛾 =𝜎𝑔

2

𝜎𝑒2

and this ratio is incorporated into the BLUE equation above to become:

𝜏�̃� =∑ (𝑌𝑖𝑗 − �̅�)𝑛

𝑗=1

𝑛 +1𝛾

=∑ (𝑌𝑖𝑗 − �̅�)𝑛

𝑗=1

𝑛 +𝜎𝑒

2

𝜎𝑔2

The predictor 𝜏�̃� now becomes the BLUP. The variance ratio shrinks the treatment effects

when added into the denominator. BLUPs are more likely to represent future results

(Gilmour, 2010) and are more appropriate for two-stage genomic prediction approaches

such as the method used in this study. Figure 8 exhibits the shrinkage (adjusted) mean

overlaid on the observed mean.

Adjustments are therefore less if the genetic variance is significantly more than the

error variance, which again reinforces the importance of reducing the error term through

appropriate experimental designs, plotsmanship and minimizing introduction of extraneous

variability into the trials. BLUPs can theoretically approximate the BLUE values if the

error variance approaches zero.

49

Figure 8. Observed and adjusted values (BLUPs) for yield, days to 50% flowering and

plant height of 510 genotypes showing shrinkage of adjusted means toward the

analysis mean.

Deriving General Combining Ability

Hybrid crop breeding only makes sense in the context of cycle-over-cycle genetic

gain if general combining ability is the main criterion in using inbreds. In reciprocal

population improvement, new inbreds are testcrossed to testers from a complementing

pool. The testcross performance of the inbreds with the testers is generally interpreted as

the GCA and inbreds with the highest GCAs are advanced or promoted and used in the

next breeding cycle. This procedure is repeated; hence it is also called reciprocal recurrent

selection and inter-population improvement, and is a basic concept in any hybrid breeding

program.

50

This study uses GCA of 122 parental lines for yield, DTF and plant height as

phenotypic values for the prediction models. GCA was derived from the trait BLUP models

using the following scripts:

yldr <- ranef(yldblupmodel)

yldfgca <- yldr$female #(Yield GCAs of females)

yldmgca <- yldr$male #(Yield GCAs of males)

dtfr <- ranef(dtfblupmodel)

dtffgca <- dtfr$female #(DTF GCAs of females)

dtfmgca <- dtfr$male #(DTF GCAs of males)

plthtr <- ranef(plthtblupmodel)

plthtfgca <- plthtr$female #(Plant height GCAs of females)

plthtmgca <- plthtr$male #(Plant height GCAs of males)

Figure 9 plots the GCAs of the parental lines for the three traits as positive or

negative values which are interpreted as the average contribution of the parental line to

hybrid performance. Since the models also specify that the males and females are random

effects, this assumption can be extended to the set inbreds from which the set of 122

parental lines are drawn, i.e. a breeding program. Hence, GCA is useful because it is

predictive of the success of a hybrid breeding program.

If this study is a breeding program, the breeder would typically select those with

positive GCA values for yield from the set of inbreds. Depending on the product profile of

a breeding program’s target environment, the breeder can also select parental lines that can

contribute early or late maturity, and short or tall plant height.

51

Figure 9. General combining ability (GCA) for yield (kg), DTF (days) and

plant height (cm) of 122 parental lines of rice hybrids.

Throughout this manuscript, the terms “yield,” “days to flowering,” and “plant

height” refer to the general combining abilities for these traits.

Marker Coverage and Population Structure

The top 122 parental lines (41 females and 81 males) representing elite lines were

pre-selected for genotyping using a chip-based platform with 60,000 single nucleotide

polymorphisms (SNPs). These lines were represented by several seed sources some of

which have been genotyped previously. There were no lines with significant non-

concordance of allele calls, indicating very low rate of technical error, hence no lines were

52

discarded. Pre-processing of 60,000 SNP marker alleles on the samples resulted to

retention of 43,344 markers which are co-dominant and without rare alleles (<5% allele

frequency).

Descriptive Statistics on SNP Marker Data

The 43,344 SNP loci used in the study are distributed throughout the twelve rice

chromosomes. The marker x genotype dataset wherein genotypes are assigned as column

names were transposed as a requirement of the marker matrix needed in the succeeding

analysis steps, a procedure done in R. Matrices in R can accommodate thousands of

columns and limitations are usually set by the computing power of the computer

performing the calculations. Recoding the allele calls to numerical scores {-1,0,1} was also

implemented.

The distribution of SNP scores {-1,0,1} was taken from the marker matrix to

visualize the frequencies of allele calls. Frequency of heterozygotes and imputed calls is

not significant (Fig. 10). Naive imputation substitutes missing data with the mean value for

the locus. This resulted to marker scores between the {-1,0,1} values, although the imputed

values are negligible in the overall data structure. Accuracy of prediction models can be

negatively impacted by heterozygote calls as well as excessive missing and imputed marker

data. Marker linkage position and LD information, if available, can be used to impute the

actual {-1,0,1} values of missing marker data.

53

Figure 10. Distribution of {-1,0,1} allele calls in the marker matrix derived

from scores from 43,344 SNP loci.

Genomic Relationships and Principal Components

Additive relationship matrices play an important role in the prediction of breeding

values. The genetic merit of additive relationship matrices is in the infinitesimal model

wherein breeding value is considered to be the sum of thousands of allele effects. In the

classic infinitesimal model, Fisher (1918) postulated that a quantitative trait is controlled

by an infinite number of loci and each locus has an infinitely small effect. Large numbers

of markers with whole genome coverage can capture genetic similarity with more accuracy

than pedigree-based relationships because the genetic covariances would be based on the

actual proportion of the genome that is identical by descent between any two individuals

(Van-Arendonk et al., 1994). VanRaden (2008) also proposed that whole genome markers

can estimate the proportion of chromosome segments shared by individuals including

identification of genes that are identical in state.

54

The 𝑔𝑒𝑛𝑜𝑡𝑦𝑝𝑒 𝑥 𝑚𝑎𝑟𝑘𝑒𝑟 matrix M with recoded values {-1,0,1} was transposed

to M’ and the two matrices were multiplied to obtain the MM’ matrix, as illustrated in the

following example:

𝑀𝑀′ = [1 0 −10 0 01 1 −1

] [1 0 10 0 1

−1 0 −1] = [

2 0 20 0 02 0 3

] 𝑓𝑜𝑟 Inbred AInbred BInbred C

In the MM’ product of the two matrices, the diagonal values count the number of

homozygous loci for each inbred. In the example above, Inbred A has two homozygous

loci, Inbred B has none and Inbred C has three homozygous loci. Off-diagonals count the

number of alleles shared by the inbreds. Inbreds A and C share two homozygous loci, while

none is shared between A-B and B-C.

The matrix was then centered and scaled so that rarer alleles are given more weight

and to standardize the mean of the diagonal elements to 1 + 𝑓, where 𝑓 is the inbreeding

coefficient. The rrBLUP function 𝐴. 𝑚𝑎𝑡 returns an additive relationship matrix based on

the above principles. Figure 11 shows a color map of the realized relationship matrix of

122 parental lines.

Eigenvectors of the relationship matrix were calculated in JMP® to generate

principal components. Figure 12 summarizes and plots the principal components that

correspond to the main clusters in the realized relationship matrix heat map.

The first two principal components already explain most of the variance in the

population structure. Population structure agrees with the existing pedigree structure (not

used in this study). The three clusters represent three genotype groups – one group

consisting mostly of CMS lines and two groups which are mostly restorer lines.

55

Figure 11. Heat map of realized relationship matrix of 122 parental lines showing three

main clusters representing one female and two male clusters.

56

Figure 12. Principal component analysis using marker data: (a) highest two principal

components in 2D plot, (b) highest three principal components in 3D plot, (c)

scree plot showing magnitude of eigenvalues and variance explained by the

principal components, and (d) summary table of eigenvalues.

Evaluation of Genomic Prediction Methods

This work is one of the first genomic selection studies in rice along with Spindel et

al. (2015) and Grenier et al. (2015), and arguably the first in hybrid rice. Data collected

from more than 24,000 plots over four seasons of field trialing in more than 300 locations,

57

were used in exploring suitability of genomic selection. Three traits with varying

heritabilities were predicted using four genomic selection models. Training population size

were varied in the cross validation procedure by differential partitioning of the datasets,

i.e. three-fold and ten-fold cross validation procedures correspond to 2/3 and 9/10 training

population sizes, respectively.

Since the study used SNP chips and also realizing that SNP chips and other fixed

platforms are becoming more common and more affordable, marker density was not held

as a variable. Marker density optimizations are applicable only to flexible marker platforms

but these are relatively more expensive than fixed platforms.

Genomic BLUP (GBLUP) and Ridge Regression

Genomic Estimated Breeding Values (GEBVs) can be estimated by ridge

regression directly relating the imputed marker matrix to the phenotype using the mixed

model function in rrBLUP:

ridgeyld<-mixed.solve (y=phenoyldgca$yldgca, Z=genoimputed)

Another method, GBLUP, uses genomic relationship matrix through the kinship

BLUP function (kin.blup) of rrBLUP package, instead of the imputed marker matrix:

gblupyld<-kin.blup(data=phenoyldgca,geno='parent',pheno='yldgca', K=G)

Using RR-BLUP as genomic selection model, the accuracies of the two methods

for three traits were compared (Figure 13; Table 3). Results indicate that the two methods

58

provide almost similar accuracies. It was earlier discussed that the realized genomic

relationships used in GBLUP was directly derived from the marker matrix.

Table 3. LSD threshold matrix comparing prediction accuracy means of GBLUP and

Ridge Regression for GCA of 122 rice parental lines for three traits.

YIELD DTF PLANT HEIGHT

GBLUP Ridge

Regression

GBLUP Ridge

Regression

GBLUP Ridge

Regression

GBLUP

-0.18265

-0.18265

-0.14099

-0.14099

-0.12086

-0.12086

Ridge

Regression

-0.18265 -0.18265 -0.14099 -0.14099 -0.12086 -0.12086

Positive values show pairs of means that are significantly different.

Figure 13. Correlation of GEBVs in RR-BLUP using marker data directly (Ridge

Regression) and genomic relationships (GBLUP).

59

Due to these results, genomic relationships were used in succeeding analyses

instead of marker matrix directly. The realized relationship matrix is a 122 x 122 matrix

corresponding to the number of lines in both the rows and columns, while the marker matrix

is a large 122 x 43,344 matrix corresponding to the number of lines in the rows and the

number of SNP loci in the columns.

Effect of Trait Heritability on Prediction Accuracy

Genomic selection accuracy is affected by the heritability of the trait (Lorenz et al.,

2011; Bernardo, 2010; Asoro et al., 2011). First, low heritability will be reflected by the

data collected from field trials which is used to train the genomic selection model. Second,

heritability is commonly utilized to account for the unknown true breeding value (TBV),

as what is observed in the field is the empirical breeding value (EBV). The correlation of

GEBV to EBV is divided by the square root of the heritability to relate GEBV to TBV:

𝛾𝐺𝐸𝐵𝑉,𝑇𝐵𝑉 =

𝛾𝐺𝐸𝐵𝑉,𝐸𝐵𝑉𝛾𝐸𝐵𝑉,𝑇𝐵𝑉

= 𝛾𝐺𝐸𝐵𝑉,𝐸𝐵𝑉

√𝐻

The effect of predicted trait on genomic selection accuracy is a function of the trait

heritability. The computed heritabilities of the traits are presented in Table 2. Heritability

was computed as genotypic variance divided by the total phenotypic variance, excluding

variance due to location. The yield trial was assumed to be conducted in an orthogonal set

of locations. Figure 14 illustrates the variability chart for the prediction accuracy with the

60

predicted trait as the main variable group, showing that the trait with the lowest heritability,

yield, has the lowest prediction accuracy.

The variability plots of correlations of GEBV and EBV versus cross validation

method across genomic selection models and traits indicate that prediction accuracies are

generally lower in the three-fold method, except for Bayesian LASSO. The variability

graph also indicates that Bayesian LASSO has the lowest correlation except for 10-fold

validation on plant height. Yield again has the lowest prediction accuracy. The preceding

variability charts also show that traits with low heritability are predicted less accurately.

Prediction accuracy means for the three traits were compared using Tukey’s HSD

test (Tukey, 1949) at α=0.05 (Figure 15; Table 4). Prediction accuracies for the traits differ

significantly, although the effect is not contributed exclusively by trait alone. Yield, having

the lowest heritability, has the lowest prediction accuracy. Heritabilities of DTF and plant

height are similar by plant breeding standards, but their prediction accuracies significantly

differ. This suggests differences in trait architecture such as number of QTL involved and

the magnitude of the effects of each QTL.

61

Figure 14. Variability chart of prediction accuracies per trait in 122 rice parental

lines. First and second level factors were interchanged between the two

graphs.

62

Figure 15. Box plots of prediction accuracy per trait in 122 rice parental lines and

comparison circles based on Tukey’s honest significant difference test.

Table 4. HSD threshold matrix of genomic selection accuracy means between all pairs

of traits.

PLANT HEIGHT DTF YIELD

Plant height

-0.07295

0.07926

0.20425

DTF 0.07926 -0.07295 0.05204

Yield

0.20425 0.05204 -0.07295


Effect of Training Population Size on Prediction Accuracy

Training population size was varied by using two cross validation methods. Ten-

fold cross validation means dividing the population into ten parts and using nine parts as

training set and the tenth part as validation set, and performing ten rounds of cross

validation using different parts as validation sets. This therefore corresponds to a training

population size of 108 lines, or 90% of total number of lines. Three-fold cross validation

63

method can also be interpreted as using two-thirds of the total number of lines as training

set. The variability chart in Figure 16 shows that a ten-fold cross validation generally has

greater prediction accuracy than a three-fold method. These results on training population

size are consistent with the findings of other researchers (Asoro et al., 2011; Hickey et al.,

2014).

The training population used so far in this study is composed of mixed

subpopulations, wherein the subpopulations or clusters were described previously. The

presence of subpopulations in a breeding germplasm may occur based on breeding history

i.e. frequency of use of a few elite lines as parents of breeding crosses. This is especially

true in hybrid breeding programs utilizing heterotic pools wherein heterotic pools represent

subpopulations.

Mixed subpopulation training sets have been used in cattle (Hayes et al., 2009)

where mixed-breed Jersey and Holstein populations were used to predict purebred Jersey

or Holstein individuals with similar accuracies to within-breed predictions.

Figure 17 shows the comparison of means between the cross validation methods or

training population size using Tukey’s HSD test at α=0.05. The HSD threshold matrix is

given in Table 5. Across all traits and prediction methods, training population size is

significantly different.

64

Figure 16. Variability chart of prediction accuracy per cross validation method. The

first and second level factors were interchanged between charts.

65

Figure 17. Box plots of prediction accuracy per cross validation method (training

population size) and comparison circles based on Tukey’s honest significant

difference test.

Table 5. HSD threshold matrix of prediction accuracy means between two cross

validation methods or training population size.

10-FOLD 3-FOLD

10-fold

3-fold

-0.04806

0.03558

0.03558

-0.08775


Effect of Genomic Selection Model on Prediction Accuracy

Genomic selection models used in this study were selected based on the reported

accuracies and usefulness of various models in literature. Most studies report the usefulness

of RR-BLUP and Bayesian methods (Heffner et al., 2009; Lorenz et al., 2011). Habier et

al. (2007) showed that RR-BLUP modelled genetic relationships more accurately because

it fitted more markers into the model than Bayesian methods, although Bayesian methods

66

were able to consider marker-QTL association into the model. Hayes et al., (2009),

Lorenzana and Bernardo (2009), Moser et al. (2009) and VanRaden et al. (2009)

demonstrated that prediction models that assume many loci evenly distributed in the

genome (e.g. RR-BLUP) have similar prediction accuracies as methods that assume fewer

loci but with varying effects (e.g. Bayesian). In some cases, RR-BLUP models are even

more accurate than Bayesian models (Lorenzana and Bernardo, 2009). Lorenz et al., (2011)

attributes this to the fact that the genetic architecture of complex traits is more likely

aligned with the infinitesimal model rather than the model of few dozen loci with varying

effects.

Figure 18 illustrates a variability chart of prediction accuracy per prediction model.

It is apparent that there is no general trend among the prediction models. A comparison of

means (Fig. 19; Table 6) confirms this observation.

It should be noted at this point that the means used to compare genomic selection

models are overall means across traits and training population size. Different trait

heritabilities may require different genomic selection models and different training

population sizes. An optimization of genomic selection parameters may be able to provide

settings for desirable accuracy versus genomic selection model, and versus cost which is

largely coming from phenotyping of training population. A large training population size

is one of the resource intensive activities of genomic selection, hence an optimum

prediction accuracy with respect to training population size is recommended.

67

Figure 18. Variability chart of prediction accuracy per genomic selection model. The

first and second level factors were interchanged between charts.

68

Figure 19. Box plots of prediction accuracy per genomic selection model and

comparison circles based on Tukey’s honest significant difference

test.

Table 6. HSD threshold matrix among prediction accuracy means between all pairs of

genomic selection models.

RR-BLUP BAYESCPI BAYESRR BAYESL

RR-BLUP

BayesCPi

BayesRR

BL

-0.11330

-0.10805

-0.10515

-0.03876

-0.10805

-0.11330

-0.11040

-0.04401

-0.10515

-0.11040

-0.11330

-0.04691

-0.03876

-0.04401

-0.04691

-0.11330


69

Correlations of the Different Genomic Selection Models

Almost all studies report varying correlations of genomic selection models (Asoro

et al., 2011; Lorenz et al., 2011; Bernardo and Yu, 2007; Heffner et al., 2009). This is

mainly because of the unique properties of breeding programs and crops on which genomic

selection studies are performed. Breeding programs and consequently populations for

genomic selection and resulting data may differ in trait heritabilities (Bernardo, 2010),

population structure (Hickey et al., 2014), marker density (Miewissen et al., 2001) and

many other factors. However, general trends may be similar.

The correlations among genomic prediction models reported in this study may only

be valid for the dataset used and by extension the larger population from which the dataset

is drawn. Correlations are given in Figure 20. RR-BLUP, BayesCPi and BayesRR are

highly correlated (>0.9). Spearman’s ρ test (Spearman, 1904) was performed and is given

in Table 7. Spearman’s ρ is a nonparametric measure of statistical dependence between

two variables by fitting the relationship in a monotonic function.

Table 7. Spearman’s rank correlation coefficient between pairs of genomic selection

models across all traits.

VARIABLE 1 VARIABLE 2 SPEARMAN’S ρ Prob>|ρ|

BayesRR

BayesCPi

BayesCPi

BayesL

BayesL

RR-BLUP

RR-BLUP

BayesRR

RR-BLUP

BayesRR

0.9757

0.9816

0.9721

0.6844

0.6654

<.0001*

<.0001*

<.0001*

<.0001*

<.0001*

70

Figure 20. Scatterplot matrices of correlations between pairs of genomic selection models

for all traits and each trait individually.

71

Population Structure as Covariate

The presence of population structure may confound prediction accuracies (Hayes

et al., 2009) and genome-wide association mapping or GWAS (Zhao et al., 2011). Several

methods have been able to incorporate population structure in genomic selection. The most

common but least accurate is incorporating the categorical grouping into the first stage

phenotypic linear mixed model as fixed effects as shown in the equation:

𝑌 = (1|𝑔) + (1|𝑙) + (1|𝑠) + (1|𝑟𝑒𝑝) + (1|𝑔: 𝑙) + (1|𝑔: 𝑠) + (1|𝑔: 𝑙: 𝑠) + 𝑠𝑢𝑏𝑝𝑜𝑝

Eigenvectors of the principal components, instead of the categorical groups, can

also be incorporated as fixed effects. Usually, the top principal components that explain

more than half of the total population variance is selected. The advantage of this method is

that the fixed effects due to population structure would have a continuous distribution, as

shown in the mixed model equation:

𝑌 = (1|𝑔) + (1|𝑙) + (1|𝑠) + ⋯ + 𝑃𝐶1 + 𝑃𝐶2 + ⋯ + 𝑃𝐶𝑛

Asoro and co-workers (2011) demonstrated that principal components can account

for population structure in the genomic prediction step by including significant eigenvalues

in the model, as shown in the mixed model equation:

𝑌 = 𝜇 + 𝑄𝑣 + 𝑀𝛼 + 𝑒

72

where 𝑌 is the empirical phenotypic value, 𝜇 is the intercept, 𝑄𝑣 is a fixed effects term

where 𝑄 is a matrix of significant eigenvectors and 𝑣 is a vector of regression coefficients

relating the principal components to the phenotypic values, and 𝑀𝛼 is a random effects

term where 𝑀 is the marker matrix and 𝛼 is a vector of estimated marker effects.

In hybrid breeding programs, there is almost always the presence of significant

population structure in the form of heterotic pools due to the method of breeding used. As

such, genomic prediction is obviously done within subpopulations, i.e. heterotic pools.

In this study, each of the three subpopulations taken individually resulted to

spurious results because of very low sample sizes. The most reasonable strategy is to jointly

take subpopulations 2 and 3 in a prediction model because it aligns well with breeding

information that these are all members of the male subpopulation. Genomic prediction

accuracies were compared within subpopulation and to the full population predictions.

Figure 21 and Table 8 present the comparison of overall means between prediction

models run on whole population and on subpopulations 2 and 3 only, indicating that

prediction methods are generally more accurate when predicting more related individuals.

These results were similar to those obtained in other works by Crossa et al. (2010), Asoro

et al. (2011), Habier et al. (2007) and Hayes et al. (2009).

Trait x GS model treatments seemed to exhibit similar trends (Fig. 22), except for

Bayesian LASSO when used on plant height wherein predictions for the whole population

has a higher mean accuracy than subpopulation prediction. Comparison of means for each

prediction model (Fig. 23; Table 9) suggests no significant differences in all pairs of means.

Although this may imply that any of the genomic selection models can be used, it is

important to consider trait architecture when deciding on what genomic selection model to

73

use. The prediction accuracy means can be used to optimize genomic prediction

parameters.

Figure 21. Box plots of prediction accuracy of overall means using whole

population and subpopulations 2 and 3 jointly, and comparison

circles based on Tukey’s honest significant difference test.

Table 8. HSD threshold matrix of prediction accuracy means between subpopulations

and mixed population.

Subpopulation Mixed population

Subpopulation

-0.05723

0.00164

Mixed population

0.00164 -0.05723


74

Fig

ure

22.

Var

iabil

ity c

har

t fo

r pre

dic

tion a

ccu

racy s

how

ing m

ean

dif

fere

nce

s fo

r m

ixed

po

pula

tion (

All

) an

d

subpopula

tion

(S

ubpop)

pre

dic

tions

for

each

gen

om

ic s

elec

tion m

odel

per

tra

it.

75

Figure 23. Box plots of prediction accuracy of GS model means using

subpopulations 2 and 3 jointly, and comparison circles based on


76

Table 9. HSD threshold matrix of prediction accuracy means between pairs of genomic

selection models using subpopulation prediction.

RR-BLUP BAYESCPI BAYESRR BAYESL

Yield

DTF

Plant height

RR-BLUP

BayesCPi

BayesRR

BayesL

RR-BLUP

BayesCPi

BayesRR

BayesL

RR-BLUP

BayesCPi

BayesRR

BayesL

-0.24388

-0.28552

-0.24931

-0.27533

-0.27918

-0.27811

-0.27772

-0.22064

-0.27349

-0.27029

-0.25630

-0.09530

-0.25407

-0.27533

-0.25950

-0.28552

-0.27811

-0.27918

-0.27878

-0.22171

-0.27029

-0.27349

-0.25310

-0.09209

-0.28010

-0.24931

-0.28552

-0.25950

-0.27772

-0.27878

-0.27918

-0.22210

-0.25630

-0.25310

-0.27349

-0.11248

-0.28552

-0.24388

-0.28010

-0.25407

-0.22064

-0.22171

-0.22210

-0.27918

-0.09530

-0.09209

-0.11248

-0.27349



A generalized linear model (GLM) was fitted using maximum likelihood as

estimation method for the subpopulation prediction dataset means. GLM is an ANOVA

procedure that uses least squares regression to determine the statistical relationship

between predictors (i.e. heritability, training population size, genomic selection model) and

a continuous response variable (i.e. prediction accuracy). GLM was used in this study to

predict genomic selection accuracy for new observed trait heritabilities and training

population size and identify the combination of predictor values that jointly optimize fitted

prediction accuracy value. The highest and lowest heritability means were taken as

variables in the model to obtain a 2x2x4 full factorial (Table 10). Center points were not

77

taken because no trait with a heritability that is midpoint of the heritability range was

included in the analysis. Therefore, curvatures of main effects were not detected. The full

factorial that includes the main effects and second degree of interactions is given as

follows:

Prediction Accuracy = Heritability + Training Population + GS model

+ Heritability * Training Population

+ Heritability * GS model

+ Training Population * GS model

Table 10. Full factorial design of genomic selection accuracy means, heritability, training

population size and genomic selection model used in optimization.

RESPONSE:

GS ACCURACY

HERITABILITY TRAINING

POPULATION

GS MODEL

0.180270343

0.256868733

0.178678867

0.354683567

0.437842510

0.416132110

0.434348604

0.381462360

0.752761667

0.706175833

0.750252000

0.686572700

0.717289110

0.682839460

0.533130330

0.467523715

0.3130

0.3130

0.3130

0.3130

0.3130

0.3130

0.3130

0.3130

0.5486

0.5486

0.5486

0.5486

0.5486

0.5486

0.5486

0.5486

0.667

0.667

0.667

0.667

0.9

0.9

0.9

0.9

0.667

0.667

0.667

0.667

0.9

0.9

0.9

0.9

RR-BLUP

BayesRR

BayesCPi

BayesL

RR-BLUP

BayesRR

BayesCPi

BayesL

RR-BLUP

BayesRR

BayesCPi

BayesL

RR-BLUP

BayesRR

BayesCPi

BayesL

78

Whole Model Test for the Generalized Linear Model

The Whole Model Test given in Table 11a indicates that the regression coefficients

of the variables are not equal to zero, hence the model is significant. Deviance and Pearson

values are not significant (Table 11b), and thus do not indicate lack of fit.

Table 11a. Whole model test for the generalized linear model created to optimize genomic

selection.

MODEL -LOG

LIKELIHOOD

L-R CHI

SQUARE

DF PROB>CHI

SQUARE

Difference

31.5483883

63.0968

12

<.0001

Full -35.094803

Reduced

-3.5464143

Table 11b. Goodness of fit test for the generalized linear model created to optimize

genomic selection.

GOODNESS OF

FIT STATISTIC

CHI SQUARE DF PROB>CHI

SQUARE

OVERDISPERSION

Pearson

0.0117

3

0.9997

0.0007

Deviance 0.0117 3 0.9997

Effect Summary and Effect Tests

Table 12 shows the LogWorth, false discovery rate (FDR) LogWorth and FDR p-

values for the main effects and interactions included in the model. LogWorth is defined as

-log10(p-value). A value that exceeds 2 is significant at the 0.01 level. FDR LogWorth is

defined as -log10(FDR p-value). This is the best statistic for plotting and assessing

79

significance (JMP, 2015). The FDR p-value is obtained by using the Benjamini-Hochberg

technique, adjusting the p-values to control the false discovery rate for multiple tests.

Table 12. LogWorth, FDR LogWorth and FDR p-values of main effects and interactions

in the generalized linear model.

SOURCE LOGWORTH FDR LOGWORTH FDR P-VALUE

Heritability

Heritability x Training

Heritability x GS Model

GS Model x Training

GS Model

Training

13.703

8.375

3.394

3.330

1.463

1.141

12.925

7.898

3.154

3.154

1.384

1.141

0.00000

0.00000

0.00070

0.00070

0.04129

0.07221

The Effects Summary is consistent with the Effect Tests (Table 13), which shows

that the main effect heritability and the interaction effects 𝐻𝑒𝑟𝑖𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑋 𝐺𝑆 𝑚𝑜𝑑𝑒𝑙,

𝐻𝑒𝑟𝑡𝑖𝑡𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑋 𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔, 𝐺𝑆 𝑚𝑜𝑑𝑒𝑙 𝑋 𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔 are significant at 0.05 level.

Table 13. Effects Test of main effects and interactions in the generalized linear model.

SOURCE DF L-R CHI SQUARE FDR P-VALUE

Heritability

GS Model

Training

Heritability x GS Model

Heritability x Training

GS Model x Training

1

3

1

3

1

3

58.551518

8.6445336

3.2321473

18.181572

34.51959

17.871884

<.0001*

0.0344*

0.0722

0.0004*

<.0001*

0.0005*

*Significant at 0.05 level

80

Prediction Profiles and Application to Breeding Programs

This study was able to simulate prediction accuracy computed from varying

heritabilities, training population sizes and genomic selection models. It should be noted

that the prediction profiles generated by this study are only applicable to the set within the

scope of this study. However, the general trends may be similar with other studies.

Individual breeding programs should perform similar exploratory studies on the usefulness

of genomic selection because heritability of predicted traits will vary among breeding

programs.

Figure 24 shows a contour map predicting genomic selection accuracy based on

heritability and size of training population. The general trend is that the higher spectrum of

genomic selection accuracy is proportional to heritability and training population size.

Figure 24. Contour map showing general trend of relationship between genomic

selection accuracy, heritability and training population size.

81

Figure 25 shows the prediction profiles of selected combinations of heritability,

genomic selection model and training population size. Genomic selection is most accurate

for high heritability traits, which agrees with the works of Asoro et al. (2011), Heffner et

al. (2009), Lorenz et al. (2011) and Lorenzana and Bernardo (2009). One reason lies at the

very beginning of any genomic selection effort, which is phenotyping. Data used in training

the genomic selection model is generated from phenotyping. A robust phenotyping system

is crucial in the successful implementation of any predictive breeding activity.

The prediction profiles in Figure 25 suggest a number of points for the dataset used

in this study. RR-BLUP and BayesRR are most suitable for either high heritability or large

training population size. In genomic selection, high heritability will not need a very large

proportion of training population as reflected in the prediction profiles. Low heritability

can be compensated by larger size of training population. However, this would entail

additional resources for the breeding program. The optimization of genomic selection in a

bigger picture therefore goes into the operational dimensions of running a breeding

program.

Future optimizations using this GLM prediction approach should include a trait

with a heritability that is midpoint of the high and low heritabilities, as well as midpoint

value for training population size. Marker density can be optimized if the breeding program

uses flexible marker platforms. It is recommended to use a feature similar to JMP’s Design

of Experiment (DOE) to create full factorials with center points. Center points are critical

in determining the curvature of main effects as well as establish hidden replications.

82

Figure 25. Prediction profiles of selected combinations of variables in the genomic

selection model: (A) Low heritability and small training population size, (B)

Low heritability and large training population size, (C) High heritability and

large training population size, and (D) High heritability and small training

population size.

83

Integrating Genomic Selection into Hybrid

Rice Breeding Programs

Any hybrid rice breeding program that intends to use genomic selection needs to

assess the impact of the technology on program resources and stakeholders who will use,

implement and benefit from the technology. Genomic selection replaces the cost of

phenotyping a plot by the cost of fingerprinting the line, which usually translates to cost

reduction. This section of the manuscript proposes a genomic selection component of a

hypothetical breeding program.

Assumptions on the Hypothetical Breeding Program

Baseline information about a breeding program to which genomic selection is to be

introduced needs to be generated. Figure 26 illustrates a basic hybrid breeding program

that utilizes reciprocal recurrent selection as described by Bertran and Hallauer (1996).

Figure 26. A scheme for hybrid breeding program using a reciprocal recurrent selection

that creates 10,000 new inbreds and 10,000 new hybrids every breeding cycle.

84

The breeding program being described generates 10,000 new inbreds (5,000 from

each heterotic pool) and by testcrossing these inbreds to a tester, 10,000 testcross hybrids

are produced for field trialing in multiple locations. Hybrid performance is the passport

data that warrants if a new inbred is “advanced” or used in the next breeding cycle. Table

14 summarizes the operational assumptions of the breeding program.

Table 14. Operational considerations of a hypothetical hybrid rice breeding program using

DH as a means of rapid inbred production.

YEAR/

SEASON

STAGE NO. OF ROWS OR

PLOTS

NO. OF

LOCATIONS

1A

1B

2A-2B

3A

3B

Breeding crosses

F1

DH inbreds (marker

pre-screening)

Seed production

Field trialing

100 rows

100 rows

30,000 rows

10,000 rows (inbreds)

1,000 rows (testers)

10,000 plots (hybrids)

500 plots (checks)

1

1

1

1

1

2

2

Total rows = 41,200

Total plots = 10,500

The breeding program being discussed aims to test 10,000 hybrids in two locations,

bringing the total number of plots to more than 20,000, assuming the breeder uses single

replication in augmented randomized complete block design. Augmented designs allow

adjustments of phenotypes by removing the effects of spatial trends, thereby increasing the

detection of the genetic signal (i.e. yield) as demonstrated by Gilmour et al. (1997).

85

Rationale on Increasing the Effectiveness of Breeding Programs

Rice production has been a major source of income of many Filipino farmers. The

average rice yield in the Philippines is 3.8 metric tons per hectare but could be as low as

2.0 t/ha in some farming villages. Planting of hybrids would give a yield of 4.0 t/ha to as

high as 9.0 t/ha. Genetic gain in hybrid rice breeding programs directly contributes to

increased rice production through the products released to the market, and to larger

acceptance of hybrid technology by farmers.

Objectives of the Project Being Proposed

Through a research management perspective, the integration of genomic selection

into an existing breeding program is presented here as a project proposal with the general

objective of increasing rate of genetic gain which will serve the larger outcome of

delivering high-yielding products to farmers. The specific objectives are as follows:

1. Identify the stakeholders who will be impacted by the project and list their

concerns and issues.

2. Understand the causes and effects of low rate of genetic gain in breeding

programs through a problem analysis using a problem tree diagram, and conduct

an objective analysis in response to the identified problem.

3. Design the project on the hypothetical breeding program using the logical

framework.

4. Discuss the management arrangements from planning and implementation to

post-implementation of the project in the hypothetical breeding program.

86

Stakeholder Analysis

A breeding program has a range of stakeholders that will be impacted with its

success or failure, ranging from the business owner to the end user of its products which

are the farmers and consumers. The stakeholder map is presented in Table 15.

Farmers and consumers are served by the outcome of breeding programs which is

to increase crop yields. The breeding program described produces hybrid rice, generally

proven to have an average yield increase of 15% over inbred varieties (Virmani, 1999). For

a farmer getting an average yield of 5,000 kg/ha using inbreds, the yield advantage will

translate to an additional income of PHP 30,000.00 per hectare per year for paddy rice sold

at farm gate prices.

The Department of Agriculture’s (DA) concerns lie on the attainment of rice self-

sufficiency for the country. Currently at 90-95% self-sufficient because of unstable seasons

(typhoons, drought), any significant increase in average rice yields will dramatically

contribute to self-sufficiency. The DA’s scope however starts after product development

as far as hybrid rice business of a private company is concerned. However, the DA can

dictate the criteria on varietal accreditation and recommendation for release, which can

influence the target product profiles of breeding programs.

The DA has an extensive network involved in disseminating new agricultural

technologies. The Hybrid Rice Commercialization Program (HRCP) has been leading the

implementation of strategies for the adoption of hybrid rice seeds since 1998. Within the

DA, the Rice Varietal Improvement Group (RVIG) recommends varieties to be released to

the National Seed Industry Council (NSIC) after a series of multi-location yield trials.

87

Table 15. Stakeholder map identifying concerns of various entities that will be impacted

by the project.

PARTICIPANTS PROBLEMS EXPECTA-

TIONS

WEAKNESSES POTENTIALS CONSEQUENCES

FOR A PROJECT

Farmers

Current yield

levels not

sufficient to

attain acceptable

living

conditions.

High yield from

high quality

rice that will

fetch a high

price.

Cannot address

their problem

apart from

agronomic

management.

Can adopt and

accept new

varieties.

Increased genetic

gain will translate to

increased yield in

farmers’ fields.

DA

Philippines is

not yet fully self-

sufficient in rice.

High yield from

rice farms that

will cover rice

requirement of

the country.

Limited to

recommending

existing

accredited

varieties and

agronomic

practices.

Well-established

network that can

bring products

to farmers.

Increased yield from

farmers’ fields will

result to attainment

of rice self-

sufficiency.

Breeders

Budget

constraints of

breeding

programs limit

genetic gain.

Increased

success rate of

products

advanced.

Lack of training

in modern

quantitative

genetics and

genomics.

Expertise in

development of

new varieties;

familiarity with

germplasm.

Increased efficiency

of breeding

programs will result

to increased rate of

genetic gain.

Trialists

Establishment of

thousands of

trial plots is a

complex job.

Reduction of

trial plots.

Expertise in

establishment

and

management of

trials.

Increased efficiency

of breeding

programs will result

to reduction of trial

plots.

Sales and

marketing

Difficult to sell

products that do

not differ

significantly

from

competitors.

Significant

yield advantage

over competing

products.

Well-established

channels for

selling and

marketing

products.

Significant yield

differentiator from

competitors will

make products easier

to sell.

Business owners/

shareholders

Investment in

R&D is costly.

Products that

can deliver

profit.

Can increase

R&D funding in

response to

successful

products.

More efficient

breeding programs

frees funds for other

R&D investments

and projects.

Environmentalists

Demand for

increased food

supply result to

increasing land

for cultivation

and loss of

biodiversity.

Reduce

conversion of

forests into

farms.

Increased yield can

help reduce

conversion of forests

to farmlands.

88

The breeders, trialists, sales and marketing, and business owner/shareholders are

part of the internal private company setting. Among these, the breeders and trialists are

members of the breeding organization directly implementing the breeding program. The

roles of the breeders and trialists are discussed under the section on management

arrangements. Breeders and trialists will have a stake on the increased efficiency of the

breeding program that would result to reduced use of resources while attaining significantly

higher rate of genetic gain.

Higher rate of genetic gain will eventually result to products that can be

differentiated from the competitors in terms of the target product profile. This will allow

the sales and marketing function to market and sell these products easily. Business owners

and shareholders will always welcome the revenue from selling these products, which will

be channeled back to the breeding program as research funding. With the increased

efficiency of the breeding program, a significant part of the budget previously used for trial

plots can now be used for other activities such as disease resistance screening.

Increase in crop yield will prevent conversion of forest land to farmland and thus

preserve biodiversity, an issue considered very important by environmentalists. Hence, a

high rate of genetic gain will provide a cushion for the increase in yield to be addressed by

genetics instead of by clearing forests.

Problem Analysis

Increasing the rate of genetic gain is the primary goal of any breeding program, as

discussed under Chapter 1 (Introduction). Low rate of genetic gain was identified as the

starter problem (Fig. 27), which is caused by several factors, and in turn causes a range of

89

issues. Factors contributing to low genetic gain can be classified into two categories: those

that are inherent in the genetic gain equation (phenotyping accuracy, phenotypic standard

deviation, heritability, cost and breeding cycle time) and human capacity factors.

The genetic gain equation was reviewed in Chapter 2 (Review of Literature) under

the section “Increasing Genetic Gain.” Some of the most common causes of low genetic

gain are substandard field trialing resulting from poor experimental design and poor choice

of locations, and lack of genetic variability. A robust experimental design will increase the

power of field trials while a good correlation of locations and the target population of

environments will increase phenotyping accuracy. Breeding cycle time in the breeding

program has been addressed with the implementation of DH technology, and will be further

enhanced with genomic selection. Phenotyping cost can also be reduced by genomic

selection by substituting yield plots with cheaper DNA fingerprinting and predicting

breeding values as discussed in this manuscript.

Then there’s the factor on human capacity. Plant breeders and other scientists run

research programs, but most of them are not well trained in managing research. Atlin

(2013) stated that human capacity has the largest potential to contribute to genetic gain in

any breeding program, further elucidating that plant breeders need to be more like

engineers – they need mechanization, computer programming, and higher-level

quantitative skills on top of their expertise in genetics and agronomy. The problem tree

describes unwillingness of researchers to adopt newer approaches to plant breeding and

insufficient use of new technologies, which stem from lack of understanding of these new

approaches. From a research management perspective, these causes will be addressed in a

proposed integration of genomic selection. The focal points of the proposal are logistical

90

considerations in running a basic genomic-enhanced breeding program, and upgrading the

skills of the research team that will implement the project.

The effects of low rate of genetic gain was very evident in the 1950’s when an

impending worldwide famine was too great for the levels of grain yields then. The

Philippines had a constant rice production of 3.7 million tons annually in the 1950’s (FAO,

2011) and the annual yield increase from paddy fields is not sufficient to meet the demands

of the growing population. The establishment of the International Rice Research Institute

in 1960 effectively institutionalized rice breeding in a global scale and genetic gain in rice

in the form of annual yield increases from new varieties has been dramatic. The Philippines

attained 7.7 million tons annual rice production in 1980 (FAO, 2011).

The effects of the starter problem were identified with the ultimate impact of low

yield in farmers’ fields. Within the breeding organization, the effects include increased cost

in field trials and creation of inferior parental lines. These in turn will result to inferior

hybrid products and high price of seeds due to high cost of goods (COGs), finally resulting

to low market share and non-acceptance of hybrid products by farmers.

The proposal identifies these causes of the starter problem that will be targeted by

this project proposal:

1. Insufficient understanding of new breeding approaches (i.e. genomic selection),

insufficient use of technology and unwillingness of researchers to adopt new

technologies will be addressed by genomic selection proof of concepts and

technical training.

2. High cost of nurseries and long breeding cycle time will be addressed by the merits

of the technology itself (genomic selection).

91

Fig

ure

27.

Pro

ble

m tre

e dia

gra

m s

ho

win

g s

om

e ca

use

s an

d e

ffec

ts o

f lo

w r

ate

of

gen

etic

gai

n in b

reed

ing p

rogra

ms.

92

Project Planning Matrix

The project planning matrix is based on a multi-year implementation of the project

and eventual scale-up in farmers’ field. The general assumption in adoption of varieties is

cyclical, as varieties are replaced with newer ones (Fig. 27). The breeding program releases

one product every year as a result of advancement decisions from internal trialing efforts.

Products typically have a life cycle of 5-7 years from initial release to retirement, as shown

in Figure 28. On-farm techno-demos (OFTD) are started as soon as hybrid products are

released as showcase to farmers, coupled with marketing and promotion activities such as

harvest festivals. Yield trends over years can be monitored within internal trialing and

OFTDs as different varieties are advanced and planted. Scale-up into farmers’ fields can

be monitored as soon as the second year of the release of the first product.

Figure 28. Life cycle of products from a breeding program that releases one new hybrid

every year. Monitoring of objectively verifiable indicators as described in the

project planning matrix is shown by the arrows.

93

Table 16. Project planning matrix on increasing genetic gain of breeding programs by

integrating genomic selection.

PROJECT STRATEGY OBJECTIVELY

VERIFIABLE

INDICATORS

METHOD OF

VERIFICATION

IMPORTANT

ASSUMPTIONS

Goal: To increase yields in

farmers’ fields. The

project will directly

address genetic gain in a

breeding program but will

aim to increase rice

production in farmers’

fields in a larger scheme

of things.

Yield trend in farmers’

fields planted with

varieties produced by the

breeding program at scale-

up.

Yield trend in pre-selected

fields planted with

varieties produced by the

breeding program as on-

farm demos.

Yield trend in

experimental yield trials

conducted internally.

Random surveys of

farmers who planted the

hybrid varieties produced

by the breeding program.

Monitoring of on-farm

demos and yields recorded

from harvest festivals.

Monitoring of average

location yields.

Records of purchase of

farmers kept by seed

distributors and disclosed

to seed producers.

Farmers follow

recommended cultural

practices.

Implementation of agreed

protocols.

Implementation of agreed

trial protocols.

Immediate Objective 1:

Conduct experiments that

will serve as proofs of

concept on the benefits of

genomic selection.

Correlation of predicted

breeding values and

empirical breeding values.

Statistical analysis of

results obtained from

experiments.


Provide training sessions

to breeders on genomic

selection and associated

fields such as quantitative

genetics and statistics.

Training plan and course

outlines.

Training plan document. Availability of expert

resource persons.


Effectively monitor yield

trends in internal trials,

OFTDs and farmers’

fields.

Yield in kg/ha obtained

every year.

Trial and OFTD data and

farmer surveys.

Farmers provide accurate

yield figures.

Output 1.1.

Assessment of accuracy of

genomic predictions.

Output 1.2.

Assessment of resources

saved by implementing

genomic selection.

Scientific report that

evaluates prediction

accuracy of genomic

selection

Feasibility report that

highlights phenotyping

cost saved.

One scientific report.

One feasibility report.

94

Table 16. Continued.


VERIFIABLE

INDICATORS

METHOD OF

VERIFICATION

IMPORTANT

ASSUMPTIONS

Output 2.1.

Introductory training

provided to all

researchers.

Output 2.2.

Advanced training

provided to researchers

identified as subject

matter experts.

Course outline on

genomic selection and

attendance records.

Course outline on

advanced genomic

selection topics and

attendance records.

Goals added to

performance management

of individuals identified as

subject matter experts.

One course report.

One course report.

Revised goals of target

individuals.

Output 3.1.

Year over year yield trend

of average yields from

internal trials.

Output 3.2.


from OFTDs.

Output 3.3.


from farmers’ fields.

Measurement of average

yield from experimental

entries in field trials.

Measurement of harvest

per hectare.

Farmer interviews

regarding previous

cropping season.

Advancement decision

reports.

Record of number of bags

per hectare.

Farmer estimates of yield

from previous cropping

season.

Farmers are fairly accurate

in their estimates.

Activity 1.1.

Initiate and conduct

genomic selection

validation experiments.

Activity 1.2.

Compare cost of project

with and without genomic

selection.

Activity 2.1.

Conduct training on

introduction to genomic

selection for breeders.

Activity 2.2.1.

Conduct advanced training

on identified breeders.

Activity 2.2.2

Identify breeders for

advanced training as

subject matter experts.

Science plans and activity

progress reports.

Cash flow with and

without project.

Number of breeders

trained.

Advanced training

conducted.

Subject matter experts

identified with

performance goals added.

Science plans approved

and progress reports

submitted monthly.

Cash flow document.

Attendance records.

Attendance records.

Performance management

document.

95

Table 16. Continued.


VERIFIABLE

INDICATORS

METHOD OF

VERIFICATION

IMPORTANT

ASSUMPTIONS

Activity 3.1.

Conduct internal yield

trials in the usual manner.

Activity 3.2.

Conduct OFTDs in target

markets.

Activity 3.3.

Identify farmers who

planted product from

breeding program and

conduct survey.

Trial info available in

breeding database.

Harvest festivals held in

OFTDs

Farmer interviews held.

Retrieve trial info and data

from breeding database.

Activity reports.

Interview questionnaires

and farmer responses.

Implementation Schedule

The breeding program takes about three years from the initial breeding cross to

produce hybrids for extensive testing, which will take another two years. The program is

on its steady state, releasing one hybrid product per year. Integration of genomic selection

can reduce the breeding cycle to at most four years and reduce cost by at least 30%.

The implementation schedule is summarized in the Gantt chart in Table 17. Proof

of concept by conducting validation of genomic selection will take two years, while hybrids

coming out of genomic selection activities will take another four years. In this duration,

internal trialing, OFTDs and farmer surveys can be conducted on products not derived from

genomic selection.

96

Table 17. Project Gantt chart showing milestones in project implementation.

Management Arrangements

The project will utilize the existing organizational structure (Fig. 29). The Senior

Breeder drives the overall direction of the program. A number of senior staff work with the

Senior Breeder to deliver the goals of the breeding program, each focusing on specific key

aspect of the program.

YEAR 1 YEAR 2 YEAR 3 YEAR 4 YEAR 5 YEAR 6 YEAR 7 YEAR 8 YEAR 9 YEAR 10

Activity 1.1.

Initiate and conduct genomic

selection validation

experiments.

Activity 1.2.

Compare cost of project with

and without genomic

selection.

Activity 2.1.

Conduct training on



Activity 2.2.1.

Conduct advanced training on

identified breeders.

Activity 2.2.2


advanced training as subject

matter experts.

Activity 3.1.

Conduct internal yield trials in

the usual manner.

Activity 3.2.


markets.

Activity 3.3.

Identify farmers who planted

product from breeding

program and conduct survey.

97

Fig

ure

29.

Org

aniz

atio

n s

truct

ure

of

the

hypoth

etic

al b

reed

ing p

rogra

m in w

hic

h g

eno

mic

sel

ecti

on

is

to b

e ap

pli

ed

in t

he

pro

ject

pro

posa

l.

98

The matrix in Table 18 summarizes the work relationships between the breeders

and the service providers. The breeders will essentially implement the breeding strategies

and coordinate activities with breeding services (DH lab, genotyping, nurseries), data

management, and trialing team.

Table 18. Service provided to breeders by breeding program support staff.

SUPPORT STAFF SUPPORT PROVIDED TO BREEDERS

Breeding Services

Manager

DH Lab Supervisor

Genotyping Supervisor

Nurseries Supervisor

Database Manager

Trialing Manager

Trialing Supervisor

Coordinates the operations in the breeding center and

ensures optimal efficiency on use of resources. Coordinates

the logistics among DH laboratory, genotyping team, and

nursery services team.

This is a critical role that ensures sufficient DH lines are

produced from breeding crosses identified by breeders and

handed back to the breeders.

Coordinates collection of leaf tissues for DNA analysis.

Facilitates shipment of samples to genotyping facilities,

including liaising with the Plant Quarantine Service of BPI

for necessary clearances.

Coordinates all logistics related to establishment of breeding

nurseries, including assigning of manpower to various field

activities. All field operations are handled by the team under

the Nurseries Supervisor, including hybrid seed production.

Maintains phenotype and genotype data and delivers

formatted field books. Also in charge of germplasm and seed

storage, and compliance to material tracking such as bar

codes for plots and seed packets.

Coordinates multi-location trials and hands over compiled

data to the Senior Breeder for analysis. Organizes hybrid

advancement meetings.

Establishes and maintains local trials and provides updates to

breeders. In charge of field operations in trials including data

gathering.

99

Implementation of genomic selection will reduce nursery plots handled by the

Nurseries Supervisor. Trial plots handled by the Trialing Supervisors will also be reduced.

This may result to higher quality trials giving higher estimates of heritability and high

quality data. Genomic selection requires a significant increase in genotyping activities.

Almost all DH lines will be subjected to genomic selection and the breeding program will

achieve record marker data points. This will increase the work load of the Genotyping

Supervisor. To implement the project, Breeders will coordinate with the Genotyping

Supervisor for timely fingerprinting of DH lines as well as timely delivery of genotyping

data. Breeders will also coordinate with the Trialing team in establishment and

maintenance of hybrid trials derived from the training population.

The management arrangements suggested here places the Senior Breeder as the

overall lead in strategic and operational decisions about the breeding program, highlighting

the precedence of the mission over the system. In some organizations, managers who are

in charge of product development teams are not scientists or breeders, putting the bias of

decision-making on what uses resources the least, i.e. bias on the system.

Budgetary Requirements

The project is expected to reduce research cost in phenotyping but increase the cost

in genotyping. The difference resulting from these changes is expected to be an overall

reduction in resource utilization, as shown in Table 19. Cost of producing DH lines is held

constant and not included in the comparison. All nursery costs presented already includes

proportional labor cost.

100

Table 19. Cost comparison between breeding programs with and without genomic

selection.

STAGE

WITHOUT GENOMIC

SELECTION

WITH GENOMIC

SELECTION

Rows/Plots Cost (PHP) Rows/Plots Cost (PHP)

Crossing

F1

DH nursery

Genotyping

Testcrossing

Trialing (2 locs)

Total lines evaluated

(effective)

Total hybrids

evaluated (effective)

Total cost

100

60

10,000

10,000

21,000

10,000

10,000

23,300.00

5,288.00

480,000.00

216,000.00

13,440,000.00

14,164,588.00

100

60

10,000

10,000

5,000

10,500

10,000

10,000

23,300.00

5,288.00

480,000.00

2,250,000.00

108,000.00

6,720,000.00

9,586,588.00

Testcrossing includes all operational aspects in producing experimental hybrid

seeds such as labor and isolation barriers. Trialing cost is the most expensive component

of a breeding program. The cost presented includes all operations expenses in maintaining

trials, as well as the cost associated with the meticulous fertilizer application, and

harvesting and other phenotyping procedures. To compare, genomic selection saves the

breeding program about 32% of its budget.

Table 20 presents summarized project cost per activity. Activities 1.1 through 2.2.1

are focused on genomic selection proof of concept and training for breeding program staff.

101

Activities 3.1 through 3.3 reflect the activities routinely done by the breeding program and

with the commercial team (OFTD, farmer surveys). Genomic selection is projected to be

fully implemented in the sixth year of the project, after the two-year creation of the proof

of concept and the four-year generation of new inbreds and hybrids. The cost reduction in

the breeding program will therefore take effect in the sixth year, as indicated in the table.

The project does not require investment on new equipment. It will utilize the existing

infrastructure, logistic and management arrangements. Genotyping will be outsourced to a

reputable company that provides reliable DNA fingerprinting services.

Table 20. Projected budget of integrating genomic selection over a ten-year period

YEAR 1 YEAR 2 YEAR 3 YEAR 4 YEAR 5 YEAR 6 YEAR 7 YEAR 8 YEAR 9 YEAR 10

Activity 1.1.

Initiate and conduct genomic

selection validation

experiments.

210,000.00 210,000.00

Activity 1.2.

Compare cost of project with

and without genomic

selection.

55,000.00 55,000.00 55,000.00 55,000.00 55,000.00 55,000.00

Activity 2.1.

Conduct training on



512,000.00

Activity 2.2.1.

Conduct advanced training on

identified breeders.128,000.00

Activity 2.2.2


advanced training as subject

matter experts.

Activity 3.1.

Conduct internal yield trials in

the usual manner.14,164,588.00 14,164,588.00 14,164,588.00 14,164,588.00 14,164,588.00 9,586,588.00 9,586,588.00 9,586,588.00 9,586,588.00 9,586,588.00

Activity 3.2.


markets.350,000.00 350,000.00 350,000.00 350,000.00 350,000.00 350,000.00 350,000.00 350,000.00 350,000.00 350,000.00

Activity 3.3.

Identify farmers who planted

product from breeding

program and conduct survey.155,000.00 155,000.00 155,000.00 155,000.00 155,000.00 155,000.00 155,000.00 155,000.00 155,000.00 155,000.00

102

Figure 30 illustrates a breeding program with a fully implemented genomic

selection scheme. Half of the inbreds are not testcrossed, reducing the cost of hybrid

production and field trialing.

Figure 30. A breeding scheme with full integration of genomic selection showing the

porportions of tested and predicted inbred GCAs.

Recommendations for Inbred Rice

Breeding Programs

Most public research programs on rice deal with developing inbred varieties. Inbred

breeding in rice has exhibited tremendous success since the release of IR8. Inbred varieties

such as IR36, IR64, IR72, Ciherang, Swarna and many others have been classified as mega-

varieties because of their extreme popularity with farmers resulting to millions of hectares

coverage in rice producing regions worldwide (Jackson et al., 2014).

103

Genomic selection in inbred rice has some differences from genomic selection in

hybrids. The first rice genomic selection research (Spindel et al., 2015) was done on

inbreds using per se yield performance. In contrast, yield in hybrid breeding is selected

based on inbred GCA which can be deduced from hybrid performance. Inbred breeding

does not undergo the testcross stage to select for GCA, but inbreds per se are evaluated for

yield and other traits. GEBVs in genomic selection in inbred breeding programs are

therefore obtained from inbred per se performance.

Mature hybrid breeding programs make use of heterotic pools. Breeding crosses

are strictly created within defined heterotic pools and such practice usually eliminates

population structure within pools. In inbred breeding, a collection of lines for evaluation

may come from different breeding crosses derived from various sources representing rice

sub-populations, e.g. indica x tropical japonica. It is therefore more common in inbred

breeding programs to have population structure, which must be accounted for in the

prediction model as discussed in this manuscript.

Large amounts of phenotypic data are available from various testing programs from

the past several years. These datasets can be used in conjunction with genotype data to

initiate genomic selection. Top inbreds may be used in breeding, with the resulting inbred

progenies subjected to prediction of breeding values, with a sufficient proportion tested in

the field to validate the prediction model.

Figure 31 outlines breeding schemes for inbreds with and without genomic

selection. Since testcrossing is not required in inbred breeding, R&D cost is generally less

than that of hybrid breeding.

104

Figure 31. Breeding schemes for inbred rice development with and without genomic

selection. Genomic selection can drastically reduce trial plots. In these

schemes, testcrossing is not required.

105

CHAPTER 5

SUMMARY AND CONCLUSION

This study is one of the first genomic selection works in rice and possibly the first

genomic selection application in hybrid rice. This study has successfully shown the genetic

and operational merits of genomic selection. The study was able to accomplish the stated

objectives.

Usefulness of Genomic Selection

Whole genome markers have been demonstrated to be useful in predicting

combining ability in hybrid rice. Contributions of parental lines to yield, days to flowering

and plant height can be predicted with accuracies comparable to published reports. This

study agrees with most research works that genomic selection is generally more accurate

for traits with high heritability. Corollary to this, increasing heritability of a trait by

implementing more robust phenotyping methods can increase prediction accuracy.

Training population size also influences prediction accuracy. The general trend is that

larger training population size increases prediction accuracy which agrees with published

reports.

Population structure can confound prediction accuracy as shown in this study and

in other works in other crop species and livestock. Population structure can be included in

the prediction model but for this study, the most practical and relevant approach to

population structure was to predict within subpopulations. Predicting within

106

subpopulations is similar to predicting within heterotic pools in hybrid breeding programs.

Prediction accuracy increased when predicting within subpopulations.

It is recommended for inbred breeding programs to consider population structure,

for example indica, japonica tropical japonica and admixtures in rice, because inbred

breeding usually does not work within subpopulations unlike heterotic pools in hybrid

breeding programs. The best method to incorporate population structure is to use the

eigenvector matrix as a fixed term in the prediction model.

Optimizing Genomic Selection Procedures

A general linear model was used to create prediction profiles to predict genomic

selection accuracy with different values of heritability, training population size and

genomic selection model, obtained from a full factorial design. RR-BLUP and BayesRR

are most suitable for either high heritability or large training population size. Prediction of

highly heritable traits will not need a very large training population as reflected in the

prediction profiles. Predicting traits with low heritability can be done more accurately by

employing larger training population size. A breeding program based on these 122 inbreds

can therefore utilize the prediction profile created to incorporate genomic selection into the

breeding process.

It is recommended for breeding programs to explore genomic selection in the

manner outlined in this work. There are numerous recommendations from previous works

taking each effect individually. This work however used generalized linear model to

consider the main effects and interaction effects in a quantitative manner.

107

Implementing Genomic Selection through a

Research Management Approach

A strategy in introducing genomic selection into an existing breeding program was

presented from a research management perspective using a project proposal approach. A

hybrid breeding program enhanced by genomic selection can effectively evaluate the full

range of desired lines and testcross hybrids by predicting the phenotype of a significant

portion of the population based on the observed phenotype of the populations actually

planted in the field, and the fingerprint information. This principle was shown here to save

almost 32% of a research organization’s budget, which can be used in other aspects of

breeding such as disease screening.

Implementing genomic selection as a new breeding procedure requires

consideration of the various stakeholders and the impact of the proposed changes. These

changes need to be constantly communicated, tested and implemented in the target program

for these to be effective.

108

LITERATURE CITED

APPLEGATE, J.L. 2002. Engaged Graduate Education: Seeing with New Eyes. Preparing

Future Faculty Occas. Pap. 9. Assoc. of Am. Colleges and Universities and Council

of Graduate Schools, Washington, DC.

ASORO, F.G., M. A. NEWELL, W.D. BEAVIS, M.P. SCOTT and J.L. JANNINK. 2011.

Accuracy and Training Population Design for Genomic Selection on Quantitative

Traits in Elite North American Oats. The Plant Genome 4(2): 132-144.

ATLIN, G. 2013. Applying the Breeding Technology Revolution to the Acceleration of

Genetic Gains for Major Food Crops in the Developing World. Paper W734, Plant

and Animal Genome XXI. January 11 - 16, 2013, San Diego, CA.

BARTLETT, M. S. 1937. Properties of Sufficiency and Statistical Tests. Proceedings of

the Royal Society A: Mathematical, Physical and Engineering Sciences 160 (901):

268.

BASAVARAJ, S. H., V. K. SINGH, A. SINGH, A. SINGH, A. SINGH, D. ANAND and

S. YADAV. 2010. Marker-assisted improvement of bacterial blight resistance in

parental lines of Pusa RH10, a superfine grain aromatic rice hybrid. Molecular

Breeding 26: 293-305.

BATES, D., M. MAECHLER, B. BOLKER and S. WALKER. 2015. Fitting Linear Mixed-

Effects Models Using lme4. Journal of Statistical Software, 67(1), 1-

48. doi:10.18637/jss.v067.i01.

BEAVIS WD. 1994. The power and deceit of QTL experiments: lessons from comparative

QTL studies. In: Wilkinson DB (ed). Proceedings of the 49th Annual Corn and

Sorghum Research Conference. Washington, DC: American Seed Trade

Association, 250–65.

BEAVIS, W.D. 1998. QTL analyses: Power, precision, and accuracy. p. 145–162. In A.H.

Patterson (ed.) Molecular dissection of complex traits. CRC Press, Boca Raton, FL.

BECKMANN, J.S., and M. SOLLER. 1986. Restriction fragment length polymorphisms

in plant genetic improvement. Oxford Surv. Plant Mol. Cell Biol. 3:196–250.

BERNARDO, R. 1995. Genetic models for predicting maize single cross performance in

unbalanced yield trial data. Crop Sci 35:141–147.

BERNARDO, R. 1996a. Best linear unbiased prediction of maize single-cross

performance. Crop Sci 36:50–56.

BERNARDO, R. 1996b. Best linear unbiased prediction of the performance of crosses

between untested maize inbreds. Crop Sci 36:872–876.

http://dx.doi.org/10.18637/jss.v067.i01

109

BERNARDO, R. 2010. Breeding for Quantitative Traits in Plants. 2nd ed. Stemma Press,

Woodbury, MN. (ISBN 978‐0‐9720724‐1‐0).

BERNARDO, R., and J. YU. 2007. Prospects for genomewide selection for quantitative

traits in maize. Crop Sci. 47:1082–1090.

BERTRAN, F.J. and A.R. HALLAUER . 1996. Hybrid improvement after reciprocal

recurrent selection in BSSS and BSCB1 maize populations. Maydica 41:360–367.

BLISS, F. 2006. Plant Breeding in the US Private Sector. Horticultural Science 41 (1): 45-

47.

BREIMAN, L. 2001. Random Forests. Machine Learning 45 (1): 5–32.

BRESEGHELLO, F., and M.E. SORRELLS. 2006. Association mapping of kernel size

and milling quality in wheat (Triticum aestivum L.) cultivars. Genetics 172:1165–

1177.

COLLARD, B.C.Y., C.M. VERA CRUZ, K.L. MCNALLY, P.S. VIRK, and D.J.

MACKILL. 2008. Rice Molecular Breeding Laboratories in the Genomics Era:

Current Status and Future Considerations. International Journal of Plant Genomics,

vol. 2008.

COLLARD, B.C.Y., M.Z.Z. JAHUFER, J.B. BROUWER and E.C.K. PANG. 2005. An

introduction to markers, quantitative trait loci (QTL) mapping and marker-assisted

selection for crop improvement: The basic concepts. Euphytica 142: 169–196.

COMSTOCK, R.E., H.F. ROBINSON and P.H. HARVEY. 1949. A breeding procedure

designed to make maximum use of both general and specific combining ability.

Agron J 41:360–367.

CROSBIE, T.M., S.R. EATHINGTON, G.R. JOHNSON, M. EDWARDS, R. REITER

AND S. STARK. 2003. Plant breeding: Past, present, and future. p. 1–50. In K.R.

Lamkey and M. Lee (ed.) Plant Breeding: The Arnel R. Hallauer Int. Symp.,

Mexico City. 17–23 Aug. 2003. Blackwell, Oxford, UK.

CROSSA, J., G. D. L. CAMPOS, P. PEREZ, D. GIANOLA, J. BURGUEÑO, J.L.

ARAUS, D. MAKUMBI, R.P. SINGH, S. DREISIGACKER, J. YAN, V. ARIEF,

M. BANZIGER and H.J. BRAUN, 2010. Prediction of genetic values of

quantitative traits in plant breeding using pedigree and molecular markers.

Genetics, 186(2):713-24.

DE LOS CAMPOS G., H. NAYA, D. GIANOLA, J. CROSSA, A. LEGARRA, E.

MANFREDI, K. WEIGEL and J. COTES. 2009. Predicting quantitative traits with

regression models for dense molecular markers and pedigree. Genetics 182: 375-

385.

110

DE LOS CAMPOS, G. and P. PEREZ. 2013. BGLR: Bayesian Generalized Regression R

package, version 1.0. R package version 1.0, URL:https://r-forge.r-

project.org/projects/bglr/.

DEKKERS, J.C.M., and F. HOSPITAL. 2002. The use of molecular genetics in the

improvement of agricultural populations. Nat. Rev. Genet. 3:22–32.

ENDELMAN, J.B. 2011. Ridge regression and other kernels for genomic selection with R

package rrBLUP. Plant Genome 4:250-255. doi: 10.3835/plantgenome

2011.08.0024

FALCONER, D.S. 1960. Introduction to Quantitative Genetics. Oliver and Boyd.

Edinburgh, United Kingdom.

FAMOSO, A.N., KE. ZHAO, R.T. CLARK, C.W. TUNG, M.H. WRIGHT, C.

BUSTAMANTE, L.V. KOCHIAN and S.R. MCCOUCH. 2011. Genetic

Architecture of Aluminum Tolerance in Rice (Oryza sativa) Determined through

Genome-Wide Association Analysis and QTL Mapping. PLoS Genet 7(8):

e1002221. doi:10.1371/journal.pgen. 1002221.

FAO. 2009. How to feed the world in 2050. http://www.fao.org. Accessed 02 Nov. 2013.

FAO. 2011. Rice paddies. FAO Fisheries and Agriculture. http://www.fao.org. Accessed 2

May 2016.

FERNANDO, R.L. 2007. Genomic selection. Acta Agric. Scand. Ser. Anim. Sci. 57:192–

195.

FERNANDO, R.L. 2009. Genomic Selection: Bayesian Methods. Available at

http://www.ans.iastate.edu/stud/courses/short/2009/B-Day2-3.pdf (verified 8 Nov.

2013). Iowa State University.

FISHER, R.A. 1918. The correlations between relatives on the supposition of Mendelian

inheritance. Philosophical Transactions of the Royal Society of Edinburgh 52: 399–

433.

FISHER, R.A. 1930. The genetical theory of natural selection. Oxford, England: Clarendon

Press. 272 pp.

FOX, P.N. and A.A. ROSIELLE. 1982. Reducing the influence of environmental main-

effects on pattern analysis of plant breeding environments. Euphytica 31:645–656.

GALBRAITH, J.R. 1971. Matrix Organization Designs: How to combine functional and

project forms. In: Business Horizons, February 1971, 29-40.

GEPTS, P. and J. HANCOCK. 2006. The Future of Plant Breeding. Crop Science 46: 1630-

1634.

http://www.fao.org/

http://www.fao.org/

http://www.ans.iastate.edu/stud/courses/short/2009/B-Day2-3.pdf

111

GIANOLA, D. and J.B. van Kaam. 2008. Reproducing kernel Hilbert spaces regression

methods for genomic assisted prediction of quantitative traits. Genetics 178: 2289–

2303.

GILMOUR, A. R., B. R. CULLIS and A.P. VERBYLA. 1997. Accounting for Natural and

Extraneous Variation in the Analysis of Field Experiments. Journal of Agricultural,

Biological, and Environmental Statistics, 2(3), 269–293.

GILMOUR, A.R. 2010. Why use BLUPs? An introduction to fixed and random effects for

plant breeders. CIMMYT Seminar Series, 17 August 2010.

GOMEZ, K.A. and A.A. GOMEZ. 1984. Statistical procedures for agricultural research

(2nd ed.). John wiley and sons, NewYork, 680p.

GONZALEZ-RECIO O., K.A. WEIGEL, D. GIANOLA, H. NAYA and G.J.M. ROSA.

2010. L2-Boosting algorithm applied to high-dimensional problems in genomic

selection. Genetics Research 92 (3): 227-37.

GOULDEN, C H. 1939. Problems in plant selection. In Proceedings of the Seventh

International Genetics Congress. Cambridge University Press, pp. 132-133.

GRENIER, C., T.V. CAO, Y. OSPINA, C. QUINTERO, M. H. CHÂTEL, J. TOHME, B.

COURTOIS and N. AHMADI. 2015. Accuracy of Genomic Selection in a Rice

Synthetic Population Developed for Recurrent Selection Breeding. PLoS ONE

10(8): e0136594. doi:10.1371/journal.pone.0136594.

HABIER, D., R.L. FERNANDO and J.C.M. DEKKERS. 2007. The impact of genetic

relationship information on genome-assisted breeding values. Genetics 177: 2389-

2397.

HAYES, B. 2007. QTL mapping, MAS, and genomic selection. Available at

http://www.ans.iastate.edu/section/abg/shortcourse/notes.pdf (verified 8 Nov.

2013). Animal Breeding & Genetics, Dep. of Animal Science, Iowa State Univ.,

Ames.

HAYES, B., P.J. BOWMAN, A. C. CHAMBERLAIN, K. VERBYLA and M. E.

GODDARD. 2009. Accuracy of genomic breeding values in multi-breed dairy

cattle populations. Genetics Selection Evolution 41:51. DOI: 10.1186/1297-9686-

41-51.

HEFFNER, E.L., M.E. SORRELLS, and J.L. JANNINK. 2009. Genomic Selection for

Crop Improvement. Crop Sci. 49:1–12.

HENDERSON, C.R. 1949. Estimation of changes in herd environment. J Dairy Sci. 32:

706.

HENDERSON, C.R. 1950. Estimation of genetic parameters. Ann Math Stat. 21: 309-310.

112

HENDERSON, C.R. 1963. Selection index and expected genetic advance. In Statistical

Genetics and Plant Breeding 141-163. NAS-NRC 982, Washington, DC.

HENDERSON, C.R. 1973. Sire evaluation and genetic trends. In Proceedings of the

Animal Breeding and Genetics Symposium in Honour of Dr.Jay L. Lush 10-41.

ASAS and ADSA, Champaign, Ill.

HENDERSON, C.R., O. KEMPTHORNE, S.R. SEARLE, and C.M. VON KROSIGK.

1959. The Estimation of environmental and genetic trends from records subject to

culling. Biometrics 15: 192–218.

HICKEY, J.M, S. DREISIGACKER, J. CROSSA, S. HEARNE, R. BABU, B. M.

PRASANNA, M. GRONDONA, A.S ZAMBELLI, V. S. WINDHAUSEN, K.

MATHEWS and G. GORJANC. 2014. Evaluation of Genomic Selection Training

Population Designs and Genotyping Strategies in Plant Breeding Programs Using

Simulation. Crop Sci. 54:1476–1488. doi: 10.2135/cropsci2013.03.0195.

HILL, R.R. and J.L. ROSENBERGER. 1985. Methods for combining data from

germplasm evaluation trials. Crop Sci 25:467-470.

HORNER, T.W. and K.J. FREY. 1957. Methods for determining natural areas for oat

varietal recommendations. Agron J 49: 313–315.

IKEHASHI, H and D. HILLERISLAMBERS. 1977. Single Seed Descent with the Use of

Rapid Generation Advance. Paper presented at the International Rice Research

Conference, 18-22 April 1977. Los Baños, Laguna, Philippines.

INTERNATIONAL RICE RESEARCH INSTITUTE. 1980. Standard evaluation system

for rice. IRRI: Los Baños, Philippines.

JACKSON, M.T., B.V. FORD-LLOYD and M.L. PARRY. 2014. Plant Genetic Resources

and Climate Change. CAB International.

JANNINK, J.L., A.J. LORENZ and H. IWATA. 2010. Genomic selection in plant

breeding: from theory to practice. Briefings in Functional Genomics 9: 166-177.

JANSEN, R., 1993. Interval mapping of multiple quantitative trait loci. Genetics 135: 205–

211.

JMP®. Online Documentation. SAS Institute Inc., Cary, NC, 1989-2015.

KINDALL, H.W. & D. PIMENTEL. 1994. Constraints on the Expansion of the Global

Food Supply. Ambio. 23 (3).

KOTTER, J.P. 2012. Leading Change. Boston: Harvard Business School Press.

113

KRAAKMAN A.T.W., R.E. NIKS, P.M. VAN DEN BERG, P. STAM and F.A. VAN

EEUWIJK. 2004. Linkage disequilibrium mapping of yield and yield stability in

modern spring barley cultivars. Genetics 2004;168: 435–46.

LANDE, R. and R. THOMPSON. 1990. Efficiency of marker-assisted selection in the

improvement of quantitative traits. Genetics 124: 743–56.

LEGARRA. A.S., C. ROBERT-GRANIE and P. CROISEAU. 2011. Improved Lasso for

genomic selection. Genet. Res., Camb., 93, pp. 77–87.

LI, X., W. YAN, H. AGRAMA, L. JIA, A. JACKSON, K. MOLDENHAUER, K.

YEATER, A. MCCLUNG and D. WU. 2012. Unraveling the Complex Trait of

Harvest Index with Association Mapping in Rice (Oryza sativa L.). PLoS ONE

7(1): e29350. doi:10.1371/journal.pone.0029350.

LIU, B., 1998. Statistical Genomics: Linkage, Mapping and QTL Analysis CRC Press,

Boca Raton.

LORENZ, A.J., S. CHAO, F.G. ASORO, E.L. HEFFNER, T. HAYASHI, H. IWATA,

K.P. SMITH, M.E. SORRELLS, and J.L. JANNINK. 2011. Genomic Selection in

Plant Breeding: Knowledge and Prospects. Advances in Agronomy, Volume 110:

77-123.

LORENZANA, R. and R. BERNARDO. 2009. Accuracy of genotypic value predictions

for marker-based selection in biparental plant populations. Theor. Appl. Genet.

120: 151-161.

LYNCH, M. and B. WALSH. 1998. Genetics and Analysis of Quantitative Traits. Sinauer

Associates. Sunderland, MA, USA.

MACKILL, D.J. 2007. Molecular Markers and Marker-Assisted Selection in Rice. In

“Genomics-Assisted Crop Improvement Vol 2: Genomics Applications in Crops”

by R. K. Varshney and R. Tuberosa (eds.). Springer. Dordrecht, The Netherlands

pp 147-168.

MALUSZINSKI M., K.J. KASHA, B.P. FORSTER and I. SZAREJKO (eds.). 2003.

Doubled Haploid Production in Crop Plants: A Manual. Kluwer Academic

Publishers, Dordrecht, The Netherlands.

MCCOUCH, S.R. and R.W. DOERGE. 1995. QTL mapping in rice. Trends Genet 11:

482–487.

MELCHINGER A.E., H.F. UTZ, C.C. SCHON. 1998. Quantitative trait locus (QTL)

mapping using different testers and independent population samples in maize

reveals low power of QTL detection and large bias in estimates of QTL effects.

Genetics 1998;149:383–403.

114

MEUWISSEN, T.H.E., B.J. HAYES, and M.E. GODDARD. 2001. Prediction of total

genetic value using genome-wide dense marker maps. Genetics 157:1819–1829.

MIEDANER T., T. WÜRSCHUM, H.P. MAURER, V. KORZUN, E. EBMEYER and J.C.

REIF. 2011. Association mapping for Fusarium head blight resistance in European

soft winter wheat. Molecular Breeding Volume 28, Issue 4, pp 647-655.

MOHAN, M., S. NAIR, A. BHAGWAT, T.G. KRISHNA, M. YANO, C.R. BHATIA and

T. SASAKI, 1997. Genome mapping, molecular markers and marker-assisted

selection in crop plants. Mol Breed 3: 87–103.

MORRIS, G.P., P. RAMU, S.P. DESHPANDE, C.T. HASH, T. SHAH, H.D.

UPADHYAYA, O. RIERA-LIZARAZU, P.J. BROWN, C.B. ACHARYA, S.E.

MITCHELL, J.HARRIMAN, J.C. GLAUBITZ, E.S. BUCKLER and

S.KRESOVICH. 2013. Population genomic and genome-wide association studies

of agroclimatic traits in sorghum. PNAS 2013 110: 453-458.

MOSER, G., B. TIER, R.R. CRUMP, M.S. KHATKAR, and H.W. RAADSMA. 2009. A

comparison of five methods to predict genomic breeding values of dairy bulls from

genome-wide SNP markers. Genet. Sel. Evol. 41, 56.

NAS, T.M.S., C.S. CASAL, Jr., Z. LI and S.S. VIRMANI. 2000. Application of Molecular

Markers for Identification of Restorers. Rice Genetics Newsletter, Vol. 20.

International Rice Research Institute.

NAS, T.M.S., D.L. SANCHEZ, G.Q. DIAZ, M.S. MENDIORO, and S.S. VIRMANI.

2005. Pyramiding of thermosensitive genetic male sterility (TGMS) genes and

identification of a candidate tms5 gene in rice. Euphytica 145: 67-75.

NEVES, H.H.R., R. CARVALHEIRO and S.A. QUEIRO. A comparison of statistical

methods for genomic selection in a mice population

PATERSON, A.H. 1996. Making genetic maps. In: A.H. Paterson (Ed.), Genome Mapping

in Plants, pp. 23–39. R. G. Landes Company, San Diego, California; Academic

Press, Austin, Texas.

PATTERSON, H.D. and R. THOMPSON. 1971. Recovery of Inter-Block Information

when Block Sizes are Unequal. Biometrika, Vol. 58, No. 3, pp. 545-554.

PHILIPPINE RICE RESEARCH INSTITUTE. Training on Grain Quality Evaluation.

May 9-10, 2012.

PIEPHO, H. P., J MOHRING, A.E. MELCHINGER, and A. BUCHSE. 2007. BLUP for

phenotypic selection in plant breeding and variety testing. Euphytica, 161(1-

2):209_228, 2007.

115

R DEVELOPMENT CORE TEAM. 2015. R: A language and environment for statistical

computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-

900051-07-0, URL http://www.R-project.org/.

RAFALSKI J.A. 2002. Novel genetic mapping tools in plants: SNPs and LD-based

approaches. Plant Sci. 162: 329–33.

REPINSKI, S.L., K.N. HAYES, J.K. MILLER, C.J. TREXLER and F.A. BLISS. 2011.

Plant Breeding Graduate Education: Opinions about Critical Knowledge,

Experience and Skill Requirements from Public and Private Stakeholders

Worldwide. Crop Science 51: 2325-2336.

ROBINSON, G.K. 1991. That BLUP Is a Good Thing: The Estimation of Random Effects.

Statistical Science,Vol. 6, No.1, 15-61.

SATTARI, M., A. KATHIRESAN, G.B. GREGORIO, J.E. HERNANDEZ, T.M.S. NAS

and S.S. VIRMANI. 2007. Development and use of a two-gene marker-aided

selection system for fertility restorer genes in rice. Euphytica 153: 35-42.

SCHAEFFER, L.R. 2006. Strategy for applying genome-wide selection in dairy cattle J.

Anim. Breed Genet. 123 (4): 218–223.

SEARLE, S.R., G. CASELLA, and C.E. MCCULLOCH. 2006. Variance components.

John Wiley & Sons, Hoboken, NJ.

SEPTININGSIH, E.M., A.M. PAMPLONA, D.L. SANCHEZ, C.N. NEERAJA, G.V.

VERGARA, S. HEUER, A.M. ISMAIl, and D.J. MACKILL. 2009. Development

of submergence-tolerant rice cultivars: the Sub1 locus and beyond. Ann. Bot. 103

(2): 151-160.

SINGH, V. K., A. SINGH, S.P. SINGH, R.K. ELLUR, V. CHOUDHARY, S. SARKEL,

S. DEVINDER, S.G. KRISHNANA, M. NAGARAJAN, K.K. VINOD, U.D.

SINGH, R. RATHORE, S.K. PRASHANTHI, P.K. AGRAWAL, J.C. BHATT, T.

MOHAPATRA, K.V. PRABHU and A.K. SINGH. 2012. Incorporation of blast

resistance into “PRR78”, an elite Basmati rice restorer line, through marker assisted

backcross breeding. Field Crops Research, 128, 8-16.

SPEARMAN, C. 1904. The proof and measurement of association between two things.

American Journal of Psychology 15: 72–101. doi:10.2307/1412159.

SPINDEL J., H. BEGUM, D. AKDEMIR, P. VIRK, B. COLLARD, E. REDOÑA, G.

ATLIN, J.L. JANNINK and S.R. MCCOUCH. 2015. Genomic Selection and

Association Mapping in Rice (Oryza sativa): Effect of Trait Genetic Architecture,

Training Population Composition, Marker Number and Statistical Model on

Accuracy of Rice Genomic Selection in Elite, Tropical Rice Breeding Lines. PLoS

Genet 11(6): e1005350. doi: 10.1371/journal.pgen.1005350.

http://www.r-project.org/

116

SPRAGUE, G. F. and L.A. TATUM. 1942. General versus specific combining ability in

single crosses of corn. J. Amer. Soc. Agron. 34: 923-32.

STORLIE, E. and G. CHARMET. 2013. Genomic Selection Accuracy using Historical

Data Generated in a Wheat Breeding Program. The Plant Genome, 6(1):1-9.

TANKSLEY, S.D. and C.M. RICK. 1980. Isozymic gene linkage map of the tomato:

Applications in genetics and breeding. Theoretical and Applied Genetics 58(2):

161-170.

TANKSLEY, S.D., 1993. Mapping polygenes. Annu Rev Genet 27: 205–233.

TEICH, A.H. 1984. Heritability of grain yield, plant height and test weight of a population

of winter wheat adapted to Southwestern Ontario. Theor Appl Genet. 1984

May;68(1-2):21-3.

TUKEY, J. 1949. Comparing Individual Means in the Analysis of Variance. Biometrics 5

(2): 99–114.

VAN-ARENDONK, J., B. TIER, and B.P. KINGHORN. 1994. Use of Multiple Genetic

Markers in Prediction of Breeding Values. Genetics, 137(1), 319–329.

VANRADEN, P. M. 2008. Efficient methods to compute genomic predictions. Journal of

dairy science, 91(11):4414.

VANRADEN, P.M., C.P. VAN TASSELL, G.R. WIGGANS, T.S. SONSTEGARD, R.D.

SCHNABEL, J.F. TAYLOR and F.S. SCHENKEL. 2009. Invited review:

Reliability of genomic predictions for North American Holstein Bulls. J. Dairy Sci.

92: 16-24.

VIRK, P.S., FORD-LLOYD, B.V., JACKSON, M.T., POONI, H.S., CLEMENO, T.P. and

NEWBURY, H.J. 1996. Predicting quantitative variation within rice germplasm

using molecular markers. Heredity 76: 296–304.

VIRMANI, S.S. 1999. Exploitation of heterosis for shifting the yield frontier in rice. p.

423-438 in J.G. Coors and S. Pandey (eds.) The egetics and exploitation of heterosis

in crops. Am. Soc. Agron., Crop Sci. Soc. Am., Madison, Wisconsin.

VOLLMANN, J., H. BUERSTMAYR and P. RUCKENBAUER. 1996. Efficient Control

of Spatial Variation in Yield Trials Using Neighbour Plot Residuals. Experimental

Agriculture, 32, pp 185-197.

WHITTAKER, J.C., R. THOMPSON and M.C. Denham. 2000. Marker-assisted selection

using ridge regression. Genet. Res. 75:249–252.

XU, S. 2003. Theoretical basis of the Beavis effect. Genetics 165(4): 2259-2268.

117

XU, Y. and J.H. CROUCH. 2008. Marker-assisted selection in plant breeding: from

publications to practice. Crop Sci. 48: 391–407.

ZHAI, W., W.G. WANG, Y.I. ZHOU, X. LI, X. ZHENG, Q. ZHANG, G. WANG and L.

ZHU. 2002. Breeding bacterial blight-resistant hybrid rice with the cloned bacterial

blight resistance gene Xa21. Molecular Breeding Vol. 8: 285-293

ZHAO, K., TUNG, T.W., EIZENGA, G.C., WRIGHT, M.H., ALI, M.L., PRICE, A.H.,

NORTON, G.J., ISLAM, M.R., REYNOLDS, A., MEZEY, J., MCCLUNG, A.M.,

BUSTAMANTE, C.D. and MCCOUCK, S.R. 2011. Genome-wide association

mapping reveals a rich genetic architecture of complex traits in Oryza sativa. Nat.

Commun 2:467. doi: 10.1038/ncomms1467.

118

APPENDIX A. Sample script for deriving BLUPs and GCAs implemented in R.

©Mark Nas

Script provided as reference to future students working on genomic selection in crops.

Please cite this manuscript when using this script. Send email to [email protected] for

questions.

library(lme4)

setwd("C:/Mark's Briefcase/Syngenta Thesis/Final analysis")

phenodata <- read.csv("phenodataset.csv", header=T)

attach(phenodata)

#Fit BLUP models, compute combining abilities and computations

#include male and female parents to compute for GCAs

#yield BLUPs and GCAs

yldblupmodel <- lmer(yield~(1|location)+(1|season)

+(1|location:season)+(1|rep)+(1|genotype)+(1|male)+(1|female)

+(1|female:male)+(1|genotype:location)+(1|genotype:season)

+(1|genotype:location:season), data=phenodata)

yldblupsumm <- summary(yldblupmodel) #variance components

capture.output(yldblupsumm, file="yldblupmodel.txt")

yldr <- ranef(yldblupmodel)

yieldblup <- yldr$genotype #(Hybrid yield BLUPs)

write.csv(yieldblup, file="yieldblup.csv")

yldfgca <- yldr$female #(Yield GCAs of female parents)

write.csv(yldfgca, file="yldGCA_female.csv")

yldmgca <- yldr$male #(Yield GCAs of male parents)

write.csv(yldmgca, file="yldGCA_male.csv")

#days to flowering BLUPs and GCAs

dtfblupmodel <- lmer(yield~(1|location)+(1|season)




dtfblupsumm <- summary(dtfblupmodel) #variance components

capture.output(dtfblupsumm, file="dtfblupmodel.txt")

dtfr <- ranef(dtfblupmodel)

dtfblup <- dtfr$genotype #(Hybrid DTF BLUP)

write.csv(dtfblup, file="dtfblup.csv")

dtffgca <- dtfr$female #(DTF GCAs of female parents)

write.csv(dtffgca, file="dtfGCA_female.csv")

dtfmgca <- dtfr$male #(DTF GCAs of male parents)

write.csv(dtfmgca, file="dtfGCA_male.csv")

mailto:[email protected]

119

#plant height BLUPs and GCAs

plthtblupmodel <- lmer(yield~(1|location)+(1|season)




plthtblupsumm <- summary(plthtblupmodel)

capture.output(plthtblupsumm, file="plthtblupmodel.txt")

plthtr <- ranef(plthtblupmodel)

plthtblup <- plthtr$genotype

write.csv(plthtblup, file="plthtblup.csv")

plthtfgca <- plthtr$female

write.csv(plthtfgca, file="plthtGCA_female.csv")

plthtmgca <- plthtr$male

write.csv(plthtmgca, file="plthtGCA_male.csv")

120

APPENDIX B. Sample script for predicting phenotypes using RR- BLUP implemented

in R.

©Mark Nas, Nonoy Bandillo



questions.

#Ridge regression BLUP

library(rrBLUP)


phenoyldgca <- read.csv('yldcga.csv') #for yield

names(phenoyldgca) <- c('line', 'yldgca')

load(file='genoImputed.rda') #load your marker matrix in {-1,0,1}

#cross validation for 10-fold using GBLUP

G <- A.mat(genoImputed) #calculate additive relationship matrix

gblupyld <- kin.blup(data=phenoyldgca, geno='line', pheno='yldgca',

K=G)

gbyldGEBV <- gblupyld$g

trainsubset <- dim(genoImputed)[1]

set.seed(30109)

xvalgblup <- sample(1:trainsubset, trainsubset)

yldShuff <- phenoyldgca[xvalgblup, ]

gShuff <- G[xvalgblup,xvalgblup]

# Set a 10 fold CV.

count <- 1:12

corVec.gblup <- vector(length=10)

tf.gblup <- matrix(NA,nrow=trainsubset,ncol=1)

for(i in 1:10)

{



gValidate <- kin.blup(data=yldTrain, geno='line', pheno='yldgca',

K=gShuff)$g[count]

corVec.gblup[i] <- cor(gValidate, yldShuff[count, 2])

count <- count+12

print(corVec.gblup[i])

}

gblupyldcorrTenfold <- mean(corVec.gblup)

capture.output(gblupyldcorrTenfold, file="subpop1.RRBLUPyldcorr.txt")


121

APPENDIX C. Sample script for predicting phenotypes using Bayesian Ridge

Regression implemented in R.




questions.

#Bayesian Ridge Regression, 10-fold cross-validation

library(BGLR)


phenoyldgca <- read.csv('yldphenogca.csv') #yield

names(phenoyldgca) <- c('parent', 'yldgca')

load(file='genoImputed.rda')


gblupyld <- kin.blup(data=phenoyldgca, geno='parent', pheno='yldgca',

K=G)


set.seed(30109)

trainsubset <- dim(genoImputed)[1]

xvalBayesRR <- sample(1:trainsubset, trainsubset)

yldShuff <- phenoyldgca[xvalBayesRR, ]

snpShuff <- genoimputed[xvalBayesRR, ]

Gshuff <- G[xvalBayesRR,xvalBayesRR]

count <- 1:12

corVec.brr <- vector(length=10)

tf.brr <- matrix(NA,nrow=trainsubset,ncol=1)

ETA <- list(list(X=snpShuff, model='BRR', probIn=.10))

for(i in 1:10)

{



modelBRR <- BGLR(y=yldTrain[,2], ETA=ETA, burnIn = 1000, nIter=2000,

verbose=FALSE)

BRRGebvs <- modelBRR$yHat[count]

corVec.brr[i] <- cor(BRRGebvs, yldShuff[count, 2])

tf.brr[count,] <- BRRGebvs


print(corVec.brr[i])

}

BRRyldcorrTenfold <- mean(corVec.brr) #mean correlation

capture.output(BRRyldcorrTenfold, file = "BRRyldcorr10fold.txt")


122

APPENDIX D. Sample script for predicting phenotypes using Bayesian CPi implemented

in R.




questions.

#Bayesian CPi, 10-fold cross validation

library(BGLR)







K=G)


set.seed(30109)

trainsubset <- dim(genoimputed)[1]

xvalBayesCpi <- sample(1:trainsubset, trainsubset)

yldShuff <- phenoyldgca[xvalBayesCpi, ]

snpShuff <- genoimputed[xvalBayesCpi, ]

Gshuff <- G[xvalBayesCpi,xvalBayesCpi]

count <- 1:12

corVec.Cpi <- vector(length=10)

tf.Cpi <- matrix(NA,nrow=trainsubset,ncol=1)

ETA <- list(list(X=snpShuff, model='BayesC', probIn=.10))

for(i in 1:10)

{



modelCpi <- BGLR(y=yldTrain[,2], ETA=ETA, burnIn = 1000, nIter=2000,

verbose=FALSE)

CpiGebvs <- modelCpi$yHat[count]

corVec.Cpi[i] <- cor(CpiGebvs, yldShuff[count, 2])

tf.Cpi[count,] <- CpiGebvs


print(corVec.Cpi[i])

}

CPiyldcorrTenfold <- mean(corVec.Cpi) #mean correlation

capture.output(CPiyldcorrTenfold, file = "CPiyldcorr10fold.txt")


123

APPENDIX E. Sample script for predicting phenotypes using Bayesian Lasso

implemented in R.


Script provided as reference to future students of UPLB working on genomic selection in

crops. Please cite this manuscript when using this script. Send email to [email protected]

for questions.

#Bayesian Lasso, 10-fold cross validation

library(BGLR)







K=G)


set.seed(30109)

trainsubset <- dim(genoimputed)[1]

xvalBayesLas <- sample(1:trainsubset, trainsubset)

yldShuff <- phenoyldgca[xvalBayesLas, ]

snpShuff <- genoimputed[xvalBayesLas, ]

Gshuff <- G[xvalBayesLas,xvalBayesLas]

count <- 1:12

corVec.Las <- vector(length=10)

tf.Las <- matrix(NA,nrow=trainsubset,ncol=1)

ETA <- list(MRK=list(X=snpShuff, type="gamma", lambda=10, shape=1.1,

rate=0.5, model="BL"))

for(i in 1:10)

{



modelLas <- BGLR(y=yldTrain[,2], ETA=ETA, burnIn = 1000, nIter=2000,

verbose=FALSE)

LasGebvs <- modelLas$yHat[count]

corVec.Las[i] <- cor(LasGebvs, yldShuff[count, 2])

tf.Las[count,] <- LasGebvs


print(corVec.Las)

}

LasyldcorrTenfold <- mean(corVec.Las) #mean correlation

capture.output(LasyldcorrTenfold, file = "Lasyldcorr10fold.txt")


ii - graduate school imgs/sample manuscript.pdfselection and phenotypic analysis. he has also an...

Documents