photometric redshift estimation: an active learning...

8
Photometric Redshift Estimation: An Active Learning Approach R. Vilalta * , E. E. O. Ishida , R. Beck , R. Sutrisno * , R. S. de Souza § A. Mahabal * Department of Computer Science, University of Houston 3551 Cullen Blvd, 501 PGH, Houston, Texas 77204-3010, USA Email: {rvilalta,rasutrisno}@uh.edu Laboratoire de Physique Corpusculaire, Universit´ e Clermont-Auvergne 4 Avenue Blaise Pascal, TSA 60026, CS 60026, 63178, Aubi` ere Cedex, France Email: [email protected] Department of Physics of Complex Systems, E¨ otv¨ os Lor´ and University Budapest 1117, Hungary Email: [email protected] § Department of Physics & Astronomy, University of North Carolina Chapel Hill NC, 27599-3255, USA Email: [email protected] Department of Astronomy, California Institute of Technology Pasadena CA, 91125, USA Email: [email protected] Abstract—A long-lasting problem in astronomy is the accurate estimation of galaxy distances based solely on the information contained in photometric filters. Due to observational selection effects, the spectroscopic (source) sample lacks coverage through- out the feature space (e.g. colors and magnitudes) compared to the photometric (target) sample; this results in a clear mismatch in terms of photometric measurement distributions. We propose a solution to this problem based on active learning, a machine learning technique where a sampling strategy enables us to select the most informative instances to build a predictive model; specifically, we use active learning following a Query by Committee approach. We show that by making wisely selected queries in the target domain, we are able to increase our predictive performance significantly. We also show how a relatively small number of queries (spectroscopic follow-up measurements) suffices to improve the performance of photometric redshift estimators significantly. I. I NTRODUCTION Galaxy redshift estimation is critical to probe the evolution of the Universe, as it provides the means to measure distances at cosmic scales via Hubble’s law [1]. Precise redshifts can be determined through the high-resolution decomposition of electromagnetic radiation as a function of wavelength, i.e., through spectroscopy, which provides absorption or emission- line features in the raw optical and/or near-infrared spectrum. However, the observational cost of this procedure is prohibitive even for medium-size surveys. Acquiring spectroscopic mea- surements for all cataloged objects is not a viable approach, and this will only become increasingly difficult with the advent of new instruments and associated large scale surveys, e.g. Pan-STARRS [2], the Dark Energy Survey (DES) [3], and the Large Synoptic Survey Telescope (LSST) [4], among others. Due to improved efficiency, cost, and capacity to probe distant objects, an alternative widespread tool for galaxy redshift estimation is to use photometric techniques (photo- z) [5]–[7]. Photo-z are relevant for a variety of astronomical and cosmological applications such as gravitational lensing [8], baryon acoustic oscillations [9], and for studies of Type Ia supernovae [10]. Photometric measurements capture the intensity of electromagnetic radiation through a small number of broad wavelength windows (filters). Compared to spectro- scopic techniques (spec-z), photometric measurements exhibit lower resolution and higher estimation uncertainty, but there are also major gains: we can substantially expedite the process and analysis of this type of observations. A major aim in astronomy is to infer spectroscopic properties from purely photometric data. The challenge behind the construction of accurate photomet- ric galaxy redshift estimators is twofold. One major obstacle is the (near-)degeneracy in photometric feature space of different galaxy types at differing redshifts, exacerbated by photometric measurement errors [11]. The second obstacle is the marked disparity in distribution between spectroscopic (training) and photometric (target) observations, such that the naive approach of building a model on spectroscopic data and applying it on photometric data is prone to failure [12]. This scenario may seem a perfect fit for domain adaptation techniques in machine learning [13]–[17], where the training set is drawn from a source domain, while the desired estimated set is drawn from a target domain, and both exhibit different distri- butions. The main challenge is to adjust the model obtained

Upload: others

Post on 23-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Photometric Redshift Estimation: An Active Learning Approachrvilalta/papers/2017/cidm17.pdfPhotometric Redshift Estimation: An Active Learning Approach R. Vilalta , E. E. O. Ishiday,

Photometric Redshift Estimation:An Active Learning Approach

R. Vilalta∗, E. E. O. Ishida†, R. Beck‡, R. Sutrisno∗, R. S. de Souza§ A. Mahabal¶∗Department of Computer Science, University of Houston

3551 Cullen Blvd, 501 PGH, Houston, Texas 77204-3010, USAEmail: {rvilalta,rasutrisno}@uh.edu

†Laboratoire de Physique Corpusculaire, Universite Clermont-Auvergne4 Avenue Blaise Pascal, TSA 60026, CS 60026, 63178, Aubiere Cedex, France

Email: [email protected]‡Department of Physics of Complex Systems, Eotvos Lorand University

Budapest 1117, HungaryEmail: [email protected]

§Department of Physics & Astronomy, University of North CarolinaChapel Hill NC, 27599-3255, USA

Email: [email protected]¶Department of Astronomy, California Institute of Technology

Pasadena CA, 91125, USAEmail: [email protected]

Abstract—A long-lasting problem in astronomy is the accurateestimation of galaxy distances based solely on the informationcontained in photometric filters. Due to observational selectioneffects, the spectroscopic (source) sample lacks coverage through-out the feature space (e.g. colors and magnitudes) compared tothe photometric (target) sample; this results in a clear mismatchin terms of photometric measurement distributions. We proposea solution to this problem based on active learning, a machinelearning technique where a sampling strategy enables us toselect the most informative instances to build a predictive model;specifically, we use active learning following a Query by Committeeapproach. We show that by making wisely selected queries in thetarget domain, we are able to increase our predictive performancesignificantly. We also show how a relatively small numberof queries (spectroscopic follow-up measurements) suffices toimprove the performance of photometric redshift estimatorssignificantly.

I. INTRODUCTION

Galaxy redshift estimation is critical to probe the evolutionof the Universe, as it provides the means to measure distancesat cosmic scales via Hubble’s law [1]. Precise redshifts canbe determined through the high-resolution decomposition ofelectromagnetic radiation as a function of wavelength, i.e.,through spectroscopy, which provides absorption or emission-line features in the raw optical and/or near-infrared spectrum.However, the observational cost of this procedure is prohibitiveeven for medium-size surveys. Acquiring spectroscopic mea-surements for all cataloged objects is not a viable approach,and this will only become increasingly difficult with the adventof new instruments and associated large scale surveys, e.g.

Pan-STARRS [2], the Dark Energy Survey (DES) [3], and theLarge Synoptic Survey Telescope (LSST) [4], among others.

Due to improved efficiency, cost, and capacity to probedistant objects, an alternative widespread tool for galaxyredshift estimation is to use photometric techniques (photo-z) [5]–[7]. Photo-z are relevant for a variety of astronomicaland cosmological applications such as gravitational lensing[8], baryon acoustic oscillations [9], and for studies of TypeIa supernovae [10]. Photometric measurements capture theintensity of electromagnetic radiation through a small numberof broad wavelength windows (filters). Compared to spectro-scopic techniques (spec-z), photometric measurements exhibitlower resolution and higher estimation uncertainty, but thereare also major gains: we can substantially expedite the processand analysis of this type of observations. A major aim inastronomy is to infer spectroscopic properties from purelyphotometric data.

The challenge behind the construction of accurate photomet-ric galaxy redshift estimators is twofold. One major obstacle isthe (near-)degeneracy in photometric feature space of differentgalaxy types at differing redshifts, exacerbated by photometricmeasurement errors [11]. The second obstacle is the markeddisparity in distribution between spectroscopic (training) andphotometric (target) observations, such that the naive approachof building a model on spectroscopic data and applying iton photometric data is prone to failure [12]. This scenariomay seem a perfect fit for domain adaptation techniques inmachine learning [13]–[17], where the training set is drawnfrom a source domain, while the desired estimated set isdrawn from a target domain, and both exhibit different distri-butions. The main challenge is to adjust the model obtained

Page 2: Photometric Redshift Estimation: An Active Learning Approachrvilalta/papers/2017/cidm17.pdfPhotometric Redshift Estimation: An Active Learning Approach R. Vilalta , E. E. O. Ishiday,

in the source domain to account for differences exhibitedin the new target domain. In our case, the source domaincorresponds to spectroscopic observations, whereas the targetdomain corresponds to photometric observations. However,under a strong distributional discrepancy between the twodomains, domain adaptation techniques alone are inadequate,and a better solution is to combine domain adaptation withactive learning techniques [18]–[24].

Active learning (AL) suggests which instances lacking avalue for the response variable should be queried to fill in themissing response value, and subsequently be incorporated intothe training set [25]–[27]. In our case, such instances corre-spond to galaxies in the photometric (target) sample that can bequeried for subsequent spectroscopic follow-up. The idea hasbeen applied to astronomical problems for the determinationof stellar population parameters [28], periodic variable stars[29], and supernova photometric classification [18]. A recentapproach used a self-organized map to pinpoint regions inthe target domain not covered in the source domain [30],but unlike AL, the methodology does not directly optimizeempirical performance and number of required queries.

In this study, we provide an empirical assessment of theeffectiveness of AL on photometric redshift estimation. Theidea is to build an initial regression model using a source(spectroscopy) dataset and to refine that model by samplingexamples from the target (photometry) dataset. The centralgoal is to evaluate the efficiency of AL under the assumptionthat querying galaxy redshifts through spectroscopic analysisis expensive; it is then important to minimize the number ofqueries, while improving the accuracy of the regression model.

Our experiments use datasets specifically constructed tocapture the distributional discrepancy between photometricand spectroscopic observations [12]. Results show that usingAL through a Query by Committee approach yields significantgains in performance. The comparison is made with similarmodels built with and without the use of AL. Moreover,results show that a few hundred instances (∼ 300) sufficeto generate a good model to predict redshifts on thousandsof spectroscopic observations. The proposed methodologyleverages modern regression techniques with the ability toadapt to new domains under the presence of distributionalchanges.

This paper is organized as follows. Section II explains basicconcepts in classification and AL. Section III explains ourapproach to galaxy redshift estimation using AL. Section IVshows our empirical results. Lastly, Section V gives a summaryand conclusions.

II. PRELIMINARY CONCEPTS

A. Basic Notation in Regression

We assume a regression algorithm receives as input aset of training examples T = {(xi, yi)}mi=1, where x =(x1, x2, · · · , xn) is a vector in the input space Rn, and y is aninstance of the output space, y ∈ R, also referred to as y(x).We assume the training dataset T consists of independentlyand identically distributed (i.i.d.) examples obtained according

to a fixed but unknown joint probability distribution, P (x, y),over the input-output space Rn×R. The outcome of the algo-rithm is a hypothesis or function fθ(x|T ) (parameterized byθ) mapping the input space to the output space, f : Rn → R.We are interested in choosing the hypothesis that minimizesan expected error:∫

x

E[ (fθ(x|T )− y(x))2|x ] g(x) dx (1)

where E[] is the expectation over the posterior distributionP (y|x) and training set T . We assume g(x) is the marginaldistribution over the input space, and fθ(x|T ) is the estimationof the response for x induced by the regression algorithm. Inwhat follows, ET [] refers to the expectation over the trainingset, and EP [] as the expectation over the posterior distributionP (y|x).

The expected error in Eq, 1 can be decomposed into threemain components:

E[ (fθ(x|T )− y(x))2|x ] = (2)

noise EP [ (y(x)− EP [ y|x])2 ]+

bias2 (ET [ fθ(x|T ) ]− EP [ y|x ])2+

variance ET [ (fθ(x|T )− ET [ fθ(x|T ) ])2 ]

The first term on the right side of Eq. 2 is independentof our estimator fθ and captures the inherent randomness ornoise in the data; it is unavoidable as long as the data arerepresented using the same feature (input) space. The secondterm is known as the squared bias, and denotes how well ourmodel matches the data distribution. The last term quantifiesthe instability of our model as it wiggles around a centralvalue; this term is known as the model variance. There is awell known trade-off between variance and bias [31]: ideallywe would like to have low bias and variance, but the termsare commonly inversely correlated1.

B. Active Learning for Regression

Active learning provides a mechanism to query those unla-beled examples deemed relevant to build an accurate predictivemodel [25]–[27]. Specifically, in (pool-based) AL, we assumethe existence of two training sets: T = {(xi, yi)}mi=1, andTU = {(xj)}lj=1; where the latter set lacks values for theresponse variable. To incorporate set TU into the regressiontask, we selectively identify examples from TU and query theirresponse value to incorporate the new pair into the training setT . An assumption is made that querying instances is costly,and that we should try to minimize the number of querieswhile improving on model performance.

In this paper we follow an approach to AL known as Queryby Committee (QBC) [32], [33]. The use of QBC originated inthe realm of classification (where the output variable y takes on

1Notice that both terms can be reduced if the space of hypotheses orfunctions is intentionally designed to contain a good approximation to thetarget function, which demands extra-evidential information lying outside thetraining data.

Page 3: Photometric Redshift Estimation: An Active Learning Approachrvilalta/papers/2017/cidm17.pdfPhotometric Redshift Estimation: An Active Learning Approach R. Vilalta , E. E. O. Ishiday,

Figure 1. Active learning for regression focuses on regions of high varianceto select new training instances. Once instances are queried and incorporatedto the training set, the model is re-computed to identify new regions of highvariance.

categorical values), but has also been used in regression [34],[35]. In essence, the algorithm builds a committee of differentmodels over training set T ; it then selects those instances onTU where most of the committee models disagree. This isstraightforward to apply in a classification problem, wherethe notion of disagreement can be easily decided under acategorical response [36]–[38]. For the regression problem, ALis based on the notion of model variance. As an illustration,Figure 1 shows a sequence of two steps where examples arequeried based on regions of high variance over the featurespace. The left panel shows a distribution of examples ap-proximated using a composition of two linear models. Regionsof high variance are queried to yield a better approximatingfunction (right panel).

In the context of QBC, variation is captured by quantifyingthe level of disagreement among the committee members (orregression models) through a metric called ambiguity [34].Specifically, let the committee be made of a set of regressionmodels2 {fk(x)} . Then the ambiguity at instance x, A(x), isdefined as follows:

A(x) =∑k

wk(fk(x)− F (x))2 (3)

where F (x) is the average estimation of x over all theensemble of regression models:

F (x) =∑k

wkfk(x) (4)

The weights {wk} are the same in Eqs 3 and 4 and denoteour degree of belief or importance for each model. We assumewk ≥ 0 and that

∑k wk = 1.

It can be shown that the generalization error for the wholeensemble can be defined as follows:∫

x

E[ (F (x)− y(x))2 ] g(x) dx = Γ− Λ (5)

where Γ is the weighted average of the generalization error ofthe regression models:

2For simplicity, we omit parameter θ in the notation of function estimators.

Γ =∑k

wkEk =∑k

wk

∫x

(fk(x)− y(x))2g(x)dx (6)

and Λ is the weighted average of the ambiguities:

Λ =∑k

wkAk =∑k

wk

∫x

(fk(x)− F (x))2g(x)dx (7)

The relevant fact about Eq. 5 is the trade-off between biasand variance on the QBC approach, which contrasts with theone defined in Section II-A. If the regression models arehighly biased (e.g., linear models) the ambiguity is small, anderror is shaped by the weighted average of the error of theregression models Γ. But if the average ambiguity Λ is high,generalization error is reduced; this is achieved by intentlybuilding a model made of a mixture of markedly differentregression models.

III. METHODOLOGY

Our goal is to build accurate galaxy redshift estimatorsusing photometric data. The two major obstacles to attainthis goal, namely the lack of spectroscopic redshifts in newsurveys, and the marked disparity in distribution betweenspectroscopic and photometric samples, can be tackled in twosteps: 1) build an initial model using the (relatively small)set of galaxies with measured spectroscopic redshifts, and 2)iterate using AL by querying redshifts from the (large) set ofgalaxies with unknown redshifts. The first step is importantto avoid the “cold start” problem [39] in AL, where theefficiency of AL is contingent on an initial accurate model,but such model requires a representative labeled sample tobe constructed in the first place; this circularity problem isbroken when prelabeled source examples can participate inthe model building process. The second step helps alleviatethe distributional discrepancy by sampling directly from thetarget domain (photometric sample) using AL.

Algorithm 1 QBC with initial source modelInput : Labeled training data T , unlabeled data TU , activelearning sample size qOutput : Regression model F (x)

1: Train model F on set T using QBC2: while TU 6= ∅ do3: Rank TU by A(x) (ambiguity, Eq. 3)4: Let T ∗

U contain the top q instances with highest rank5: Query redshift values on T ∗

U

6: T = T ∪ T ∗U , TU = TU \ T ∗

U

7: Train model F on set T using QBCreturn model F

Algorithm 1 shows the main steps behind our approach.A committee of models F is built using a sample withknown redshifts (T ); this will be referred to as the sourcemodel. An iterative process then selects the q instances withhighest ambiguity, and queries their redshift. Training set T is

Page 4: Photometric Redshift Estimation: An Active Learning Approachrvilalta/papers/2017/cidm17.pdfPhotometric Redshift Estimation: An Active Learning Approach R. Vilalta , E. E. O. Ishiday,

Figure 2. Distributions for r-band magnitude and 4 SDSS colors in our source (blue line) and target (green region) galaxy samples, i.e. samples A and D ofthe TEDDY catalog from [12].

augmented with the set of queried instances (while the sameinstances are removed from TU ). A new model is then builtwith the new training set. The process repeats until set TUis exhausted; the output is regression model F ; this will bereferred to as the target model. Results in Section IV quantifythe advantages of our active-learning approach applied tophotometric redshift estimation.

IV. EXPERIMENTS

For the purpose of testing the applicability of AL to thearea of photometric redshift estimation, our experiments usedata specifically designed to capture discrepancies betweenspectroscopic and photometric observations as reported by[12]. We select the TEDDY galaxy catalogue of [12] as astarting point, which explicitly deals with extrapolation infeature space, an issue that regularly comes up in real worldphoto-z applications [40].

A. Dataset

The TEDDY catalogue3 [12] was assembled from spectro-scopic galaxy measurements of the SDSS Data Release 12(DR12, [41]). There are four different subsets available, ofwhich we utilize TEDDY A and TEDDY D, as detailed below.

The TEDDY A sample consists of galaxies that are withinthe bounds of a rectangular parallelepiped in the space of thefour SDSS colors (1.0 < u − g < 2.2, 1.2 < g − r < 2.0,0.1 < r − i < 0.8, and 0.2 < i − z < 0.5), with an extraconstraint on r-band magnitude, r < 21. From the SDSSDR12 spectroscopic sample, ≈ 75, 000 objects that satisfythese criteria were randomly selected to form TEDDY A. Werefer to this sample as our source dataset, where redshiftshave been accurately determined, but coverage over the featurespace is limited.

TEDDY D is disjoint from TEDDY A, and has been randomlysampled from the entire SDSS DR12 spectroscopic sampleto contain ≈ 75, 000 galaxies. As it is not limited by thestringent color and magnitude cuts of TEDDY A, its featurespace coverage is much wider, therefore it represents a casewhen training set coverage is not available, i.e. extrapolationis needed. We refer to this sample as our target dataset.

3Available in the COINtoolbox -https://github.com/COINtoolbox/photoz catalogues

Figure 2 shows the SDSS photometric color and r-bandmagnitude distributions of our source and target datasets. Thecuts in parameter space in case of the former clearly lead toa significantly narrower coverage than that of the latter. Thefive features (r magnitude, u−g, g−r, r− i, and i−z colors)comprise the 5D photometric feature space. Spectroscopicredshifts of all galaxies in the sample are also available, readilyavailable for training or validation.

Clearly, our setup models the task of spectroscopic targetselection with the goal of optimizing photo-z estimationresults. The datasets were explicitly chosen this way to mirrorthe scenario of using a limited spectroscopic sample to makeinferences on an extended photometric sample. It has to benoted, though, that our target set uses spectroscopic-qualityobservations, therefore the photometric error distributions ofour two sets can in fact be assumed similar. A full analysis,taking into account that photometric samples generally havehigher photometric errors, as illustrated by the HAPPY cata-logue of [12], will be presented in future work.

B. Experimental Strategy

Our experimental strategy is as follows. We begin byrandomly sampling |T | = 10, 000 instances from source set(Teddy A) to make up the initial training set. This number islarge enough to guarantee an accurate model, while avoidingsignificant computational cost. The result is an initial ensemblemodel F embedding the following regression algorithms:Linear regression model (LM), Local linear Regression [12]with (neighborhood size) k = 100 (LLR 100), Local LinearRegression with k = 1000 (LLR 1000), Random Forest (RF),and Support Vector Machines (SVMs).

On the target set (Teddy D), we sample 5, 000 instances touse as the pool for AL (TU ). A separate (mutually exclusive)set of 30, 000 instances is sampled to form a validation set. Thevalidation set tracks the performance of photo-z estimationthroughout the AL phase, and is never used for training.

Our implementation of the Query by Committee Algorithmproceeds in batches of q = 10 (see Algorithm 1). Each batch ismade up of the set of instances with highest ambiguity (Eq. 3),and is used to extend the size of the training set T . In thisstudy we give the same importance to all models, such thatwi = 1

s , where s is the number of models in the ensemble (in

Page 5: Photometric Redshift Estimation: An Active Learning Approachrvilalta/papers/2017/cidm17.pdfPhotometric Redshift Estimation: An Active Learning Approach R. Vilalta , E. E. O. Ishiday,

Figure 3. Photo-z estimation results before (top) and after (bottom) using active learning under the Query by Committee approach.

Mean (x10−2) Std (x10−2) MAD (x10−2) Outlier rate (%)

0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250

1

2

3

4

4

6

8

4

6

8

10

12

14

−3

−2

−1

0

Number of queries

Ensemble LM SVM RF LLR_100 LLR_1000

Figure 4. An assessment of the performance of the ensemble model and its constituent models using active learning. Performance diagnostics are shown asa function of the number of queries.

our experiments s = 5). The equal-weight policy was adoptedsimply to test the feasibility of our proposed approach.

C. Results

We first compare the quality of the predictions made bythe ensemble model, and each of the regression algorithmsembedded in it, with and without the use of AL. Figure 3shows different performance plots. Each panel shows the pho-tometric redshift (photo-z) estimates versus their spectroscopicredshift (spec-z) values; an error-free model corresponds tothe dashed diagonal line. Columns correspond to differentregression methods. Rows capture one of two scenarios: thefirst row shows the performance of the models with no AL;each model is trained on the source dataset and directly appliedto the target dataset. The second row shows the performanceof each of the regression models after 250 queries usingthe QBC approach (Section III). Without the use of AL,predictions appear markedly biased due to the distributionaldiscrepancy between source and target domains. The effectof such disparity is more accentuated with Support Vector

Machines, Random Forests and LLR 100. LLR 1000 and theEnsemble Model seem less affected by the discrepancy, butstill produce a more accurate model after training with theselected queried examples.

For a second round of experiments, we assess the perfor-mance of different models using standard diagnostics adoptedin the astronomical literature [12], [42], [43], as follows: mean,standard deviation and median absolute deviation (MAD) ofthe normalized redshift error ∆znorm = (zspec− zphoto)/(1 +zspec), where zphoto and zspec are the photometric and spec-troscopic redshifts, respectively, and the outlier rate. Following[12], [42], [43], outliers are defined as those galaxies where|∆znorm| > 0.15, and the mean ∆znorm and standard devia-tion σ (∆znorm) are computed with the outliers removed fromthe samples.

Our experiments compare the evolution of the three diag-nostics described above using the ensemble model and theother committee models. In Figure 4, the dashed (orange) lineshows results from the ensemble model as a function of thenumber of queries, while other lines represent the different

Page 6: Photometric Redshift Estimation: An Active Learning Approachrvilalta/papers/2017/cidm17.pdfPhotometric Redshift Estimation: An Active Learning Approach R. Vilalta , E. E. O. Ishiday,

Mean (x10−2) Std (x10−2) MAD (x10−2) Outlier rate (%)

0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250

0.5

0.6

0.7

0.8

0.9

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

−1.5

−1.0

−0.5

Number of queries

Labeling strategies Current Passive Learning Active Learning−QBC

Figure 5. Evolution of photometric redshift estimation diagnostics for 3 different labeling strategies, all using the ensemble model.

Table IRESULTS FROM APPLYING QUERY BY COMMITTEE TO THE TEDDY

CATALOG.

model nqueriesmean std MAD outlier

(x 10−2) (x 10−2) (x 10−2) rate (%)

Ensemble0 -1.35 6.45 4.04 0.9150 -0.25 4.61 2.56 0.58

100 -0.34 4.96 2.51 0.48150 -0.27 4.74 2.46 0.49200 -0.27 4.67 2.44 0.48250 -0.25 4.62 2.43 0.5

LM0 -0.31 5.45 3.67 1.3550 -0.34 5.4 3.63 1.32

100 -0.31 5.36 3.62 1.29150 -0.32 5.32 3.59 1.27200 -0.34 5.29 3.56 1.25250 -0.3 5.25 3.55 1.26

SVM0 -2.23 8.07 6.03 4.4650 -0.22 4.92 3.04 1.1

100 -0.46 4.89 2.78 0.6150 -0.37 4.71 2.65 0.52200 -0.33 4.57 2.58 0.51250 -0.29 4.44 2.53 0.51

RF0 -1.2 8.28 8.09 1.8650 -0.03 4.93 3.44 0.66

100 -0.21 4.7 2.89 0.47150 -0.14 4.5 2.74 0.48200 -0.15 4.38 2.64 0.49250 -0.16 4.28 2.6 0.51

LLR 1000 -3.72 13.94 5.39 3.8750 -0.44 7.27 2.83 0.8

100 -0.55 9.3 2.68 0.63150 -0.44 8.27 2.55 0.65200 -0.44 8.06 2.51 0.6250 -0.43 8.1 2.48 0.62

LLR 10000 -1.39 7.13 2.83 0.5150 -0.68 5.8 2.59 0.56

100 -0.5 6.03 2.54 0.51150 -0.38 5.79 2.49 0.52200 -0.37 5.72 2.47 0.51250 -0.35 5.7 2.46 0.52

committee models. The left-most panel shows the mean valueof ∆znorm (i.e. overall bias), which converges quickly withthe number of queries (∼ 200). Convergence is even faster forother diagnostics, namely σ (∆znorm), MAD and the outlierrate (from left to right). The corresponding numerical values,for each diagnostic and model, are shown in Table I.

Clearly, AL helps reduce the number of catastrophic errorsconsiderably, and lowers the scatter and bias for all competitivemodels after just a couple hundred queries. While such resultsare encouraging, we must also take into account that theincrease itself in the number of points in the training set,even if randomly selected, might impact the diagnostic results.In Figure 5, we show the evolution of the diagnostics asa function of the number of queries for 3 distinct labelingstrategies. The current strategy (dot-dashed, blue) illustratesthe effect of randomly selecting points following the initialtraining distribution4. Passive learning is defined as randomsampling from the target distribution (dashed-orange), andAL via query by committee (full-green) as described in theprevious sections. From the figure, we see that increasing thesize of the training sample following the same labeling strategyproduces only a marginal effect on the final diagnostics. On theother hand, passive learning improves all diagnostics almost aseffectively as AL, with the exception of bias. This is expected,since in passive learning we are randomly sampling from thetarget distribution, thus directly sampling over regions of thefeature space poorly covered by the source distribution (Figure2).

In order to clarify the differences between passive andAL strategies, we show in Figure 6 the evolution of the r-band magnitude distribution in the query set for each strategy,superimposed on the corresponding distribution in the training(spectroscopic) and target (photometric) sets. In both panels,the light colored curves show early query sets and darker

4In the context of the TEDDY catalog, this means querying randomly objectsfrom TEDDY B [12].

Page 7: Photometric Redshift Estimation: An Active Learning Approachrvilalta/papers/2017/cidm17.pdfPhotometric Redshift Estimation: An Active Learning Approach R. Vilalta , E. E. O. Ishiday,

Figure 6. Distribution of r-band magnitude in the training/spectroscopic (black-dotted) and target/photometric (blue-dashed) samples. Also shown are theevolution of the queried sample when adopting a passive (green) or active (red) learning approach via QBC. In both panels, the light colored curves correspondto earlier queries and the dark colored ones to late queries. We show here the query set evolution from 10 to 250 queries.

colored curves indicate later query sets from 10 to 250 queries.It is clear that when sampling from the target distribution (leftpanel) all regions of the training feature space are sampledhomogeneously, not taking into account the fact that thereare already many training points at intermediate r-magnitudevalues. AL, on the other hand, focuses on querying objectswith higher r-band magnitudes in order to compensate forthe imbalance between training and target samples in thatregion. This difference is responsible for the disparity in thediagnostics shown in Figure 5, particularly along bias.

V. SUMMARY AND CONCLUSIONS

We propose active learning (AL) as the next step in the questfor improving galaxy photometric redshift estimation. Ourproposed methodology comes as a response to observationalconstraints that prevent us from ever obtaining a representativespectroscopic sample to be used for training. In the commonscenario where the available spectroscopic sample is highlybiased towards closer and brighter objects, we propose the useof current spectroscopic samples (source dataset) merely as aresource from which to build an initial regression model; thesame model can then be used as a starting point to iterativelybuild increasingly more accurate models by adding selectedobjects from the photometric sample (target dataset). Key tothe successful application of this procedure is the use of AL tofind a relatively small number of objects in the target set that,when added to the existing source set, yield highly predictivemodels on the target distribution.

Using an approach to AL called Query by Committee(QBC), we show experimental results on datasets designed tocapture the distributional discrepancy between spectroscopicand photometric observations. Results show a significant de-gree of model refinement as an iterative process adds queriedexamples into the training set. Specifically, we conclude that(1) a few hundred queried galaxies suffice to generate accurateregression models on the photometric (target) domain; and(2) AL lowers the number of catastrophic errors (or outlier

rate) for all competitive models after a few hundred queries.Overall, we observe strong evidence in favor of AL as a meansto compensate for the distributional discrepancy between spec-troscopic and photometric observations.

As future work we plan to look for alternative approachesto AL in regression. This is an area poorly explored in themachine learning community, where most of the attentionhas focused on the classification task. Additionally, we planto look for patterns across the set of queried instances thatwould enable us to anticipate which galaxies are in need ofspectroscopic analysis in real time, just as observations arebeing recorded. Finally, we plan to estimate photo-z for allSDSS galaxies as proposed here, so as to build luminosityfunctions, and to construct a more accurate map of the large-scale structure of the universe.

ACKNOWLEDGMENTS

This work was partly supported by the Center for AdvancedComputing and Data Systems (CACDS), and by the TexasInstitute for Measurement, Evaluation, and Statistics (TIMES)at the University of Houston.

We thank the IAA Cosmostatistics Initiative5 (COIN) -where our interdisciplinary research team was formed. COINis a non-profit organization whose aim is to nourish the syn-ergy between astrophysics, cosmology, statistics and machinelearning communities. EEOI and RSS thank Bruno Quint forsuggesting the query evolution visualization.

REFERENCES

[1] E. Hubble, “A Relation between Distance and Radial Velocity amongExtra-Galactic Nebulae,” Proceedings of the National Academy of Sci-ence, vol. 15, pp. 168–173, Mar. 1929.

[2] K. C. Chambers, E. A. Magnier, N. Metcalfe, H. A. Flewelling, M. E.Huber, C. Z. Waters, L. Denneau, P. W. Draper, and et al., “The Pan-STARRS1 Surveys,” ArXiv e-prints, Dec. 2016.

5http://cointoolbox.github.io/

Page 8: Photometric Redshift Estimation: An Active Learning Approachrvilalta/papers/2017/cidm17.pdfPhotometric Redshift Estimation: An Active Learning Approach R. Vilalta , E. E. O. Ishiday,

[3] Dark Energy Survey Collaboration, T. Abbott, F. B. Abdalla, J. Aleksic,S. Allam, A. Amara, D. Bacon, E. Balbinot, and et al., “The Dark EnergySurvey: more than dark energy - an overview,” MNRAS, vol. 460, pp.1270–1299, Aug. 2016.

[4] Z. Ivezic, J. A. Tyson, B. Abel, E. Acosta, R. Allsman, Y. AlSayyad,S. F. Anderson, J. Andrew, and et al., “LSST: from Science Drivers toReference Design and Anticipated Data Products,” ArXiv e-prints, May2008.

[5] H. Hildebrandt, C. Wolf, and N. Benıtez, “A blind test of photometricredshifts on ground-based data,” A&A, vol. 480, no. 3, pp. 703–714,2008.

[6] A. Krone-Martins, E. E. O. Ishida, and R. S. de Souza, “The firstanalytical expression to estimate photometric redshifts suggested by amachine,” MNRAS, vol. 443, pp. L34–L38, Sep. 2014.

[7] J. Elliott, R. S. de Souza, A. Krone-Martins, E. Cameron, E. E. O.Ishida, and J. Hilbe, “The overlooked potential of Generalized LinearModels in astronomy-II: Gamma regression and photometric redshifts,”A&C, vol. 10, pp. 61–72, Apr. 2015.

[8] A. Petri, M. May, and Z. Haiman, “Cosmology with photometric weaklensing surveys: Constraints with redshift tomography of convergencepeaks and moments,” Physical Review D, vol. 94, no. 6, p. 063534,Sep. 2016.

[9] C. Blake, A. Collister, S. Bridle, and O. Lahav, “Cosmological baryonicand matter densities from 600000 SDSS luminous red galaxies withphotometric redshifts,” MNRAS, vol. 374, pp. 1527–1548, Feb. 2007.

[10] R. Kessler, D. Cinabro, B. Bassett, B. Dilday, J. A. Frieman, P. M.Garnavich, S. Jha, J. Marriner, R. C. Nichol, M. Sako, M. Smith, J. P.Bernstein, D. Bizyaev, A. Goobar, S. Kuhlmann, D. P. Schneider, andM. Stritzinger, “Photometric Estimates of Redshifts and Distance Modulifor Type Ia Supernovae,” ApJ, vol. 717, pp. 40–57, Jul. 2010.

[11] N. Benıtez, “Bayesian Photometric Redshift Estimation,” ApJ, vol. 536,pp. 571–583, Jun. 2000.

[12] R. Beck, C. A. Lin, E. E. O. Ishida, F. Gieseke, R. S. de Souza, M. Costa-Duarte, M. W. Hattab, and A. Krone-Martins, “On the realistic validationof photometric redshifts,” MNRAS, vol. 468, no. 4, p. 4323, 2017.

[13] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis ofrepresentations for domain adaptation,” in Proceedings of the 19thInternational Conference on Neural Information Processing Systems, ser.NIPS’06. Cambridge, MA, USA: MIT Press, 2006, pp. 137–144.

[14] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W.Vaughan, “A theory of learning from different domains,” Mach. Learn.,vol. 79, no. 1-2, May 2010.

[15] H. Daume, III and D. Marcu, “Domain adaptation for statisticalclassifiers,” J. Artif. Int. Res., vol. 26, no. 1, pp. 101–126, May 2006.[Online]. Available: http://dl.acm.org/citation.cfm?id=1622559.1622562

[16] Y. Mansour, M. Mohri, and A. Rostamizadeh, Domain adaptation:Learning bounds and algorithms, 2009.

[17] Y. Shi and F. Sha, “Information-Theoretical Learning of DiscriminativeClusters for Unsupervised Domain Adaptation,” ArXiv e-prints, Jun.2012.

[18] K. D. Gupta, R. Pampana, R. Vilalta, E. E. O. Ishida, and R. S. de Souza,“Automated supernova ia classification using adaptive learning tech-niques,” in 2016 IEEE Symposium Series on Computational Intelligence(SSCI), Dec 2016, pp. 1–8.

[19] G. Matasci, D. Tuia, and M. Kanevski, “Svm-based boosting of activelearning strategies for efficient domain adaptation,” IEEE Journal ofSelected Topics in Applied Earth Observations and Remote Sensing,vol. 5, no. 5, pp. 1335–1343, Oct 2012.

[20] C. Persello and L. Bruzzone, “Active learning for domain adaptation inthe supervised classification of remote sensing images,” IEEE Transac-tions on Geoscience and Remote Sensing, vol. 50, no. 11, pp. 4468–4483,2012.

[21] P. Rai, A. Saha, H. Daume, III, and S. Venkatasubramanian, “Domainadaptation meets active learning,” in Proceedings of the NAACL HLT2010 Workshop on Active Learning for Natural Language Processing,ser. ALNLP ’10. Stroudsburg, PA, USA: Association for ComputationalLinguistics, 2010, pp. 27–32.

[22] A. Saha, P. Rai, H. Daume, S. Venkatasubramanian, and S. L. DuVall,“Active supervised domain adaptation,” in Proceedings of the 2011European Conference on Machine Learning and Knowledge Discoveryin Databases - Volume Part III, ser. ECML PKDD’11. Berlin,Heidelberg: Springer-Verlag, 2011, pp. 97–112.

[23] X. Shi, W. Fan, and J. Ren, Actively Transfer Domain Knowledge.Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 342–357.

[24] X. Wang, T. kuo Huang, and J. Schneider, “Active transfer learningunder model shift,” in Proceedings of the 31st International Conferenceon Machine Learning (ICML-14), T. Jebara and E. P. Xing, Eds.JMLR Workshop and Conference Proceedings, 2014, pp. 1305–1313.[Online]. Available: http://jmlr.org/proceedings/papers/v32/wangi14.pdf

[25] B. Settles, Active Learning. Morgan & Claypool, 2012.[26] M. F. Balcan, A. Beygelzimer, and J. Langford, “Agnostic active

learning,” Journal of Computer and System Sciences, vol. 75, no. 1,pp. 78 – 89, 2009.

[27] D. A. Cohn, Z. Ghahramani, and M. I. Jordan, “Active learning withstatistical models,” Journal of Artificial Intelligence Research, vol. 4,no. 1, pp. 129–145, 1996.

[28] T. Solorio, O. Fuentes, R. Terlevich, and E. Terlevich, “An activeinstance-based machine learning method for stellar population studies,”MNRAS, vol. 363, pp. 543–554, Oct. 2005.

[29] J. W. Richards, D. L. Starr, H. Brink, A. A. Miller, J. S. Bloom,N. R. Butler, J. B. James, J. P. Long, and J. Rice, “Active Learning toOvercome Sample Selection Bias: Application to Photometric VariableStar Classification,” ApJ, vol. 744, p. 192, Jan. 2012.

[30] D. Masters, P. Capak, D. Stern, O. Ilbert, M. Salvato, S. Schmidt,G. Longo, J. Rhodes, and et al, “Mapping the Galaxy Color-RedshiftRelation: Optimal Photometric Redshift Calibration Strategies for Cos-mology Surveys,” ApJ, vol. 813, p. 53, Nov. 2015.

[31] S. Geman, E. Bienenstock, and R. Doursat, “Neural networks and thebias/variance dilemma,” Neural Comput., vol. 4, no. 1, pp. 1–58, Jan.1992.

[32] H. S. Seung, M. Opper, and H. Sompolinsky, “Query by committee,” inProceedings of the Fifth Annual Workshop on Computational LearningTheory, ser. COLT ’92. New York, NY, USA: ACM, 1992, pp. 287–294.

[33] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby, “Selective samplingusing the query by committee algorithm,” Mach. Learn., vol. 28, no.2-3, pp. 133–168, 1997.

[34] R. Burbidge, J. J. Rowland, and R. D. King, “Active learning forregression based on query by committee,” in Proceedings of the 8th In-ternational Conference on Intelligent Data Engineering and AutomatedLearning. Springer-Verlag, 2007, pp. 209–218.

[35] A. Krogh and J. Vedelsby, “Neural network ensembles, cross validationand active learning,” in Proceedings of the 7th International Conferenceon Neural Information Processing Systems, ser. NIPS’94. MIT Press,1994, pp. 231–238.

[36] N. Abe and H. Mamitsuka, “Query learning strategies using boosting andbagging,” in Proceedings of the Fifteenth International Conference onMachine Learning, ser. ICML ’98. San Francisco, CA, USA: MorganKaufmann Publishers Inc., 1998, pp. 1–9.

[37] P. Melville and R. J. Mooney, “Diverse ensembles for active learning,”in Proceedings of the Twenty-first International Conference on MachineLearning, ser. ICML ’04. New York, NY, USA: ACM, 2004, pp. 74–.

[38] I. Muslea, S. Minton, and C. A. Knoblock, “Active + semi-supervisedlearning = robust multi-view learning,” in Proceedings of the NineteenthInternational Conference on Machine Learning, ser. ICML ’02. SanFrancisco, CA, USA: Morgan Kaufmann Publishers Inc., 2002, pp. 435–442.

[39] D. Kale and Y. Liu, “Accelerating active learning with transfer learning,”in 2013 IEEE 13th International Conference on Data Mining, 2013, pp.1085–1090.

[40] C. J. MacDonald and G. Bernstein, “Photometric Redshift Biases fromGalaxy Evolution,” Publications of the Astronomical Society of Pacific,vol. 122, pp. 485–489, Apr. 2010.

[41] S. Alam, Albareti, C. Allende Prieto, F. Anders, S. F. Anderson,T. Anderton, B. H. Andrews, E. Armengaud, E. Aubourg, S. Bailey, andet al., “The Eleventh and Twelfth Data Releases of the Sloan Digital SkySurvey: Final Data from SDSS-III,” ApJS, vol. 219, p. 12, Jul. 2015.

[42] H. Hildebrandt, S. Arnouts, P. Capak, L. A. Moustakas, C. Wolf, F. B.Abdalla, R. J. Assef, M. Banerji, and et al, “PHAT: PHoto-z AccuracyTesting,” A&A, vol. 523, p. A31, Nov. 2010.

[43] T. Dahlen, B. Mobasher, S. M. Faber, H. C. Ferguson, G. Barro, S. L.Finkelstein, K. Finlator, A. Fontana, and et al., “A Critical Assessmentof Photometric Redshift Methods: A CANDELS Investigation,” ApJ,vol. 775, p. 93, Oct. 2013.