supplementary information - images.nature.com · supplementary information ... super nssnp genes...

1

SUPPLEMENTARY INFORMATION

Extensive genomic and transcriptional diversity identified through massively parallel DNA and

RNA sequencing of eighteen Korean individuals

Young Seok Ju, Jong-Il Kim, Sheehyun Kim, Dongwan Hong, Hansoo Park, Jong-Yeon Shin,

Seungbok Lee, Won-Chul Lee, Sujung Kim, Saet-Byeol Yu, Sung-Soo Park, Seung-Hyun Seo, Ji-

Young Yun, Hyun-Jin Kim, Dong-Sung Lee, Maryam Yavartanoo, Hyunseok Peter Kang, Omer

Gokcumen, Diddahally. R. Govindaraju, Jung Hee Jung, Hyonyong Chong, Kap-Seok Yang, Hyungtae

Kim, Charles Lee and Jeong-Sun Seo

Nature Genetics: doi:10.1038/ng.872

2

Table of Contents

Supplementary Note

1. Genome Analysis

Sensitivity of SNP calling··················································································· 5

Super nsSNP genes ························································································ 6

Estimating the number of novel variants as the number of personal genomes increases ·········· 7

Linkage disequilibrium of novel variants ······························································ 7

Detection of large deletions ·············································································· 8

Identification of large deletion breakpoints ··························································· 9

Inferring molecular mechanisms for large deletion formation ···································· 9

Microhomology-sequence motifs for NHEJ deletions ············································ 10

De novo assembly of short reads ····································································· 10

2. Transcriptome analysis

Sequence alignment for transcriptome ······························································ 12

Expression mapping ······················································································ 12

Unknown transcripts ······················································································ 13

Genes escape X-inactivation············································································ 13

3. Comprehensive analysis of genome and transcriptome

Transcriptional base modification (TBM) ···························································· 15

Allelic-specific expression (ASE) ······································································ 15

References ·································································································· 17


3

Supplementary Figures

Suppl. Figure 1: Experimental overview of the study ··········································· 18

Suppl. Figure 2: Sensitivity of SNP detection in population-level ··························· 19

Suppl. Figure 3: Description of PRIM2 gene - Example of super nsSNP gene ················· 20

Suppl. Figure 4: Super nsSNP genes ······························································ 21

Suppl. Figure 5: Overview of the strategy for detecting large deletion breakpoints ··········· 22

Suppl. Figure 6: Comparison of RNA sequence alignments on cDNA and genome

sequences ······································································································ 23

Suppl. Figure 7: Validation of 4 unknown transcripts··········································· 24

Suppl. Figure 8: Distance between unknown transcripts and nearest gene ···················· 25

Suppl. Figure 9: Validation of 15 TBMs ···························································· 26

Supplementary Tables

Suppl. Table 1: Whole Genome Sequencing Statistics (10 individuals) ··························· 28

Suppl. Table 2: WGS SNP Accuracy from validation using Illumina 610K genotyping array ··· 29

Suppl. Table 3: Primers for validations······························································ 30

Suppl. Table 4: Indel list of 10 individuals extracted by whole genome sequencing ············ 31

Suppl. Table 5: Exome sequencing statistics ························································· 32

Suppl. Table 6: Non-synonymous SNP list detected from 18 individuals

(10 whole genome sequencing, 8 whole exome sequencing) ·················· 33

Suppl. Table 7: Funtional assessment of nsSNP of 18 individuals ································ 34

Suppl. Table 8: Super nsSNP gene list ································································ 35

Suppl. Table 9: List of Korean common novel nsSNP LD ·········································· 36

Suppl. Table 10: Total 5,496 large deletion list of 8 individuals

including breakpoints information ··················································· 37

Suppl. Table 11: Validation of large deletions using 24M CGH array data ······················· 38


4

Suppl. Table 12: Breakpoints list of NA10851 ························································ 39

Suppl. Table 13: Motifs on flanking regions of NHEJ large deletions······························ 40

Suppl. Table 14: RNA Sequencing Statistics ························································· 41

Suppl. Table 15: Comparison of transcriptome alignment methods using pseudogene

expression ········································································································· 42

Suppl. Table 16: Expression map represented in RPKM value on all RefSeq genes ··········· 43

Suppl. Table 17: List of Korean common novel transcripts ········································· 44

Suppl. Table 18: 23 Genes Escape X-inactivation ····················································· 45

Suppl. Table 19: 1,809 TBM sites ··········································································· 46

Suppl. Table 20: 580 Allele Specific Expression sites ···················································· 47

Suppl. Table 21: Contig list generated by de novo assembly ···········································48

Suppl. Table 22: Alignment result of de novo assemble contigs ······································· 49


5

Supplementary Note

1. Genome Analysis

1.1. Sensitivity of SNP calling

The number of SNPs detected in an individual genome can be changed by altering SNP filter

conditions. Even in the same individual, number of SNPs detected with different algorithms can be

changed extensively1 (3.84 million SNPs by ELAND, 4.13 million SNPs by MAQ, 3.61 million SNPs in

common). In this project, we focused on the identification of “rare” SNPs. As we may expect, since

false positive SNPs tend to appear in an individual specific manner (interpreted as being rare), the

number of rare SNPs can be easily overestimated if the SNP filter conditions are modified to allow for

more false positives to be included in the SNP list. Therefore, we have attempted to reduce the false

positives in our SNP list, attempting to maximize our PPV rather than detection sensitivity.

We have achieved a high PPV (>99.94%) for SNP detection, based on the comparison of whole-

genome sequencing and microarray data. SNPs we called are accurate. Our experimental validation

by PCR and Sanger sequencing also supported our high accuracy of SNPs we called (100%). Given

the PPV of SNP detection (~ 99.94%), we expect the number of false positives in each individual

genome to be approximately 2,000 (3.5 million SNPs x 0.0006).

Excluding data from AK1 and AK2 (sequenced by earlier platforms), the sensitivities of SNP detection

for the other Korean individuals are ~ 97%. Most of the variants that we could not detect (false

negatives) appear to be covered by only a few reads, with which we cannot accurately call SNPs. For

example, AK4, 62.4% (n=4,930) of the mismatch-sites between sequencing and 610K microarray

were covered less than 10 times. Lower coverage (23.1x) than 30x may account for the part of the

insufficient read-depth for SNP calling. Interestingly, 74.3% (n=3,665) of these regions showed high

GC (> 55%) or low GC (<25%) contents. The GA technology does not provide robust sequence data

in genomic regions of high or low GC contents1.

Because we sequenced 10 individual genomes, the population-level sensitivity of SNP detection is

much higher than that of each individual. Compared with microarray-data, we identified ~95% of


6

singletons. However, we could identify > 99% of SNPs when more than 2 individuals have the variants

in the genome (Supplementary Figure 2).

1.2. Super nsSNP genes

The density of nsSNPs on coding sequences was 0.254/kb on average. During the course of

summarizing nsSNPs, we identified a subset of genes with more nsSNPs than expected. These

“super nsSNP” genes were defined as those in which 1) the average number of nsSNPs was ≥2

among all the individuals whole-genome sequenced, and 2) the nsSNP density was ≥4/kb of coding

sequence.

We found that, in some cases, hidden duplication of genes that are frequently duplicated among the

population but are not located in the human reference genome may generate super nsSNP genes as

an artifact. For example, PRIM2 is a super nsSNP gene. However, the read-depth for PRIM2 genes is

highly elevated for all Korean individuals whole-genome sequenced, suggesting that copy number

gain in PRIM2 gene may be frequent in this population (Supplementary Figure 3). Interestingly, the

human reference genome includes only a single copy of PRIM2 on chromosome 6, whereas the C.

Venter genome2 includes PRIM2 homologous DNA segments on chromosome 5. Therefore, during

alignments, short reads generated from homologous segments should be mapped to chromosome 6;

thus, all mismatches between the segments of chromosome 5 and 6 would appear as “SNPs” (mostly

heterozygous) even though there are actually few “variants” of PRIM2 in the human genome.

Because the reference genome is not a perfect reference, the interpretation of human resequencing

should be done carefully. Of the 86 super nsSNP genes we identified, 15 showed more than a 30%

increase in read-depth (Supplementary Figure 4). In addition, 33 were located on the segmental

duplications of human genomes, and 75 overlapped with known CNV regions archived in the

Database of Genome Variants (DGV). (Only 9 super nsSNP genes were not related to increased

read-depth, segmental duplications, or CNV regions in the DGV.) These observations suggest that

structural variants could partially account for super nsSNP genes.

1.3. Estimating the number of novel variants as the number of personal genomes

increases


7

We simulated the number of novel variants that would be “discovered” as the number of personal

genomes increases. To accomplish this, we first randomly permuted the order of 10 personal

genomes 1000 times. At each step, we obtained the average number of “new” variants; that is, those

that were not archived in genome databases and were not discovered in the previous step. This

method was applied to SNPs, short Indels, and nsSNPs. For nsSNPs, we performed identical tests

using nsSNPs from 18 individuals (10 whole-genomes and 8 whole-exomes) to further confirm the 10

individual estimates.

Then, using the numbers of each novel variant at each incremental step, we obtained trend curves

by the least-mean-square method. Surprisingly, the extrapolation of the trend showed that a number

of SNPs would be identified as novel, even though many personal genomes are deep sequenced. For

example, ~54,000 novel SNPs would be identified after sequencing 100 haploid genomes (50

individuals; < 1% allele frequency). Similarly, ~28,000, ~6000 and ~700 SNPs would be discovered as

novel when 100 (< 0.5%), 500 (< 0.1%) and 5,000 (< 0.01%) individuals are sequenced in high depth,

respectively. However, this number should be interpreted cautiously, in particular when extensive

extrapolation was used, since approximately 2,000 SNPs are expected to be false positives in an

individual genome, given the PPV in SNP detection (99.94%).

The number of novel nsSNP decreases relatively more slowly than SNPs. From these observations,

we may conclude that nsSNPs are relatively more diverse than SNPs.

1.4. Linkage disequilibrium of novel variants

We examined the linkage disequilibrium (correlation, r2) between novel but common (allele frequency

≥ 2/20 among Koreans) non-synonymous and surrounding SNPs (within 20 kb), both upstream and

downstream, among 10 individual whole genomes sequenced. We were particularly interested in the

linkage relationship between novel non-synonymous and known variants, or tagging SNPs, for

common genotyping arrays, since the novel nsSNPs are likely to be functional. However, the linkage

with known variants should be tight if they are to be detected as candidates for complex diseases in

genome-wide association studies (GWAS).


8

1.5. Detection of large deletions

Because of incompleteness of the human reference genome (e.g., PRIM2), and genomic

characteristics of that makes some genomic regions „inaccessible‟ by sequencing technologies (e.g.,

extremely high GC ratio, repetitive sequence, gaps), whole-genome resequencing of single

individuals against the human reference genome is not a feasible and accurate approach for

identifying structural variations that show real polymorphisms among human populations. In this

manuscript, by comparing multiple genomes in parallel, we could identify reliable “polymorphisms”

existing in human populations.

To find SV, we used read-depth (RD), paired-end (PE) read, and split-reads information of each

personal genomes and performed pairwise comparison between genomes. First, we calculated the

normalized (to 25.0x) personal whole-genome read-depth of coverage in a 30 bp window size as

follows:

normalized RD30bp window ,person = RD30bp window ,person × 25.0x

average whole− genome RDperson

Then we compared the 30 bp window coverage between two individuals

(RD deviation = normalized RD30bp window ,person A normalized RD30bp window ,person B) using all available

combinations (N= C2 = 28 )8 . To be defined as a deletion candidate, the read-depth deviation should

be increased (4/3 = 1.33x) or decreased (3/4 = 0.75x) for more than 33 windows in a row (>1 kb long).

Likewise, to identify regions of homozygous deletion for all individuals considered, we also counted

regions with in which the read-depth for all individuals was less than 5x for more than 33 windows in a

row (>1 kb long). Because the frequency of CN loss predominates over CN gain in the human

genome, we proceeded to next step using the assumption that all candidate regions identified above

correspond to CN loss.

For the next step, we investigated stretched paired-end reads, aligning each end onto each flanking

region of large deletion candidates (defined as <1 kb from the estimated junction). We regarded

deletion candidate regions with ≥2 stretched paired-end reads as suggestive. However, existence of

stretched paired-end reads is not an essential prerequisite for large deletion; thus, we did not remove

candidates with one or no paired-end reads in this step.


9

Thereafter, we checked read-depth changes and paired-end reads near deletion candidate regions for

all individuals in parallel. Regions with nearby unstable read depths, which make it difficult to

determine deletion, were removed. If an individual region exhibited a remarkable read-depth decline

compared to flanking regions or if read depths were variable among individuals, it was considered a

large deletion. If any individuals showed fitting of stretched reads to the large deletion regions, all

individuals who showed a clear decline in read depth for the regions were regarded as carrying the

deletion in their genome. If all individuals showed a read depth of approximately zero (RD < 3), we

regarded the regions as unimorphic CN losses.

1.6. Identification of large deletion breakpoints

To identify nucleotide-resolution breakpoints of large deletions, we used “orphan reads”, which are

short-reads that only one of paired-end reads is successfully aligned to the reference genome. We

collected all the „orphan reads‟ which were mapped within 1 kb of an estimated large deletion

boundary. Then unmapped ends of the orphan-reads were re-aligned to reference genome using the

BLAT3 program, to check if it could be split and separately aligned („split reads‟) to both side of large

deletion region. The alignment information of the „split reads‟ provides nucleotide-resolution

breakpoints of large deletions.

To collect the most reliable breakpoints for each large deletions, we picked a split-reads group that

had the greatest number of split reads with the same gap coordinates. When there was more than

one split-reads group with different gap coordinates and the same number of split reads, we chose

split reads with the closest gap size compared to the estimated deletion size. In this way, we

summarized BLAT results of orphan reads to one best split-reads group for each detected large

deletion.

1.7. Inferring molecular mechanisms for large deletion formation

To infer mechanisms of large deletion formation, we classified 5,496 large deletions we detected into

4 particular mechanisms, such as VNTR, NAHR, TEI and NHEJ, using DNA sequences of large

deletions and breakpoint junction through an algorithm slightly modified from BreakSeq4. If > 80% of a

large deletion is covered by simple repeat sequences predicted by Tandem Repeat Finder5, it is


10

categorized as "VNTR". Then, we compared flanking sequences of both ends of a deletion using

BLAST. If both the sequences share exact homology accross the breakpoints, we classified it as

NAHR. Large deletions not yet categorized as "VNTR" or "NAHR", are annotated by RepeatMasker,

and if a deletion is completely covered by any of transposable elements, e.g. Alu or LINE, the deletion

is classified as TEI. Finally, remaining large deletion lacking the former patterns are classified as

NHEJ

1.8. Microhomology-sequence motifs for NHEJ deletions

We assessed the microhomology-sequence motifs ≤10 bp within 400 bp upstream and downstream

from the breakpoints of NHEJ large deletions (or estimated boundaries if breakpoints were not

detected). In our large deletion set, there are 3,664 large deletions by NHEJ mechanisms. Large

deletions with identical positions were compressed, and finally we obtained a set of 1,022 non-

redundant NHEJ deletions, and 2,044 flanking DNA sequences of 400-bp long.

We used MEME (Motif-based sequence analysis tools, http://meme.ncbr.net/)6 for detection of the

microhomology-sequence. We have shown the most significant 3 motifs (Supplementary Table 13).

1.9. De novo assembly of short reads not aligned onto human genome reference

In order to find novel contigs that might emerge in Korean individuals, we conducted a de novo

sequence assembly using read data not aligned to the reference human genome, gathered from eight

Korean sequencings. Before merging vertices into contigs, we discarded all reads that contained any

ambiguities („N‟s) and those with the lowest base quality scores („B‟s) and then we obtained about

15G reads of sequence data. De novo sequence assembly of the filtered read data was carried out

using ABySS7 version 1.2.1 short-read assembler and the MPI (Message Passing Interface) protocol.

To assess assembly performance of overlapping sub-string values (k-mer), we compared assemblies

of 181.2 million paired-end reads for k-values ranging from 25 to 34 bp, and found the optimal size

(32 bp) of k-mers with parameters of four coverage depths and two erode bases. We then aligned the

assembled contigs greater than 1000 bp in length with the reference human genome (hg19) and the

Huref genome sequence using NCBI BLAST version 2.2.22 (parameters, e = 1 x 10-20

, a = 23,


http://meme.ncbr.net/

11

F = false, -X = 1000), and chose those contigs that mapped onto these genomes by less than 99%

using our own scripts. In addition, we aligned these contigs to the common chimpanzee whole-

genome shotgun draft assembly (Pan troglodytes 2), ultimately retaining contigs with a mapping ratio

greater than 99%. To analyze the context of the remaining contigs, we aligned DNA and RNA

sequencing data from eight Koreans to those contigs. Moreover, we aligned the sequence read data

of YH8, NA10851

9, NA12878

10, NA18507

1, NA19240

10, ABT

11, KB1

11, and Eskimo

12 onto each contig

using the GSNAP13

alignment tool, selecting the options “exact matches” and “unique alignment”, and

then compared the number of aligned contigs in each genome.


12

2. Transcriptome Analysis

2.1. Sequence alignment for transcriptome

Using the GSNAP alignment tool, we aligned short reads from transcriptome sequencing to a set of

constructed mRNA sequences instead of the reference human genome to avoid mapping errors

resulting from mRNA splicing. As introduced in the main manuscript (Figure 5a), if short reads are

mapped to genomic sequences, short reads containing splice junctions usually cannot be aligned in

situ. The reads appear to be non-mapped, or mapped to ectopic sites, such as pseudogenes, which

have sequences that are highly homologous to the in situ sequences but lack introns. As a result, (1)

read depths for loci near splice junctions usually decrease; (2) pseudogenes appear to be transcribed,

especially near the sequences of splice junctions in real genes; and (3) false-positive variants are

detected due to ectopic alignments (Supplementary Figure 6 and Supplementary Table 15).

We generated the mRNA sequences set using information about exons from RefSeq, UCSC, and

Ensembl gene databases. All information was downloaded from the UCSC genome browser. Exons

for a total of 161,250 genes were available (33,907 from RefSeq, 65,271 from UCSC and 62,072 from

Ensembl). The mRNA sequences were generated from human reference genome NCBI Build 36.3

based on their exonic positions.

After mapping the short reads from transcriptome sequencing onto the set of 161,250 mRNA

sequences, the mapping information for each base (i.e., read depth, type and number of mismatches)

was transformed into genomic location from mRNA-scale. Results of transcriptome sequencing, such

as expression level and variants information, were obtained from this mapping information.

2.2. Expression mapping

We examined the expression level of human genes (RefSeq gene), normalized using reads per

kilobase of exon per million mapped reads (RPKM) values14

, calculated by applying the following

equation:

RPKM = 109

C

NL


13

where C is the number of reads mapped to a total gene, N is the total number of mapped reads in the

experiment, and L is the length in base pairs of a gene. Using threshold > 1 RPKM14

, we identified

11,101 genes in active transcription in lymphoblastoid cell lines.

2.3. Unknown transcripts

Unknown transcripts were detected by filtering short reads from transcriptomes aligned outside of

currently known genomic regions. First, reads that failed to align to the 161,250 mRNA sequences

from the transcriptome pipeline were re-aligned with human reference genome NCBI build 36.3. Then,

short reads overlapping with any known gene regions were removed based on information obtained

from four different databases: (1) known genes from RefSeq gene (downloaded on 12 Sep. 2010), (2)

UCSC (downloaded on 10 May 2009), (3) Ensembl (downloaded 9 Aug. 2009), (4) and known mRNA

from GenBank (Downloaded 12 Sep. 2010). All database information was downloaded from

repositories at the UCSC genome browser (http://genome.ucsc.edu). To be conservative in identifying

unknown transcripts, we also removed short reads that overlapped with any human expressed

sequence tags (ESTs) (downloaded from the UCSC genome browser). Short reads that overlapped

known genic regions by ≥ 1 bp were filtered out. As a result, we obtained short reads that did not

overlap any known genic regions. These reads were collapsed to construct unknown transcript

regions for each individual. Unknown transcript regions with average read-depths < 4 were

considered insignificant and were removed. Finally, our interpretations were made more conservative

by removing unknown transcript regions found in only single individuals.

2.4. Genes escape X-inactivation

To find X-chromosome genes that are expressed at higher levels in females than in males, we

compared gene expression levels between genders. First, we removed 585 genes with < 1 RPKM in

gene expression from among the 948 genes on the X-chromosome. Then, we removed 186 genes for

which the average expression in females was lower than that in males. Using the expression level of

the remaining 177 genes, we performed a Wilcoxon rank-sum test to analyze differences between the

two groups. This non-parametric method was used because the sample size was not very large for

either group (9 males, 6 females). From these tests, we determined that the minimum significance


14

level was 0.0018 (e.g. XIST) when the expression level in all six females was greater than that in

males. We selected 23 genes as candidate escape from X-inactivation genes at a significance level of

0.05.

To estimate the false discovery rates (FDR) for establishing a cut-off value, we calculated q-values

using the p-values of 162 genes tested using QVALUE software15

. The FDR values for the most

significant (XIST, p-value = 0.0018) and minimally suggestive (ALG13, p-value = 0.0392) genes were

0.017 and 0.151, respectively.


15

3. Comprehensive analysis of genome and transcriptome

3.1. Transcriptional base modification (TBM)

We compared the sequences of each genome and transcriptome sets of 15 individuals (seven sets of

whole-genomes and transcriptomes, and eight sets of whole-exomes and transcriptomes), sequenced

using an Illumina Genome Analyzer. To identify modifications of RNA from genomic sequences, we

identified loci where variations were present in the transcriptome but not in the genome sequence.

SNPs in the transcriptome are defined as loci containing more than three identical mismatches in

high-quality reads (Q score > 15) with a mismatch allele frequency ≥ 20%. Genomic regions with no

variants were defined as loci where (1) at least five high quality reads existed; (2) one or fewer

mismatches were found; and (3) the mismatch allele frequency was < 10%, which allows one

mismatch for more than 10 reads and supports the existence of a wild-type allele. For sequencing of

the eight exomes, we included an additional criterion since many UTR regions were not covered by

hybridization capture system. We considered all the known SNPs exist in dbSNP130 are potential

genomic variants in the individual sequenced exomes. Conversely, loci not reported as variants in the

dbSNP130 were considered to be wild-type.

Because most of the errors in transcriptome sequencing would be detected as false positives, we

further filtered the results using conservative filter criteria. First, we filtered out singletons; that is,

those found in only one of the 15 individuals. Candidate loci found in only exome-sequenced

individuals were also removed. Then, we aligned 61-bp long sequences, comprising 30 bp upstream,

the variant allele, and 30 bp downstream of each candidate site, with the Human Reference Genome

Build 36.3 using BLAT. If the sequence matched perfectly to any region, it was removed, since the

modified RNA sequence might be transcribed from the perfectly matched region rather than from the

candidate TBM site.

3.2. Allele-specific expression (ASE)

We explored the expression level of each allele on the heterozygous nsSNPs. Among 28,042 nsSNPs

found in 15 individuals, we identified 4,867 loci where two or more individuals had heterozygous

SNPs and the corresponding gene is active in transcription. We calculated the read counts for wild-


16

type and variant alleles in both genome and transcriptome sequencing results for each individual

containing the heterozygous SNPs. If the read depth for either genome or transcriptome was < 5, the

individual was regarded as non-informative for the given variant, and the corresponding locus was

removed.

Using read-counts for each category, we constructed a 2 x 2 table for each individual in each

heterozygotic locus, as shown below.

Wild type Variant Total

Genome A B A+B

Transcriptome C D C+D

Total A+C B+D A+B+C+D

We carried out a Fisher‟s exact test for testing unbalanced expression. If the resulting p-value was

<0.10, the individual was considered suggestive for allele specific expression at the locus. Loci with

fewer than two suggestive individuals were considered to be non-informative, and were removed.

Preferential expression (PE) was quantified using the following formula:

PE = 1

n( VariantFrequencyindividual ,transcriptome

n1 − VariantFrequencyindividual ,genome ),

where n is the number of suggestive individuals for the locus. If the PE for a locus was between -0.2

and 0.2, the magnitude of allele-specific expression was regarded as insufficient and the locus was

removed.

Finally, we constructed a new 2 x 2 table using read counts from all suggestive individuals for all

suggestive regions, and then repeated Fisher‟s exact tests to obtain a significance level. Applying the

Bonferroni‟s correction method for multiple testing (n = 4,867), we removed loci with p-values >

1.027 x 10-5

. Finally, we determined 580 loci that satisfied the criteria for allele-specific expression.

.


17

Reference

1 Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator

chemistry. Nature 456, 53-59, (2008).

2 Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol 5, e254, (2007).

3 Kent, W. J. BLAT--the BLAST-like alignment tool. Genome Res 12, 656-664, (2002).

4 Lam, H. Y. et al. Nucleotide-resolution analysis of structural variants using BreakSeq and a

breakpoint library. Nat Biotechnol 28, 47-55, (2010).

5 Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27,

573-580, (1999).

6 Bailey, T. L. & Elkan, C. Fitting a mixture model by expectation maximization to discover motifs

in biopolymers. Proc Int Conf Intell Syst Mol Biol 2, 28-36, (1994).

7 Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res 19,

1117-1123, (2009).

8 Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60-65, (2008).

9 Ju, Y. S. et al. Reference-unbiased copy number variant analysis using CGH microarrays. Nucleic

Acids Res, (2010).

10 Durbin, R. M. et al. A map of human genome variation from population-scale sequencing. Nature

467, 1061-1073, (2010).

11 Schuster, S. C. et al. Complete Khoisan and Bantu genomes from southern Africa. Nature 463,

943-947, (2010).

12 Rasmussen, M. et al. Ancient human genome sequence of an extinct Palaeo-Eskimo. Nature 463,

757-762, (2010).

13 Wu, T. D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short

reads. Bioinformatics 26, 873-881, (2010).

14 Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying

mammalian transcriptomes by RNA-Seq. Nat Methods 5, 621-628, (2008).

15 Storey, J. D. & Tibshirani, R. Statistical significance for genomewide studies. Proc Natl Acad Sci

U S A 100, 9440-9445, (2003).


18

Supplementary Figures

Supplementary Figure 1. Experimental overview of the study. To identify functional genomic

variants, we performed whole-genome sequencing, targeted-exome capture sequencing, and

transcriptome sequencing.


19

Supplementary Figure 2. Sensitivity of SNP detection in population-level


20

Supplementary Figure 3. Description of PRIM2 gene - Example of super nsSNP gene


21

Supplementary Figure 4. Super nsSNP genes compared to Segmental Duplication (SD), Increase

of Read Depth (RD) and Database of Genome Variants (DGV)


22

Supplementary Figure 5. Overview of the strategy for detecting nucleotide-resolution large

deletion breakpoints


23

Supplementary Figure 6. Aligning RNA short reads on cDNA sequence provide better results

(higher expression level) than aligning on reference genome sequence. Gene expression level (genes

in chromosome 1) assessed by cDNA alignment and genome alignment are compared. Genome

alignment loses many short-reads, especially short-reads from splice junctions.


24

Supplementary Figure 7. Validation result of 4 unknown transcripts using PCR and gel

electrophoresis


25

Supplementary Figure 8. Distance between each of unknown transcripts and its nearest gene


26

Supplementary Figure 9. Validation results of 15 TBM sites by Sanger Sequencing of DNA (above)

and RNA (below)


27


28

Supplementary Tables

Supplementary Table 1. Whole Genome Sequencing Statistics (10 individuals)

ID Gender Total Bases Aligned Bases Read length Read Bases Read Bases

1 x 36 519,486,218 18,701,503,848 364,705,892 13,129,412,112

2 x 36 1,646,543,336 59,275,560,096 1,360,491,421 48,977,691,156

2 x 88 123,322,768 10,852,403,584 99,363,440 8,743,982,720

2 x 106 177,416,122 18,806,108,932 79,506,040 8,427,640,240

2 x 25 6,371,995,780 159,299,894,500 1,451,098,162 36,277,454,050

2 x 50 3,390,922,334 169,546,116,700 852,680,528 42,634,026,400

AK3 Male 89,154,943,968 73,664,437,785 2 x 76 1,173,091,368 89,154,943,968 969,340,034 73,664,437,785

2 x 76 444,312,562 33,767,754,712 403,774,458 30,684,050,534

2 x 101 430,032,812 43,433,314,012 350,934,340 35,441,899,710

2 x 36 297,272,572 10,701,812,592 270,105,201 9,723,384,303

2 x 76 1,032,644,200 78,480,959,200 855,072,031 64,980,546,495

2 x 36 55,752,362 2,007,085,032 52,073,655 1,874,563,411

2 x 76 540,079,624 41,046,051,424 486,485,356 36,969,646,182

2 x 101 301,478,526 30,449,331,126 247,079,721 24,952,724,255

AK7 Male 103,902,771,632 87,557,181,002 2 x 76 1,367,141,732 103,902,771,632 1,152,156,028 87,557,181,002

AK9 Male 92,883,089,498 77,220,870,831 2 x 151 615,119,798 92,883,089,498 511,452,344 77,220,870,831

2 x 76 287,229,228 21,829,421,328 262,486,993 19,947,194,343

2 x 101 616,387,950 62,255,182,950 541,867,353 54,722,835,143

2 x 36 320,810,256 11,549,169,216 302,486,715 10,889,034,343

2 x 76 216,223,192 16,432,962,592 194,812,391 14,804,360,307

2 x 101 465,375,420 47,002,917,420 388,626,600 39,247,175,224

* Statistics of AK2 performed by SOLiD platform is based on current SRA010321 data (SRX018824, SRX018829, SRX018830, SRX018821)

** AK2 statics excludes redundant pair for calculation

AK20 Female 74,985,049,228 64,940,569,874

AK14 Female 84,084,604,278 74,670,029,486

AK5 Male 89,182,771,792 74,703,930,798

AK6 Female 73,502,467,582 63,796,933,848

AK2* Female 328,846,011,200 78,911,480,450**

AK4 Female 77,201,068,724 66,125,950,244

Total short read data Aligned short read data

AK1 Male 107,635,576,460 79,278,726,228


29

Supplementary Table 2. WGS SNP Accuracy from validation using Illumina 610K genotyping array

Wildtype Hetero SNP Homo SNP total homo hetero

AK1 Wildtype 281159 10409 1428 0.9997275 0.9938772 0.959271378 0.98891 0.935696

Hetero SNP 63 150036 281

Homo SNP 13 1426 127051

AK2 Wildtype 281212 15498 956 0.9996608 0.991047292 0.943371616 0.992618 0.903771

Hetero SNP 83 144696 1594

Homo SNP 10 860 126957

AK4 Wildtype 281099 7181 1331 0.9996718 0.996173154 0.970826736 0.9896 0.956157

Hetero SNP 83 156453 929

Homo SNP 10 155 125725

AK5 Wildtype 281733 8117 2405 0.9996295 0.995281657 0.963857946 0.981356 0.949936

Hetero SNP 97 153923 1233

Homo SNP 7 91 125360

AK6 Wildtype 281588 10141 2377 0.9996486 0.994877351 0.957024169 0.981505 0.937692

Hetero SNP 91 152428 1240

Homo SNP 7 188 124906

AK7 Wildtype 281215 9027 4191 0.9996589 0.995934463 0.954679485 0.967176 0.944949

Hetero SNP 90 154914 1099

Homo SNP 5 33 122392

AK14 Wildtype 280873 5585 936 0.9996148 0.996034498 0.977666508 0.992774 0.96562

Hetero SNP 100 156749 1016

Homo SNP 10 116 127581

AK20 Wildtype 281511 6923 828 0.999496 0.989956306 0.973392788 0.993618 0.95715

Hetero SNP 134 154497 2703

Homo SNP 9 145 126216

Definition Wildtype a b c (e+f+h+i)/(d+e+f+g+h+i)

Hetero SNP d e f (e+i)/(e+f+h+i)

Homo SNP g h i (e+f+h+i)/(b+c+e+f+h+i)

(f+i)/(c+f+i)

(e+h)/(b+e+h)

PPV

Genotype accuracy

Sensitivity_total

Sensitivity_homo

Sensitivity_hetero

Illumina 610k sensitivityPPVindividual Whole Genome Sequencing

Genotype

accuracy


30

Supplementary Table 3. Primer List used for validation of SNPs, Indels, Novel Transcripts and

RNA ModificationGS SNP Accuracy from validation using Illumina 610K genotyping array

SuppTable3_Validation_Primer_List.xls

Depicted below is a preview of the full version.

individual position F primer R primer

SNP01 AK3 chr1:61693622 ACGTAGATCCTGATTTCGTGGT TGACCATAATGCTTGCTGTTTC

SNP02 AK3 chr10:88920229 TTGTACATGTTTTAGAGAAAGCAAA TTGAAGGTGCTCCAATTCTACA

SNP03 AK3 chr15:49853231 AGGAATCTCGGTTGGATATGAA TCAGGACTAACCTGCAAGATCA

SNP04 AK3 chr21:42786256 ATCCAGTCAAGTCAACGGTTCT CAACTTTAAGGTGGGAAAGGTG

SNP05 AK3 chr3:101756793 CAGATTCTGGCAATGAAATGTCT TTTTGGAACCAAGATAGCAGGT

SNP06 AK3 chr5:66091330 ACAGTGTTGGGAGAATGGAGTT GCAGGACCTTGTAAAGAAATGC

SNP07 AK5 chr1:215871347 GGAGTGAACAAAAAGTCGAACC CGCTGAGGAACAACTGGTATAA

SNP08 AK5 chr19:61165270 TGAAATGAGAAACCTCGTGATG CAGTAGCTTTTGCAGTTTGCAC

SNP09 AK5 chr3:39518603 TTTCCTGATGTGCGTTTATGTC TCCAGTCCTCCTCTTTCTTCTG

SNP10 AK5 chr5:172045840 CTCTCGTTAGACGGGAAAGCTA ATTTCCTTACCCAGGGATGACT

SNP11 AK7 chr11:6609168 AGTGAGGGTGCAGAGAGAAAAG TAGGTGAGTTGTGTTCCCACAG

SNP12 AK7 chr15:31665551 CCCTCAGGAAGGACTGTTCATA AGGGAGATGATCGACTTGATGT

SNP13 AK7 chr22:40510576 CATCTTCATTGTCTCCCATCCT CTTACCTGCACCAGGAGGTTCT

SNP14 AK9 chr1:110034661 GCCTTCTGCAGATCACTTTTGT AGGACTGGGAAAACATCTGAAA

SNP15 AK9 chr7:48320193 CTGTCTGGAAAGTGTGATCAGG TTGTAAATGAAAGCTCGCACAT

SNP16 AK9 chr10:5425823 CCAGTTGGACACCAATCTACAA TCTGTGTCAGACATCACCACTG

SNP17 AK9 chr22:37807148 AAGGTTACCCTGACCATCTTTG CCCATCACAGACACTTAAGCAG

SNP18 AK20 chr12:8705937 TGCTATTCCTCTTCGACCTCAT TCTGAGCATCACATTCTCTGCT

SNP19 AK20 chr17:31987125 TGCACAAGGATAGGAACCAGTA GTTACAGACAGGACTCCCTTGG

SNP20 AK20 chr6:30654121 CCCTGATCTTCAAGTTGGATTC GGATGTTTTTCTTCCTCCTCCT

SNP21 AK20 chr7:134269011 GAGAAAGAATTAAAGCCGAGCA CCTTAGCCTTCTCTTCCTCCTC

SNP22 AK4 chr11:133543212 TGAGCTTCCTGCATGACTACAT TTGTGGGAATTACACCTCCTCT

SNP23 AK4 chr17:7621663 TGGTGACGATAGAAATTCATGC TCCTTGCATTTACAGAACATGG

SNP24 AK4 chr3:11276768 TGCTTTCCACTTGATATTGTGC TTCATGTGCAACCCAGATACAT

SNP25 AK4 chr6:431719 TGGACTTCTTTTATGTGGCAGA GGGCTGTAAAACAAGTGTCTCC

SNP26 AK6 chr14:63553000 TGCTTTGTTGGGTATTGTTTTT ACATCAAGCCATCTATCCACAA

SNP27 AK6 chr18:31080233 GTGGCAAAACTTTCAAAAGGAG GCACAAAGCAAGCTAGACTCAA

SNP28 AK6 chr2:43846079 AGAGAAGGGAATTCTGGTAGCC ATGGAGGACAAGGAGTGAATGT

SNP29 AK6 chr6:127679300 TGATTATCCTCTACGGCACAAA GCATTATACCTTTGTGTTTCTGCTT

SNP30 AK6 chrX:131040253 GCTATGTGGACTTGTCCTTTCC TCTAAGCTCCTTCCAAACAAGC

SNP31 AK14 chr1:198644404 GTATGGTTTGATGACCCAGGTT AGAGAAGCCATTTGGATGTGAT

SNP32 AK14 chr5:65386257 AATCTTGGTGATCCAGGCTCTA CTTGATGCATTTGGACCATCTA

SNP33 AK14 chr5:95250409 CTGGTCCAAGCAGAGTTCTAGG TCTGACCTGTGGTTGAAAAATG

Indel01 AK3 chr7:80141323 AAAAGGGTGATAGGCAATTGAA TGGCCTAATATGTAACTTCTCTTTG

Indel02 AK5 chr19:52466505 ACTTGCCTGTGTCCCCAAAG CTCCACCTCTTCACCCCAAT

Indel03 AK5 chr20:74155 AAGTGTCTAAACGACGTTGGAA TCGAAGCAGTAGTCATCATCAAA

Indel04 AK5 chr3:191588765 CAGGGCGTGAGAAAAAGTAAAA TACCTCCAGAGAGTCATCAGCA

Indel05 AK7 chr1:8638908 TGTTGTCATTGTCCTCGTCTTC TCCCTGGAATTGAGTGAGAAAT

Indel06 AK7 chr11:55518142 CAGAGTCCCCACAATATTCACA TCCAGGACCGTCTACCTAAAAA

Indel07 AK7 chr21:44882040 AATCAGGCTACACCAGCTCCT AGACGGACTTAGAGCAGACAGG

Indel08 AK7 chr5:98220064 TTCCGACTACTCCAGGTATGCT TCATCGGTTACATTCAGACCAC

Indel09 AK7 chr6:159580769 AAGCCAATTTTGAGTCTTGGAG GATCCCATTGGAGCTCATTATC

Indel10 AK9 chr1:33068543 GAGGGATCTAAGCACGTTTACAA AATGCTGAAGGTAACAGGAGAAA


31

Supplementary Table 4. Indel list of 10 individuals extracted by whole genome sequencing

SuppTable4_Indel_Table_1_20.txt


Chromosometype position position_alternativesize allele AK1 AK3 AK5 AK7 AK9 AK2 AK4 AK6 AK14 AK20 total sample_countmax_total

chr1 del 228 228 27 - 0 0 0 0 0 0 1 1 2 1 5 4 20

chr1 del 328 328 24 - 0 0 1 0 0 0 0 0 0 0 1 1 20

chr1 ins 353 353 1 A 0 0 0 0 0 0 0 0 1 0 1 1 20

chr1 del 39377 39377 1 - 0 0 0 0 1 0 0 0 0 0 1 1 20

chr1 del 42096 42096 2 - 0 0 0 0 1 0 0 0 0 0 1 1 20

chr1 del 43001 43001 2 - 0 0 0 0 0 0 0 0 1 0 1 1 20

chr1 del 44575 44575 16 - 0 0 0 0 1 0 0 0 0 0 1 1 20

chr1 del 51213 51213 1 - 0 0 0 0 1 0 0 0 0 0 1 1 20

chr1 ins 51224 51224 1 A 0 0 0 0 0 0 0 0 1 0 1 1 20

chr1 del 53598 53601 3 - 0 0 1 1 1 0 1 0 1 1 6 6 20

chr1 del 56023 56023 6 - 0 0 0 0 0 0 0 0 1 0 1 1 20

chr1 del 62003 62022 4 - 0 0 0 0 1 0 0 0 0 0 1 1 20

chr1 del 63705 63705 1 - 0 0 0 1 0 0 0 0 0 1 2 2 20

chr1 del 71453 71453 1 - 0 0 0 0 1 0 0 0 0 0 1 1 20

chr1 del 71996 71996 3 - 0 0 0 0 1 0 0 0 0 0 1 1 20

chr1 del 73692 73692 20 - 0 0 0 0 1 0 0 0 0 0 1 1 20

chr1 del 88784 88784 1 - 0 0 0 0 1 0 0 0 0 0 1 1 20

chr1 ins 94048 94048 8 CACACACA0 0 0 0 1 0 0 0 0 0 1 1 20

chr1 del 220913 220913 2 - 0 0 0 0 0 0 0 1 1 0 2 2 20

chr1 del 223548 223551 1 - 0 0 0 0 1 0 0 0 0 0 1 1 20

chr1 del 233713 233713 4 - 0 0 0 1 1 0 0 1 0 0 3 3 20

chr1 del 235119 235123 1 - 0 0 0 1 0 0 0 0 0 0 1 1 20

chr1 ins 239148 239148 1 T 0 1 0 0 0 0 0 0 0 0 1 1 20

chr1 del 241490 241491 1 - 0 0 0 0 1 0 2 1 2 0 6 4 20

chr1 ins 245787 245787 2 CT 0 0 0 0 0 0 0 1 1 0 2 2 20

chr1 ins 245792 245792 2 TG 0 0 0 0 0 0 0 0 0 1 1 1 20

chr1 del 333889 333889 3 - 0 0 0 0 0 0 0 0 1 0 1 1 20

chr1 del 530208 530210 5 - 0 1 0 0 0 0 0 0 0 0 1 1 20

chr1 del 536003 536003 5 - 0 0 0 0 0 1 0 0 0 0 1 1 20

chr1 ins 537496 537496 2 GT 0 0 0 0 0 2 0 0 0 0 2 1 20

chr1 del 537701 537701 2 - 0 0 0 0 0 2 0 0 0 0 2 1 20

chr1 del 537719 537719 2 - 0 0 0 0 0 1 0 0 0 0 1 1 20

chr1 del 557102 557102 1 - 2 0 0 0 0 0 0 0 0 0 2 1 20

chr1 del 557879 557884 1 - 0 0 0 0 0 0 0 2 0 0 2 1 20

chr1 del 602551 602555 3 - 0 0 0 0 1 0 0 0 0 1 2 2 20

chr1 ins 670110 670110 6 GTGTGT 0 0 0 0 0 0 0 0 1 0 1 1 20

chr1 ins 703655 703655 3 GCT 0 0 0 0 2 0 0 0 0 0 2 1 20

chr1 ins 710936 710936 2 GC 0 0 0 0 0 1 0 0 0 0 1 1 20

chr1 del 713661 713666 2 - 0 2 0 1 2 1 1 1 1 1 10 8 20

chr1 del 714067 714067 5 - 0 0 0 0 0 1 0 0 0 0 1 1 20

chr1 del 714811 714829 5 - 0 0 0 0 1 0 0 0 0 0 1 1 20

chr1 ins 715418 715418 5 GGAAT 0 0 0 0 0 0 0 0 1 0 1 1 20

chr1 del 716096 716096 5 - 0 0 0 0 0 1 0 0 0 0 1 1 20

chr1 del 716176 716176 10 - 0 0 0 0 1 0 0 0 0 0 1 1 20

chr1 del 716176 716176 15 - 0 0 0 0 0 0 1 0 1 0 2 2 20

chr1 del 716888 716888 5 - 0 0 0 0 0 1 0 0 0 0 1 1 20

chr1 ins 724802 724803 1 T 0 0 0 0 1 0 1 1 0 0 3 3 20

chr1 del 735233 735234 1 - 0 0 0 0 0 0 0 0 0 1 1 1 20

chr1 ins 739834 739834 2 AA 0 0 0 0 2 0 0 0 0 1 3 2 20


32

Supplementary Table 5. Exome Sequencing Statistics

Individuals Gender Read length reads bases Aligned reads Aligned bases aligned coverage

AK_N1 Male 2 x 78 65,740,784 5,127,781,152 62,465,112 4,872,092,077 62.7

AK_N2 Male 2 x 78 67,124,814 5,235,735,492 63,544,668 4,956,295,050 63.8

AK_N5 Male 2 x 78 68,444,468 5,338,668,504 64,553,546 5,034,976,940 64.8

AK_N6 Male 2 x 78 66,995,750 5,225,668,500 62,752,082 4,894,465,964 63.0

AK_N7 Male 2 x 78 66,884,224 5,216,969,472 62,973,127 4,911,704,660 63.2

AK_N9 Female 2 x 78 70,142,332 5,471,101,896 63,828,875 4,978,499,064 64.1

AK_N14 Female 2 x 78 71,856,586 5,604,813,708 64,188,537 5,006,552,917 64.5

AK_N15 Male 2 x 78 71,876,294 5,606,350,932 64,543,996 5,034,263,798 64.8


33

Supplementary Table 6. Non-synonymous SNP list detected from 18 individuals (10 whole genome sequencing, 8 whole exome sequencing)

SuppTable6_nsSNP_from_18individuals.xls


chr pos ref allele AK1 AK3 AK5 AK7 AK9 AK2 AK4 AK6 AK14AK20AK_N1AK_N2AK_N5AK_N6AK_N7AK_N15AK_N9AK_N14annotations ref_aa snp_aa blosum nssnp

chr1 59374 A G 0 2 2 0 1 0 2 0 2 2 2 2 2 2 2 2 2 2 CDS:OR4F5 T A 0 nsSNP

chr1 855557 C T 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 CDS:SAMD11 H Y 2 nsSNP

chr1 867694 T C 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CDS:SAMD11 W R -3 nsSNP

chr1 878522 T C 2 2 2 0 0 2 2 0 2 2 2 2 2 2 2 2 2 2 CDS:NOC2L I V 3 nsSNP

chr1 879101 G A 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 CDS:NOC2L A V 0 nsSNP

chr1 891991 C T 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CDS:PLEKHN1 A V 0 nsSNP

chr1 895986 G C 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 CDS:PLEKHN1 A P -1 nsSNP

chr1 899101 G C 2 2 1 2 0 2 0 0 2 0 2 2 2 2 2 2 2 2 CDS:PLEKHN1 R P -2 nsSNP

chr1 899172 T C 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CDS:PLEKHN1 S P -1 nsSNP

chr1 925085 C A 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 CDS:HES4::Intron:HES4 "T"R S -1 nsSNP

chr1 939471 G A 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 CDS:ISG15 S N 1 nsSNP

chr1 966461 C T 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 CDS:AGRN T I -1 nsSNP

chr1 970994 A G 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 CDS:AGRN Q R 1 nsSNP

chr1 979070 G C 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 CDS:AGRN S T 1 nsSNP

chr1 1110294 G A 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 CDS:TTLL10 S N 1 nsSNP

chr1 1122801 G A 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 CDS:TTLL10 G D -1 nsSNP

chr1 1212130 G C 2 0 0 0 0 1 0 0 0 1 2 0 0 0 1 0 0 0 CDS:SCNN1D R P -2 nsSNP

chr1 1213248 G C 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 CDS:SCNN1D E Q 2 nsSNP

chr1 1252564 G A 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 CDS:GLTPD1 R K 2 nsSNP


34

Supplementary Table 7. Trait-O-Matic results on nsSNP of 18 individuals

SuppTable7_TraitOMatic_18individuals.xls


Coordinates Genotype

Gene, amino acid change Trait-associated allele

chr1:9246497 G/A

H6PD, R453Q A

chr1:25589952 C/G

RHCE, P226A C

chr1:46643348 C/A

FAAH, P129T A

chr1:65809029 G/G

LEPR, K109R G

chr1:65831101 G/G

LEPR, Q223R G

chr1:167963570 G/A

SELE, H468Y A

chr1:194925860 C/T

CFH, Y402H C

chr1:194925860 C/T

CFH, Y402H C

chr1:205173101 A/A

PIGR, A580V A

chr1:224086256 T/C

EPHX1, Y113H C

chr1:224086256 T/C

EPHX1, Y113H C

chr1:224086256 T/C

EPHX1, Y113H C

chr1:224086256 T/C

EPHX1, Y113H C

chr2:108880033 G/G

EDAR, V370A G

chr2:230758959 G/G

SP110, L425S G

chr3:46374212 G/A

CCR2, V64I A

chr4:2876505 T/T

ADD1, G460W T

chr4:100458342 T/C

ADH1B, R48H T

chr4:100458342 T/C

ADH1B, R48H T

chr4:100479812 T/C

ADH1C, I350V C

chr4:100482988 C/T

ADH1C, R272Q T

chr4:102970099 G/A

BANK1, R61H A

chr5:7923973 A/G

MTRR, I22M G

chr5:7923973 A/G

MTRR, I22M G

Associated trait

CORTISONE REDUCTASE DEFICIENCY

RH E/e POLYMORPHISM

DRUG ADDICTION, SUSCEPTIBILITY TO

LEPTIN RECEPTOR POLYMORPHISM

LEPTIN RECEPTOR POLYMORPHISM

IgA NEPHROPATHY, SUSCEPTIBILITY TO

BASAL LAMINAR DRUSEN, INCLUDED

MACULAR DEGENERATION, AGE-RELATED, 4, SUSCEPTIBILITY TO

IgA NEPHROPATHY, SUSCEPTIBILITY TO

EMPHYSEMA, SUSCEPTIBILITY TO, INCLUDED

LYMPHOPROLIFERATIVE DISORDERS, SUSCEPTIBILITY TO

PREECLAMPSIA, SUSCEPTIBILITY TO, INCLUDED

PULMONARY DISEASE, CHRONIC OBSTRUCTIVE, SUSCEPTIBILITY TO, INCLUDED

HAIR MORPHOLOGY 1, HAIR THICKNESS

MYCOBACTERIUM TUBERCULOSIS, SUSCEPTIBILITY TO

HUMAN IMMUNODEFICIENCY VIRUS TYPE 1, RESISTANCE TO

HYPERTENSION, SALT-SENSITIVE ESSENTIAL, SUSCEPTIBILITY TO

AERODIGESTIVE TRACT CANCER, SQUAMOUS CELL, ALCOHOL-RELATED,

PROTECTION AGAINST; INCLUDED

ALCOHOL DEPENDENCE, PROTECTION AGAINST



SYSTEMIC LUPUS ERYTHMATOSUS, ASSOCIATION WITH

DOWN SYNDROME, SUSCEPTIBILITY TO, INCLUDED

NEURAL TUBE DEFECTS, FOLATE-SENSITIVE, SUSCEPTIBILITY TO


35

Supplementary Table 8. Super nsSNP gene list

SuppTable8_Super_nsSNP_Gene_List.xls


# nsSNPdensity

(/Kb)# nsSNP

density

(/Kb)# nsSNP

density

(/Kb)# nsSNP

density

(/Kb)# nsSNP

density

(/Kb)

ZNF717 NM_001128223 3 75868718 75915203 5 2749 75 27.28 85 30.92 67 24.37 79 28.74 75 27.28

OR4C3 NM_001004702 11 48303068 48304058 1 991 16.5 16.65 17 17.15 14 14.13 18 18.16 11 11.1

CDC27 NM_001256 17 42553299 42621537 19 2494 40.6 16.28 65 26.06 31 12.43 59 23.66 31 12.43

FRG2C NM_001124759 3 75796220 75797882 4 853 13.5 15.83 9 10.55 12 14.07 16 18.76 14 16.41

OR4C45 NM_001005513 11 48323475 48330575 2 921 13.7 14.88 15 16.29 16 17.37 18 19.54 15 16.29

OR9G9 NM_001013358 11 56224439 56225357 1 919 12.9 14.04 16 17.41 9 9.79 7 7.62 14 15.23

FAM104B NM_001166702 X 55189241 55204143 3 342 4.6 13.45 4 11.7 4 11.7 4 11.7 4 11.7

PRIM2 NM_000947 6 57291202 57620661 15 1543 17.7 11.47 15 9.72 18 11.67 19 12.31 19 12.31

HLA-DRB1 NM_002124 6 32654845 32665497 6 807 8.5 10.53 6 7.43 8 9.91 9 11.15 5 6.2

HLA-DPA1 NM_033554 6 33144404 33149325 5 787 7.8 9.91 8 10.17 8 10.17 9 11.44 5 6.35

CTBP2 NM_001329 10 126668076 126717613 11 1347 12.6 9.35 30 22.27 9 6.68 11 8.17 11 8.17

SEC22B NM_004892 1 143807903 143827246 5 653 6 9.19 7 10.72 6 9.19 6 9.19 7 10.72

HLA-DQB1 NM_002123 6 32735990 32742362 5 791 7.1 8.98 4 5.06 8 10.11 9 11.38 10 12.64

OR13C5 NM_001004482 9 106400558 106401515 1 958 8.4 8.77 5 5.22 9 9.39 8 8.35 4 4.18

KCNJ12 NM_021012 17 21259247 21260549 3 1303 11.2 8.6 13 9.98 9 6.91 13 9.98 10 7.67

MUC4 NM_138297 3 196959717 197023085 23 3401 28.4 8.35 34 10 2 0.59 30 8.82 22 6.47

TAS2R31 NM_176885 12 11074271 11075201 1 931 7.3 7.84 6 6.44 7 7.52 7 7.52 7 7.52

HLA-DQA1 NM_002122 6 32713213 32718519 5 772 6 7.77 5 6.48 2 2.59 8 10.36 6 7.77

OR51Q1 NM_001004757 11 5400006 5400960 1 955 7.4 7.75 7 7.33 7 7.33 7 7.33 8 8.38

HLA-A NM_002116 6 30018309 30021211 8 1106 8.4 7.59 16 14.47 6 5.42 13 11.75 7 6.33

Gene Code Chr

Start

of first exon

(bp)

Stop

of last exon

(bp)

Number

of exons

Total

exon

length

(bp)

AK1Average AK3 AK5 AK7


36

Supplementary Table 9. List of Korean common novel nsSNP LD

SuppTable9_KoreanCommonNovel_nsSNP_LD.xls


chr pos ref_allele var_allele frequency annotation ref_aa snp_aa blosum rsSNP_maxLD r2

chr1 9755922 A T 2//20 CDS:CLSTN1 F Y 3 rs77601527 1

chr1 11694516 G A 2//20 CDS:C1orf187 G R -2 rs77681396 1

chr1 11949647 A G 2//20 CDS:PLOD1 Y C -2 rs116892868 0.44444

chr1 12260475 G A 2//20 CDS:VPS13D D N 1 rs7545503 0.44444

chr1 12760062 C G 3//20 CDS:PRAMEF12 N K 0 rs80177200 1

chr1 12776677 T A 8//20 CDS:PRAMEF1 L stop -4 rs1063776 1

chr1 12776775 T C 2//20 CDS:PRAMEF1 C R -3 rs1613050 1

chr1 12777066 C G 4//20 CDS:PRAMEF1 R G -2 rs1063774 1

chr1 12778353 C T 2//20 CDS:PRAMEF1 A V 0 rs74850310 1

chr1 12778480 T G 2//20 CDS:PRAMEF1 I M 1 rs848426 0.86538

chr1 12810958 T G 2//20 CDS:PRAMEF11 Q H 0 rs1769772 0.64286

chr1 12810984 C T 4//20 CDS:PRAMEF11 D N 1 rs2076063 0.64286

chr1 12811002 C T 5//20 CDS:PRAMEF11 A T 0 rs1736809 0.42857

chr1 12829871 T C 7//20 CDS:LOC649330,HNRNPCL1T A 0 rs12745844 0.64286

chr1 12829872 G C 7//20 CDS:LOC649330,HNRNPCL1S R -1 rs12745844 0.64286

chr1 12829903 T C 7//20 CDS:LOC649330,HNRNPCL1E G -2 rs12745844 0.64286

chr1 12830036 C G 2//20 CDS:LOC649330,HNRNPCL1D H -1 rs1630264 1

chr1 12830083 T C 2//20 CDS:LOC649330,HNRNPCL1K R 2 rs1630264 1

chr1 12830120 T C 2//20 CDS:LOC649330,HNRNPCL1I V 3 rs1630264 1

chr1 12830385 A C 5//20 CDS:LOC649330,HNRNPCL1F L 0 rs61777008 0.60494

chr1 12830389 C T 6//20 CDS:LOC649330,HNRNPCL1G D -1 rs1737113 0.64286

chr1 12830390 C T 6//20 CDS:LOC649330,HNRNPCL1G S 0 rs1737113 0.64286

chr1 12842229 G A 2//20 CDS:PRAMEF2 A T 0 rs61781252 1

chr1 12843952 A C 2//20 CDS:PRAMEF2 S R -1 rs116865587 1

chr1 12843954 C A 2//20 CDS:PRAMEF2 S R -1 rs116865587 1

chr1 12843969 T G 2//20 CDS:PRAMEF2 I M 1 rs58112782 1

chr1 12862091 A C 2//20 CDS:PRAMEF4 F C -2 rs3928864 1

chr1 13105825 T C 2//20 CDS:LOC440563 E G -2 rs113741404 1

chr1 13106059 G A 5//20 CDS:LOC440563 P L -3 rs113259710 1

chr1 13106115 A C 6//20 CDS:LOC440563 F L 0 rs28434299 0.53552

chr1 13106119 C T 5//20 CDS:LOC440563 G D -1 rs78443402 0.66667

chr1 13106120 C T 5//20 CDS:LOC440563 G S 0 rs78443402 0.66667

chr1 16773608 A G 2//20 CDS:NBPF1 S P -1 rs58145953 0.58333

chr1 16786264 T C 4//20 CDS:NBPF1 S G 0 rs598052 0.84848

chr1 17439736 G A 2//20 CDS:PADI1 R H 0 rs4363467 1

chr1 46911224 C T 2//20 CDS:C1orf223 R W -3 rs12562113 0.69262

chr1 47375775 C T 2//20 CDS:CYP4A22 R C -3 rs2224622 1

chr1 52078667 T A 2//20 CDS:NRD1 E V -2 rs117346555 0.58333

chr1 64247576 T G 2//20 CDS:ROR1 S A 1 rs1341511 0.25

chr1 89221866 C T 2//20 CDS:RBMXL1::Intron:CCBL2 "T"A T 0 rs77567101 1

chr1 89221886 C G 3//20 CDS:RBMXL1::Intron:CCBL2 "T"G A 0 rs112636230 1

chr1 89297183 A G 2//20 CDS:GBP1 L P -3 rs12125301 0.76563

chr1 111830738 G A 2//20 CDS:ADORA3 R C -3 rs1415793 0.47368


37

Supplementary Table 10. Total 5,496 deletion list of 8 individuals including breakpoints information

SuppTable10_Large_Deletion_List.xls


whole-gene CDS UTR intron promoter(<1kb)

LargeDeletion_1 AK3 chr1 1261 534631 535891 N/S N/S N/S - - - - - VNTR








LargeDeletion_2 AK3 chr1 1051 859231 860281 N/S N/S N/S - - - SAMD11 - VNTR

LargeDeletion_2 AK5 chr1 1051 859231 860281 859230 859817 1 - - - SAMD11 - VNTR







LargeDeletion_3 AK3 chr1 1591 955171 956761 N/S N/S N/S - - - AGRN - VNTR


LargeDeletion_3 AK7 chr1 1591 955171 956761 955260 955805 2 - - - AGRN - VNTR






LargeDeletion_4 AK3 chr1 1231 1064821 1066051 1065396 1065944 1 - - - - - VNTR

LargeDeletion_4 AK5 chr1 1231 1064821 1066051 1064327 1065607 1 - - - - - VNTR


stopbreakpoint

start

breakpoint

stop#split_reads

Gene Annotation inferred

mechanismindex individual chr size start


38

Supplementary Table 11. Validation of large deletions using 24M CGH array data

IndividualTotal

deletions

Subject regions

(≥5 array probes)

Validated regions

(p-value*<0.05)

Validated regions

(p-value*<0.01)

Accuracy

(p-value<0.05)

Accuracy

(p-value<0.01)

AK4 674 330 272 255 82.42% 77.27%

AK6 693 331 285 271 86.10% 81.87%

AK14 700 318 262 245 82.39% 77.04%

AK20 683 318 268 245 84.28% 77.04%

Total 2,750 1,297 1,087 1,016 83.81% 78.33%

* Wilcoxon Rank Sum Test using R statistics


39

Supplementary Table 12. Breakpoint list of NA10851 deletions (CNV Loss)

SuppTable12_NA10851_Deletion(CNV)_Breakpoints.xls


cnv_id chr cnv_typecnv_size cnv_start cnv_end gap_size gap_start gap_end

NA10851_BP_1 chr1 LOSS 3085 2042821 2045905 987 2042811 2043797

NA10851_BP_2 chr1 LOSS 606 7847539 7848144 310 7847572 7847881

NA10851_BP_3 chr1 LOSS 82381 13219599 13301979 162082 13220037 13382118

NA10851_BP_4 chr1 LOSS 3646 54864917 54868562 3693 54864862 54868554

NA10851_BP_5 chr1 LOSS 901 58516499 58517399 913 58516498 58517410

NA10851_BP_6 chr1 LOSS 1188 59878725 59879912 851 59878834 59879684

NA10851_BP_7 chr1 LOSS 1088 61855369 61856456 844 61855446 61856289

NA10851_BP_8 chr1 LOSS 763 67780751 67781513 856 67780549 67781404

NA10851_BP_9 chr1 LOSS 468 72222105 72222572 412 72222209 72222620

NA10851_BP_10 chr1 LOSS 45875 72538815 72584689 45516 72538912 72584427

NA10851_BP_11 chr1 LOSS 2410 79993560 79995969 1248 79994369 79995616

NA10851_BP_12 chr1 LOSS 2433 89248781 89251213 2716 89248503 89251218

NA10851_BP_13 chr1 LOSS 3298 94060875 94064172 2884 94060962 94063845

NA10851_BP_14 chr1 LOSS 585 104244674 104245258 526 104244723 104245248

NA10851_BP_15 chr1 LOSS 1065 104470632 104471696 439 104470723 104471161

NA10851_BP_16 chr1 LOSS 827 105469153 105469979 912 105469073 105469984

NA10851_BP_17 chr1 LOSS 4341 108534537 108538877 3927 108534849 108538775

NA10851_BP_18 chr1 LOSS 12815 112493555 112506369 12913 112493315 112506227

NA10851_BP_19 chr1 LOSS 34690 150822121 150856810 32198 150822167 150854364

NA10851_BP_20 chr1 LOSS 3337 157134152 157137488 2451 157134158 157136608

NA10851_BP_21 chr1 LOSS 964 157915302 157916265 950 157915332 157916281

NA10851_BP_22 chr1 LOSS 1246 167270761 167272006 871 167270985 167271855

NA10851_BP_23 chr1 LOSS 4730 173338098 173342827 6154 173337124 173343277

NA10851_BP_24 chr1 LOSS 767 186806093 186806859 775 186806079 186806853

NA10851_BP_25 chr1 LOSS 2158 190183033 190185190 2109 190183069 190185177

NA10851_BP_26 chr1 LOSS 1715 197040280 197041994 1665 197040373 197042037

NA10851_BP_27 chr1 LOSS 1521 201330618 201332138 844 201330970 201331813

NA10851_BP_28 chr1 LOSS 3580 205608618 205612197 3601 205608592 205612192

NA10851_BP_29 chr1 LOSS 8527 208144233 208152759 7922 208144679 208152600

NA10851_BP_30 chr1 LOSS 2736 208789026 208791761 2420 208789084 208791503

NA10851_BP_31 chr1 LOSS 6837 220440446 220447282 6374 220440789 220447162

NA10851_BP_32 chr1 LOSS 495 223740078 223740572 331 223740095 223740425

NA10851_BP_33 chr1 LOSS 1696 234985539 234987234 1610 234985557 234987166

NA10851_BP_34 chr1 LOSS 1116 241849373 241850488 1018 241849369 241850386

NA10851_BP_35 chr1 LOSS 2642 244204375 244207016 1560 244204866 244206425


40

Supplementary Table 13. Motif on flanking regions of NHEJ large deletions


41

Supplementary Table 14. RNA Sequencing Statistics

<8AKs Transcriptome Seq. Run1>

Read length reads bases Aligned reads Aligned bases Coverage

AK3 2 x 101 51,601,132 5,211,714,332 28,345,352 2,862,410,015 36.9

AK4 2 x 101 51,143,140 5,165,457,140 27,636,273 2,790,800,158 35.9

AK5 2 x 101 52,630,854 5,315,716,254 28,317,447 2,859,606,849 36.8

AK6 2 x 101 51,324,294 5,183,753,761 26,733,252 2,699,610,066 34.8

AK7 2 x 101 51,032,760 5,154,308,760 27,011,183 2,727,650,181 35.1

AK14 2 x 101 52,862,496 5,339,112,096 26,717,457 2,698,007,226 34.7

AK20 2 x 101 51,586,996 5,210,286,596 27,373,384 2,764,293,063 35.6

<8AK_Ns Transcriptome Seq. Run1>

Read length reads bases Aligned reads Aligned bases Coverage

AK_N1 2 x 78 77,298,628 6,029,292,984 59,347,587 4,606,250,876 59.3

AK_N2 2 x 78 82,591,894 6,442,167,732 63,596,426 4,940,288,891 63.6

AK_N5 2 x 78 90,704,150 7,074,923,700 64,991,714 5,059,103,355 65.1

AK_N6 2 x 78 86,655,342 6,759,116,676 61,578,609 4,797,048,436 61.8

AK_N7 2 x 78 87,381,024 6,815,719,872 61,237,205 4,775,180,801 61.5

AK_N9 2 x 78 81,470,204 6,354,675,912 61,697,228 4,790,953,276 61.7

AK_N14 2 x 78 88,447,086 6,898,872,708 67,369,082 5,228,014,109 67.3

AK_N15 2 x 78 78,300,992 6,107,477,376 61,677,972 4,787,852,561 61.6


42

Supplementary Table 15. Alignments of RNA short-reads on cDNA sequence decreases

misalignment of short-reads on pseudogenes.

Chr. Gene RPKM_GenomeBased RPKM_cDNABased

1 SUMO1P3 19.8012 0.4284

1 HSP90B3P 44.8018 0.0409

1 TOP1P1 11.3555 0.5103

1 AURKAPS1 12.0098 0.3908

3 PA2G4P4 77.8963 0.3407

6 LYPLA2P1 20.3936 0.4135

7 CLK2P 11.374 0

7 RPL23P8 239.9053 0.1288

8 RNF5P1 30.8248 0.3391

8 NACAP1 34.0562 1.2033

8 PTTG3P 29.723 0

9 PTENP1 13.8665 1.1005

9 ANXA2P2 107.289 1.462

10 PIPSL 37.2138 0.0875

11 CSNK2A1P 45.2763 1.8908

12 NME2P1 360.4471 0.0972

13 ATP5EP2 251.3407 0.0897

16 UBE2MP1 50.9908 0.2103

16 CSDAP1 11.0325 0


43

Supplementary Table 16. Expression map represented in RPKM value for all Refseq genes

SuppTable16_Gene_Expression_Map_in_RPKM.xls


AK3 AK5 AK7 AK_N1 AK_N2 AK_N5 AK_N6 AK_N7 AK_N15 AK4 AK6 AK14 AK20 AK_N9 AK_N14

1 29751 LOC100288778 4225 7502 1220 NR_028269 10.54964 12.185 14.002 14.091 10.209 8.219 5.938 7.566 11.588 7.651 14.359 9.887 11.921 11.827 9.663 9.138

1 19700 WASH5P 4225 19233 1769 NR_024540 10.25641 12.048 13.978 14.1 10.128 7.507 5.65 6.886 10.883 7.547 13.713 9.297 11.317 11.607 10.063 9.125

1 2924 FAM138F 24475 25944 1129 NR_026820 0.003707 0 0 0 0 0.036 0 0.019 0 0 0 0 0 0 0 0

1 18978 FAM138A 24475 25944 1129 NR_026818 0.003707 0 0 0 0 0.036 0 0.019 0 0 0 0 0 0 0 0

1 27700 FAM138C 24475 25944 1129 NR_026822 0.003707 0 0 0 0 0.036 0 0.019 0 0 0 0 0 0 0 0

1 33158 OR4F5 58954 59871 918 NM_0010054840.014827 0.041 0 0.077 0 0 0 0 0 0 0 0 0.059 0 0.045 0

1 29456 LOC100132062 313755 318443 4369 NR_028325 2.633713 4.93 3.59 4.024 2.811 1.914 1.522 1.832 1.704 1.547 3.974 2.052 2.756 2.447 2.915 1.489

1 29775 LOC100133331 313755 318443 4272 NR_028327 2.693387 5.041 3.671 4.116 2.874 1.958 1.557 1.874 1.743 1.582 4.064 2.099 2.819 2.502 2.981 1.522

1 29784 LOC100132287 313755 318443 4369 NR_028322 2.633713 4.93 3.59 4.024 2.811 1.914 1.522 1.832 1.704 1.547 3.974 2.052 2.756 2.447 2.915 1.489

1 23957 OR4F29 357522 358460 939 NM_001005221 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 28748 OR4F3 357522 358460 939 NM_001005224 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 33188 OR4F16 357522 358460 939 NM_001005277 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 31657 MIR1977 556051 556128 78 NR_031741 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 13392 OR4F3 610959 611897 939 NM_001005224 0.00568 0 0.04 0 0.045 0 0 0 0 0 0 0 0 0 0 0

1 23958 OR4F29 610959 611897 939 NM_001005221 0.00568 0 0.04 0 0.045 0 0 0 0 0 0 0 0 0 0 0

1 33189 OR4F16 610959 611897 939 NM_001005277 0.00568 0 0.04 0 0.045 0 0 0 0 0 0 0 0 0 0 0

1 29782 LOC100133331 651003 655594 4272 NR_028327 2.89888 5.379 3.886 4.029 3.307 1.912 1.558 1.886 1.985 1.898 4.063 3.323 3.074 2.684 2.786 1.715

1 26608 NCRNA00115 751450 752765 1316 NR_024321 1.847667 1.928 2.428 1.245 2.615 2.005 2.675 1.418 1.37 2.024 2.014 2.132 1.042 1.692 1.703 1.425

1 22531 LOC643837 752927 779603 1543 NR_015368 5.284347 4.936 6.378 7.498 6.553 4.802 5.208 4.593 4.171 4.376 5.572 5.385 4.794 6.085 4.865 4.049

1 638 FAM41C 793320 802045 1700 NR_027055 1.025213 1.293 1.152 1.332 0.702 0.664 1.045 0.79 0.487 0.514 2.244 1.408 1.296 1.18 0.695 0.577

1 27737 FLJ39609 842818 844680 494 NR_026874 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 12523 SAMD11 850984 869824 2554 NM_152486 4.23398 5.999 5.519 6.475 3.633 2.51 1.994 3.122 4.118 2.685 4.37 4.854 5.73 6.015 2.931 3.557

1 17488 NOC2L 869446 884542 2800 NM_015658 56.4704 81.353 76.418 80.267 40.936 37.365 28.795 44.552 43.75 38.736 63.486 65.319 71.363 79.816 50.294 44.607

1 3040 KLHL17 885830 890958 2560 NM_198317 1.945547 2.619 1.797 1.753 2.343 1.742 1.03 2.025 1.794 1.605 1.983 1.935 1.983 2.277 2.246 2.051

1 11766 PLEKHN1 891740 900345 2398 NM_032129 0.248513 0.163 0.09 0.297 0.295 0.411 0.253 0.253 0.158 0.438 0.182 0.377 0.193 0.165 0.2 0.253

1 31739 PLEKHN1 891740 900345 2293 NM_0011601840.255713 0.176 0.094 0.286 0.312 0.378 0.251 0.269 0.163 0.454 0.194 0.391 0.212 0.19 0.214 0.253

1 28805 C1orf170 900442 907336 3040 NR_027693 0.058 0.022 0.044 0.117 0.084 0.055 0.014 0.133 0.051 0.08 0.045 0.048 0 0.069 0.074 0.034

1 21904 HES4 924206 925415 961 NM_021170 0.778307 0.901 0.554 0.946 1.414 0.557 0.506 1.116 0.626 1.367 0.944 1.195 0.271 0.731 0.41 0.138

1 29552 HES4 924208 925415 1037 NM_0011424670.742247 0.845 0.514 0.881 1.345 0.516 0.481 1.077 0.617 1.317 0.943 1.107 0.28 0.691 0.393 0.128

1 8392 ISG15 938710 939782 666 NM_005101 268.2438 186.28 419.374 380.118 259.442 99.085 136.799 241.431 326.256 185.151 190.792 427.151 289.287 314.469 441.4 126.623

1 8191 AGRN 945366 981355 7319 NM_198576 10.80492 8.352 12.267 12.034 16.947 14.3 7.813 11.017 7.653 14.786 10.835 12.316 8.24 9.068 9.438 7.007

1 5344 C1orf159 1007061 1041599 2104 NM_017891 3.982073 7.995 6.259 5.951 3.214 2.325 1.067 2.779 2.76 2.074 4.427 4.863 4.849 5.54 2.981 2.647

accession ave_rpkmGene Expression (RPKM)

chr gene_id gene start stop size


44

Supplementary Table 17. List of Unknown transcripts

SuppTable17_Unknown_Transcripts.xls


index individuals chr size(bp) start stop# of

individualsnearest_gene distance(bp)

NovelTranscript_1 AK3,AK_N5,AK7,AK_N7,AK_N1,AK_N2,AK_N6,AK_N14,AK6,AK_N15,AK20,AK4,AK5,AK_N9chr1 268 1333087 1333354 14 MRPL20 531

NovelTranscript_2 AK_N14,AK_N5,AK_N7,AK_N1chr1 761 1501253 1502013 4 SSU72 1128

NovelTranscript_3 AK6,AK_N5 chr1 138 2464125 2464262 2 LOC115110 6956

NovelTranscript_4 AK_N1,AK_N7 chr1 95 2477222 2477316 2 TNFRSF14 1834

NovelTranscript_5 AK_N14,AK_N9 chr1 345 2501115 2501459 2 C1orf93 6649



NovelTranscript_8 AK3,AK_N7,AK4,AK20,AK_N14,AK_N1,AK_N6,AK5,AK6,AK_N2,AK14,AK_N9,AK7,AK_N5,AK_N15chr1 166 3641312 3641477 15 KIAA0495 930

NovelTranscript_9 AK4,AK14 chr1 188 4210231 4210418 2 LOC284661 161552

NovelTranscript_10 AK_N2,AK_N5,AK_N15,AK_N9chr1 735 7897021 7897755 4 TNFRSF9 4738

NovelTranscript_11 AK_N2,AK6 chr1 256 7897756 7898011 2 TNFRSF9 4482

NovelTranscript_12 AK_N2,AK14,AK6,AK_N6,AK_N15,AK_N5,AK_N7,AK_N9,AK4,AK_N14,AK20,AK5chr1 337 7899895 7900231 12 TNFRSF9 2262

NovelTranscript_13 AK_N5,AK3 chr1 515 7943265 7943779 2 PARK7 521

NovelTranscript_14 AK4,AK6,AK_N5,AK_N6,AK14,AK_N1,AK_N7,AK7,AK_N15,AK_N2,AK_N14,AK20chr1 531 9018450 9018980 12 SLC2A5 613

NovelTranscript_15 AK_N14,AK_N2 chr1 217 9019361 9019577 2 SLC2A5 16

NovelTranscript_16 AK_N6,AK_N14,AK_N7,AK_N9,AK_N5,AK_N15,AK5chr1 1011 9970337 9971347 7 NMNAT1 2194

NovelTranscript_17 AK_N7,AK_N15 chr1 326 9972031 9972356 2 NMNAT1 3888

NovelTranscript_18 AK_N14,AK_N1,AK4,AK_N7,AK_N15chr1 485 10435206 10435690 5 APITD1 409

NovelTranscript_19 AK_N15,AK_N14 chr1 141 10437528 10437668 2 APITD1 2731

NovelTranscript_20 AK_N7,AK_N9,AK14chr1 333 11044140 11044472 3 SRM 1462

NovelTranscript_21 AK_N5,AK_N7 chr1 254 11047958 11048211 2 EXOSC10 1051

NovelTranscript_22 AK_N5,AK14 chr1 125 11048682 11048806 2 EXOSC10 456

NovelTranscript_23 AK_N2,AK20,AK7,AK_N5,AK3,AK6,AK_N14,AK5,AK_N9,AK_N7,AK14,AK_N6,AK_N15chr1 746 11278696 11279441 13 UBIAD1 7619

NovelTranscript_24 AK_N7,AK_N14,AK_N5,AK14,AK4,AK_N2,AK_N6,AK_N9,AK6chr1 426 11279945 11280370 9 UBIAD1 8868

NovelTranscript_25 AK_N2,AK_N9,AK7,AK_N15chr1 681 12037286 12037966 4 TNFRSF8 8054


NovelTranscript_27 AK_N1,AK_N9,AK_N14,AK_N6,AK_N5,AK_N7,AK_N15chr1 211 12045808 12046018 7 TNFRSF8 2

NovelTranscript_28 AK4,AK_N2,AK_N1,AK_N5,AK_N15,AK_N7chr1 430 12126853 12127282 6 TNFRSF8 2

NovelTranscript_29 AK_N2,AK_N5,AK_N6,AK_N9,AK_N15,AK_N14chr1 1050 12127341 12128390 6 TNFRSF8 490





NovelTranscript_34 AK_N6,AK_N5,AK_N7,AK_N15,AK_N2,AK_N9,AK_N1chr1 654 13897046 13897699 7 PRDM2 1622

NovelTranscript_35 AK_N14,AK_N9,AK_N2,AK_N6chr1 193 13898365 13898557 4 PRDM2 764


45

Supplementary Table 18. 23 Genes escape X-inactivation

chr gene start stop ave_rpkm male_rpkm female_rpkm pvalue qvalue previous report*

X PRKX 3532384 3641675 9.560273 8.15844444 11.66283333 0.0027 0.01924878 9/9

X HDHD1A 6976961 7076231 22.92633 18.3414444 29.804 0.0018 0.01668228 8/9

X PNPLA4 7826804 7855475 3.217813 2.287 4.614166667 0.0018 0.01668228 9/9

X MSL3 11686199 11703791 35.78468 29.5732222 45.10183333 0.0027 0.01924878 3/9

X TMSB4X 12903147 12905267 6894.269 6306.36333 7776.126667 0.0392 0.1513762 not analyzed

X TRAPPC2 13640282 13662675 9.250967 7.61022222 11.71216667 0.0018 0.01668228 9/9

X GEMIN8 13934766 13957956 6.921387 6.29111111 7.866833333 0.0018 0.01668228 9/9

X CA5BP 15602960 15631395 5.98988 4.71577778 7.901166667 0.0027 0.01924878 9/9

X ZRSR2 15718495 15751303 10.67819 9.06488889 13.09816667 0.0018 0.01668228 not analyzed

X SYAP1 16647628 16690727 17.53308 16.3571111 19.29716667 0.0392 0.1513762 9/9

X CXorf15 16714476 16772561 10.94431 9.56222222 13.0175 0.0113 0.06981842 9/9

X EIF1AX 20052557 20069887 45.26513 37.8303333 56.41733333 0.0018 0.01668228 9/9

X EIF2S3 23982986 24006851 133.5755 123.438667 148.781 0.0216 0.1177573 9/9

X GPR34 41433170 41441474 1.32278 1.01244444 1.788666667 0.0216 0.1177573 0/9

X CDK16 46962472 46974336 17.27303 15.7872222 19.502 0.0292 0.1503465 7/7

X KDM5C 53237229 53271329 23.22696 17.8846667 31.24066667 0.0018 0.01668228 9/9

X SMC1A 53417795 53466343 32.10438 27.8975556 38.41466667 0.008 0.05295961 7/9

X FOXO4 70232724 70240109 1.87224 1.74288889 2.0665 0.0392 0.1513762 3/9

X RPS4X 71409178 71413866 1781.525 1419.05078 2325.235333 0.0018 0.01668228 9/9

X TSIX 72928765 72965791 3.771327 0.175 9.166 0.0018 0.01668228 not analyzed

X XIST 72957220 72989313 12.37481 0.12877778 30.74416667 0.0018 0.01668228 9/9

X NGFRAP1 102517924 102519657 3.75198 2.45233333 5.701666667 0.0392 0.1513762 0/9

X ALG13 110811002 110820279 5.858393 5.22677778 6.805833333 0.0392 0.1513762 5/9

* Carrel, L & Willard, H.F. X-inactivation profile reveals extensive variability in X-linked gene expression in females. Nature (2005)

The value reflects fraction of hybrids expressing genes in inactivated X chromosomes in the previous paper (out of 9 (or less) tested).


46

Supplementary Table 19. 1,809 TBM sites

SuppTable19_TBM_List.xls


Chr position wt snp rna_wt rna_snp rna_var% dna_wt dna_RD rna_A rna_C rna_G rna_T is_dbsnp

chr1 5636 T C 8 7 0.4667 0 11 0 7 0 8 rs2691318

chr1 5636 T C 5 6 0.5455 0 26 0 6 0 5 rs2691318

chr1 5636 T C 4 6 0.6 1 17 0 6 0 4 rs2691318

chr1 6837 C T 50 21 0.2958 0 26 0 50 0 21 rs1045474

chr1 6837 C T 59 18 0.2338 1 30 0 59 0 18 rs1045474

chr1 6837 C T 44 11 0.2 0 23 0 44 0 11 rs1045474

chr1 8231 T C 2 14 0.875 0 7 0 14 0 2 rs4849248

chr1 8231 T C 5 4 0.4444 0 9 0 4 0 5 rs4849248

chr1 8231 T C 2 14 0.875 1 25 0 14 0 2 rs4849248

chr1 8231 T C 4 13 0.7647 0 12 0 13 0 4 rs4849248

chr1 8231 T C 2 9 0.8182 0 13 0 9 0 2 rs4849248

utr_gene utr_strand allele_change individual is_validated_by_2ndRNAseq

||uc009viv.1,uc009viw.1,uc009vix.1 -,-,- AG AK5 yes



||LOC100288778,WASH7P,uc009vit.1,uc009viu.1,uc001aae.2,uc001aab.2,uc009viq.1,uc009vir.1,uc001aac.2,uc009viv.1,uc009viw.1,ENSG00000146556,uc001aaf.1,uc009vix.1,uc001aag.1,uc009viy.1,uc009viz.1,uc009vjc.1,uc009vjd.1,uc001aai.1,uc001aah.2,uc009vja.1,uc009vjb.1-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,-,- GA AK20 yes



||uc009viu.1,uc001aab.2,uc001aac.2,uc009viz.1,uc001aai.1,uc009vja.1,uc009vjb.1 -,-,-,-,-,-,- AG AK3 yes






47

Supplementary Table 20. 580 Allele Specific Expression sites

SuppTable20_Allele_Specific_Expression_Sites.xls


AK3 AK5 AK7 AK4 AK6 AK14 AK20 AK_N1 AK_N2 AK_N5 AK_N6 AK_N7 AK_N9 AK_N14 AK_N15

1 1411854 G A rs860213 ATAD3B R Q 1 nsSNP 1:0 1:5 3:0 5:5 1:1 4:1 0:3 51:18 45:22 32:12 51:15 37:34 49:0 46:0 18:30

1 1421028 C T rs72468211ATAD3B P S -1 nsSNP 3:3 3:3 3:4 0:4 3:1 6:6 2:6 10:15 5:2 8:7 7:2 9:4 8:4 9:4 11:4

1 1640647 T C rs17845218CDK11B,CDK11AH R 0 nsSNP 20:40 26:22 34:29 21:28 30:13 33:18 18:22 1:0 0:0 0:0 0:0 0:1 2:1 3:0 0:0

1 1640657 A G rs1059830 CDK11B,CDK11AC R -3 nsSNP 18:38 27:22 29:23 23:24 29:10 37:17 19:23 1:0 0:0 0:0 0:0 0:1 2:1 2:0 0:0

1 7832324 C T rs2890565 UTS2 S N 1 nsSNP 31:0 6:13 31:0 21:8 15:0 23:0 18:0 65:0 64:0 0:62 69:0 30:32 42:24 86:0 77:0

1 7836017 G A rs228648 UTS2 T M -1 nsSNP 9:17 0:33 22:16 16:11 33:0 13:15 16:11 56:62 55:72 138:0 60:61 110:0 52:35 56:49 113:0

1 9755922 A T novel CLSTN1 F Y 3 nsSNP 22:0 10:14 27:0 17:0 11:0 12:0 7:0 31:0 36:0 19:32 20:16 31:0 11:14 39:0 37:0

1 11773514 C T rs2274976 MTHFR R Q 1 nsSNP 13:0 18:0 19:0 12:0 9:4 6:9 13:0 118:0 105:0 64:62 101:0 102:0 96:0 52:47 84:0

1 19285848 T A rs12584 UBR4 M L 2 nsSNP 0:37 18:21 0:32 0:14 0:19 6:6 0:14 34:18 0:53 33:24 28:33 0:63 27:17 0:46 0:45

chr pos wt snp dbsnp annotation wtAA snpAA blosum is_nssnpGenome Read-Counts

AK3 AK5 AK7 AK4 AK6 AK14 AK20 AK_N1 AK_N2 AK_N5 AK_N6 AK_N7 AK_N9 AK_N14 AK_N15

0::61 0::49 48::0 0::29 0::40 21::15 0::57 10::11 14::18 5::5 22::21 0::29 23::25 26::0 7::28 6 6 5 0.298 3.61E-10

52::20 64::11 78::11 29::8 46::7 47::6 37::13 27::11 16::5 6::5 28::11 28::8 38::14 33::9 15::10 7 7 3 -0.387 1.83E-06

24::96 31::72 26::86 13::49 19::64 15::61 14::61 19::31 7::34 13::24 28::35 19::34 7::52 15::48 9::29 7 7 7 0.289 2.72E-20

48::100 48::78 47::93 36::64 52::68 37::71 46::59 37::38 11::43 27::31 49::51 34::40 20::56 51::48 32::29 7 7 5 0.231 5.39E-10

0::0 0::0 0::0 0::0 28::0 21::0 14::0 0::0 31::0 0::57 43::0 0::12 7::24 0::0 32::0 4 2 2 0.4 2.56E-06

0::0 0::0 0::0 0::0 20::0 3::0 17::0 0::0 15::0 40::0 16::8 8::1 8::4 1::0 20::0 10 4 3 -0.381 4.78E-08

51::0 32::24 27::0 57::0 38::0 48::0 38::0 73::0 36::1 19::16 22::12 55::0 45::0 61::0 42::0 4 4 3 -0.306 1.69E-06

13::0 17::0 11::0 18::0 0::8 7::7 11::0 19::0 23::0 6::10 20::1 13::0 29::1 5::23 16::0 4 4 2 0.406 9.57E-06

0::113 53::65 0::102 0::131 0::99 64::57 0::114 66::74 1::144 58::74 52::45 2::98 0::129 0::98 0::124 6 6 4 0.217 4.03E-08

Transcriptome Read-counts# of heterozygote individuals# of informative individuals# of suggestive individualsAS_score FisherPvalue


48

Supplementary Table 21. Contig list generated by de novo assembly

SuppTable21_Denovo_Contigs_List.xls



49

Supplementary Table 22. Alignment result of de novo assemble contigs

SuppTable22_Denovo_Alignment_Result.xls


AK3 AK4 AK5 AK6 AK7 AK9 AK14 AK20 AK3 AK4 AK5 AK6 AK7 AK14 AK20

# of reads 220 277 261 245 270 63 239 228 0 0 0 0 0 0 0

# of bases 16720 23927 16996 20220 20520 9513 22014 16188 0 0 0 0 0 0 0

# of paired matches 116 190 138 160 130 36 158 162 0 0 0 0 0 0 0

# of exact matched reads 220 277 261 245 270 63 239 228 0 0 0 0 0 0 0

# of exact matched bases 16720 23927 16996 20220 20520 9513 22014 16188 0 0 0 0 0 0 0

coverage of exact matches 16.33 23.37 16.6 19.75 20.04 9.29 21.5 15.81 0 0 0 0 0 0 0

# of exact paired-end matched reads(exact) 116 190 138 160 130 36 158 162 0 0 0 0 0 0 0

# of exact paired-end matched bases(exact) 8816 16490 8648 13110 9880 5436 14558 11352 0 0 0 0 0 0 0

coverage of exact paried-end matches 8.61 16.1 8.45 12.8 9.65 5.31 14.22 11.09 0 0 0 0 0 0 0

Whole genome Transcriptome

855418

Contig ID Category

AK3 AK4 AK5 AK6 AK7 AK9 AK14 AK20 NA10851 NA12878 NA18507 NA19240 ABT KB1 Eskimo YH

225 224 232 240 269 60 122 194 580 1959 188 201 0 166 102 4906

17100 18699 15312 19275 20444 9060 10797 14159 27070 70524 6768 7236 0 11056 7156 158178

190 194 206 198 244 56 102 178 364 186 0 0 0 100 0 202

169 198 181 221 184 45 111 178 449 199 0 1 0 103 78 656

12844 16423 11636 17671 13984 6795 9736 12733 20414 7164 0 36 0 7828 5476 23214

12.54 16.04 11.36 17.26 13.66 6.64 9.51 12.43 19.94 7 0 0.04 0 7.64 5.35 22.67

118 162 130 170 132 32 84 156 272 142 0 0 0 70 0 146

8968 13562 8040 13680 10032 4832 7284 10906 11972 5112 0 0 0 5320 0 5110

8.76 13.24 7.85 13.36 9.8 4.72 7.11 10.65 11.69 4.99 0 0 0 5.2 0 4.99

Nomatch paired-end reads Publicly opened human genome


supplementary information - images.nature.com · supplementary information ... super nssnp genes...

Documents