Close
About
FAQ
Home
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Novel statistical and computational methods for analyzing genome variation
(USC Thesis Other)
Novel statistical and computational methods for analyzing genome variation
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Transcript (if available)
Content
Title
Novel statistical and computational methods for
analyzing genome variation
Author
Yunfei Guo
Conferring Program
University of Southern California Graduate School
Degree being conferred
PhD in Biostatistics
Degree conferral date
August 6, 2016
Acknowledgements
I wish to acknowledge my indebtedness to my mentor Dr. Kai Wang for his instrumental
guidance and inspiring encouragement during the entire project. Special thanks are due to Dr.
David Conti and Dr. Paul Thomas for their openness, criticism and support. I am also deeply
indebted for the support and understanding of my colleagues and friends, especially Chengliang
Dong and Hui Yang. I am particularly grateful to my fiancé, Xiaolu Han, whose strength, and
care made the critical difference for completion of this effort.
Table of contents
Chapter 1. Automated, Accurate And Flexible Variant Detection ....................................................... 1
Abstract .................................................................................................................................................... 1
INTRODUCTION .................................................................................................................................... 1
MATERIAL AND METHODS ................................................................................................................ 3
Workflow .............................................................................................................................................. 3
Built-in parallel processing ................................................................................................................ 6
Support for cluster environment ....................................................................................................... 9
Consensus result generation ............................................................................................................ 9
Analysis summary ............................................................................................................................ 10
RESULTS .............................................................................................................................................. 11
Consistency and accuracy evaluation ........................................................................................... 11
Aligner consistency evaluation ....................................................................................................... 15
Mendelian error evaluation ............................................................................................................. 16
Performance evaluation .................................................................................................................. 19
Mendelian disease disease-contributory variant identification .................................................. 21
DISCUSSION ........................................................................................................................................ 22
Chapter 2. Framework for prioritizing GWAS loci with epigenetic features .................................... 24
ABSTRACT ........................................................................................................................................... 24
INTRODUCTION .................................................................................................................................. 24
IMPLEMENTATION ............................................................................................................................. 26
Architecture ....................................................................................................................................... 26
Input and output ................................................................................................................................ 26
CONCLUSIONS ................................................................................................................................... 32
Chapter 3. Fast De Novo Genome Assembly, Evaluation, Gap filling and Error Correction ........ 33
ABSTRACT ........................................................................................................................................... 33
INTRODUCTION .................................................................................................................................. 33
RESULTS .............................................................................................................................................. 34
Generation of de novo genome assembly on a Chinese genome ........................................... 34
Gap filling on the reference genome by de novo assembly ....................................................... 43
Characterization of structural variants and novel sequences .................................................... 45
DISCUSSION ........................................................................................................................................ 55
METHODS............................................................................................................................................. 56
Generation of sequence data ......................................................................................................... 56
BioNano genome mapping ............................................................................................................. 57
Karyotyping for chromosomal abnormality ................................................................................... 57
de novo draft genome assembly by Falcon ................................................................................. 58
Evaluation of draft genome quality by BioNano........................................................................... 60
Construction of physical map by BioNano nanochannel array ................................................. 60
Improve draft genome assembly by hybrid scaffolding .............................................................. 61
Consensus quality and scaffolding accuracy analysis ............................................................... 62
Short read polishing ......................................................................................................................... 62
Gap filling in GRCh38 ...................................................................................................................... 62
Finding novel sequences and calculating genome coverage .................................................... 65
Comparative analysis using NIST Chinese genome .................................................................. 66
Functional analysis on novel sequence elements ....................................................................... 66
Structural variation calling (Illumina whole-genome sequencing) ............................................. 68
Structural variation calling (BioNano genome map) ................................................................... 68
Structural variation calling (SMRT long reads) ............................................................................ 68
References ................................................................................................................................................ 69
Chapter 1. Automated, Accurate And Flexible Variant Detection
Abstract
Next-generation sequencing (NGS) technology has greatly improved our ability to identify
disease-contributory variants for Mendelian diseases. However, users are often faced with
issues such as software compatibility, complicated configuration, and no access to high-
performance computing facility. More importantly, discrepancies exist among different aligners
and variant callers. We developed a computational pipeline, SeqMule, and a number of helper
methods, to perform automated variant calling from NGS data on human genomes and exomes.
SeqMule integrates computational-cluster-free parallelization capability built on top of the variant
callers, and facilitates normalization/intersection of variant calls to generate consensus set with
high confidence. SeqMule integrates 5 alignment tools, 5 variant calling algorithms and accepts
various combinations all by one-line command, therefore allowing highly flexible yet fully
automated variant calling. In a modern machine (2 Intel Xeon X5650 CPUs, 48GB memory),
when fast turn-around is needed, SeqMule generates annotated VCF files in a day from a 30X
whole-genome sequencing data set; when more accurate calling is needed, SeqMule generates
consensus call set that improves over single callers, as measured by both Mendelian error rate
and consistency. In addition, SeqMule supports Sun Grid Engine for parallel processing, offers
turn-key solution for deployment on Amazon Web Services, allows automated quality check,
Mendelian error check, consistency evaluation, variant normalization, HTML-based summary of
results. SeqMule is available at http://seqmule.openbioinformatics.org.
INTRODUCTION
The development of next-generation sequencing (NGS) technologies has dramatically changed
the landscape of human genetics research
1-6
. Identifying disease-contributory variants for
various human genetic diseases will greatly improve diagnosis and facilitate development of
therapies.
Page 1 of 73
However, besides discrepancies associated with sequencing platforms
7
, there is still
considerable variation across variant calling algorithms; for example, we previously reported
SNV concordance of only 57.4% for 5 bioinformatics pipelines (SOAP, BWA-GATK, BWA-
SNVer, GNUMAP, BWA-SAMtools), while 0.5-5.1% variants were called as unique to each
pipeline
8
. Performance of aligners also varies under different sequencing error rates and indel
distribution
9
. Yet few published pipelines offer two or more alternative aligner and variant calling
programs
10-14
. While some workflow management systems do provide more flexibility
10-13
, local
installation and configuration is highly challenging for average users. Therefore, there is a
strong community need for a comprehensive and flexible pipeline that allows easy execution
and integration of multiple tools.
There are multiple challenges for building such a pipeline. Installation and configuration poses
the first problem, and the severity of this problem is evidenced by numerous attempts to
address it
15-17
. Software libraries such as Bioconductor
15
and Bioperl
16
, and web-based
interfaces [e.g.
17
] all aim to provide ease of access. The diversity of bioinformatics tools has
paradoxically given rise to one more layer of complexity. In a typical variant calling analysis, 4 to
6 tools might be required to perform QC (quality check), alignment, sorting, and variant calling.
Ideally, the output from one program can be fed into another one as is. In real-world scenarios,
this might not be the case. For instance, GATK does not accept output from SOAP2 aligner.
Another issue is that constant and asynchronous development of the software would, from time
to time, lead to loss of compatibility and break down of what was working. Even if compatibility
issues can be solved, reproducibility will be difficult to maintain across highly heterogeneous
pipelines. A pre-packaged virtual machine (VM) provides users with an alternative to address
this problem
18-20
. However, having two operating systems running on the same machine means
at least 1 CPU core and a few gigabytes of memory must be reserved for the host OS, and
unavoidably limits the computational resources available for the guest system. Adding another
layer of operating system also increases computational overhead by 13% to 28% compared
with performance on a native system
19
. Finally, VM implementation reduces flexibility of
software tools as a bundle and becomes difficult to deploy for average users without informatics
skills.
Page 2 of 73
To address the discrepancy issues without compromising ease of use, performance and
reproducibility, we developed a computational pipeline, SeqMule, which performs a series of
automated steps for identifying variants from NGS data. It integrates 5 alignment tools, 5 variant
calling algorithms, and allows various combinations of them via modifying a text-based, human-
readable configuration file. The intersection of sets of variants from different combinations of
tools can be extracted to achieve higher accuracy, both in terms of sensitivity and specificity.
Most setup procedure and analyses can be done with one-line commands. SeqMule also
provides cluster-free parallel capability built on top of the variant callers, which could drastically
reduce the time for variant calling by about an approximately linear factor of N (N is number of
CPU cores). As far as we know, only GATK Queue and FreeBayes provide such parallelism
among variant callers, but users have to manually set up a Queue or generate a region file for
parallel processing. At the end of analysis, an HTML-based report will be prepared to show an
overview for every step of the analysis, which helps assure users of data quality and appropriate
analysis settings. We believe that SeqMule will be useful to easily and efficiently obtain variant
calls from NGS data, and improve variant calling consistency and accuracy.
MATERIAL AND METHODS
Workflow
Currently, SeqMule integrates 5 popular mapping tools: BWA (including BWA-backtrack and
BWA-MEM), Bowtie, Bowtie2, SOAP2, SNAP
21-25
, 5 variant calling algorithms: GATK (including
GATKLite and version 3), SAMtools, VarScan 2, Freebayes, SOAPsnp
21,26-28
and some
accessory programs: FastQC, Picard, tabix and VCFtools
29
. Tools were selected based on their
popularity, ease of use and performance. BWA, Bowtie, Bowtie2 and SOAP2 are the most
popular aligners for NGS analysis; SNAP was the fastest aligner at the time when SeqMule was
developed; GATK, SAMTools, FreeBayes and SOAPsnp are among the most popular variant
callers; VarScan2 is the most widely used variant caller that can perform both germline and
somatic variant calling; FastQC is the most popular tool for NGS quality control and generates
HTML-based reports automatically; Picard, tabix and VCFtools are the de facto gold standard
with robust implementation for BAM/VCF manipulation in NGS analysis. Of note, SNAP can be
orders of magnitude faster compared with the popular aligner BWA-MEM
25,30
. All tools and
related packages, except for those without open-source license, can be downloaded and
installed by one single command with no need for root access. We actively maintain a list of
programs and their related source code, database files to make sure there is no compatibility
conflict under default settings.
Page 3 of 73
A workflow scheme is shown in Figure 1. SeqMule takes FASTQ, gzipped FASTQ or BAM as
input. Quality scores in FASTQ can be encoded either in Phred+33 scheme or in Phred+64
scheme. For FASTQ, SeqMule can automatically decide which scheme is used by examining
the beginning of input. Other necessary files for analysis, including reference genomes,
alignment indexes, known variant databases, can be downloaded via one-line built-in command
from SeqMule website. In a typical pipeline, input data goes through QC, alignment, sorting,
indexing, PCR duplicate removal, variant calling and report generation. All steps use default
parameters. Reads with mapping quality larger than 30 (20 for SNAP) will be used for variant
calling by default. Variants will be filtered following either recommended best practice (for GATK)
or by depth threshold of 10 unless otherwise stated.
Page 4 of 73
Figure 1. Scheme of SeqMule workflow and currently available tools in each step. Dashed line marks non-
mandatory steps, solid line marks mandatory steps.
Page 5 of 73
Multi-sample variant calling can be used if multiple sets of input data from the same lineage are
supplied. BAM and VCF files are generated in the end. The VCF files are ready for downstream
annotation and filtering analysis, which means users can feed them either to locally installed
ANNOVAR program
31
or to wANNOVAR web server
17
.
To allow various combinations of aligners and variant callers, SeqMule uses a specifically
designed configuration file. The configuration file consists of key, value pairs in the form of
‘key=value’. Keys are categorized as global options, programs and local options, each with
different prefixes. The prefixes for program keys also determine whether this program is
mandatory or exclusive or mandatory and exclusive at a particular step. Programs can be either
enabled or disabled by assigning 1 or 0 to the value of the corresponding keys. All settings are
written in plain English with embedded help documentation alongside. Over 40 different
combinations of aligners and variant callers have been tested to produce valid output and their
configurations are readily available in ‘misc/predefined_config’ directory under SeqMule
installation path. More combinations are left for the users to explore.
Built-in parallel processing
Analysis can be run in a parallel fashion either via native support by variant callers (e.g. multi-
processing option inside GATK) or via SeqMule’s built-in multi-processing framework. As is
shown in Figure 2, when SeqMule’s built-in multi-processing is enabled, it splits the genome into
multiple equally sized bins, writes those bins into N BED files (N for number of CPU cores), and
launches multiple processes to call variants over each region.
Page 6 of 73
Figure 2. Bin generation and assignment. Assume we want to run our analysis using 3 processes. We need
to split the chromosomes (blue) into 3 sets of bins. Bin size will be dynamically determined according to the
total size of all chromosomes or user-defined regions. There is a minimum and maximum bin size to
minimize the effects of a read spanning two bins and uneven coverage. In practice, the size limits shown in
the figure work well from a couple of genes up to the whole genome. Bins are generated by walking through
all regions to be analyzed. Subsequently, bins are assigned to each process by rotations. In the end, each
process is expected to deal with approximately same numbers of reads.
Each bin will be large enough to minimize the overhead costs of small bins, and small enough to
have fine-grained genomic intervals. Currently the minimum bin size is set to be 50 Kbp while
the maximum is 1 Mbp. All bins are assigned to each process by rotation (Figure 2) so that two
adjacent bins will not be analyzed by the same process. In contrast to assignment without
rotation, the assignment-by-rotation strategy is able to assign approximately equal number of
reads to each process (Figure 3). As number of reads is directly associated with processing
time, this assignment-by-rotation strategy will avoid having too many reads processed by one
thread due to uneven coverage
32
, therefore, all processes are expected to finish in similar
amount of time.
Page 7 of 73
Figure 3. Read count distribution for 12 process assignments. Chromosome 1 is split into 12 groups of regions following
assignment-by-rotation and assignment-no-rotation schemes, respectively. Reads from NA12878 whole genome sequencing
data (58X) are counted for each group of regions.
We compared variant calling time consumption under different max bin sizes using sample
NA12878 from 1000 Genomes Project. Variants were called by SAMtools with 12 concurrent
processes. Table 1 shows that the standard deviation increases as the max bin size grows from
500 Kbp to 20 Mbp. However, the maximum running time of child processes, or simply put, the
overall variant calling time, does not necessarily increase as the max bin size becomes larger.
Variant calling is fastest (194.9 minutes) when max bin size is 1 Mbp. All processes will be
spawned by fork on a single machine, and so there is no need for cluster infrastructure.
Different processes communicate with each other through a script file recording the status and
command of each step. An execution manager (the parent process) is responsible for
monitoring, starting and stopping all processes. Because some analyses may take hours or
days to finish, the execution manager is designed to be able to stop and then resume the
pipeline at any step. This feature comes in handy when users have to adjust some parameters
(e.g. memory limit) after SeqMule aborts due to errors.
Page 8 of 73
Table 1. Variant calling time consumption and max bin size in quick mode. Sample NA12878 from 1000
Genomes Project was used to benchmark variant calling time with different max bin sizes in quick mode.
Variants were called by SAMtools with 12 concurrent processes. K stands for thousand, M for million. Row
minimums are highlighted by bold face. The statistics was calculated based on running time of all child
processes. The result shows that the smaller the max bin size is, the smaller the standard deviation is, and
that maximum running time does not increase linearly as max bin size grows. The shortest maximum
running time is obtained when max bin size is 1 Mbp.
Max Bin Size (bp) 500K 1M 5M 10M 20M
Average running time (min) 209.2 191.4 191.5 210.0 209.6
Minimum running time (min) 207.7 188.0 180.5 189.6 188.6
Maximum running time (min) 211.4 194.9 199.4 233.1 253.0
Standard Deviation 1.3 1.8 5.4 12.3 16.5
Support for cluster environment
SeqMule supports analyzing large number of samples via Sun Grid Engine (SGE), a popular job
scheduling system in cluster environment. Internally, when SeqMule generates tasks for each
step for every sample, it also specifies their dependencies. The relationship results in a task
dependency graph (TDG) with tasks as nodes and task interactions as edges. SeqMule then
uses topological sorting to rearrange these tasks in linear fashion such that if task A is
dependent on output from task B, task A will be put behind task B. Upon running, SeqMule runs
tasks one by one and will not start a task unless all its dependent tasks are finished. With this
design in place, many samples can be processed in parallel due to their nature of independence.
When SGE is available, SeqMule submits each task to SGE and waits for it to finish.
Consensus result generation
For SNVs, variant calls with the same chromosome and position fields are considered
overlapping. Only such variants will be combined in consensus records. To combine two
overlapping SNVs, alternative alleles in each record will be put together while the chromosome,
position and reference columns remain unchanged.
Page 9 of 73
For non-SNV variants, primarily consisting of indels, records with same chromosome, position
and reference allele fields are considered overlapping. After combining, alternative alleles from
all records will be put together. Same indels might be presented in different ways by different
algorithms. For example, both ‘TGGG TGG‘ and ‘TG T’ can denote deletion of a G allele, but it
is hard to tell which G is deleted in the first case
8
. In this circumstance, we apply 'variant
normalization' in Vt (https://github.com/atks/vt) to first normalize all VCF files to the same
standard before combining them. During normalization, all alleles will be left (5’) and right (3’)
trimmed to remove superfluous nucleotides and be moved to the leftmost positions.
Nevertheless, it is expected that concordance rate between different tools for indels are less
than that for SNVs
8
.
When multiple samples are present in the same VCF file, it will be split into single-sample VCF
before merging. After all VCFs from the same sample are merged, the resulting VCFs will be
combined into one multi-sample VCF. Variant quality and genotype quality from the first input
file will appear in the combined VCF. The files to be merged will be sorted based on the
following priority: GATK>SAMtools>FreeBayes>VarScan>SOAPsnp. The priority is solely for
the purpose of reproducibility of quality score assignment. It will not affect consensus call results
as it only applies to the situation where mutual agreement exists. We assign variants ‘PASS’ in
the filter field only if they are unfiltered in any one of the input VCF files. Consensus calls can be
generated in different ways. Users are able to specify the minimum number of files in which
each consensus call appears. For example, "2-out-of-4" consensus calls represent the calls
found in at least 2 out of 4 input files.
Analysis summary
When an analysis starts or finishes, SeqMule automatically examines the FASTQ, BAM, VCF
and generates an HTML-based report showing various statistics. Important statistics include,
but are not limited to, coverage curve, average coverage, coverage over capture region,
percentage of capture region covered by at least N reads, Ti/Tv ratio,
Heterozygote/Homozygote ratio. FASTQ statistics are calculated by FastQC; alignment
statistics come from SAMtools; coverage statistics are done by SAMtools and SeqMule’s built-in
program; VCFtools output most of the variant statistics. In particular, coverage in a region
specified by a BED file is calculated as follows: SAMtools ‘depth’ subprogram is called over the
region to output base-level depth, then number of bases with different read depths are tabulated
to produce histogram-like statistics.
Page 10 of 73
Besides automatic generation of summary statistics, users can manually feed SeqMule with a
multi-sample VCF from a family trio and obtain Mendelian error rates. The VCF file will first be
converted to MAP and PED format used in PLINK
33
to record all genotypes in a more compact
fashion. Subsequently pedigree information, based on user specification, is added to the PED
file. At last, SeqMule iterates through all variants in the offspring and counts number of calls with
0, 1, 2 IBD (identical by descent) alleles from father or mother, number of allele drop-in events,
number of allele drop-out events. The Mendelian error rate calculation is implemented with
VCFtools, and a modified version of mendelFix
34
.
In accordance with its multi-caller feature, SeqMule also generates a Venn Diagram for users to
inspect overlapping of variant sets from different callers. Variants are split into SNVs and non-
SNVs. Chromosome, position, reference allele and alternative allele of SNVs are used to
determine whether two variants overlap. For non-SNVs, besides variant-normalization, records
will first be intervalized by ignoring reference alleles and alternative alleles, and then extended
10 bp towards both ends. The extended intervals for each non-SNV are used to determine
whether two variants overlap. This feature is implemented with VennDigram package
35
and
custom Perl code.
RESULTS
Consistency and accuracy evaluation
To demonstrate the consistency of calls on identical samples, we analyzed 3 data sets for
NA12878 obtained from the 1000 Genomes project
36
(sequenced by HiSeq and hereafter
referred to as HiSeq-1000G) and from AllSeq (sequenced by HiSeq X Ten and hereafter
referred to as HiSeqX-D, HiSeqX-J). For each data set, 4 variant callers, FreeBayes, SAMtools,
VarScan and GATK HaplotypeCaller (GATK-HC) were used to call variants, then the results
from 4 callers were merged in different ways to obtain consensus calls. For each variant calling
method, we plotted a Venn diagram comparing the overlapping of variants among the 3 data
sets (Figure 4). The concordance rates range from 91.89%~94.37% among all variant callers
and consensus calls while GATK HaplotypeCaller calls give the highest concordance (94.37%).
Page 11 of 73
Figure 4. Consistency evaluation using 3 sequencing data sets for NA12878. The 3 data sets were generated
by 1000 Genomes project and AllSeq (HiSeq-D, HiSeq-J). We used the same alignment algorithms (BWA-
MEM) as the primary focus is to compare variant callers. For each data set, 4 variant callers were used to call
variants (A), then the results from 4 callers were merged in different ways to get consensus calls (B). For
each variant calling method, we plotted a Venn Diagram comparing overlapping of variants among the 3 data
sets. The percentages show the proportion of variants shared by 3 data sets.
As common variants are usually easy to detect, we later used ANNOVAR to filter the variants by
minor allele frequency (MAF). Variants with MAF>1% (1000 genomes project, October, 2014
release, European population) were discarded. The results (Figure 5) show that for rare and
novel variants, consistency rates are considerably lower, ranging from 69.10%~87.69%. It
suggests that most of the inconsistency in variant calling can be attributed to rare and novel
variants. 4-out-of-4 consensus calls show the highest concordance (87.69%) under this
scenario. The overall consistency on rare variants is summarized in Figure 6. Some variant
callers, e.g. GATK, utilizes prior information about known common and rare variants to filter out
low-quality variant calls
37
. Other variant callers, e.g. FreeBayes, SAMtools, do not explicitly use
prior information about common variants in their model; however, the parameters could have
been optimized to guarantee reasonable performance for the majority of variants, i.e., the
common variants.
Page 12 of 73
Figure 5. Consistency evaluation using rare variants from 3 sequencing data sets for NA12878. The 3 data
sets were generated by 1000 Genomes project and AllSeq (HiSeq-D, HiSeq-J). We used the same alignment
algorithms (BWA-MEM) as the primary focus is to compare variant callers. For each data set, 4 variant callers
were used to call variants, variants with MAF>1% were dropped. The remaining variants from 4 callers were
merged in different ways to get consensus calls. For each variant calling method, we plotted a Venn Diagram
comparing overlapping of variants among the 3 data sets (A,B). The percentages show the proportion of
variants shared by 3 data sets.
Figure 6. Consistency evaluation summary using rare variant calls from NA12878.
Page 13 of 73
In addition to consistency, we also examined the precision, sensitivity and specificity. The 1000
genomes data set for NA12878 (HiSeq-1000G) was used in this case. All consensus calls and
variants called by individual callers were compared with the gold standard from Genome In a
Bottle (GIB) project and an Illumina HumanOmni2.5-8v1 SNP array (Figure 7, Figure 8). For the
comparison with GIB gold standard, precision rates range from 97.1% to 99.5% with GATK-HC
being the highest; sensitivity rates range from 97.1% to 98.4%, with GATK-HC being the highest;
specificity rates are all close to 100%. However, as the GIB gold standard was generated by
GATK-HC, there is no doubt that the result is biased in favor of GATK. For the comparison with
SNP array, precision rates range from 99.6% to 99.8% with 4-out-of-4 consensus calls being the
highest; sensitivity rates range from 96.7% to 97.4% with 2-out-of-4 consensus call and GATK-
HC both being the highest; specificity rates are all above 99.5%. These numbers show that all
individual callers and consensus calls can give us accurate results in the regions covered by
GIB gold standard and SNP array, indicating that consensus calling may only have minimal
impact on the variants in GIB gold standard.
Figure 7. Compare individual callers with Genome In a Bottle Gold Standard. Results from 4 individual
variant callers, GATK HaplotypeCaller (gatk_hc), Freebayes, SAMtools and VarScan were uploaded to
http://www.bioplanet.com/gcat to be compared against gold standard from Genome In a Bottle project
(Version 2.18) and Illumina HumanOmni2.5-8v1 SNP array.
Page 14 of 73
Figure 8. Compare consensus results with Genome In a Bottle Gold Standard. Consensus calls were
generated by combining results from individual variant callers in different ways, namely 2-out-of-4, 3-out-of-4
and 4-out-of-4. Then they were uploaded to http://www.bioplanet.com/gcat to be compared against gold
standard from Genome In a Bottle project (Version 2.18) and Illumina HumanOmni2.5-8v1 SNP array. Result
from GATK HaplotypeCaller (GATK-HC) is also shown as a reference.
Aligner consistency evaluation
To our knowledge, there is little comparison in literature regarding different aligners in the
context of variant calling. It has been reported that there is differential performance for aligners
when sequencing error rates and indel frequencies and sizes are varying
9
, so it is sensible to
expect discrepancies associated with variant calling for different aligners. We used 2 variant
callers, SAMtools and FreeBayes, and 3 aligners, BWA-MEM, Bowtie2 and SNAP, for each
caller to evaluate concordance of variant calling on an exome data set
38
. Results in Figure 9
show that concordance rates for SNVs are 77.31% for SAMtools, 78.43% for FreeBayes,
respectively; concordance rates for non-SNVs (after normalization and extension) are 49.73%
for SAMtools, 50.61% for FreeBayes, respectively. While there is no previous study regarding
aligner consistency evaluation, these results suggest that aligners may account for a
considerable amount of difference in variant calling. The above results have been published in
the original SeqMule paper
39
.
Page 15 of 73
Figure 9. Variant concordance for different aligners. An exome data set was used to do alignment and call
variants. For each variant caller, we used 3 aligners to map the reads. Two panels are SNVs, and two panels
are non-SNVs (mostly indels). Panel A shows SNVs from SAMtools, panel B for SNVs from FreeBayes, panel
C for non-SNVs from SAMtools, panel D for non-SNVs from FreeBayes.
Mendelian error evaluation
As family-based analysis is critical in Mendelian disease studies, we went on to examine
Mendelian error rate by different methods in a previously published exome sequencing study on
a family trio
38
. Two types of errors were counted: allele drop-in and allele drop-out. Allele drop-in
(ADI) means that an offspring presents an allele that does not appear in either parent. Allele
drop-out (ADO) means that an offspring misses an allele that should have been inherited from
the parents. Sum of ADI and ADO is the total number of Mendelian errors. Mendelian error rate
is the ratio of all Mendelian errors to all variant calls shared by the family trio (i.e. not marked as
missing in any family member).
Page 16 of 73
The result (Table 2) shows that individual variant callers generally returns 41.1 to 52.6 thousand
trio calls with error rates between 1.78% and 2.88%. Among them, FreeBayes gives us the most
trio calls (52.6K) and GATK-HC gives us the lowest error rate (1.78%). For a single variant
caller, it is hard to predict whether error rate is higher if the overall number of trio calls is small.
In contrast, when we have fewer consensus trio calls, the Mendelian error rate is also lower.
The lowest Mendelian error rate we can get is 0.66%, from 4-out-of-4 consensus calls for this
data set, with 35.7 thousand trio calls. As most of our interests lie in rare and novel variants, we
did variants filtering based on MAF. Variants with MAF>1% (1000 genomes project, October,
2014 release, European population) were discarded. The results show that generally Mendelian
error rates become larger for rare and novel variants, ranging from 0.53% to 6.48% (Table 3,
Figure 10). Again, the lowest error rate was achieved by 4-out-of-4 consensus calls (0.53%),
with nearly five thousand trio calls. Compared with results from individual callers, the
consensus-call approach has more stable performance. It is also more flexible in that users can
reduce Mendelian error rate when needed. In trio-based sequencing studies, researchers may
want to examine the most reliable set of calls first to find low-hanging fruits, before delving deep
into noisier call sets to find additional candidates.
Table 2. Mendelian error rate comparison for variant calling methods. Allele drop in (ADI) means that an
offspring presents an allele that does not appear in either parent. Allele drop out (ADO) means that an
offspring misses an allele that should have been inherited from the parents.
Number of
calls shared
by the family
trio
Number of
allele drop in
Number of
allele drop
out
Total number of
Mendelian errors
Proportion of
Mendelian errors
GATK HaplotypeCaller 44559 215 579 794 1.78%
SAMtools 48586 769 631 1400 2.88%
FreeBayes 52609 797 321 1118 2.13%
VarScan 41102 360 542 902 2.19%
Consensus (2 out of 4) 49741 730 796 1526 3.07%
Consensus (3 out of 4) 45748 228 603 831 1.82%
Consensus (4 out of 4) 35690 6 230 236 0.66%
Page 17 of 73
Table 3. Mendelian error rate comparison for variant calling methods (MAF<1%). Allele drop in (ADI) means
that an offspring presents an allele that does not appear in either parent. Allele drop out (ADO) means that an
offspring misses an allele that should have been inherited from the parents.
Number of calls
shared by the
family trio
Number of
allele drop in
Number of
allele drop
out
Total number of
Mendelian errors
Proportion of
Mendelian errors
GATK HaplotypeCaller 5831 46 51 97 1.66%
SAMtools 9155 471 122 593 6.48%
FreeBayes 13426 644 34 678 5.05%
VarScan 6428 228 88 316 4.92%
Consensus (2 out of 4) 10090 447 159 606 6.01%
Consensus (3 out of 4) 7553 84 87 171 2.26%
Consensus (4 out of 4) 4934 4 22 26 0.53%
Page 18 of 73
Figure 10. Mendelian error rate comparison for variant calling methods (MAF<1%). 2o4 denotes calls from 2-
out-of-4 consensus call set, 3o4 for 3-out-of-4 and 4o4 for 4-out-of-4. Only loci present in all family members
are considered. ADI (allele drop in) and ADO (allele drop out) are counted as Mendelian errors. A trend line is
added for consensus results. Only variants with MAF<1% are shown here.
Performance evaluation
To evaluate the computational resources consumed by SeqMule, we used a whole genome
sequencing data set (mentioned before as HiSeqX-D) with 818 million 151-bp paired end reads
(30X coverage) for benchmarking. BWA-MEM or SNAP was used as the aligner. BWA-MEM is
by far the most popular aligner for NGS variant analysis while SNAP was the fastest aligner to
the best of our knowledge at the time of our evaluation. PicardTools was used to remove PCR
duplicates. GATK HaploTypeCaller or FreeBayes was used to call variants. We used a machine
equipped with 12-core Intel Xeon 2.66 GHz CPU and 48 Gigabytes of memory.
Table 4 details the computation time of SeqMule under different pipeline configurations. As is
shown, conventional BWA-MEM and GATK-HC combination takes over 50 hours to analyze a
whole genome, whereas SNAP+FreeBayes combination needs 36.3 hours. With quick mode
(SeqMule’s built-in parallel framework) enabled, variant calling can run up to 12 times faster (12
CPU cores). However, this comes with the tradeoff of increased memory usage. For example,
Page 19 of 73
when speed is of primary concern, users can use the SNAP+FreeBayes combination. This
combination requires less than 21 hours of running time and 31.3 Gigabytes of memory with
quick mode enabled.
Table 4. Time consumption under different configurations. A human whole genome data set (818 million 151
bp-long paired-end reads, 30X coverage) was aligned with BWA-MEM and SNAP. PCR duplicates were
removed by Picardtools. Variants were called by GATK HaplotypeCaller and FreeBayes. Quick mode here
denotes SeqMule’s built-in parallel framework. Built-in parallel capability is always turned on for underlying
3rd party algorithms (e.g. GATK’s -nt option).
Quick Mode Aligner Variant caller
CPU
(number of
cores)
Max Memory
Used (G)
Alignment
Time
(hour)
Variant
Calling
Time
(hour)
Total
Time
(hour)
No BWA GATK-HC 12 19.8 13.5 22.6 54.0
Yes BWA GATK-HC 12 43.2 13.2 19.8 51.1
No SNAP FreeBayes 12 31.3 5.7 16.0 36.3
Yes SNAP FreeBayes 12 31.3 5.3 1.3 20.5
For users who are interested in analyzing exome data sets, we also did a similar comparison
using an exome data set (138.8 million 90bp-long paired-end reads, 113X coverage in target
region). BWA-MEM was used as the aligner. PicardTools was used to remove PCR duplicates.
GATK-HC was used to call variants. As is shown in Table 5, using more CPUs or processes
reduces the running time in general. With quick mode enabled, variant calling can run 2 to 7
times faster depending on number of CPU cores used. On a machine with same configuration,
the shortest running time is 317 minutes for the BWA-MEM+GATK-HC combination, and 264
minutes for the SNAP+FreeBayes combination (not shown in Table 5).
Page 20 of 73
Table 5. Time consumption under different configurations. A human exome data set (138.8 million 90bp-long
paired-end reads, 113X coverage in target region) was aligned with BWA-MEM. PCR duplicates were removed
by Picardtools. Variants were called by GATK HaplotypeCaller. Quick mode here denotes SeqMule’s built-in
parallel framework. Built-in parallel capability is always turned on for underlying 3rd party algorithms.
Quick
Mode Enabled
CPU (number of
cores)
Max Memory Used
(G)
Variant
Calling Time
(min)
Time (min)
Time Saving Compared
With Analysis Using 1 CPU
No 1 7.03 217.5 910 0.00%
No 2 8.17 208.4 695 23.63%
No 4 11.90 189.9 540 40.66%
No 8 16.27 203.6 529 41.87%
No 12 20.34 192.5 474 47.91%
Yes 2 8.17 117.8 606 33.41%
Yes 4 14.67 51.5 385 57.69%
Yes 8 28.98 40.3 346 61.98%
Yes 12 43.29 31.1 317 65.16%
In the exome analysis, the running time for GATK-HC does not seem to vary with number of
CPU cores when quick mode is disabled. This is perhaps due to the fact that GATK’s
HaplotypeCaller only supports parallel execution with “-nct” option (enable multiple CPU threads
per data thread) which does not scale well and has a constant overhead. The configuration files
used are “snap_freebayes.config”, “bwa_gatk_HaplotypeCaller_norealn_norecal.config”. They
can be found in seqmule/misc/predefined_config/. The software used includes FastQC
(v0.11.2), Picard (v1.115), SAMtools (v0.1.19-44428cd), FreeBayes (v0.9.14-14-gb00b735),
SNAP (v1.0beta.16), BWA-MEM (v0.7.10-r789), GATK (v3.1-1-g07a4bf8).
Mendelian disease disease-contributory variant identification
We used exome sequencing data from a previously reported family trio
38
to demonstrate the
applicability and ease of use of SeqMule. The proband is a 28-year-old Caucasian male
diagnosed with idiopathic hemolytic anemia. This phenotype was not observed either in his
parents nor his siblings. With Agilent SureSelect Human All Exon kit and Illumina HiSeq 2000,
we obtained 97 million paired-end reads per sample with designed capture regions covered at
82x on average. Initial alignment was done by BWA-MEM algorithm, followed by Picard removal
of duplicates. GATK HaplotypeCaller, SAMtools, VarScan were then used to call variants. Multi-
sample variant calling was enabled, so all 3 samples were supplied to each of the variant callers.
Page 21 of 73
We filtered results based on criteria recommended by each algorithm. Subsequently we
extracted consensus variants from 3 sets of filtered variants (in 3-out-of-3 fashion), and reduced
number of variants from 34 thousand per caller to 31 thousand on average. These steps were
performed by executing one single SeqMule command given that the combination of aligner and
variant callers has been tested before.
Next we utilized the ANNOVAR “variants reduction” pipeline
31
, under a recessive disease model.
Synonymous, non-splicing variants and variants observed in the 1000 Genomes Projects and
NHLBI-ESP 6500 exome project with minor allele frequency (MAF) >1% in European
populations
40
were filtered out. The filtering was done with one single ANNOVAR command for
each subject (variants_reduction.pl -protocol nonsyn_splicing,1000g2012apr_all,esp6500_ea,recessive -operation g,f,f,m -
aaf_threshold 0.01 input.father.avinput annovar/humandb/ -buildver hg19).
In the remaining variants, father and son shared 236 variants, while mother and son shared 253.
There are 64 genes overlapping between the two sets of shared variants, 32 of which are
homozygous and 6 are compound heterozygous in the proband. Manual examination of
candidate genes easily identified mutations in PKLR that likely contribute to the disease. PKLR
harbours two heterozygous variants that are predicted by multiple scores to be deleterious, and
has been confirmed by us as the disease-contributory gene in biochemical tests
38
. This example
demonstrated that SeqMule can reveal disease-contributory variants easily with minimal efforts
from the users, therefore greatly facilitating genetic studies on human genetic diseases.
DISCUSSION
In summary, SeqMule is a comprehensive, user-friendly, flexible and efficient tool for analyzing
human exome and genome sequencing data. Five alignment tools, five variant callers and
various accessory programs are included to provide users with numerous choices.
Different call sets can be integrated to produce more reliable results in terms of consistency and
Mendelian error rate. However, if users are solely interested in rare variants and de novo point
mutations from family-based samples, they will have more gain in power and Mendelian
consistency by using more specialized tools such as PolyMutt
41
, DenovoGear
42
and FamSeq
43
.
Test results on our own data suggest that PolyMutt can dramatically reduce Mendelian errors
(0.2% for FreeBayes, 0.0% for GATK-HC, 0.3% for SAMtools, data same as in Figure 10) by
utilizing a family-aware calling framework. When sample size is large, variant quality can also be
improved by imputation-based methods
44
.
Page 22 of 73
SeqMule’s built-in parallel processing capability can improve current variant calling speed by a
number of times without involving high-performance computing infrastructure or complicated
configuration. Besides all of these features, SeqMule uses one-line commands for complicated
tasks wherever possible, making it extremely easy to download, install, configure and run a
large number of bioinformatics tools.
Popular bioinformatics platforms, such as Galaxy, make a number of bioinformatics tools very
accessible. Users just upload their data and start analysis right away. But the restricted storage,
limited data transfer speed and prolonged job queuing time makes it impractical to use when
users have non-trivial amount of data.
The option of launching a Galaxy instance
45
or other pipeline in the cloud is likely to overcome
some of the obstacles mentioned above without cumbersome local installation. But, as a free,
public resource open to all biologists, such a platform is usually built towards serving a large
population who are not only interested in variant discovery, but also many other types of
analyses. In contrast, SeqMule is tailored for exome or genome sequence analysis in the
context of human genetic disease study and therefore hosts many features optimized for this
purpose, such as consensus call generation, multi-sample variant calling, Mendelian error rate
statistics, and HTML-based summary, among other things. Besides, SeqMule is more agile
because new tools, once they are integrated into SeqMule by developers, can be added with
just one update command (see SeqMule’s manual).
In addition to platform solutions, there are quite a few standalone pipelines for variant analysis,
such as HugeSeq
7
, ngs_backbone
46
and bcbio-nextgen (https://github.com/chapmanb/bcbio-
nextgen). HugeSeq integrates one aligner and two SNP/indel callers (SAMtools and GATK),
and was last updated one year ago. ngs_backbone integrates one aligner (BWA) and one
variant caller (GATK). bcbio-nextgen has two aligners (BWA, NovoAlign) and four variant callers
(GATK, FreeBayes, Platypus, SAMtools). Since NovoAlign is a proprietary aligner not free to all
researchers, none of these pipelines provides alternative open-source mappers. In view of
discrepancies associated with aligners
9
(Figure 9), it is important to offer users multiple open-
source alignment programs.
Many users have used SeqMule to analyze their sequencing data and obtained meaningful
results
47-49
. With the rapid development and deployment of next-generation sequencing
technologies, we expect that SeqMule will facilitate analysis of the upcoming massive amounts
of sequencing data to expedite discoveries for human genetic diseases.
Page 23 of 73
Chapter 2. Framework for prioritizing GWAS loci with epigenetic
features
ABSTRACT
Identifying causal variants remains a key challenge in post-GWAS (Genome Wide Association
Study) era, since many GWAS SNPs (including imputed ones) fall into non-coding regions,
making it very difficult to associate statistical significance with predicted functionality. Therefore,
we created a web-based tool, Enlight, which overlays functional annotation information, such as
histone modification states, methylation patterns, transcription factor binding sites, eQTL, higher
order chromosomal structure, to GWAS results. Enlight is accessible by a web browser at
http://enlight.usc.edu.
INTRODUCTION
Genome-wide association studies (GWAS) enabled the identification of genetic variants
associated with complex diseases and traits. To date, there are 14,012 genome-wide significant
variants in the GWAS catalog
50
. However, most of the top hits are believed to be proxy markers
for the underlying causal variants, and the majority of those variants lie within non-coding
sequences
51
. Despite substantial efforts in the post-GWAS era to identify causal variants, lack
of knowledge on biological functions has become a major hurdle for differentiating causal
variants from highly correlated tag SNPs.
A large amount of biological annotation is now available for non-coding sequences, given the
community efforts devoted to elucidating functional significance of non-coding sequences in the
past few years. For example, the ENCODE project has systematically mapped regions of active
transcription, transcription factor association, chromatin structure and histone modification, and
assigned at least one biochemical function or activity, e.g. transcription, histone modification, to
about 80% of the genome
52
. Recent studies showed that disease-related variants detected by
GWAS are significantly enriched in regions where regulatory elements are inferred, such as
DNase I hypersensitivity sites (DHS) and transcriptional factor binding sites (TFBS)
51,53
. Proper
integration of these biological information with GWAS results will likely shed light on
prioritization and identification of candidate causal variants
54
.
Page 24 of 73
To that end, tools such as HaploReg
55
, RegulomeDB
56
, GWAS3D
57
and FunciSNP
58
were
developed to leverage functional annotations to interrogate GWAS variants. However, the
popularity of regional plots highlights the importance of visual inspection of the GWAS results.
While LocusZoom
59
generates high-quality images showing local linkage disequilibrium (LD)
and association significance, it provides little information for interpreting non-coding variants.
Therefore, we developed a web-based tool, Enlight, combining biological annotation and
GWAS-related information into one single plot. It takes GWAS results (including imputed ones)
as input, and returns a regional plot with a variety of epigenetic information such as histone
modification, methylation patterns, transcription factor binding sites, eQTL and high-order
chromosomal structure.
Page 25 of 73
IMPLEMENTATION
Architecture
The server backend is implemented in Perl CGI and MySQL. Regional plots are performed by
LocusZoom
59
, annotation plots are performed by custom R and Python code built on the basis
of LocusZoom. Text annotation is performed by ANNOVAR
31
. Detailed workflow is shown
Figure 11.
Figure 11. Enlight workflow. Enlight takes at least two columns of input, the SNP name and P value. After
submission, a job will be scheduled in MySQL database. Unpon execution, a modified version of LocusZoom
will output regional plots and annotation plots; meanwhile, ANNOVAR outputs text annotation for each
variant.
Input and output
Page 26 of 73
The minimum input contains two columns: marker name (rsID) and P value. Optionally, users
can supply chromosome, start position, end position, reference allele and alternative allele in
the first 5 columns; if they are absent, Enlight converts marker names to corresponding
coordinates based on dbSNP137. Upon submission, Enlight will output a conventional regional
plot containing association strength, LD information, gene position, chromosomal recombination
rate. Meanwhile, Enlight plots categorical annotation like chromHMM classification as colored
stripes, continuous annotation such as histone modification states, chromatin accessibility signal
by DNase-seq, transcription factor (TF) binding strength, etc., as histograms, below the GWAS
results panel (Figure 12). Variants that appear in the GWAS catalog and eQTL databases from
either GTEx project or eQTL browser of Pritchard Lab (http://eqtl.uchicago.edu/), can be shown
by a XYplot. HiC interaction
60
can be plotted as a heatmap. Users are able to select data tracks
from 14 cell lines and 17 annotation types or plot with their own data in BED format.
Enlight allows easy access to a range of high-throughput functional assays. In particular,
ChromHMM
61
is a computational model for learning chromatin states from a large amount of
ChIP-seq data sets. It conveniently classifies every 200bp region to one of 10 chromatin states
such as strong enhancer, weak enhancer, active promoter, to name a few. This classification
system allows easy interpretation of ChIP-seq data and potential functional impact of variants in
a specific region. Histone modification states measured by ChIP-seq usually mark the states of
transcription activity. For example, H3K4me3
62
is a marker for transcriptional activation;
H3K9me3
63
for repression, H3K9ac
64
for active promoter, H3K36me3
65
for active elongation.
DNase-seq assay measures chromatin accessibility. Peaks in DNase-seq are more likely to be
candidates for transcription factor binding
66
. eQTL
67
is short for expression quantitative trait loci.
Essentially eQTL are SNPs that are statistically significantly associated with gene expression,
i.e. a proxy marker for a functional locus that can alter gene expression. HiC
60
is a high-
throughput method that can detect chromosomal interactions across the entire genome. It has
been known that enhancer regulate promoter via long-range interactions, some of which can
interrogated by HiC methods
68
.
In addition to visualization, Enlight generates text annotation for each variant using ANNOVAR,
which can be used for downstream analyses, such as finding biofeature overlap, regression
model building, etc. Because ANNOVAR only supports region-based, filter-based and gene-
based annotation, biofeatures such as transcription factor binding motifs are not readily
available in the results.
Page 27 of 73
Figure 12 shows a typical plot. rs2071278 is genome-wide significantly associated with serum
complement C3 and C4 levels
69
, important measurements in assessment of rheumatoid
arthritis
70
. Therefore, rs2071278 is possibly a SNP tagging nearby functional variants. For
instance, rs204991, rs204990, rs204989, all in high LD with rs2071278 (1000 genomes,
European population, 2012), are shown to be significantly associated with expression of human
leukocyte antigen (HLA) genes by microarray
71
. This finding partly explains their strong
associations with rheumatoid arthritis in WTCCC study
72
. When strong associations appear near
known GWAS hits and are surrounded by interesting epigenetic and other functional features,
they perhaps suggest functional importance.
Page 28 of 73
Figure 12. An example Enlight plot. It shows the association signals for rs944797 and nearby loci from a
diastolic blood pressure study (dbGAP: pha003589.1). The color stripe shows the chromHMM segmentation
result which combines multiple epigenetic chromatin marks. The SNP rs1333045 (yellow), which is in high LD
with rs944797, shows up in a proposed enhancer region in blood vessel relevant cell lines, HUVEC and
HSMM, but not in two blood cell lines, GM12878 or K562. rs1333045 also overlaps with or is adjacent to peaks
in wgEncodeRegTfbsClusteredV3, which summarizes a large collection of ChIP-seq results, and
tfbsConsSites, which describes TFBS conservation across human/mouse/rat.
Page 29 of 73
Enlight has been shown to aid in finding additional variants contributing to genetic basis of some
complex phenotypes. For instance, human serum uric acid concentration (SUA) is a complex
trait. A recent meta-analysis of multiple GWAS identified 28 loci associated with SUA jointly
explaining only 7.7% of the SUA variance, with 3.4% explained by two major loci (SLC2A9 and
ABCG2)
73
. We examined whether gene–gene interactions could regulate SUA using two large
GWAS cohorts included in the meta-analysis [the Atherosclerosis Risk in Communities study
cohort (ARIC) and the Framingham Heart Study cohort (FHS)]. Abundant genome-wide
significant local interactions were found in the 4p16.1 region in both ARIC and FHS data. They
were mostly located in an intergenic area near SLC2A9, and apparently were not a result of
linkage disequilibrium. Forward selection approach helped us indentify 5 more SNPs with
marginal effects and 3 epistatic SNP pairs in ARIC. These findings, together with the lead SNP,
could explain 1.5% more SUA variance than the lead SNP alone. However, only 0.3% was
accounted for by the marginal and epistatic effects in the intergenic area.
Looking into the intergenic region (Figure 13), we can see some strong and weak enhancer
elements along with DNase peaks and conservative transcription factor binding peaks. This
suggests that the epistatically interacting SNPs may alter transcription factor binding activity for
some enhancers in this region. Enrichment analysis by HaploReg
55
showed that enhancers
active in ENCODE hepatic (HepG2) cells and precursor red blood cells (K562) are statistically
significantly enriched (P=4.7e-5, 5.0e-6, respectively) in the intergenic area between WDR1 and
SLC2A9, possibly a regulatory target for the epistatic SNPs. While there is high-throughput data
supporting our statistical results, more experimental validation may be needed, including
enhancer activity assay, qPCR for SLC2A9 and WDR1 gene expression with and without
alternative alleles, to name a few.
Page 30 of 73
Figure 13. All marginal SNPs in the 4p16.1 region in ARIC, their LD with the lead SNP rs3733588 colored in purple and the
recombination rate distribution in the region. The colored stripe shows chromatin states of the 4p16.1 region. Histograms
are UCSC genome browser data tracks of ENCODE data in the region with all scores normalized from 0 to 1000. The
wgEncodeBroadHmmHepg2HMM data track shows chromatin states in the region of HepG2 cells characterized using the
ChromHMM software, including two strong enhancer zones flanking the WDR1 gene; the wgEncodeRegTfbsClusteredV3 data
track shows transcription factor binding site cluster assayed by ChIP-seq; the wgEncodeOpenChromChipHepg2CtcfPk data
track shows DNase-seq signal peaks in HepG2 cells marking open chromatin flanking WDR1; the
wgEncodeBroadHistoneHepg2H3k27acStdPk data track displays active regions in HepG2 cells by histone mark H3K27Ac.
Page 31 of 73
CONCLUSIONS
We have created a user-friendly tool to generate regional plots of association results with
biological annotation as well as text annotation for individual variant. Enlight makes it possible to
visually connect epigenetic features or other functional annotations with the LD information, thus
facilitating the identification of putative causal variants. Enlight can be accessed via a web-
based interface.
Page 32 of 73
Chapter 3. Fast De Novo Genome Assembly, Evaluation, Gap filling
and Error Correction
ABSTRACT
Short-read sequencing enabled de novo assembly of a few individual human genomes, but with
inherent limitations in characterizing repeat elements. Here we sequence a Chinese individual
HX1 by single-molecule real-time (SMRT) long-read sequencing, construct a physical map by
NanoChannel arrays, and generate a de novo assembly of 2.93Gb (contig N50: 8.3Mb, scaffold
N50: 22.0Mb), including 39.3 Mb N-bases. The assembly fully or partially fills 274 (28.4%) N-
gaps in the reference genome GRCh38. Comparison to GRCh38 reveals 18.4Mb of HX1-
specific sequences, including 378kb that are not present in previously reported Asian genomes.
These novel sequence elements harbor epigenetic regulatory elements and may influence gene
expression. Our results imply that improved characterization of genome functional variation may
require the use of a range of genomic technologies on diverse human populations. Assemblies
are available for download at http://hx1.wglab.org/.
INTRODUCTION
The advent of next-generation short-read sequencing paved the way to characterize the
genomes of thousands of species, and had enabled de novo assembly of a few individual
human genomes
74,75
. However, these assemblies may have inherent technical limitations in
characterizing repeat elements that span longer than the read length
76,77
, yet repeats and
segmental duplications are known to cover approximately half of the human genome. For
example, a formal analysis of the de novo sequence assembly generated from the genome of a
Han Chinese individual and a Yoruban individual showed that 420.2 megabase pairs of
common repeats and 99.1% of validated duplicated sequences were not present, which resulted
in missing thousands of coding exons in the genome assembly
76
. Therefore, the use of
additional genomic techniques, such as fosmid pooling
78
and long-read sequencing
79
, may be
necessary to better characterize complicated genomic regions in human genomes.
Previous studies reported that pervasive genetic differences exist across different ethnicity
groups, especially on structural variants
80-82
. For example, through reconstruction of the
ancestral human genome, it was reported that megabases of DNA were lost in different human
lineages and that large duplications were introgressed from one lineage to another
81
.
Additionally, genomic elements that are absent from reference genomes may be present in
Page 33 of 73
personal genomes
83,84
. For example, a study estimated that a complete human pan-genome
would contain ~19–40 Mb of novel sequence not present in the extant reference genome
85
.
These novel sequences that are not present in the reference genome may harbor functional
genomic elements that are ethnicity-specific, and may affect gene regulations or transcriptional
diversity.
To address the limitations on previously published de novo genome assembly, and to improve
our understanding of transcriptome variations, here we sequence both DNA and RNA of a
Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing
86
. We also
generate a physical map of the HX1 genome using IrysChip
87
, which is a nanopore array that
detects a characteristic 7-nucleotide sequence along very long genomic segments, typically
hundreds of kilobases. We perform de novo genome assembly to build a Chinese reference
genome, using a hybrid approach that combines long-read sequencing data and IrysChip data
79
.
We demonstrate a few unique applications of the HX1 assembly, including the ability to fill gaps
in the human reference genome assembly GRCh38, as well as the ability to identify fine-scale
structural variants. In parallel, leveraging long-read RNA sequencing, we also identify novel
transcriptional elements, especially those with multiple spliced isoforms. Through the combined
use of a few genomic techniques, we perform detailed characterization of the HX1 genome and
demonstrate that long-read sequencing can detect potentially functional elements in human
genomes that are missed by short-read sequencing.
RESULTS
Generation of de novo genome assembly on a Chinese genome
We leveraged SMRT DNA sequencing technology
86
and sequenced genomic DNA from an
anonymous Chinese individual (HX1) with normal karyotype (Figure 14) at 103X genome-wide
coverage (Table 6). In total, we obtained 44.2 million "subreads" (a portion of the sequencing
read that is informative for downstream analysis) after removing adapters and performing quality
control measures. These subreads have a mean length of 7.0kb and a N50 length of 12.1kb
(Table 7, Figure 15), where N50 refers to the length for which the collection of all sequences of
that length or longer contains at least half of the sum of the lengths of all sequences.
Page 34 of 73
Table 6. Statistics of the genomics data sets generated in this study.
Material Platform # Cells # Reads Bases Coverage Mean
length
N50
length
DNA Illumina
HiSeq X
- 2.8 billion reads 428.8
G
143X 151 151
DNA PacBio
SMRT cell
377 cells 44.2M reads 309.0G 103X 7.0Kb 12.1Kb
DNA BioNano
IrysChip
12 cells 1.169M
molecules
(>150kb)
302.8G 101X 259.0Kb 224.7Kb
RNA PacBio
SMRT cell
50 cells (1-2kb, 2-
3kb, 3-5kb,5kb+)
2.721M error-
corrected reads
5.8G - 2.1Kb 2.7Kb
RNA Illumina
HiSeq 2500
NA 48.9M reads 4.4G - 90 90
Table 7. Statistics on subreads generated from PacBio long-read DNA sequencing on the genome DNA of
HX1. Filtered subreads are used in error-correction by DALIGNER in Falcon. Error-corrected subreads are
used for de novo assembly.
Filtered subreads Error-corrected subreads
Min 35 501
25% Quantile 1,754 2,064
Median 4,853 4,975
Mean 6,990 6,114
75% Quantile 10,531 8,621
Max 56,677 53,633
N50 12,134 9,137
Count 44,207,919 26,562,171
Page 35 of 73
Figure 14. Karyotype. Karyotype analysis did not reveal large-scale chromosomal abnormality in HX1.
Figure 15. Length distribution of subreads and error-corrected subreads. (a) read length distribution of
filtered subreads; (b) read length distribution of error-corrected subreads. Filtered subreads are used in
error-correction by DALIGNER in Falcon. Error-corrected subreads are used for de novo assembly.
Page 36 of 73
We modified and improved the FALCON software
88
and performed de novo genome assembly
on the long reads, resulting in 5,843 contigs (N50=8.3Mb) and a total size of 2.9Gb. In addition,
206Mb of "associate contigs", that is, alternative haplotypes, were constructed along with the
primary contigs. Finally, we also performed short-read sequencing on the Illumina HiSeq X
platform, with 143X coverage of the genome (Table 6). Short reads were used to further polish
HX1 contigs and correct indel and SNV errors. The continuity of the contigs is substantially
higher than assemblies generated from competing technologies in previous studies (Table 9),
demonstrating the clear advantage of long-read sequencing in genome assembly. We note that
a recently published genome using the same SMRT technology reported a contig N50 of
906kb
79
, and we believe that the almost 10-fold improvement in our study can be attributed to
the improved chemistry, longer read length, enhanced assembly algorithm, as well as the much
deeper sequencing depth (Table 8).
Table 8. Assembly quality and sequencing depth. Assembly quality (contig N50 and assembly size) is
compared with regards to sequencing depth and length cutoff for error-corrected subreads. Sequencing
depth (>6kb) denotes average sequencing depth counting only reads longer than 6kb. Depth is
approximately calculated as total base pairs divided by 3 Gb.
Contig N50
(Mb)
Assembly size
(Gb)
Sequencing
depth
Sequencing depth
(>6kb)
Length cutoff for error-
corrected subreads (kb)
0.61 2.68 65 48 2
2.76 2.81 81 64 2
4.65 2.87 99 80 2
6.49 2.88 99 80 9
7.16 2.87 103 83 13
8.01 2.89 103 83 11
8.28 2.88 103 83 12
Page 37 of 73
Table 9. Comparison to previously published de novo assembly on personal genomes. *YHref’s alternative
haplotype is contained in the haploid-resolved diploid genome (HDG) sequence, which is not readily
comparable with our results.
Sample Assembly Accession Sequencin
g Platform
Scaffoldin
g platform
Contig
N50
Scaffold
N50
Size Size
(alternative
haplotype)
# Gaps Gap length
Venter HuRef GCA_000002125.
2
Sanger BAC 108,431 17,664,25
0
2,844,000,50
4
NA 66,906 34,429,377
NA1287
8
ALLPATHS GCA_000185165.
1
Illumina
GAII &
HiSeq
Fosmid 23,924 12,084,11
8
2,786,258,56
5
NA 220,31
8
171,353,64
4
BGI-YH YH_2.0 GCA_000004845.
2
Illumina
HiSeq
Fosmid 20,516 20,520,93
2
2,911,235,36
3
NA 235,51
4
105,204,23
0
BGI-YH YHref NA Ilumina
HiSeq
Fosmid 484,222 23,192,26
0
2,883,329,36
1
NA* NA NA
CHM1 CHM1_1.1 GCF_000306695.
2
Illumina
HiSeq
BAC 143,936 50,362,92
0
3,037,866,61
9
NA 40,915 210,229,88
0
NA1287
8
ASM101398v
1
GCA_001013985.
1
PacBio
SMRT
BioNano
IrysChip
906kb/
1.4Mb
(V2)
26,834,08
1
3,176,574,37
9
NA 2,332 146,352,28
6
HX1 HX1 N/A PacBio
SMRT
BioNano
IrysChip
8,325,00
4
21,979,25
0
2,934,084,19
3
206,388,24
8
10,901 39,341,483
To evaluate the completeness and accuracy of the draft HX1 assembly, we performed several
analyses. First, we generated a physical map of the same DNA sample by NanoChannel-based
fluidic IrysChip
87
. From the IrysChip run with 101X whole-genome coverage (based on all
sequence reads >150kb, Table 10, Figure 16), we calculated the mapping rate of these
fragments on different genome assemblies. We noted that this analysis was biased against
more fragmented assemblies such as HX1, since some IrysChip reads may span two different
contigs. With a highly stringent threshold suitable for human genome analysis, the IrysChip
reads yielded comparable mapping rates to GRCh38 and HX1 at 80.2% and 78.9%,
respectively. Second, we aligned HX1 contigs with gapped (N bases included) to GRCh38, and
found that 97.11% of the non-N regions in GRCh38 are covered by HX1 (Figure 17). This ratio
was at a similar level as other personal genome assemblies including YH2.0
83
(94.99%),
NA12878
79
(97.50%) and HuRef
84
(97.04%). Third, we evaluated consensus quality of the
assemblies, by comparing them with the chromosomes in the reference genome GRCh38 using
MUMmer
89
. We found that the consensus accuracy for HX1 was 99.73%, similar to YH2.0
Page 38 of 73
(99.81%), NA12878 (99.73%), and HuRef (99.84%). Fourth, we aligned the NCBI RefSeq
transcripts to several genome assemblies using the NCBI Assembly Evaluation pipeline. We
found that 776 out of 50909 transcripts could not be placed on HX1 versus 455 for NA12878,
306 for YH2.0 and 22 for GRCh38 (primary assembly). The percentage of mapped coding
transcripts with CDS (coding sequence) coverage ≥95% was higher in HX1 than in all
assemblies except GRCh38: 99.96%, 90.32%, 95.30%, and 97.69% for GRCh38, YH2.0,
NA12878 and HX1, respectively (Table 11), indicating the low number of misassemblies in HX1.
Similarly, the numbers of transcripts split across multiple genomic locations in GRCh38, YH_2.0,
NA12878 and HX1 are 11, 1213, 1375, 348, respectively, demonstrating the high quality of HX1.
Last but not least, short read alignments showed that 99.42% and 99.33% of the Illumina reads
can be mapped to GRCh38 and HX1, respectively, suggesting the high quality and
completeness of HX1.
Table 10. Statistics on the filtered (>150kb) BioNano data.
Measure Value
Total DNA 302.8Gb
Total molecules 1,169,210
Occupancy 2.58%
Average molecule SNR 16.61
Average molecule intensity 0.14
Center of mass (N50) 264.3kb
Average length 259.0kb
Median length 224.7kb
Average label density /100kb 8.92
Page 39 of 73
Table 11. RefSeq transcripts alignment results from the NCBI Assembly Evaluation pipeline.
GRCh38
˦
YH2.0
*
NA12878
*
HX1
#
Number of transcripts retrieved from Entrez 50909 50909 50909 50909
Number of transcripts not aligning 22 306 455 776
Number of transcripts with multiple best alignments
(split transcripts)
11 1213 1375 348
Number of transcripts with CDS coverage < 95% in
aligned transcripts
16 3798 1836 900
Number of conflicting placements of transcripts from
different genes
1 635 490 628
Percentage of aligned coding transcripts with CDS
coverage >= 95%
99.96 90.32 95.30 97.69
*
NCBI accession number: GCA_000004845.2 for YH2, GCA_001013985.1 for NA12878
#
HX1 scaffolds
˦
Unless there is an assembled chrY in an assembly, our code excludes any transcript that aligns best to GRCh38
chrY from the counts for these assemblies without chrY. It means that the denominator (number of sequences that
we align or that we attempt to align) is a little bit lower for the YH2.0, NA12878 and HX1.
Page 40 of 73
Figure 16. Summary of the IrysChip analysis. (A) Distribution of the molecule mass versus the size of the
molecules in filtered data. (B) Distribution of the molecule count versus the size of the molecules in filtered
data.
Page 41 of 73
Figure 17. Non-N coverage of GRCh38 by HX1 assembly for each chromosome.
To generate scaffolds on the draft assembly, we applied a hybrid scaffolding approach
79
on the
IrysChip data and the HX1 draft assembly: First, we performed de novo assembly of the
IrysChip reads, resulting in 2,346 contigs with N50 of 1.80Mb. Next, we stitched HX1 contigs
together based on information from the IrysChip assembly. Together with HX1 contigs that
cannot be anchored, the N50 for the hybrid assembly improved to 22.0 Mb. We next evaluated
the mis-joining error rate with a similar strategy as described in a previous study
79
, by
comparing to the reference human genome GRCh38. The HX1 scaffolds had a mis-joining error
rate of 1.26%, similar to what was observed in YH2.0 (3.88%), NA12878 (0.55%) and HuRef
(1.15%). The final genome assembly contained 2.93 Gb primary sequences (including 39.3Mb
N bases) and 206Mb alternative haplotypes.
Page 42 of 73
Gap filling on the reference genome by de novo assembly
The reference human genome GRCh38 contains 966 "N-gaps" as stretches of Ns, and we next
assessed whether we can fill in these gaps using the de novo assembly. The average and
median lengths of gaps in GRCh38 are 180kb and 998 bp, respectively (Figure 18A). One
previous study by Chaisson et al used SMRT sequencing to fill gaps
90
; this study used a local
assembly approach, and was able to close or extend into 31 of the 172 interstitial euchromatic
gaps in GRCh38. Another study that used SMRT sequencing closed 28 interstitial gaps in
GRCh38 with 34kb of assembled sequences
79
. Given the availability of whole-genome de novo
assembly, we developed a novel statistical approach called GFA (gap-filling by assembly,
Figure 23). Interestingly, we found that 28.4% (254) of the 966 N-gaps in GRCh38 can be
completely or partially filled by HX1. Among the 328 gaps over 10kb, 36.0% (118) of them can
be completely or partially filled. The total length of filled or shortened gaps amounts to 7.1Mb
(Figure 18B). Compared to previous studies, gaps filled by HX1 have overlap with 48 of the 172
interstitial gaps defined by Chaisson et al
90
, adding 1.8 Mb sequences to GRCh38; however,
among the 48 interstitial gaps closed by us, only 10 were closed by Chaisson et al, suggesting
that these two gap closing methods are highly complementary. We further evaluated the repeat
contents within the gaps that can be closed by us, and found that simple repeats and satellite
sequences were significantly enriched within the closed gaps compared to GRCh38 (p<0.001)
(Figure 18C). As an example, one ~700bp gap can be completely and confidently closed
(Figure 18D), where HX1 can be aligned to both flanking segments of the gap (Figure 18E). In
summary, together with Chaisson et al, 69 out of 172 interstitial gaps in GRCh38 can be closed
by long-read sequencing.
Page 43 of 73
Figure 18. Summary of gap filling in GRCh38. (A) Length distribution of all gaps (stretches of "N" in genome
sequence) in GRCh38. (B) Length distribution of all gaps that can be fully or partially closed. (C) Violin plots
showing the distribution of LINE, SINE, LTR, simple repeat and satellite in closed gaps and in GRCh38. (D) A
dotplot showing how a gap on 17p13.3 is closed by a contig in HX1. The plot shows comparison of two
sequences and each dot indicates a region of close similarity between them (E) genome browser shot of the
gap region that was closed. The gap is flanked by two contigs that are new in GRCh38 (not carried forward
from GRCh37), yet a HX1 associate contig (000850F-001-01) can completely align to flanking regions,
therefore filling this assembly gap and revising its length from 718bp to 731bp.
Page 44 of 73
Characterization of structural variants and novel sequences
We next cataloged structural variants (SVs) in the genome of HX1, by comparing to the
GRCh38 genome assembly. From long-read sequencing data, we identified 9,891 deletions and
10,284 insertions by a previously validated local assembly approach
90
(Figure 20A). We
classified these SVs by type and by repeat contents of the variant sequence (Table 12), and
found that about half of the deletion and insertion calls are short tandem repeats (STRs) or
mobile element insertions (MEIs) (Figure 20B). We further compared SVs in HX1 to those
detected in CHM1 - a haploid genome assembled by long-read sequencing and analyzed by the
same method of SV detection
90
, as well as all SVs cataloged in the 1000 Genomes Project
82
(Figure 20C). Due to the increased sensitivity of SV detection from long-read sequencing, HX1
shares substantially more SVs with the CHM1 genome, compared to all SVs cataloged in the
1000 Genomes Project. Additionally, from the IrysChip physical mapping data, we identified 783
insertions and 377 deletions, using a previously validated approach
91
. From the short-read
sequencing data, we also identified 2,403 deletions and 783 insertions. We found that 82.8%
deletions and 66.9% insertions from IrysChip overlap with SVs detected by long-read
sequencing (Figure 19), but only 33.7% deletions and 13.8% insertions from short-read
sequencing can be detected by long-read sequencing. Finally, we demonstrated an example
where a 204.7kb deletion can be visually identified by all technical platforms (Figure 20D,E).
However, a 132bp deletion can only be detected by long-read sequencing (Figure 20F); it is
located within simple repeat regions with high (57.2%) GC contents (Figure 20G) and shows no
clear drop in coverage in Illumina data (Figure 20F), potentially explaining its failure to be
identified in Illumina data using read-depth method. However, an alternative split-read based
method
92
with appropriate parameters was able to identify this deletion. To identify likely
functional SVs specific to HX1, we filtered out calls found in segmental duplications or those
shared with CHM1, and intersected the remaining calls with RefSeq exons. This left us with 35
exonic deletions and 14 exonic insertions. Among these exonic SVs, 8 deletions (23%) and 5
insertions (31%) had been previously observed in the 1000 Genomes Project, including a
homozygous exonic deletion in C1orf168 that completely removes the 10th and 11th exon.
Interestingly, this deletion was only observed as a heterozygous event in East Asian populations
(AF=1.1%), including 10 Han Chinese and 1 Japanese individuals, and was therefore an Asian-
specific SV. In summary, long-read sequencing offers improved sensitivity in identifying SVs,
especially those containing repetitive elements, and some of these SVs may contain functional
genomic elements that are ethnicity-specific.
Page 45 of 73
Table 12. Structural variants detected in HX1 in comparison to GRCh38. All events are larger than 50bp.
Alu_STR: Alu element associated with STR; Complex: multiple repeat elements are found; MEI: mobile
element insertion.
Insertion Deletion Ins/del
Repeat
Category
Num Mean
length
Total
bases
Num Mean
length
Total bases Total
events
Total
bases
MEI 2,172 600 1,302,834 2,499 504 1,258,260 0.87 1.04
L1 175 1,121 196,227 207 552 114,177 0.85 1.72
L1HS 117 2,622 306,826 146 2,648 386,599 0.80 0.79
L1P 152 770 116,964 214 579 123,871 0.71 0.94
Mosaic Alu 187 496 92,842 141 360 50,724 1.33 1.83
AluS 110 219 24,144 127 180 22,819 0.87 1.06
Alu_STR 93 961 89,407 63 439 27,662 1.48 3.23
AluY 865 286 247,698 1,153 284 327,077 0.75 0.76
SVA 292 511 149,172 227 618 140,327 1.29 1.06
HERV 23 318 7,320 31 263 8,141 0.74 0.90
LTR 92 533 49,047 108 256 27,665 0.85 1.77
MER 66 351 23,187 82 356 29,198 0.80 0.79
Centromeric
Satellites
608 561 340,995 436 501 218,243 1.39 1.56
ALR 484 612 296,077 387 551 213,177 1.25 1.39
HSAT 124 362 44,918 49 103 5,066 2.53 8.87
Complex 822 3,065 2,519,672 766 3,075 2,355,090 1.07 1.07
STR 3,655 332 1,212,704 2,130 176 374,176 1.72 3.24
Other 256 269 68,982 431 243 104,795 0.59 0.66
Unannotated 2,771 210 580,924 3,629 180 654,225 0.76 0.89
Total 10,284 586 6,026,111 9,891 502 4,964,789 1.04 1.21
Page 46 of 73
Figure 19. Structural variation calls across different technologies. (A) The number of CNV calls generated by
CNVnator (Illumain short-read sequencing), IrysChip (BioNano physical mapping), FES-SV (PacBio long-read
sequencing) at different size thresholds. (B) The concordance of FES-SV calls with the other two
technologies.
Page 47 of 73
Figure 20. Detection of structural variants by different technologies. (A) Chromosome ideogram showing
large-scale (>1kb) deletions (blue) and insertions (red) identified from long-read sequencing data. (B) Pie
chart showing the distribution of different classes of structural variants identified from long-read sequencing
data. (C) Venn diagram showing the overlap of structural variants between HX1, CHM1 and the 1000
Genomes Project for insertions and deletions, respectively. (D) Integrative Genomics Viewer screen shot of
the long-reads (upper panel) and short reads alignment (lower panel) around a ~200kb deletion (E) Alignment
of de novo assembled genome map (blue) to reference genome map (green) where the ~200kb deletion
occurs. Black vertical lines represent labels for the enzyme recognition site. Contig 2 shows identical label
patterns as reference, yet contig 1 contains the deletion. (F) Integrative Genomics Viewer screen shot of
long-read (upper panel) and short-read (lower panel) alignment around a 132bp deletion on KRTAP1-1. This
deletion is visually discernible from long-read sequencing, because the coverage is reduced and half the
reads contain the deletion in alignments. However, read-depth based method failed to detect this deletion
with short read data. (G) Genome browser screen shot of the region surrounding the 132bp deletion on
KRTAP1-1, demonstrating the presence of simple tandem repeats and the very high GC content of the
deletion.
Page 48 of 73
One of our primary interests lies in the identification of novel genomic elements that are absent
from the reference genome assembly GRCh38. In total, we identified 18.4Mb sequences in HX1
that were not present in GRCh38 primary scaffolds nor its alternative loci, among which only
378kb (~2%) cannot be mapped to previously published Asian genome assemblies
78,83
,
suggesting that most HX1-specific sequences are likely to be found in Asian populations. To
further investigate this, we re-analyzed Illumina short-read sequencing data on a Chinese
subject provided by the National Institute of Standards and Technology (NIST) as standard
benchmarking data. Among the 907 million raw reads, we achieved a mapping rate of 99.56%
to GRCh38; among unmapped reads, we found that 7.8% can be mapped to the 18.4Mb novel
sequence in HX1, confirming that Asian-specific sequence elements were present but were
missed from GRCh38. We next performed variant calling on the NIST genome. In the regions
shared between HX1 and GRCh38, we identified 3,157,818 SNVs on HX1 but 3,852,118 SNVs
on GRCh38, suggesting that Asian-specific reference genome might reduce the number of
called SNVs. However, HX1 might be less appropriate for indel calling due to higher indel error
rate of SMRT long reads. Among the SNVs called on HX1, 37,462 (~1%) resided within the
novel sequences of HX1. We further examined the contents within the novel sequences in HX1
(Table 13), and found that microsatellites are significantly enriched in the novel sequences
compared to genome-wide average (68.4% vs 1.6%, p<10
-5
). Similarly, simple repeats are also
significantly over-represented (10.3% vs 0.9%, p<10
-5
). Therefore, long-read sequencing is
especially effective in capturing highly repetitive regions in the genome.
Page 49 of 73
Table 13. Repeat content within novel (non-GRCh38) sequences in HX1.
Novel sequence in HX1 Random sequences from GRCh38
GC: 40.79%, 928 sequences, size: 18.4Mb GC: 40.95%, 251 sequences, size: 18.4Mb
element count length percentage of
sequence
element count length percentage of
sequence
SINEs: 4507 578352 3.14% 10081 2350498 12.77%
ALUs 4312 551425 3.00% 7272 1937478 10.53%
MIRs 195 26927 0.15% 2790 410656 2.23%
LINEs: 772 751185 4.08% 5523 3638149 19.77%
LINE1 637 716433 3.90% 3597 3101890 16.86%
LINE2 123 33340 0.18% 1657 468664 2.55%
L3/CR1 9 847 0.00% 201 45972 0.25%
LTR 404 263736 1.43% 3051 1589941 8.64%
ERVL 82 41827 0.23% 664 325935 1.77%
ERVL-MaLRs 185 73600 0.40% 1504 631239 3.43%
ERV_classI 104 109682 0.60% 701 539504 2.93%
ERV_classII 21 35248 0.19% 47 57160 0.31%
DNA elements 347 85282 0.46% 2357 564788 3.07%
hAT-Charlie 190 40134 0.22% 1271 256052 1.39%
TcMar-Tigger 105 31711 0.17% 543 201641 1.10%
Unclassified: 23 30339 0.16% 53 36362 0.20%
Total
interspersed
repeats
6053 1708894 9.29% 21065 8179738 44.46%
Satellites 6216 12576907 68.38% 34 299489 1.63%
Simple repeats 11945 1890873 10.28% 2489 165630 0.90%
Low
complexity
586 38670 0.21% 2251 105424 0.57%
Page 50 of 73
Several other previously published studies reported novel sequence elements from Asians, so
we performed comparative analysis. For example, Li et al found 7,330 novel sequences (4.9Mb)
absent from GRCh37
85
. Re-analysis of their data showed that 3,440 (3.7Mb) sequences can be
aligned to GRCh38, yet 4,676 (4.4Mb) can be aligned to HX1. Among these sequences, 3,129
(3.5Mb) can be aligned to both GRCh38 and HX1. To assess whether regulatory functional
elements exist in novel sequences in HX1, we analyzed raw sequencing data on five markers
(CTCF, DNase 1 hypersensitivity, H3K4me1, H3K4me3, H3K27ac) on the lymphoblastoid cell
line GM12878 from the ENCODE project (Table 14)
93
. Among all the reads that cannot be
mapped to GRCh38, 1.1% can be mapped to HX1 (Table 15) and we performed peak-calling on
each marker. We also performed RNA-Seq on HX1 using Illumina short-read sequencing, and
found that genes closest to these peaks within the 500kb flanking region tend to have higher
expression levels than all other genes in the RNA-Seq data (Figure 21A). Similar patterns were
also observed for novel non-GRCh38 sequences of HX1 (Figure 21B, Table 16). In summary,
we demonstrated that long-read sequencing can identify potentially functional pieces in
genomes that evade detection by short-read sequencing.
Page 51 of 73
Table 14. List of data sets downloaded from http://encodeproject.org for functional analysis.
Cell line Producer Assay Replicate# Accession Control Accession
GM12878 broad H3K4me3 1 ENCFF000ASR ENCFF000ARK
GM12878 broad H3K4me3 2 ENCFF000AUB ENCFF000ARO
GM12878 uw H3K4me3 1 ENCFF001EYE ENCFF001HID
GM12878 uw H3K4me3 2 ENCFF001EYF ENCFF001HID
GM12878 broad H3K4me1 1 ENCFF000ASM ENCFF000ARK
GM12878 broad H3K4me1 2 ENCFF000ATK ENCFF000ARO
GM12878 broad H3K27ac 1 ENCFF000ASP ENCFF000ARK
GM12878 broad H3K27ac 2 ENCFF000ASU ENCFF000ARO
GM12878 broad CTCF 1 ENCFF000ARP ENCFF000ARK
GM12878 broad CTCF 2 ENCFF000ARV ENCFF000ARO
GM12878 stanford CTCF 1 ENCFF000VUU ENCFF000VWV
GM12878 stanford CTCF 2 ENCFF000VUW ENCFF000VWV
GM12878 uta CTCF 1 ENCFF000ROU ENCFF000RPB
GM12878 uta CTCF 2 ENCFF000ROX ENCFF000RPB
GM12878 uta CTCF 3 ENCFF000ROZ ENCFF000RPB
GM12878 uw CTCF 1 ENCFF001HHX ENCFF001HID
GM12878 uw CTCF 2 ENCFF001HIA ENCFF001HID
GM12878 uw DNase 1 ENCFF001CUR NA
GM12878 uw DNase 2 ENCFF001CWQ NA
GM12878 duke DNase 1 ENCFF000SLF NA
GM12878 duke DNase 2 ENCFF000SLG NA
GM12878 duke DNase 3 ENCFF000SLL NA
GM12878 duke DNase 4 ENCFF000SLP NA
GM12878 duke DNase 5 ENCFF000SLR NA
Page 52 of 73
Table 15. Mapping rate for ENCODE reads from each one of the assays from GM12878 that cannot be
mapped to GRCh38 but can be realigned to HX1.
Total number of
GRCh38-unmapped
reads
Total number of reads
mapped to HX1
Mapping rate of
GRCh38-
unmapped reads
CTCF 52,428,376 583,528 0.011
DNase 133,235,297 1,102,211 0.008
H3K4me1 18,969,475 173,028 0.009
H3K4me3 8,787,823 961,685 0.109
H3K27ac 55,380,745 262,783 0.005
Total 268,801,716 3,083,235 0.011
Page 53 of 73
Table 16. Summary for peaks in HX1 novel sequences generated using ENCODE GM12878 sample and their
associated genes.
Assay Peak Gene
Contig Start End Contig Start End FPKM Distance
to peak
CTCF 000504F 2,767,793 2,768,161 000504F 2,529,336 2,531,746 7.96 236048
000800F 671,615 672,870 000800F 587,324 693,220 47.00 0
H3K4me1 001058F 115,364 116,029 001058F 181,365 182,584 5.79 65337
H3K27ac 001058F 121,686 121,940 001058F 181,365 182,584 5.79 59426
DNase 000102F 4,813,943 4,814,071 000102F 4,944,978 4,980,549 4.28 130908
000235F 4,325,591 4,325,743
000235F 4,237,882 4,242,650 3.78
82942
000235F 4,325,791 4,325,853 83142
000504F 2,780,203 2,780,257 000504F 2,529,336 2,531,746 7.96 248458
000507F 2,367,118 2,367,161 000507F 2,124,214 2,352,056 2.50 15063
000800F 656,120 656,696
000800F 587,324 693,220 47.00
0
000800F 657,262 658,394 0
000800F 658,567 660,785 0
000800F 660,909 665,520 0
000800F 665,625 668,481 0
000800F 668,731 672,967 0
001058F 115,479 116,041
001058F 181,365 182,584 5.79
65325
001058F 116,444 116,827 64539
001058F 121,651 121,747 59619
001252F 233,670 234,313
001252F 85,421 90,442 1.30
143229
001252F 234,449 236,250 144008
001377F 135,128 135,342 001377F 123,711 124,302 4.83 10827
001632F 79,834 79,884
001632F 119,584 120,284 5.85
39701
001632F 81,055 81,159 38426
001632F 81,265 81,845 37740
001632F 81,848 82,407 37178
001632F 131,939 132,559 11656
001632F 132,804 133,397 12521
001632F 133,729 134,363 13446
002340F 46,144 46,150
002340F 2,013 10,871 8.02
35274
002340F 50,330 50,851 39460
002360F 14,243 14,345 002360F 29,439 30,130 1.84 15095
Page 54 of 73
Figure 21. Distribution of FKPM values for genes flanking each of the five markers (CTCF, DNase, H3K4me1,
H3K4me3, H3K27ac), compared to background expression. For each peak, the nearest gene with 500kb
flanking region is analyzed. (A) Violin plot on all peaks in HX1 (B) Violin plot on peaks in HX1 that are located
in novel sequences. Note that in novel sequences, there is no gene that can be associated with H3K4me3
and there is only one gene that can be associated to H3K4me1 and H3K27ac.
DISCUSSION
In the current study, we generated one of the first near-complete de novo genome assemblies
for a Chinese individual. The contig and scaffold N50 values of the assembly were substantially
higher than previous studies on de novo human genome assembly, implicating the unique
advantage of long-read sequencing in assembling complex genomes such as the human
genome.
There are several current limitations to use long-read sequencing to generate de novo
assemblies and analyze personal genomes. First, although the read length is substantially
higher than short-read sequencing, it is still at the scale of tens of kilobases due to the
technological limitations of the current sequencing platform. Some highly complex genomic
regions may still not be adequately assayed or assembled, especially when sequencing
coverage is low. With the current development of a number of nanopore-based long-read
sequencing platforms, this problem may be alleviated by technological innovations. Second,
some gaps in the reference genome are long and are surrounded by segmental duplications or
other highly repetitive sequences
94
; therefore, they may not be filled by our long read assembly.
For example, 24 of the 966 gaps are longer than 300kb based on current estimation, and none
Page 55 of 73
of them can be closed by our method. Third, compared to the Illumina platform that enabled
$1000 genomes, PacBio long-read sequencing is still relatively cost-prohibitive to be applied to
personal genomes at large scale. With the continued decline in sequencing cost and the
improvement in sequencing throughput per flow cell, this problem may be reduced in the future.
Fourth, due to the much higher error rates (especially for indels) of PacBio long-read
sequencing, variant detection will not be reliable at low sequencing coverage, so analysis of
genetic mutations in personal genomes still needs to rely on more accurate short-read
sequencing data. Finally, in the current study, to demonstrate the use of reference-free de novo
assembly, we used the NanoChannel arrays for scaffolding the contigs from the genome
assembly, which resulted in ~3 fold improvements in N50 values. However, we acknowledge
that we could alternatively use the reference human genome GRCh38 for generating much
longer scaffolds.
In summary, while short-read based alignment and variant calling based on reference genome
remain a common practice to assay personal genomes, de novo assembly by long-read
sequencing may reveal novel and complementary biological insights. Improved understanding
and better characterization of genome functional variation may require the use of a range of
genomic technologies on diverse human populations
95
.
METHODS
Generation of sequence data
Freshly drawn blood samples were obtained from an anonymous healthy Chinese adult male
(HX1) with normal karyotype, using protocols approved by the Institutional Review Board of
Jinan University. HX1 provided written consent for public release of genomic data. For long-read
DNA sequencing, high-molecular-weight DNA was extracted from isolated lymphocytes using
Phenol-Chloroform method and sequenced by the PacBio sequencer RS II with the P6-C4
sequencing reagent. For long-read RNA sequencing, total RNA was extracted from blood using
TRIzol extraction reagent and subjected to the IsoSeq protocol with four library sizes (1-2kb, 2-
3kb, 3-5kb, 5kb+), and sequenced on the PacBio sequencer RS II. For short-read DNA
sequencing, genomic DNA was subject to Illumina TruSeqDNA Nano library preparation
protocol and 151bp paired-end reads were generated by Illumina HiSeq X sequencer. For short-
read RNA sequencing, RNA samples were subject to the TruSeq mRNA Library Kit and
sequenced on Illumina HiSeq2500 sequencer. For BioNano physical mapping, DNA extracted
Page 56 of 73
from freshly drawn whole blood were subject to manufacturer recommended protocols for library
preparation and optical scanning, with the default nicking enzyme NT.BspQI.
BioNano genome mapping
We used the Irys System from BioNano Genomics, a NanoChannel-based fluidic IrysChip that
can unravel, sort, and confine native-state, long, genomic DNA fragments in a linearized
conformation. The DNA is labeled by a nicking enzyme that recognizes characteristic 7-mer in
genome sequence. Labeled sites appear as a uniquely recognizable pattern of “dots on a string”,
so that once the DNA is stretched inside the NanoChannels, the Irys CCD detector optically
images them. Uniquely, the straightened molecules in solution are able to move smoothly
through the nanoscale fluidic environment, enabling multiple cycles of automated loading and
imaging for high-throughput scanning and analysis.
The BioNano optical mapping data was generated using DNA extracted from 3mL freshly drawn
whole blood, using manufacturer recommended protocols for library preparation and optical
scanning. The default nicking enzyme NT.BspQI was used for digesting DNA. A total of 12 flow
cells were used for this DNA analysis. We selected 150kb as the threshold for size filter (those
molecules <150kb are removed from raw data), and selected dynamic SNR (signal-to-noise
ratio) filter which allows IrysView to automatically calculate the optimal label SNR. The final set
of cleaned data includes 302.8Gb data on 1,169,210 molecules, with N50 of 264.3kb (Table 10).
Of note, 30.0Gb (9% bin mass fraction) of the data has length over 500kb. The IrysView
software (software version 2.1.1.8025 with RefAligner/Assembly version r3827), in a desktop
computer was used for data reformatting, QC and visualization, yet heavy-duty data analysis
(such as assembly, scaffolding and genome comparison) was performed using a computing
cluster. Additional procedures for data analysis are described in several sections below.
Karyotyping for chromosomal abnormality
To assess large-scale chromomosal abnormality or aneuploidy in the HX1 genome, we
performed karyotyping at the Clinical Diagnostic Laboratory at the Women and Children's
Hospital of Guangdong. Standard karyotyping techniques are used on 5mL freshly drawn blood.
Briefly, after a short-term culture of cells derived from the blood, dividing cells are arrested in
metaphase by addition of colchicine, which poisons the mitotic spindle. The cells are next
treated with a hypotonic solution that causes their nuclei to swell and the cells to burst. The cells
were then examined on a glass slide with stains to examine the structural features of the
chromosomes.
Page 57 of 73
de novo draft genome assembly by Falcon
After we obtained raw data, we performed subread filtering using the SMRT Portal software
(Pacific Biosciences, Meno Park, CA) with default settings for subread protocol. FASTA files
were then pooled together for assembly. The average read length and N50 are 7.0kb and
12.1kb, respectively. Due to the many random errors in SMRT long reads, a hierarchical
genome assembly process assembler called Falcon
88
specifically designed for PacBio reads
was used to do de novo assembly. In particular, we modified Falcon to improve its performance
in an NFS-based computing cluster. The source code is available on github
(https://github.com/WGLab/EnhancedFALCON).
Unlike hybrid assembly method that combines SMRT long reads and Illumina short reads,
Falcon relies on single source of sequencing data for assembly. It first uses DALIGNER
96
to find
overlaps between filtered subreads and then extracts consensus sequences from overlaps (pre-
assembly). The consensus sequences are much more accurate than subreads
88
. DALIGNER is
called again to find overlaps between error-corrected reads. Based on these overlaps, Falcon
constructs a graph where a node denotes a read and a directed edge denotes an overlap. By
removing redundant edges and resolving complicated paths, simple paths can be identified and
later be converted to contigs. Because human genome is diploid, alternative paths may exist in
certain regions. As a result, associate contigs will also be constructed. A visualization of
assembly procedure is shown in Figure 22.
All reads were used to build the assembly (“-a” option in DBsplit). Length cutoff is 6,000 bp for
error-correction and 12,000 bp for graph construction. Detailed configuration is provided as
follows.
# The length cutoff used for seed reads used for initial mapping
length_cutoff = 6000
# The length cutoff used for seed reads usef for pre-assembly
length_cutoff_pr = 12000
#-M means limit memory usage, use integer
pa_HPCdaligner_option = -v -dal128 -t16 -e.70 -l1000 -s1000
ovlp_HPCdaligner_option = -v -dal128 -t32 -h60 -e.96 -l500 -s1000
#-x is length threshold for including reads in DB
#consider add -a option in future to include secondary reads for error correction
pa_DBsplit_option = -x500 -s400 -a
Page 58 of 73
ovlp_DBsplit_option = -x500 -s400 -a
falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 6
overlap_filtering_setting = --max_diff 60 --max_cov 100 --min_cov 2 --bestn 10 --n_core 12
Figure 22. Falcon de novo assembly workflow.
Page 59 of 73
Evaluation of draft genome quality by BioNano
We next used the BioNano data as an orthogonal genome mapping approach to evaluate the
quality and completeness of the draft genome assembly. For this analysis, we compared the
draft genome with GRCh38, to assess whether BioNano tend to be better mapped to one
genome versus the other. This analysis is not fair, since a draft assembly contains many short
contigs, and some BioNano reads may not be mappable to these short contigs, resulting in
reduced mapping rate. However, the difference between draft assembly and a reference
genome assembly (GRCh38) can inform us about the completeness of the draft assembly.
Using the "Molecular Quality Report" function in the BioNano IrysView software (version
2.1.1.8025) with RefAligner/Assembly (version r3827), we evaluated what fraction of BioNano
molecules (>150kb) can be mapped to the draft genome assembly and the GRCh38 reference
assembly, respectively. The parameters used are " -nosplit 2 -BestRef 1 -biaswt 0 -Mfast 0 -FP
1.5 -sf 0.2 -sd 0.0 -A 5 -outlier 1e-4 -endoutlier 1e-3 -S -1000 -sr 0.04 -resbias 5 64 -maxmem
36 -M 3 -minlen 150 -T 1e-9 -maxthreads 12 -hashgen 5 3 2.4 1.5 0.05 5.0 1 1 2 -hash -
hashdelta 10 -hashmaxmem 36 -insertThreads 8 -stdout -stderr". These are the default
parameters in the software, except that we used a highly stringent threshold (T=1e-9) to declare
a match and sampled 50,000 molecules (rather than 5,000 as shown by default settings), per
suggestions from BioNano to process human genomes. The mapping rates of BioNano
molecules to the draft assembly and GRCh38 are 78.9% and 80.2%, respectively.
Construction of physical map by BioNano nanochannel array
The BioNano NanoChannel Array (Irys System) linearizes DNA molecules up to megabases in
length, and uses nicking enzymes that recognize characteristic 7-mer in genome sequence to
provide physical maps. This method therefore allows de novo assembly of the human genome,
and such assembly can be further used as scaffolds for PacBio-assembled genomic sequences.
This approach has already been used recently for analyzing human genomes
79
.
Page 60 of 73
We performed de novo genome assembly on the filtered data, with manufacturer-recommended
parameters for human genome (molecular length threshold: 150kb; min label per molecule: 8;
maximum backbone intensity: 0.6; false positive density/100kb: 1.5; false negative rate: 0.15%;
Scaling SD: 0; siteSD: 0.2kb; relative SD: 0.03; initial assembly p-value cutoff: 1e-10; extension
and refinement p-value cutoff: 1e-11; merge p-value cutoff: 1e-15), autonoise adjustment, and 5
iterations of computation. The final results contain 2,346 contigs with the N50 for the assembly
as 1.80Mb.
Improve draft genome assembly by hybrid scaffolding
After de novo assembly of the physical map using BioNano data, we used the physical map as
scaffolds for PacBio-assembled draft genome. This approach is referred to as "hybrid
scaffolding", which identifies regions that PacBio contigs can be grouped together by BioNano
scaffold, and generate new assembled sequences with improved N50 values, as demonstrated
in a recent study (N50=906kb for PacBio data, and N50=13.6Mb after hybrid scaffolding by
BioNano data)
79
.
With extensive discussion with BioNano, we developed the following strategies and wrote a few
custom scripts (available upon requests) for post-processing of hybrid scaffolds and generating
the complete genome sequence. This procedure is described here:
(1) Use IrysView software to perform hybrid scaffolding, with BioNano assembly as the
scaffold, and PacBio assembly (as raw FASTA sequence rather than enzyme-digested
cMAP) as the query sequence.
(2) Use 'Export to FASTA' option in the IrysView software to export the hybrid assembly into
a FASTA file.
(3) Identify the set of contigs (as cMAP IDs) that were filtered out during the scaffolding
process, through a search of the intermediate results in hybrid scaffolding, and back-
track the identifier of the contigs through the key file generated within the intermediate
results
(4) retrieve these 'discarded' contigs from the original FASTA file from PacBio assembly,
and supplement them into the hybrid assembly to generate the final assembly.
Page 61 of 73
Consensus quality and scaffolding accuracy analysis
Assembly consensus quality was evaluated using MUMmer (nucmer, delta-filter and dnadiff).
Scaffolding accuracy was evaluated following Pendleton’s method
79
with a few modifications.
Sliding 100Kb windows (from start to end) were selected in the assembly to be evaluated.
250bp sequences at head and tail of each window were extracted and aligned to GRCh38 main
chromosomes using BWA-MEM. Among uniquely aligned pairs of reads with mapping quality >
30, if two reads are not within 90 Kb or 110Kb to each other, the corresponding window was
considered mis-joined.
Short read polishing
Illumina short reads were aligned to HX1 contigs with BWA-MEM. SAMtools was used to call
indel and SNVs. Each homozygous (heterozygosity rate<0.1) indel or SNV with genotype
quality > 30 was considered as an error in the assembly. HX1 contigs were polished by
replacing reference alleles with corresponding alternative alleles to the assembly at each variant
site. This procedure was repeated three times to achieve higher consensus quality and lower
indel error rate.
Gap filling in GRCh38
We developed a Gap Filling by Assembly (GFA) procedure for closing gaps in the reference
genome. Any region consisting of continuous runs of N in the target assembly (GRCh38) is
defined as a gap in our method. Therefore, gaps between scaffolds are not considered in our
analysis. Gaps within 200bp to each other are merged. Flanking sequences upstream and
downstream of the gaps were mapped to the source assembly (HX1). If two anchor sequences
for the same gap can both be aligned, they will be examined to remove discordant pairs which
include those alignments with inconsistent orientation, on different contigs, or overlapping with
each other. If only one anchor can be aligned, then the anchor will be extended into the gap
region wherever possible. Code used for gap filling has been deposited to github
(https://github.com/WGLab/uniline). The above steps are illustrated in Figure 23.
For each closable gap, a probability is calculated based on the mapping score and gap length.
A brief description is shown below.
: ℎ
1
: ℎ 1
2
: ℎ 2
Page 62 of 73
: ℎ
= 1
2
When only single anchor is mapped:
= min( 1
, 2
)
1
and 2
come from mapping quality score of BWA-MEM algorithm, which takes both
alignment quality and sequence context into consideration. We make the assumption that
mapping and gap length are independent of each other. This assumption does not necessarily
hold in all circumstances, and can be improved by explicitly modeling the insertion and deletion
length in the target genome (where gaps are to be filled).
(predicted gap is equal to original gap by chance) is calculated as follows.
0
: ℎ
: ℎ
:
Let = − 0
. We assume ~ (0, 2
0
2
), that is we model the difference between predicted
gap length and original gap length with a normal distribution. Therefore we have
= �
� − � − 0
� ≤ ≤ � − 0
� � ≠ 0
( −0.5 ≤ ≤ 0.5) = 0
Intuitively, means the (random) chance observing a gap with or less extreme given 0
under a normal distribution. The following table shows how the model behaves under some
common situations (k=2). As is shown, the model permits ~10% of flexibility at a threshold of 30
and does not penalize harshly when deviates two to three times from 0
.
L
0
1000
L
g
500 900 1000 2000 2500 3000
Phred-scaled P
g
(rounded to
nearest integer)
16 32 85 10 6 4
Page 63 of 73
In our analysis, k = 2 was used to calculate gap closing score. Only uniquely closable gaps
were shown in results, i.e., flanking sequences of these gaps were uniquely mapped. Gap
closing score of 30 were used to filter out low quality predictions.
Figure 23. Illustration of the Gap Filling by Assembly (GFA) procedure.
Page 64 of 73
Finding novel sequences and calculating genome coverage
MUMmer
89,97,98
is among the first programs that utilizes suffix-tree algorithm to find maximal
matches. The time requirement for building suffix-tree for a sequence of length n is of O(n).
Given a suffix tree of S and a query sequence Q of length m, all unique maximal matches can
be found in time proportional to m. The architecture of MUMmer makes it extremely fast for
genome-scale sequence comparison. For each contig in GRCh38 including alt loci
(http://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/analysisSet/) and decoy sequences
(hs38d1), we mapped our contigs with MUMmer. The results (*.coords) were post-processed by
custom scripts to filter and extract regions unmappable with >80% identity and gap allowance of
1kb. Here gap allowance means maximum gap length allowed between two closely mapped
regions that will be merged together. However, MUMmer is not as sensitive as BLAST
99
or its
descendent, LastZ
100
, for genome-scale comparison, so a second round of mapping by LastZ
was used to further refine unmapped sequences. LastZ is a successor for BLASTZ
101,102
, which
is an independent implementation of BLAST
99
. BLASTZ
101,102
differs from BLAST primarily in two
aspects: BLASTZ has an option that requires the alignments must occur in the same order and
orientation in both reference and query sequences; BLASTZ uses a different scoring scheme
that takes sequence conservation into consideration. LastZ
100
was developed based on BLASTZ
and is more robust and flexible. LastZ was run with parameters lenient for gap extension,
repetitive sequences. Hence large gaps that are possibly structural variations would not hinder
alignment, and repetitive sequences do not cause significant slowdown of alignment speed.
Regions unmapped by LastZ with gapped threshold of 10,000 and gap allowance of 1kb are
considered as novel sequences.
Percentage of coverage on GRCh38 (no_alt_analysis) by different assemblies is calculated
based off of the results from MUMmer. Same alignment procedure was used as described for
finding novel sequence. Regions mapped with >80% identity and gap allowance of 1kb are
considered as valid alignments. Percentage of non-N base pairs covered by these valid
mappings is denoted as percentage of coverage.
The commands used for MUMmer and LastZ are shown as follows. Custom scripts are
deposited in github (https://github.com/WGLab/uniline).
#commands for mummer
nucmer -c 400 -l 150 --prefix=$prefix $ref $query
Page 65 of 73
show-coords -r -c -l -k $prefix.delta > $prefix.coords
#commands for LastZ
lastz $ref $query --notransition --gap=1000,1 --step=20 --ambiguous=iupac --format=maf --gappedthresh=10000 --
identity=80 --progress=10 --maxwordcount=1 --masking=0
Comparative analysis using NIST Chinese genome
To assess quality of HX1 and GRCh38 as reference genome for variant calling, we downloaded
a Chinese sample NA24694 HiSeq reference data at http://ftp-
trace.ncbi.nlm.nih.gov/giab/ftp/data/ChineseTrio/HG006_NA24694-
huCA017E_father/NA24694_Father_HiSeq100x/NA24694_Father_HiSeq100x_fastqs/ from
National Institute of Standards and Technology (NIST). In total, we obtained 907 million reads
(133 billion bases), with an estimated coverage of 41X. The raw FASTQ files were aligned to
both HX1 and GRCh38 by BWA. For HX1, we achieved mapping rate of 99.22% and for
GRCh38 we achieved mapping rate of 99.56%. Reads that cannot be mapped to GRCh38 were
further mapped against HX1.
Functional analysis on novel sequence elements
To explore the potential biological function in the novel sequences, we used GM12878 cell line
and downloaded 17 of its associated ChIP-Seq data sets with four assays, including H3K4me3,
H3K4me1, H3K27ac and CTCF, and 7 DNase-Seq data sets for the same cell line.
The raw FASTQ files were aligned to GRCh38 using bowtie2 (2.0.5)
23
. To investigate whether
HX1 can be a complementary reference to GRCh38 for mapping sequencing reads for
nucleosome positioning, transcriptional factor binding sites and histone modification, we then
extracted unmapped ENCODE reads from GRCh38 and realigned them to HX1 with bowtie2.
MACS
103,104
was used to call ChIP-seq peaks and DNase hot spots. We used settings
recommended by authors of MACS. Specifically, for histone mark ChIP-seq, we used “--
nomodel” option plus a control data set; for transcription factor, we used default parameters plus
a control data set; for DNase-seq, we used “--nolambda” option plus no control data.
To examine how these regulatory elements can potentially modify gene expression for HX1, we
performed transcriptome analysis using Illumina RNA-Seq data as follows. First, in order to
assess the expression level for each transcript in HX1,we used Cufflinks to generate
transcriptome assemblies and used FPKM from this assembly to assess the expression level of
each gene predicted by Cufflinks. We examined the distribution of FPKM values of all 36,042
genes predicted by Cufflinks. Note that here we used cutoff of 500bp to ensure that all predicted
Page 66 of 73
genes are likely to be real genes, rather than short artifacts generated from Cufflinks prediction.
From the pattern of the distribution FPKM values of all predicted genes, we observed that after
natural log transformation, the distribution of FPKM values follow roughly Gaussian distribution
with several outliers in the tail. Therefore, we excluded additional 180 potential outliers using
99.99% percentile (FPKM=22,026, natural log FPKM=10) as cutoff and used the remaining
35,862 transcripts for downstream analysis. Then, for each regulatory element, we extracted the
FPKM value for nearest gene that is located within its 500kb flanking regions (250kb upstream
and 250kb downstream) and compared it with the background FPKM value of whole
transcriptome in HX1. Here we define background gene set as set of genes that are not flanking
CTCF binding sites, DNase I hypersensitivity sites, H3K4me1, H3K4me3 or H3K27ac histone
marks. We found that there are 2,338 unique genes located within 500kb flanking regions of
CTCF binding sites, 6,061 unique genes near DNase I hypersensitivity sites, 513 genes near
H3K4me1 histone modification sites, 3,146 genes near H4K4me3 sites and 2,079 genes near
H3K27ac sites. Overall, genes flanking regulatory regions in HX1 have increased expression
level compared to the background. For histone modifications, H3K4me3 has the highest median
FPKM value for its regulated genes (median FPKM=10.48) compared to the background
(median FPKM=3.67) and such difference is significant (P<10
-10
by Wilcoxon Rank-sum
test);H3K4me1 has the lowest median FPKM value for its regulated genes (median FPKM=8.76)
and yet such expression level is still significantly higher than the background (P<10
-10
by
Wilcoxon Rank-sum test). This may imply that in HX1 sequences, there could exist activated
enhancers that increase the expression of nearby genes. Similarly, for genes near DNase I
hypersensitivity sites and genes near CTCF-binding sites,their expression is also significantly
higher than the background (P<10
-10
by Wilcoxon Rank-sum test for both DNase and CTCF
assay).
As for regulatory regions in novel sequences, we found 2 predicted genes flanking CTCF
binding sites, 11 genes flanking DNase hypersensitivity sites, 1 gene for H3K4me1 sites, 1 gene
for H3K27ac sites and no genes flanking H3K4me3 sites.Overall, genes flanking regulatory
regions in HX1 novel sequences have increased expression level compared to the background.
Here we define background gene set as set of genes that are not flanking CTCF binding sites,
DNase I hypersensitivity sites, H3K4me1, H3K4me3 or H3K27ac histone marks that are located
in novel sequences in HX1. For histone modifications, H3K4me3 and H3K4me1 are both linked
to a predicted gene located at contig 001058F coordinate of 181,365 to 182,584, with 1.31-fold
increased expression (FPKM=5.79), compared to the background (median FPKM=4.41). Genes
Page 67 of 73
flanking CTCF binding sites and DNase hypersensitivity I sites in novel sequences also have
higher expression than background. For example, for CTCF binding sites, the associated genes
have 6.23-fold increased expression (median FPKM=27.48) compared to the background. This
may imply the existence of potential functional regulatory elements in HX1 novel sequences.
Structural variation calling (Illumina whole-genome sequencing)
To detect CNVs from the Illumina WGS data, the read depth based tool CNVnator was used,
which uses a Mean-shift Algorithm, to iteratively partition and merge genomic segments for
multiple steps
105
.
For whole-genome sequencing data, all the Illumina short reads were first aligned onto GRCh38
genome, with BWA-MEM
106
, to generate the BAM file. Afterwards, the BAM file was used as
input into CNVnator
105
, to generate CNV calls, with the bin size set as 100 bp. The output is a
list of CNVs, with 37,665 deletions and 1,391 duplications. After filtration with P<0.01 standard
and removal of CNVs overlapping with Immunoglobulin regions and centromeres, there are in
total 2403 deletions and 783 duplications left.
Structural variation calling (BioNano genome map)
Through comparison to an expected digestion map for the GRCh38 genome, the BioNano
Genome Map data can be used to identify structural variations. Using the IrysView software, we
performed a comparison of the cMAP (expected 7-mer maps digested by enzyme) of our
BioNano assembly with GRCh38, and identified 783 insertions, 377 deletions in the HX1
genome in comparison to GRCh38.
Structural variation calling (SMRT long reads)
We identified structural variants >=50 bp from PacBio long-read sequencing data using FES-SV
as previously described
90
. Briefly, we aligned the PacBio reads to GRCh38 using a modified
version of BLASR (https://github.com/mchaisso/blasr) with affine alignment parameters (-bestn
2 -maxAnchorsPerPosition 100 -advanceExactMatches 10 -affineAlign -affineOpen 100 -
affineExtend 0 -insertion 5 -deletion 5 -extend -maxExtendDropoff 50) to identify 49,054
candidate sites of structural variation. We then locally assembled all reads mapping to each
candidate site using MHAP and Celera (8.3rc1) and aligned each local assembly to its original
locus using BLASR with refined affine alignment parameters (-affineAlign -affineOpen 8 -
affineExtend 0 -bestn 1 -maxMatch 30 -sdpTupleSize 13) to identify the precise breakpoints of
all structural variants. Additionally, we performed the same local assemblies in 152,251 sliding
windows across the genome (60 Kbp withi 20 Kbp slide) to discover any additional variants that
Page 68 of 73
might have been missed by the initial candidate discovery. Finally, we annotated the repetitive
content of the sequence associated with each insertion and deletion using RepeatMasker (3.3.0)
with sensitive alignment parameters (-xsmall -no_is -e wublast -s).
References
1. Rabbani, B., Mahdieh, N., Hosomichi, K., Nakaoka, H. & Inoue, I. Next-generation sequencing:
impact of exome sequencing in characterizing Mendelian disorders. J Hum Genet 57, 621-32
(2012).
2. Bamshad, M.J. et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev
Genet 12, 745-55 (2011).
3. Meyerson, M., Gabriel, S. & Getz, G. Advances in understanding cancer genomes through
second-generation sequencing. Nat Rev Genet 11, 685-96 (2010).
4. Mardis, E.R. The impact of next-generation sequencing technology on genetics. Trends Genet 24,
133-41 (2008).
5. Morozova, O. & Marra, M.A. Applications of next-generation sequencing technologies in
functional genomics. Genomics 92, 255-64 (2008).
6. Genomes Project, C. et al. A map of human genome variation from population-scale sequencing.
Nature 467, 1061-73 (2010).
7. Lam, H.Y. et al. Performance comparison of whole-genome sequencing platforms. Nat
Biotechnol 30, 78-82 (2012).
8. O'Rawe, J. et al. Low concordance of multiple variant-calling pipelines: practical implications for
exome and genome sequencing. Genome Med 5, 28 (2013).
9. Ruffalo, M., LaFramboise, T. & Koyuturk, M. Comparative analysis of algorithms for next-
generation sequencing read alignment. Bioinformatics 27, 2790-6 (2011).
10. Hull, D. et al. Taverna: a tool for building and running workflows of services. Nucleic Acids Res 34,
W729-32 (2006).
11. Reich, M. et al. GenePattern 2.0. Nat Genet 38, 500-1 (2006).
12. Abouelhoda, M., Issa, S.A. & Ghanem, M. Tavaxy: integrating Taverna and Galaxy workflows
with cloud computing support. BMC Bioinformatics 13, 77 (2012).
13. Goecks, J., Nekrutenko, A., Taylor, J. & Galaxy, T. Galaxy: a comprehensive approach for
supporting accessible, reproducible, and transparent computational research in the life sciences.
Genome Biol 11, R86 (2010).
14. Pabinger, S. et al. A survey of tools for variant analysis of next-generation genome sequencing
data. Brief Bioinform (2013).
15. Gentleman, R.C. et al. Bioconductor: open software development for computational biology and
bioinformatics. Genome Biol 5, R80 (2004).
16. Stajich, J.E. et al. The Bioperl toolkit: Perl modules for the life sciences. Genome Res 12, 1611-8
(2002).
17. Chang, X. & Wang, K. wANNOVAR: annotating genetic variants for personal genomes via the
web. J Med Genet 49, 433-6 (2012).
18. Krampis, K. et al. Cloud BioLinux: pre-configured and on-demand bioinformatics computing for
the genomics community. BMC Bioinformatics 13, 42 (2012).
19. Nocq, J., Celton, M., Gendron, P., Lemieux, S. & Wilhelm, B.T. Harnessing virtual machines to
simplify next-generation DNA sequencing analysis. Bioinformatics 29, 2075-83 (2013).
Page 69 of 73
20. Angiuoli, S.V. et al. CloVR: a virtual machine for automated and portable sequence analysis from
the desktop using cloud computing. BMC Bioinformatics 12, 356 (2011).
21. Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res 19,
1124-32 (2009).
22. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of
short DNA sequences to the human genome. Genome Biol 10, R25 (2009).
23. Langmead, B. & Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357-9
(2012).
24. Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966-
7 (2009).
25. Zaharia, M. et al. Faster and More Accurate Sequence Alignment with SNAP. in ArXiv e-prints Vol.
1111 5572 (2011).
26. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-
generation DNA sequencing data. Genome Res 20, 1297-303 (2010).
27. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-9
(2009).
28. Koboldt, D.C. et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer
by exome sequencing. Genome Res 22, 568-76 (2012).
29. Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156-8 (2011).
30. R. Pandya, W.B., M. Zaharia, T. Sittler, K. Curtis, C. Hartl, A. Fox, S. Schenker, I. Stoica, D.
Patterson. SNAP: fast, accurate sequence alignment enabling biological applications. ASHG
meeting 2014, San Diego (2014).
31. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from
high-throughput sequencing data. Nucleic Acids Res 38, e164 (2010).
32. Wang, W., Wei, Z., Lam, T.W. & Wang, J. Next generation sequencing has lower sequence
coverage and poorer SNP-detection capability in the regulatory regions. Sci Rep 1, 55 (2011).
33. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage
analyses. Am J Hum Genet 81, 559-75 (2007).
34. Utsunomiya, Y.T. et al. mendelFix: a Perl script for checking Mendelian errors in high density SNP
data of trio designs. arXiv:1306.2243 (2013).
35. Chen, H. VennDiagram: Generate high-resolution Venn and Euler plots. CRAN (2011).
36. Altshuler, D.M. et al. An integrated map of genetic variation from 1,092 human genomes.
Nature 491, 56-65 (2012).
37. DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation
DNA sequencing data. Nat Genet 43, 491-8 (2011).
38. Lyon, G.J. et al. Exome sequencing and unrelated findings in the context of complex disease
research: ethical and clinical implications. Discovery medicine 12, 41 (2011).
39. Guo, Y., Ding, X., Shen, Y., Lyon, G.J. & Wang, K. SeqMule: automated pipeline for analysis of
human exome/genome sequencing data. Sci Rep 5, 14283 (2015).
40. Genomes Project, C. et al. An integrated map of genetic variation from 1,092 human genomes.
Nature 491, 56-65 (2012).
41. Li, B. et al. A likelihood-based framework for variant calling and de novo mutation detection in
families. PLoS Genet 8, e1002944 (2012).
42. Ramu, A. et al. DeNovoGear: de novo indel and point mutation discovery and phasing. Nat
Methods 10, 985-7 (2013).
43. Peng, G. et al. Rare variant detection using family-based sequencing analysis. Proc Natl Acad Sci
U S A 110, 3985-90 (2013).
Page 70 of 73
44. Nielsen, R., Korneliussen, T., Albrechtsen, A., Li, Y. & Wang, J. SNP calling, genotype calling, and
sample allele frequency estimation from New-Generation Sequencing data. PLoS One 7, e37558
(2012).
45. Afgan, E. et al. Galaxy CloudMan: delivering cloud compute clusters. BMC Bioinformatics 11
Suppl 12, S4 (2010).
46. Blanca, J.M., Pascual, L., Ziarsolo, P., Nuez, F. & Canizares, J. ngs_backbone: a pipeline for read
cleaning, mapping and SNP calling using next generation sequence. BMC Genomics 12, 285
(2011).
47. Shi, L. et al. inverted question markGenotype-first inverted question mark approaches on a
curious case of idiopathic progressive cognitive decline. BMC Med Genomics 7, 66 (2014).
48. Jia, H., Guo, Y., Zhao, W. & Wang, K. Long-range PCR in next-generation sequencing: comparison
of six enzymes and evaluation on the MiSeq sequencer. Sci Rep 4, 5737 (2014).
49. Zhang, X. et al. Exome sequencing on malignant meningiomas identified mutations in
neurofibromatosis type 2 (NF2) and meningioma 1 (MN1) genes. Discov Med 18, 301-11 (2014).
50. Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide association
loci for human diseases and traits. Proc Natl Acad Sci U S A 106, 9362-7 (2009).
51. Maurano, M.T. et al. Systematic localization of common disease-associated variation in
regulatory DNA. Science 337, 1190-5 (2012).
52. Bernstein, B.E. et al. An integrated encyclopedia of DNA elements in the human genome. Nature
489, 57-74 (2012).
53. Schaub, M.A., Boyle, A.P., Kundaje, A., Batzoglou, S. & Snyder, M. Linking disease associations
with regulatory information in the human genome. Genome Res 22, 1748-59 (2012).
54. McCarthy, M.I. & Hirschhorn, J.N. Genome-wide association studies: potential next steps on a
genetic journey. Hum Mol Genet 17, R156-65 (2008).
55. Ward, L.D. & Kellis, M. HaploReg: a resource for exploring chromatin states, conservation, and
regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Res 40, D930-
4 (2012).
56. Boyle, A.P. et al. Annotation of functional variation in personal genomes using RegulomeDB.
Genome Res 22, 1790-7 (2012).
57. Li, M.J., Wang, L.Y., Xia, Z., Sham, P.C. & Wang, J. GWAS3D: Detecting human regulatory variants
by integrative analysis of genome-wide associations, chromosome interactions and histone
modifications. Nucleic Acids Res 41, W150-8 (2013).
58. Coetzee, S.G., Rhie, S.K., Berman, B.P., Coetzee, G.A. & Noushmehr, H. FunciSNP: an
R/bioconductor tool integrating functional non-coding data sets with genetic association studies
to identify candidate regulatory SNPs. Nucleic Acids Res 40, e139 (2012).
59. Pruim, R.J. et al. LocusZoom: regional visualization of genome-wide association scan results.
Bioinformatics 26, 2336-7 (2010).
60. Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding
principles of the human genome. Science 326, 289-93 (2009).
61. Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and characterization.
Nat Methods 9, 215-6 (2012).
62. Liang, G. et al. Distinct localization of histone H3 acetylation and H3-K4 methylation to the
transcription start sites in the human genome. Proc Natl Acad Sci U S A 101, 7357-62 (2004).
63. Barski, A. et al. High-resolution profiling of histone methylations in the human genome. Cell 129,
823-37 (2007).
64. Koch, C.M. et al. The landscape of histone modifications across 1% of the human genome in five
human cell lines. Genome Res 17, 691-707 (2007).
Page 71 of 73
65. Bartke, T. et al. Nucleosome-interacting proteins regulated by DNA and histone methylation. Cell
143, 470-84 (2010).
66. Pique-Regi, R. et al. Accurate inference of transcription factor binding from DNA sequence and
chromatin accessibility data. Genome Res 21, 447-55 (2011).
67. Gilad, Y., Rifkin, S.A. & Pritchard, J.K. Revealing the architecture of gene regulation: the promise
of eQTL studies. Trends Genet 24, 408-15 (2008).
68. Li, W., Notani, D. & Rosenfeld, M.G. Enhancers as non-coding RNA transcription units: recent
insights and future perspectives. Nat Rev Genet 17, 207-23 (2016).
69. Yang, X. et al. Genome-wide association study for serum complement C3 and C4 levels in
healthy Chinese subjects. PLoS Genet 8, e1002916 (2012).
70. Makinde, V.A., Senaldi, G., Jawad, A.S., Berry, H. & Vergani, D. Reflection of disease activity in
rheumatoid arthritis by indices of activation of the classical complement pathway. Ann Rheum
Dis 48, 302-6 (1989).
71. Zeller, T. et al. Genetics and beyond--the transcriptome of human monocytes and disease
susceptibility. PLoS One 5, e10693 (2010).
72. Wellcome Trust Case Control, C. Genome-wide association study of 14,000 cases of seven
common diseases and 3,000 shared controls. Nature 447, 661-78 (2007).
73. Kottgen, A. et al. Genome-wide association analyses identify 18 new loci associated with serum
urate concentrations. Nat Genet 45, 145-54 (2013).
74. Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing.
Genome Res 20, 265-72 (2010).
75. Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel
sequence data. Proc Natl Acad Sci U S A 108, 1513-8 (2011).
76. Alkan, C., Sajjadian, S. & Eichler, E.E. Limitations of next-generation genome sequence assembly.
Nat Methods 8, 61-5 (2011).
77. Chaisson, M.J., Wilson, R.K. & Eichler, E.E. Genetic variation and the de novo assembly of human
genomes. Nat Rev Genet (2015).
78. Cao, H. et al. De novo assembly of a haplotype-resolved human genome. Nat Biotechnol 33, 617-
22 (2015).
79. Pendleton, M. et al. Assembly and diploid architecture of an individual human genome via
single-molecule technologies. Nat Methods 12, 780-6 (2015).
80. Jakobsson, M. et al. Genotype, haplotype and copy-number variation in worldwide human
populations. Nature 451, 998-1003 (2008).
81. Sudmant, P.H. et al. Global diversity, population stratification, and selection of human copy
number variation. Science (2015).
82. Sudmant, P.H. et al. An integrated map of structural variation in 2,504 human genomes. Nature
526, 75-81 (2015).
83. Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60-5 (2008).
84. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol 5, e254 (2007).
85. Li, R. et al. Building the sequence map of the human pan-genome. Nat Biotechnol 28, 57-63
(2010).
86. Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133-8
(2009).
87. Cao, H., Tegenfeldt, J.O., Austin, R.H. & Chou, S.Y. Gradient nanostructures for interfacing
microfluidics and nanofluidics. Appl Phys Lett 81, 3058-3060 (2002).
88. Chin, C.S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT
sequencing data. Nat Methods 10, 563-9 (2013).
Page 72 of 73
89. Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol 5, R12
(2004).
90. Chaisson, M.J. et al. Resolving the complexity of the human genome using single-molecule
sequencing. Nature 517, 608-11 (2015).
91. Lam, E.T. et al. Genome mapping on nanochannel arrays for structural variation analysis and
sequence assembly. Nat Biotechnol 30, 771-6 (2012).
92. Layer, R.M., Chiang, C., Quinlan, A.R. & Hall, I.M. LUMPY: a probabilistic framework for structural
variant discovery. Genome Biol 15, R84 (2014).
93. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57-74 (2012).
94. International Human Genome Sequencing, C. Finishing the euchromatic sequence of the human
genome. Nature 431, 931-45 (2004).
95. Whole genome? Nat Genet 47, 963 (2015).
96. Myers, E.W. The fragment assembly string graph. Bioinformatics 21 Suppl 2, ii79-85 (2005).
97. Delcher, A.L. et al. Alignment of whole genomes. Nucleic Acids Res 27, 2369-76 (1999).
98. Delcher, A.L., Phillippy, A., Carlton, J. & Salzberg, S.L. Fast algorithms for large-scale genome
alignment and comparison. Nucleic Acids Res 30, 2478-83 (2002).
99. Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res 25, 3389-402 (1997).
100. Harris, R.S. IMPROVED PAIRWISE ALIGNMENT OF GENOMIC DNA. PhD Thesis (2007).
101. Schwartz, S. et al. PipMaker--a web server for aligning two genomic DNA sequences. Genome
Res 10, 577-86 (2000).
102. Schwartz, S. et al. Human-mouse alignments with BLASTZ. Genome Res 13, 103-7 (2003).
103. Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol 9, R137 (2008).
104. Feng, J., Liu, T. & Zhang, Y. Using MACS to identify peaks from ChIP-Seq data. Curr Protoc
Bioinformatics Chapter 2, Unit 2 14 (2011).
105. Abyzov, A., Urban, A.E., Snyder, M. & Gerstein, M. CNVnator: an approach to discover, genotype,
and characterize typical and atypical CNVs from family and population genome sequencing.
Genome research 21, 974-984 (2011).
106. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv
preprint arXiv:1303.3997 (2013).
Page 73 of 73
Asset Metadata
Creator
Guo, Yunfei (author)
Core Title
Novel statistical and computational methods for analyzing genome variation
Contributor
Electronically uploaded by the author
(provenance)
School
Keck School of Medicine
Degree
Doctor of Philosophy
Degree Program
Biostatistics
Publication Date
06/30/2016
Defense Date
05/31/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
annotation,BioNano,causal variant,de novo assembly,GWAS,indel,NGS,OAI-PMH Harvest,pipeline,SMRT sequencing,SNP,Visualization
Format
application/pdf
(imt)
Language
English
Advisor
Conti, David (
committee chair
), Thomas, Paul D. (
committee member
), Wang, Kai (
committee member
)
Creator Email
guoyunfei1989@gmail.com,yunfeigu@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-259466
Unique identifier
UC11280298
Identifier
etd-GuoYunfei-4491.pdf (filename),usctheses-c40-259466 (legacy record id)
Legacy Identifier
etd-GuoYunfei-4491.pdf
Dmrecord
259466
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Guo, Yunfei
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Abstract (if available)
Abstract
Chapter 1. Automated, Accurate And Flexible Variant Detection: Next-generation sequencing (NGS) technology has greatly improved our ability to identify disease-contributory variants for Mendelian diseases. However, users are often faced with issues such as software compatibility, complicated configuration, and no access to high-performance computing facility. More importantly, discrepancies exist among different aligners and variant callers. We developed a computational pipeline, SeqMule, and a number of helper methods, to perform automated variant calling from NGS data on human genomes and exomes. SeqMule integrates computational-cluster-free parallelization capability built on top of the variant callers, and facilitates normalization/intersection of variant calls to generate consensus set with high confidence. SeqMule integrates 5 alignment tools, 5 variant calling algorithms and accepts various combinations all by one-line command, therefore allowing highly flexible yet fully automated variant calling. In a modern machine (2 Intel Xeon X5650 CPUs, 48GB memory), when fast turn-around is needed, SeqMule generates annotated VCF files in a day from a 30X whole-genome sequencing data set
Tags
annotation
BioNano
causal variant
de novo assembly
GWAS
indel
NGS
pipeline
SMRT sequencing
SNP
Linked assets
University of Southern California Dissertations and Theses