Close
The page header's logo
About
FAQ
Home
Login
USC Login
Register
0
Selected 
Invert selection
Deselect all
Deselect all
 Click here to refresh results
 Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Novel statistical and computational methods for analyzing genome variation
(USC Thesis Other) 

Novel statistical and computational methods for analyzing genome variation

doctype icon
play button
PDF
 Download
 Share
 Open document
 Flip pages
 More
 Download a page range
 Download transcript
Copy asset link
Request this asset
Transcript (if available)
Content Title
Novel statistical and computational methods for
analyzing genome variation




Author
Yunfei Guo
Conferring Program
University of Southern California Graduate School
Degree being conferred
PhD in Biostatistics
Degree conferral date
August 6, 2016


Acknowledgements
I wish to acknowledge my indebtedness to my mentor Dr. Kai Wang for his instrumental
guidance and inspiring encouragement during the entire project. Special thanks are due to Dr.
David Conti and Dr. Paul Thomas for their openness, criticism and support. I am also deeply
indebted for the support and understanding of my colleagues and friends, especially Chengliang
Dong and Hui Yang. I am particularly grateful to my fiancé, Xiaolu Han, whose strength, and
care made the critical difference for completion of this effort.
 


Table of contents
Chapter 1. Automated, Accurate And Flexible Variant Detection ....................................................... 1
Abstract .................................................................................................................................................... 1
INTRODUCTION .................................................................................................................................... 1
MATERIAL AND METHODS ................................................................................................................ 3
Workflow .............................................................................................................................................. 3
Built-in parallel processing ................................................................................................................ 6
Support for cluster environment ....................................................................................................... 9
Consensus result generation ............................................................................................................ 9
Analysis summary ............................................................................................................................ 10
RESULTS .............................................................................................................................................. 11
Consistency and accuracy evaluation ........................................................................................... 11
Aligner consistency evaluation ....................................................................................................... 15
Mendelian error evaluation ............................................................................................................. 16
Performance evaluation .................................................................................................................. 19
Mendelian disease disease-contributory variant identification .................................................. 21
DISCUSSION ........................................................................................................................................ 22
Chapter 2. Framework for prioritizing GWAS loci with epigenetic features .................................... 24
ABSTRACT ........................................................................................................................................... 24
INTRODUCTION .................................................................................................................................. 24
IMPLEMENTATION ............................................................................................................................. 26
Architecture ....................................................................................................................................... 26
Input and output ................................................................................................................................ 26
CONCLUSIONS ................................................................................................................................... 32
Chapter 3. Fast De Novo Genome Assembly, Evaluation, Gap filling and Error Correction ........ 33
ABSTRACT ........................................................................................................................................... 33
INTRODUCTION .................................................................................................................................. 33
RESULTS .............................................................................................................................................. 34
Generation of de novo genome assembly on a Chinese genome ........................................... 34
Gap filling on the reference genome by de novo assembly ....................................................... 43
Characterization of structural variants and novel sequences .................................................... 45


DISCUSSION ........................................................................................................................................ 55
METHODS............................................................................................................................................. 56
Generation of sequence data ......................................................................................................... 56
BioNano genome mapping ............................................................................................................. 57
Karyotyping for chromosomal abnormality ................................................................................... 57
de novo draft genome assembly by Falcon ................................................................................. 58
Evaluation of draft genome quality by BioNano........................................................................... 60
Construction of physical map by BioNano nanochannel array ................................................. 60
Improve draft genome assembly by hybrid scaffolding .............................................................. 61
Consensus quality and scaffolding accuracy analysis ............................................................... 62
Short read polishing ......................................................................................................................... 62
Gap filling in GRCh38 ...................................................................................................................... 62
Finding novel sequences and calculating genome coverage .................................................... 65
Comparative analysis using NIST Chinese genome .................................................................. 66
Functional analysis on novel sequence elements ....................................................................... 66
Structural variation calling (Illumina whole-genome sequencing) ............................................. 68
Structural variation calling (BioNano genome map) ................................................................... 68
Structural variation calling (SMRT long reads) ............................................................................ 68
References ................................................................................................................................................ 69

 



Chapter 1. Automated, Accurate And Flexible Variant Detection
Abstract
Next-generation sequencing (NGS) technology has greatly improved our ability to identify
disease-contributory variants for Mendelian diseases. However, users are often faced with
issues such as software compatibility, complicated configuration, and no access to high-
performance computing facility. More importantly, discrepancies exist among different aligners
and variant callers. We developed a computational pipeline, SeqMule, and a number of helper
methods, to perform automated variant calling from NGS data on human genomes and exomes.
SeqMule integrates computational-cluster-free parallelization capability built on top of the variant
callers, and facilitates normalization/intersection of variant calls to generate consensus set with
high confidence. SeqMule integrates 5 alignment tools, 5 variant calling algorithms and accepts
various combinations all by one-line command, therefore allowing highly flexible yet fully
automated variant calling. In a modern machine (2 Intel Xeon X5650 CPUs, 48GB memory),
when fast turn-around is needed, SeqMule generates annotated VCF files in a day from a 30X
whole-genome sequencing data set; when more accurate calling is needed, SeqMule generates
consensus call set that improves over single callers, as measured by both Mendelian error rate
and consistency. In addition, SeqMule supports Sun Grid Engine for parallel processing, offers
turn-key solution for deployment on Amazon Web Services, allows automated quality check,
Mendelian error check, consistency evaluation, variant normalization, HTML-based summary of
results. SeqMule is available at http://seqmule.openbioinformatics.org.
INTRODUCTION
The development of next-generation sequencing (NGS) technologies has dramatically changed
the landscape of human genetics research
1-6
. Identifying disease-contributory variants for
various human genetic diseases will greatly improve diagnosis and facilitate development of
therapies.  
Page 1 of 73


However, besides discrepancies associated with sequencing platforms
7
, there is still
considerable variation across variant calling algorithms; for example, we previously reported
SNV concordance of only 57.4% for 5 bioinformatics pipelines (SOAP, BWA-GATK, BWA-
SNVer, GNUMAP, BWA-SAMtools), while 0.5-5.1% variants were called as unique to each
pipeline
8
. Performance of aligners also varies under different sequencing error rates and indel
distribution
9
. Yet few published pipelines offer two or more alternative aligner and variant calling
programs
10-14
. While some workflow management systems do provide more flexibility
10-13
, local
installation and configuration is highly challenging for average users. Therefore, there is a
strong community need for a comprehensive and flexible pipeline that allows easy execution
and integration of multiple tools.
There are multiple challenges for building such a pipeline. Installation and configuration poses
the first problem, and the severity of this problem is evidenced by numerous attempts to
address it
15-17
. Software libraries such as Bioconductor
15
and Bioperl
16
, and web-based
interfaces [e.g.
17
] all aim to provide ease of access. The diversity of bioinformatics tools has
paradoxically given rise to one more layer of complexity. In a typical variant calling analysis, 4 to
6 tools might be required to perform QC (quality check), alignment, sorting, and variant calling.
Ideally, the output from one program can be fed into another one as is. In real-world scenarios,
this might not be the case. For instance, GATK does not accept output from SOAP2 aligner.
Another issue is that constant and asynchronous development of the software would, from time
to time, lead to loss of compatibility and break down of what was working. Even if compatibility
issues can be solved, reproducibility will be difficult to maintain across highly heterogeneous
pipelines. A pre-packaged virtual machine (VM) provides users with an alternative to address
this problem
18-20
. However, having two operating systems running on the same machine means
at least 1 CPU core and a few gigabytes of memory must be reserved for the host OS, and
unavoidably limits the computational resources available for the guest system. Adding another
layer of operating system also increases computational overhead by 13% to 28% compared
with performance on a native system
19
. Finally, VM implementation reduces flexibility of
software tools as a bundle and becomes difficult to deploy for average users without informatics
skills.
Page 2 of 73


To address the discrepancy issues without compromising ease of use, performance and
reproducibility, we developed a computational pipeline, SeqMule, which performs a series of
automated steps for identifying variants from NGS data. It integrates 5 alignment tools, 5 variant
calling algorithms, and allows various combinations of them via modifying a text-based, human-
readable configuration file. The intersection of sets of variants from different combinations of
tools can be extracted to achieve higher accuracy, both in terms of sensitivity and specificity.
Most setup procedure and analyses can be done with one-line commands. SeqMule also
provides cluster-free parallel capability built on top of the variant callers, which could drastically
reduce the time for variant calling by about an approximately linear factor of N (N is number of
CPU cores). As far as we know, only GATK Queue and FreeBayes provide such parallelism
among variant callers, but users have to manually set up a Queue or generate a region file for
parallel processing. At the end of analysis, an HTML-based report will be prepared to show an
overview for every step of the analysis, which helps assure users of data quality and appropriate
analysis settings. We believe that SeqMule will be useful to easily and efficiently obtain variant
calls from NGS data, and improve variant calling consistency and accuracy.
MATERIAL AND METHODS
Workflow
Currently, SeqMule integrates 5 popular mapping tools: BWA (including BWA-backtrack and
BWA-MEM), Bowtie, Bowtie2, SOAP2, SNAP
21-25
, 5 variant calling algorithms: GATK (including
GATKLite and version 3), SAMtools, VarScan 2, Freebayes, SOAPsnp
21,26-28
and some
accessory programs: FastQC, Picard, tabix and VCFtools
29
. Tools were selected based on their
popularity, ease of use and performance. BWA, Bowtie, Bowtie2 and SOAP2 are the most
popular aligners for NGS analysis; SNAP was the fastest aligner at the time when SeqMule was
developed; GATK, SAMTools, FreeBayes and SOAPsnp are among the most popular variant
callers; VarScan2 is the most widely used variant caller that can perform both germline and
somatic variant calling; FastQC is the most popular tool for NGS quality control and generates
HTML-based reports automatically; Picard, tabix and VCFtools are the de facto gold standard
with robust implementation for BAM/VCF manipulation in NGS analysis. Of note, SNAP can be
orders of magnitude faster compared with the popular aligner BWA-MEM
25,30
. All tools and
related packages, except for those without open-source license, can be downloaded and
installed by one single command with no need for root access. We actively maintain a list of
programs and their related source code, database files to make sure there is no compatibility
conflict under default settings.  
Page 3 of 73


A workflow scheme is shown in Figure 1. SeqMule takes FASTQ, gzipped FASTQ or BAM as
input. Quality scores in FASTQ can be encoded either in Phred+33 scheme or in Phred+64
scheme. For FASTQ, SeqMule can automatically decide which scheme is used by examining
the beginning of input. Other necessary files for analysis, including reference genomes,
alignment indexes, known variant databases, can be downloaded via one-line built-in command
from SeqMule website. In a typical pipeline, input data goes through QC, alignment, sorting,
indexing, PCR duplicate removal, variant calling and report generation. All steps use default
parameters. Reads with mapping quality larger than 30 (20 for SNAP) will be used for variant
calling by default. Variants will be filtered following either recommended best practice (for GATK)
or by depth threshold of 10 unless otherwise stated.
Page 4 of 73



Figure 1. Scheme of SeqMule workflow and currently available tools in each step. Dashed line marks non-
mandatory steps, solid line marks mandatory steps.
Page 5 of 73


Multi-sample variant calling can be used if multiple sets of input data from the same lineage are
supplied. BAM and VCF files are generated in the end. The VCF files are ready for downstream
annotation and filtering analysis, which means users can feed them either to locally installed
ANNOVAR program
31
or to wANNOVAR web server
17
.
To allow various combinations of aligners and variant callers, SeqMule uses a specifically
designed configuration file. The configuration file consists of key, value pairs in the form of
‘key=value’. Keys are categorized as global options, programs and local options, each with
different prefixes. The prefixes for program keys also determine whether this program is
mandatory or exclusive or mandatory and exclusive at a particular step. Programs can be either
enabled or disabled by assigning 1 or 0 to the value of the corresponding keys. All settings are
written in plain English with embedded help documentation alongside. Over 40 different
combinations of aligners and variant callers have been tested to produce valid output and their
configurations are readily available in ‘misc/predefined_config’ directory under SeqMule
installation path. More combinations are left for the users to explore.
Built-in parallel processing
Analysis can be run in a parallel fashion either via native support by variant callers (e.g. multi-
processing option inside GATK) or via SeqMule’s built-in multi-processing framework. As is
shown in Figure 2, when SeqMule’s built-in multi-processing is enabled, it splits the genome into
multiple equally sized bins, writes those bins into N BED files (N for number of CPU cores), and
launches multiple processes to call variants over each region.
Page 6 of 73



Figure 2. Bin generation and assignment. Assume we want to run our analysis using 3 processes. We need
to split the chromosomes (blue) into 3 sets of bins. Bin size will be dynamically determined according to the
total size of all chromosomes or user-defined regions. There is a minimum and maximum bin size to
minimize the effects of a read spanning two bins and uneven coverage. In practice, the size limits shown in
the figure work well from a couple of genes up to the whole genome. Bins are generated by walking through
all regions to be analyzed. Subsequently, bins are assigned to each process by rotations. In the end, each
process is expected to deal with approximately same numbers of reads.
Each bin will be large enough to minimize the overhead costs of small bins, and small enough to
have fine-grained genomic intervals. Currently the minimum bin size is set to be 50 Kbp while
the maximum is 1 Mbp. All bins are assigned to each process by rotation (Figure 2) so that two
adjacent bins will not be analyzed by the same process. In contrast to assignment without
rotation, the assignment-by-rotation strategy is able to assign approximately equal number of
reads to each process (Figure 3). As number of reads is directly associated with processing
time, this assignment-by-rotation strategy will avoid having too many reads processed by one
thread due to uneven coverage
32
, therefore, all processes are expected to finish in similar
amount of time.  
Page 7 of 73



Figure 3. Read count distribution for 12 process assignments. Chromosome 1 is split into 12 groups of regions following
assignment-by-rotation and assignment-no-rotation schemes, respectively. Reads from NA12878 whole genome sequencing
data (58X) are counted for each group of regions.
We compared variant calling time consumption under different max bin sizes using sample
NA12878 from 1000 Genomes Project. Variants were called by SAMtools with 12 concurrent
processes. Table 1 shows that the standard deviation increases as the max bin size grows from
500 Kbp to 20 Mbp. However, the maximum running time of child processes, or simply put, the
overall variant calling time, does not necessarily increase as the max bin size becomes larger.
Variant calling is fastest (194.9 minutes) when max bin size is 1 Mbp. All processes will be
spawned by fork on a single machine, and so there is no need for cluster infrastructure.
Different processes communicate with each other through a script file recording the status and
command of each step. An execution manager (the parent process) is responsible for
monitoring, starting and stopping all processes. Because some analyses may take hours or
days to finish, the execution manager is designed to be able to stop and then resume the
pipeline at any step. This feature comes in handy when users have to adjust some parameters
(e.g. memory limit) after SeqMule aborts due to errors.
Page 8 of 73


Table 1. Variant calling time consumption and max bin size in quick mode. Sample NA12878 from 1000
Genomes Project was used to benchmark variant calling time with different max bin sizes in quick mode.
Variants were called by SAMtools with 12 concurrent processes. K stands for thousand, M for million. Row
minimums are highlighted by bold face. The statistics was calculated based on running time of all child
processes. The result shows that the smaller the max bin size is, the smaller the standard deviation is, and
that maximum running time does not increase linearly as max bin size grows. The shortest maximum
running time is obtained when max bin size is 1 Mbp.
Max Bin Size (bp)  500K  1M  5M  10M  20M  
Average running time (min)  209.2  191.4  191.5  210.0  209.6  
Minimum running time (min)  207.7  188.0  180.5  189.6  188.6  
Maximum running time (min)  211.4  194.9  199.4  233.1  253.0  
Standard Deviation  1.3  1.8  5.4  12.3  16.5  


Support for cluster environment
SeqMule supports analyzing large number of samples via Sun Grid Engine (SGE), a popular job
scheduling system in cluster environment. Internally, when SeqMule generates tasks for each
step for every sample, it also specifies their dependencies. The relationship results in a task
dependency graph (TDG) with tasks as nodes and task interactions as edges. SeqMule then
uses topological sorting to rearrange these tasks in linear fashion such that if task A is
dependent on output from task B, task A will be put behind task B. Upon running, SeqMule runs
tasks one by one and will not start a task unless all its dependent tasks are finished. With this
design in place, many samples can be processed in parallel due to their nature of independence.
When SGE is available, SeqMule submits each task to SGE and waits for it to finish.
Consensus result generation
For SNVs, variant calls with the same chromosome and position fields are considered
overlapping. Only such variants will be combined in consensus records. To combine two
overlapping SNVs, alternative alleles in each record will be put together while the chromosome,
position and reference columns remain unchanged.
Page 9 of 73


For non-SNV variants, primarily consisting of indels, records with same chromosome, position
and reference allele fields are considered overlapping. After combining, alternative alleles from
all records will be put together. Same indels might be presented in different ways by different
algorithms. For example, both ‘TGGG TGG‘ and ‘TG T’ can denote deletion of a G allele, but it
is hard to tell which G is deleted in the first case
8
. In this circumstance, we apply 'variant
normalization' in Vt (https://github.com/atks/vt) to first normalize all VCF files to the same
standard before combining them. During normalization, all alleles will be left (5’) and right (3’)
trimmed to remove superfluous nucleotides and be moved to the leftmost positions.
Nevertheless, it is expected that concordance rate between different tools for indels are less
than that for SNVs
8
.
When multiple samples are present in the same VCF file, it will be split into single-sample VCF
before merging. After all VCFs from the same sample are merged, the resulting VCFs will be
combined into one multi-sample VCF. Variant quality and genotype quality from the first input
file will appear in the combined VCF. The files to be merged will be sorted based on the
following priority: GATK>SAMtools>FreeBayes>VarScan>SOAPsnp. The priority is solely for
the purpose of reproducibility of quality score assignment. It will not affect consensus call results
as it only applies to the situation where mutual agreement exists. We assign variants ‘PASS’ in
the filter field only if they are unfiltered in any one of the input VCF files. Consensus calls can be
generated in different ways. Users are able to specify the minimum number of files in which
each consensus call appears. For example, "2-out-of-4" consensus calls represent the calls
found in at least 2 out of 4 input files.
Analysis summary
When an analysis starts or finishes, SeqMule automatically examines the FASTQ, BAM, VCF
and generates an HTML-based report showing various statistics. Important statistics include,
but are not limited to, coverage curve, average coverage, coverage over capture region,
percentage of capture region covered by at least N reads, Ti/Tv ratio,
Heterozygote/Homozygote ratio. FASTQ statistics are calculated by FastQC; alignment
statistics come from SAMtools; coverage statistics are done by SAMtools and SeqMule’s built-in
program; VCFtools output most of the variant statistics. In particular, coverage in a region
specified by a BED file is calculated as follows: SAMtools ‘depth’ subprogram is called over the
region to output base-level depth, then number of bases with different read depths are tabulated
to produce histogram-like statistics.
Page 10 of 73


Besides automatic generation of summary statistics, users can manually feed SeqMule with a
multi-sample VCF from a family trio and obtain Mendelian error rates. The VCF file will first be
converted to MAP and PED format used in PLINK
33
to record all genotypes in a more compact
fashion. Subsequently pedigree information, based on user specification, is added to the PED
file. At last, SeqMule iterates through all variants in the offspring and counts number of calls with
0, 1, 2 IBD (identical by descent) alleles from father or mother, number of allele drop-in events,
number of allele drop-out events. The Mendelian error rate calculation is implemented with
VCFtools, and a modified version of mendelFix
34
.
In accordance with its multi-caller feature, SeqMule also generates a Venn Diagram for users to
inspect overlapping of variant sets from different callers. Variants are split into SNVs and non-
SNVs. Chromosome, position, reference allele and alternative allele of SNVs are used to
determine whether two variants overlap. For non-SNVs, besides variant-normalization, records
will first be intervalized by ignoring reference alleles and alternative alleles, and then extended
10 bp towards both ends. The extended intervals for each non-SNV are used to determine
whether two variants overlap. This feature is implemented with VennDigram package
35
and
custom Perl code.  
RESULTS
Consistency and accuracy evaluation
To demonstrate the consistency of calls on identical samples, we analyzed 3 data sets for
NA12878 obtained from the 1000 Genomes project
36
(sequenced by HiSeq and hereafter
referred to as HiSeq-1000G) and from AllSeq (sequenced by HiSeq X Ten and hereafter
referred to as HiSeqX-D, HiSeqX-J). For each data set, 4 variant callers, FreeBayes, SAMtools,
VarScan and GATK HaplotypeCaller (GATK-HC) were used to call variants, then the results
from 4 callers were merged in different ways to obtain consensus calls. For each variant calling
method, we plotted a Venn diagram comparing the overlapping of variants among the 3 data
sets (Figure 4). The concordance rates range from 91.89%~94.37% among all variant callers
and consensus calls while GATK HaplotypeCaller calls give the highest concordance (94.37%).
Page 11 of 73



Figure 4. Consistency evaluation using 3 sequencing data sets for NA12878. The 3 data sets were generated
by 1000 Genomes project and AllSeq (HiSeq-D, HiSeq-J). We used the same alignment algorithms (BWA-
MEM) as the primary focus is to compare variant callers. For each data set, 4 variant callers were used to call
variants (A), then the results from 4 callers were merged in different ways to get consensus calls (B). For
each variant calling method, we plotted a Venn Diagram comparing overlapping of variants among the 3 data
sets. The percentages show the proportion of variants shared by 3 data sets.
As common variants are usually easy to detect, we later used ANNOVAR to filter the variants by
minor allele frequency (MAF). Variants with MAF>1% (1000 genomes project, October, 2014
release, European population) were discarded. The results (Figure 5) show that for rare and
novel variants, consistency rates are considerably lower, ranging from 69.10%~87.69%. It
suggests that most of the inconsistency in variant calling can be attributed to rare and novel
variants. 4-out-of-4 consensus calls show the highest concordance (87.69%) under this
scenario. The overall consistency on rare variants is summarized in Figure 6. Some variant
callers, e.g. GATK, utilizes prior information about known common and rare variants to filter out
low-quality variant calls
37
. Other variant callers, e.g. FreeBayes, SAMtools, do not explicitly use
prior information about common variants in their model; however, the parameters could have
been optimized to guarantee reasonable performance for the majority of variants, i.e., the
common variants.
Page 12 of 73



Figure 5. Consistency evaluation using rare variants from 3 sequencing data sets for NA12878. The 3 data
sets were generated by 1000 Genomes project and AllSeq (HiSeq-D, HiSeq-J). We used the same alignment
algorithms (BWA-MEM) as the primary focus is to compare variant callers. For each data set, 4 variant callers
were used to call variants, variants with MAF>1% were dropped. The remaining variants from 4 callers were
merged in different ways to get consensus calls. For each variant calling method, we plotted a Venn Diagram
comparing overlapping of variants among the 3 data sets (A,B). The percentages show the proportion of
variants shared by 3 data sets.

Figure 6. Consistency evaluation summary using rare variant calls from NA12878.
Page 13 of 73


In addition to consistency, we also examined the precision, sensitivity and specificity. The 1000
genomes data set for NA12878 (HiSeq-1000G) was used in this case. All consensus calls and
variants called by individual callers were compared with the gold standard from Genome In a
Bottle (GIB) project and an Illumina HumanOmni2.5-8v1 SNP array (Figure 7, Figure 8). For the
comparison with GIB gold standard, precision rates range from 97.1% to 99.5% with GATK-HC
being the highest; sensitivity rates range from 97.1% to 98.4%, with GATK-HC being the highest;
specificity rates are all close to 100%. However, as the GIB gold standard was generated by
GATK-HC, there is no doubt that the result is biased in favor of GATK. For the comparison with
SNP array, precision rates range from 99.6% to 99.8% with 4-out-of-4 consensus calls being the
highest; sensitivity rates range from 96.7% to 97.4% with 2-out-of-4 consensus call and GATK-
HC both being the highest; specificity rates are all above 99.5%. These numbers show that all
individual callers and consensus calls can give us accurate results in the regions covered by
GIB gold standard and SNP array, indicating that consensus calling may only have minimal
impact on the variants in GIB gold standard.

Figure 7. Compare individual callers with Genome In a Bottle Gold Standard. Results from 4 individual
variant callers, GATK HaplotypeCaller (gatk_hc), Freebayes, SAMtools and VarScan were uploaded to
http://www.bioplanet.com/gcat to be compared against gold standard from Genome In a Bottle project
(Version 2.18) and Illumina HumanOmni2.5-8v1 SNP array.
Page 14 of 73



Figure 8. Compare consensus results with Genome In a Bottle Gold Standard. Consensus calls were
generated by combining results from individual variant callers in different ways, namely 2-out-of-4, 3-out-of-4
and 4-out-of-4. Then they were uploaded to http://www.bioplanet.com/gcat to be compared against gold
standard from Genome In a Bottle project (Version 2.18) and Illumina HumanOmni2.5-8v1 SNP array. Result
from GATK HaplotypeCaller (GATK-HC) is also shown as a reference.
Aligner consistency evaluation
To our knowledge, there is little comparison in literature regarding different aligners in the
context of variant calling. It has been reported that there is differential performance for aligners
when sequencing error rates and indel frequencies and sizes are varying
9
, so it is sensible to
expect discrepancies associated with variant calling for different aligners. We used 2 variant
callers, SAMtools and FreeBayes, and 3 aligners, BWA-MEM, Bowtie2 and SNAP, for each
caller to evaluate concordance of variant calling on an exome data set
38
. Results in Figure 9
show that concordance rates for SNVs are 77.31% for SAMtools, 78.43% for FreeBayes,
respectively; concordance rates for non-SNVs (after normalization and extension) are 49.73%
for SAMtools, 50.61% for FreeBayes, respectively. While there is no previous study regarding
aligner consistency evaluation, these results suggest that aligners may account for a
considerable amount of difference in variant calling. The above results have been published in
the original SeqMule paper
39
.
Page 15 of 73



Figure 9. Variant concordance for different aligners. An exome data set was used to do alignment and call
variants. For each variant caller, we used 3 aligners to map the reads. Two panels are SNVs, and two panels
are non-SNVs (mostly indels). Panel A shows SNVs from SAMtools, panel B for SNVs from FreeBayes, panel
C for non-SNVs from SAMtools, panel D for non-SNVs from FreeBayes.
Mendelian error evaluation
As family-based analysis is critical in Mendelian disease studies, we went on to examine
Mendelian error rate by different methods in a previously published exome sequencing study on
a family trio
38
. Two types of errors were counted: allele drop-in and allele drop-out. Allele drop-in
(ADI) means that an offspring presents an allele that does not appear in either parent. Allele
drop-out (ADO) means that an offspring misses an allele that should have been inherited from
the parents. Sum of ADI and ADO is the total number of Mendelian errors. Mendelian error rate
is the ratio of all Mendelian errors to all variant calls shared by the family trio (i.e. not marked as
missing in any family member).
Page 16 of 73


The result (Table 2) shows that individual variant callers generally returns 41.1 to 52.6 thousand
trio calls with error rates between 1.78% and 2.88%. Among them, FreeBayes gives us the most
trio calls (52.6K) and GATK-HC gives us the lowest error rate (1.78%). For a single variant
caller, it is hard to predict whether error rate is higher if the overall number of trio calls is small.
In contrast, when we have fewer consensus trio calls, the Mendelian error rate is also lower.
The lowest Mendelian error rate we can get is 0.66%, from 4-out-of-4 consensus calls for this
data set, with 35.7 thousand trio calls. As most of our interests lie in rare and novel variants, we
did variants filtering based on MAF. Variants with MAF>1% (1000 genomes project, October,
2014 release, European population) were discarded. The results show that generally Mendelian
error rates become larger for rare and novel variants, ranging from 0.53% to 6.48% (Table 3,
Figure 10). Again, the lowest error rate was achieved by 4-out-of-4 consensus calls (0.53%),
with nearly five thousand trio calls. Compared with results from individual callers, the
consensus-call approach has more stable performance. It is also more flexible in that users can
reduce Mendelian error rate when needed. In trio-based sequencing studies, researchers may
want to examine the most reliable set of calls first to find low-hanging fruits, before delving deep
into noisier call sets to find additional candidates.
Table 2. Mendelian error rate comparison for variant calling methods. Allele drop in (ADI) means that an
offspring presents an allele that does not appear in either parent. Allele drop out (ADO) means that an
offspring misses an allele that should have been inherited from the parents.

Number of
calls shared
by the family
trio  
Number of
allele drop in  
Number of
allele drop
out  
Total number of
Mendelian errors  
Proportion of
Mendelian errors  
GATK HaplotypeCaller  44559  215  579  794  1.78%  
SAMtools  48586  769  631  1400  2.88%  
FreeBayes  52609  797  321  1118  2.13%  
VarScan  41102  360  542  902  2.19%  
Consensus (2 out of 4)  49741  730  796  1526  3.07%  
Consensus (3 out of 4)  45748  228  603  831  1.82%  
Consensus (4 out of 4)  35690  6  230  236  0.66%  

Page 17 of 73


Table 3. Mendelian error rate comparison for variant calling methods (MAF<1%). Allele drop in (ADI) means
that an offspring presents an allele that does not appear in either parent. Allele drop out (ADO) means that an
offspring misses an allele that should have been inherited from the parents.

Number of calls
shared by the
family trio  
Number of
allele drop in  
Number of
allele drop
out  
Total number of
Mendelian errors  
Proportion of
Mendelian errors  
GATK HaplotypeCaller  5831  46  51  97  1.66%  
SAMtools  9155  471  122  593  6.48%  
FreeBayes  13426  644  34  678  5.05%  
VarScan  6428  228  88  316  4.92%  
Consensus (2 out of 4)  10090  447  159  606  6.01%  
Consensus (3 out of 4)  7553  84  87  171  2.26%  
Consensus (4 out of 4)  4934  4  22  26  0.53%  

Page 18 of 73



Figure 10. Mendelian error rate comparison for variant calling methods (MAF<1%). 2o4 denotes calls from 2-
out-of-4 consensus call set, 3o4 for 3-out-of-4 and 4o4 for 4-out-of-4. Only loci present in all family members
are considered. ADI (allele drop in) and ADO (allele drop out) are counted as Mendelian errors. A trend line is
added for consensus results. Only variants with MAF<1% are shown here.
Performance evaluation
To evaluate the computational resources consumed by SeqMule, we used a whole genome
sequencing data set (mentioned before as HiSeqX-D) with 818 million 151-bp paired end reads
(30X coverage) for benchmarking. BWA-MEM or SNAP was used as the aligner. BWA-MEM is
by far the most popular aligner for NGS variant analysis while SNAP was the fastest aligner to
the best of our knowledge at the time of our evaluation. PicardTools was used to remove PCR
duplicates. GATK HaploTypeCaller or FreeBayes was used to call variants. We used a machine
equipped with 12-core Intel Xeon 2.66 GHz CPU and 48 Gigabytes of memory.
Table 4 details the computation time of SeqMule under different pipeline configurations. As is
shown, conventional BWA-MEM and GATK-HC combination takes over 50 hours to analyze a
whole genome, whereas SNAP+FreeBayes combination needs 36.3 hours. With quick mode
(SeqMule’s built-in parallel framework) enabled, variant calling can run up to 12 times faster (12
CPU cores). However, this comes with the tradeoff of increased memory usage. For example,
Page 19 of 73


when speed is of primary concern, users can use the SNAP+FreeBayes combination. This
combination requires less than 21 hours of running time and 31.3 Gigabytes of memory with
quick mode enabled.
Table 4. Time consumption under different configurations. A human whole genome data set (818 million 151
bp-long paired-end reads, 30X coverage) was aligned with BWA-MEM and SNAP. PCR duplicates were
removed by Picardtools. Variants were called by GATK HaplotypeCaller and FreeBayes. Quick mode here
denotes SeqMule’s built-in parallel framework. Built-in parallel capability is always turned on for underlying
3rd party algorithms (e.g. GATK’s -nt option).
Quick Mode  Aligner  Variant caller  
CPU
(number of
cores)  
Max Memory
Used (G)  
Alignment
Time
(hour)  
Variant
Calling
Time
(hour)  
Total
Time
(hour)  
No  BWA  GATK-HC  12  19.8  13.5  22.6  54.0  
Yes  BWA  GATK-HC  12  43.2  13.2  19.8  51.1  
No  SNAP  FreeBayes  12  31.3  5.7  16.0  36.3  
Yes  SNAP  FreeBayes  12  31.3  5.3  1.3  20.5  

For users who are interested in analyzing exome data sets, we also did a similar comparison
using an exome data set (138.8 million 90bp-long paired-end reads, 113X coverage in target
region). BWA-MEM was used as the aligner. PicardTools was used to remove PCR duplicates.
GATK-HC was used to call variants. As is shown in Table 5, using more CPUs or processes
reduces the running time in general. With quick mode enabled, variant calling can run 2 to 7
times faster depending on number of CPU cores used. On a machine with same configuration,
the shortest running time is 317 minutes for the BWA-MEM+GATK-HC combination, and 264
minutes for the SNAP+FreeBayes combination (not shown in Table 5).
Page 20 of 73


Table 5. Time consumption under different configurations. A human exome data set (138.8 million 90bp-long
paired-end reads, 113X coverage in target region) was aligned with BWA-MEM. PCR duplicates were removed
by Picardtools. Variants were called by GATK HaplotypeCaller. Quick mode here denotes SeqMule’s built-in
parallel framework. Built-in parallel capability is always turned on for underlying 3rd party algorithms.
Quick
Mode Enabled  
CPU (number of
cores)  
Max Memory Used
(G)  
Variant
Calling Time
(min)  
Time (min)  
Time Saving Compared
With Analysis Using 1 CPU  
No  1  7.03  217.5  910  0.00%  
No  2  8.17  208.4  695  23.63%  
No  4  11.90  189.9  540  40.66%  
No  8  16.27  203.6  529  41.87%  
No  12  20.34  192.5  474  47.91%  
Yes  2  8.17  117.8  606  33.41%  
Yes  4  14.67  51.5  385  57.69%  
Yes  8  28.98  40.3  346  61.98%  
Yes  12  43.29  31.1  317  65.16%  

In the exome analysis, the running time for GATK-HC does not seem to vary with number of
CPU cores when quick mode is disabled. This is perhaps due to the fact that GATK’s
HaplotypeCaller only supports parallel execution with “-nct” option (enable multiple CPU threads
per data thread) which does not scale well and has a constant overhead. The configuration files
used are “snap_freebayes.config”, “bwa_gatk_HaplotypeCaller_norealn_norecal.config”. They
can be found in seqmule/misc/predefined_config/. The software used includes FastQC
(v0.11.2), Picard (v1.115), SAMtools (v0.1.19-44428cd), FreeBayes (v0.9.14-14-gb00b735),
SNAP (v1.0beta.16), BWA-MEM (v0.7.10-r789), GATK (v3.1-1-g07a4bf8).
Mendelian disease disease-contributory variant identification
We used exome sequencing data from a previously reported family trio
38
to demonstrate the
applicability and ease of use of SeqMule. The proband is a 28-year-old Caucasian male
diagnosed with idiopathic hemolytic anemia. This phenotype was not observed either in his
parents nor his siblings. With Agilent SureSelect Human All Exon kit and Illumina HiSeq 2000,
we obtained 97 million paired-end reads per sample with designed capture regions covered at
82x on average. Initial alignment was done by BWA-MEM algorithm, followed by Picard removal
of duplicates. GATK HaplotypeCaller, SAMtools, VarScan were then used to call variants. Multi-
sample variant calling was enabled, so all 3 samples were supplied to each of the variant callers.
Page 21 of 73


We filtered results based on criteria recommended by each algorithm. Subsequently we
extracted consensus variants from 3 sets of filtered variants (in 3-out-of-3 fashion), and reduced
number of variants from 34 thousand per caller to 31 thousand on average. These steps were
performed by executing one single SeqMule command given that the combination of aligner and
variant callers has been tested before.
Next we utilized the ANNOVAR “variants reduction” pipeline
31
, under a recessive disease model.
Synonymous, non-splicing variants and variants observed in the 1000 Genomes Projects and
NHLBI-ESP 6500 exome project with minor allele frequency (MAF) >1% in European
populations
40
were filtered out.  The filtering was done with one single ANNOVAR command for
each subject (variants_reduction.pl -protocol nonsyn_splicing,1000g2012apr_all,esp6500_ea,recessive -operation g,f,f,m -
aaf_threshold 0.01 input.father.avinput annovar/humandb/ -buildver hg19).
In the remaining variants, father and son shared 236 variants, while mother and son shared 253.
There are 64 genes overlapping between the two sets of shared variants, 32 of which are
homozygous and 6 are compound heterozygous in the proband. Manual examination of
candidate genes easily identified mutations in PKLR that likely contribute to the disease. PKLR
harbours two heterozygous variants that are predicted by multiple scores to be deleterious, and
has been confirmed by us as the disease-contributory gene in biochemical tests
38
. This example
demonstrated that SeqMule can reveal disease-contributory variants easily with minimal efforts
from the users, therefore greatly facilitating genetic studies on human genetic diseases.
DISCUSSION
In summary, SeqMule is a comprehensive, user-friendly, flexible and efficient tool for analyzing
human exome and genome sequencing data. Five alignment tools, five variant callers and
various accessory programs are included to provide users with numerous choices.
Different call sets can be integrated to produce more reliable results in terms of consistency and
Mendelian error rate. However, if users are solely interested in rare variants and de novo point
mutations from family-based samples, they will have more gain in power and Mendelian
consistency by using more specialized tools such as PolyMutt
41
, DenovoGear
42
and FamSeq
43
.
Test results on our own data suggest that PolyMutt can dramatically reduce Mendelian errors
(0.2% for FreeBayes, 0.0% for GATK-HC, 0.3% for SAMtools, data same as in Figure 10) by
utilizing a family-aware calling framework. When sample size is large, variant quality can also be
improved by imputation-based methods
44
.
Page 22 of 73


SeqMule’s built-in parallel processing capability can improve current variant calling speed by a
number of times without involving high-performance computing infrastructure or complicated
configuration. Besides all of these features, SeqMule uses one-line commands for complicated
tasks wherever possible, making it extremely easy to download, install, configure and run a
large number of bioinformatics tools.
Popular bioinformatics platforms, such as Galaxy, make a number of bioinformatics tools very
accessible. Users just upload their data and start analysis right away. But the restricted storage,
limited data transfer speed and prolonged job queuing time makes it impractical to use when
users have non-trivial amount of data.
The option of launching a Galaxy instance
45
or other pipeline in the cloud is likely to overcome
some of the obstacles mentioned above without cumbersome local installation. But, as a free,
public resource open to all biologists, such a platform is usually built towards serving a large
population who are not only interested in variant discovery, but also many other types of
analyses.  In contrast, SeqMule is tailored for exome or genome sequence analysis in the
context of human genetic disease study and therefore hosts many features optimized for this
purpose, such as consensus call generation, multi-sample variant calling, Mendelian error rate
statistics, and HTML-based summary, among other things. Besides, SeqMule is more agile
because new tools, once they are integrated into SeqMule by developers, can be added with
just one update command (see SeqMule’s manual).
In addition to platform solutions, there are quite a few standalone pipelines for variant analysis,
such as HugeSeq
7
, ngs_backbone
46
and bcbio-nextgen (https://github.com/chapmanb/bcbio-
nextgen). HugeSeq integrates one aligner and two SNP/indel callers (SAMtools and GATK),
and was last updated one year ago. ngs_backbone integrates one aligner (BWA) and one
variant caller (GATK). bcbio-nextgen has two aligners (BWA, NovoAlign) and four variant callers
(GATK, FreeBayes, Platypus, SAMtools). Since NovoAlign is a proprietary aligner not free to all
researchers, none of these pipelines provides alternative open-source mappers. In view of
discrepancies associated with aligners
9
(Figure 9), it is important to offer users multiple open-
source alignment programs.
Many users have used SeqMule to analyze their sequencing data and obtained meaningful
results
47-49
. With the rapid development and deployment of next-generation sequencing
technologies, we expect that SeqMule will facilitate analysis of the upcoming massive amounts
of sequencing data to expedite discoveries for human genetic diseases.
Page 23 of 73


Chapter 2. Framework for prioritizing GWAS loci with epigenetic
features
ABSTRACT
Identifying causal variants remains a key challenge in post-GWAS (Genome Wide Association
Study) era, since many GWAS SNPs (including imputed ones) fall into non-coding regions,
making it very difficult to associate statistical significance with predicted functionality. Therefore,
we created a web-based tool, Enlight, which overlays functional annotation information, such as
histone modification states, methylation patterns, transcription factor binding sites, eQTL, higher
order chromosomal structure, to GWAS results. Enlight is accessible by a web browser at
http://enlight.usc.edu.  
INTRODUCTION
Genome-wide association studies (GWAS) enabled the identification of genetic variants
associated with complex diseases and traits. To date, there are 14,012 genome-wide significant
variants in the GWAS catalog
50
. However, most of the top hits are believed to be proxy markers
for the underlying causal variants, and the majority of those variants lie within non-coding
sequences
51
. Despite substantial efforts in the post-GWAS era to identify causal variants, lack
of knowledge on biological functions has become a major hurdle for differentiating causal
variants from highly correlated tag SNPs.
A large amount of biological annotation is now available for non-coding sequences, given the
community efforts devoted to elucidating functional significance of non-coding sequences in the
past few years. For example, the ENCODE project has systematically mapped regions of active
transcription, transcription factor association, chromatin structure and histone modification, and
assigned at least one biochemical function or activity, e.g. transcription, histone modification, to
about 80% of the genome
52
. Recent studies showed that disease-related variants detected by
GWAS are significantly enriched in regions where regulatory elements are inferred, such as
DNase I hypersensitivity sites (DHS) and transcriptional factor binding sites (TFBS)
51,53
. Proper
integration of these biological information with GWAS results will likely shed light on
prioritization and identification of candidate causal variants
54
.  
Page 24 of 73


To that end, tools such as HaploReg
55
, RegulomeDB
56
, GWAS3D
57
and FunciSNP
58
were
developed to leverage functional annotations to interrogate GWAS variants. However, the
popularity of regional plots highlights the importance of visual inspection of the GWAS results.
While LocusZoom
59
generates high-quality images showing local linkage disequilibrium (LD)
and association significance, it provides little information for interpreting non-coding variants.
Therefore, we developed a web-based tool, Enlight, combining biological annotation and
GWAS-related information into one single plot. It takes GWAS results (including imputed ones)
as input, and returns a regional plot with a variety of epigenetic information such as histone
modification, methylation patterns, transcription factor binding sites, eQTL and high-order
chromosomal structure.  
Page 25 of 73


IMPLEMENTATION
Architecture
The server backend is implemented in Perl CGI and MySQL. Regional plots are performed by
LocusZoom
59
, annotation plots are performed by custom R and Python code built on the basis
of LocusZoom. Text annotation is performed by ANNOVAR
31
. Detailed workflow is shown
Figure 11.

Figure 11. Enlight workflow. Enlight takes at least two columns of input, the SNP name and P value. After
submission, a job will be scheduled in MySQL database. Unpon execution, a modified version of LocusZoom
will output regional plots and annotation plots; meanwhile, ANNOVAR outputs text annotation for each
variant.

Input and output
Page 26 of 73


The minimum input contains two columns: marker name (rsID) and P value. Optionally, users
can supply chromosome, start position, end position, reference allele and alternative allele in
the first 5 columns; if they are absent, Enlight converts marker names to corresponding
coordinates based on dbSNP137. Upon submission, Enlight will output a conventional regional
plot containing association strength, LD information, gene position, chromosomal recombination
rate. Meanwhile, Enlight plots categorical annotation like chromHMM classification as colored
stripes, continuous annotation such as histone modification states, chromatin accessibility signal
by DNase-seq, transcription factor (TF) binding strength, etc., as histograms, below the GWAS
results panel (Figure 12). Variants that appear in the GWAS catalog and eQTL databases from
either GTEx project or eQTL browser of Pritchard Lab (http://eqtl.uchicago.edu/), can be shown
by a XYplot. HiC interaction
60
can be plotted as a heatmap. Users are able to select data tracks
from 14 cell lines and 17 annotation types or plot with their own data in BED format.  
Enlight allows easy access to a range of high-throughput functional assays. In particular,
ChromHMM
61
is a computational model for learning chromatin states from a large amount of
ChIP-seq data sets. It conveniently classifies every 200bp region to one of 10 chromatin states
such as strong enhancer, weak enhancer, active promoter, to name a few. This classification
system allows easy interpretation of ChIP-seq data and potential functional impact of variants in
a specific region. Histone modification states measured by ChIP-seq usually mark the states of
transcription activity. For example, H3K4me3
62
is a marker for transcriptional activation;
H3K9me3
63
for repression, H3K9ac
64
for active promoter, H3K36me3
65
for active elongation.
DNase-seq assay measures chromatin accessibility. Peaks in DNase-seq are more likely to be
candidates for transcription factor binding
66
. eQTL
67
is short for expression quantitative trait loci.
Essentially eQTL are SNPs that are statistically significantly associated with gene expression,
i.e. a proxy marker for a functional locus that can alter gene expression. HiC
60
is a high-
throughput method that can detect chromosomal interactions across the entire genome. It has
been known that enhancer regulate promoter via long-range interactions, some of which can
interrogated by HiC methods
68
.    
In addition to visualization, Enlight generates text annotation for each variant using ANNOVAR,
which can be used for downstream analyses, such as finding biofeature overlap, regression
model building, etc. Because ANNOVAR only supports region-based, filter-based and gene-
based annotation, biofeatures such as transcription factor binding motifs are not readily
available in the results.
Page 27 of 73


Figure 12 shows a typical plot. rs2071278 is genome-wide significantly associated with serum
complement C3 and C4 levels
69
, important measurements in assessment of rheumatoid
arthritis
70
. Therefore, rs2071278 is possibly a SNP tagging nearby functional variants. For
instance, rs204991, rs204990, rs204989, all in high LD with rs2071278 (1000 genomes,
European population, 2012), are shown to be significantly associated with expression of human
leukocyte antigen (HLA) genes by microarray
71
. This finding partly explains their strong
associations with rheumatoid arthritis in WTCCC study
72
. When strong associations appear near
known GWAS hits and are surrounded by interesting epigenetic and other functional features,
they perhaps suggest functional importance.
Page 28 of 73



Figure 12. An example Enlight plot. It shows the association signals for rs944797 and nearby loci from a
diastolic blood pressure study (dbGAP: pha003589.1). The color stripe shows the chromHMM segmentation
result which combines multiple epigenetic chromatin marks. The SNP rs1333045 (yellow), which is in high LD
with rs944797, shows up in a proposed enhancer region in blood vessel relevant cell lines, HUVEC and
HSMM, but not in two blood cell lines, GM12878 or K562. rs1333045 also overlaps with or is adjacent to peaks
in wgEncodeRegTfbsClusteredV3, which summarizes a large collection of ChIP-seq results, and
tfbsConsSites, which describes TFBS conservation across human/mouse/rat.
Page 29 of 73


Enlight has been shown to aid in finding additional variants contributing to genetic basis of some
complex phenotypes. For instance, human serum uric acid concentration (SUA) is a complex
trait. A recent meta-analysis of multiple GWAS identified 28 loci associated with SUA jointly
explaining only 7.7% of the SUA variance, with 3.4% explained by two major loci (SLC2A9 and
ABCG2)
73
. We examined whether gene–gene interactions could regulate SUA using two large
GWAS cohorts included in the meta-analysis [the Atherosclerosis Risk in Communities study
cohort (ARIC) and the Framingham Heart Study cohort (FHS)]. Abundant genome-wide
significant local interactions were found in the 4p16.1 region in both ARIC and FHS data. They
were mostly located in an intergenic area near SLC2A9, and apparently were not a result of
linkage disequilibrium. Forward selection approach helped us indentify 5 more SNPs with
marginal effects and 3 epistatic SNP pairs in ARIC. These findings, together with the lead SNP,
could explain 1.5% more SUA variance than the lead SNP alone. However, only 0.3% was
accounted for by the marginal and epistatic effects in the intergenic area.  
Looking into the intergenic region (Figure 13), we can see some strong and weak enhancer
elements along with DNase peaks and conservative transcription factor binding peaks. This
suggests that the epistatically interacting SNPs may alter transcription factor binding activity for
some enhancers in this region. Enrichment analysis by HaploReg
55
showed that enhancers
active in ENCODE hepatic (HepG2) cells and precursor red blood cells (K562) are statistically
significantly enriched (P=4.7e-5, 5.0e-6, respectively) in the intergenic area between WDR1 and
SLC2A9, possibly a regulatory target for the epistatic SNPs. While there is high-throughput data
supporting our statistical results, more experimental validation may be needed, including
enhancer activity assay, qPCR for SLC2A9 and WDR1 gene expression with and without
alternative alleles, to name a few.
Page 30 of 73



Figure 13. All marginal SNPs in the 4p16.1 region in ARIC, their LD with the lead SNP rs3733588 colored in purple and the
recombination rate distribution in the region. The colored stripe shows chromatin states of the 4p16.1 region. Histograms
are UCSC genome browser data tracks of ENCODE data in the region with all scores normalized from 0 to 1000. The
wgEncodeBroadHmmHepg2HMM data track shows chromatin states in the region of HepG2 cells characterized using the
ChromHMM software, including two strong enhancer zones flanking the WDR1 gene; the wgEncodeRegTfbsClusteredV3 data
track shows transcription factor binding site cluster assayed by ChIP-seq; the wgEncodeOpenChromChipHepg2CtcfPk data
track shows DNase-seq signal peaks in HepG2 cells marking open chromatin flanking WDR1; the
wgEncodeBroadHistoneHepg2H3k27acStdPk data track displays active regions in HepG2 cells by histone mark H3K27Ac.
Page 31 of 73


CONCLUSIONS
We have created a user-friendly tool to generate regional plots of association results with
biological annotation as well as text annotation for individual variant. Enlight makes it possible to
visually connect epigenetic features or other functional annotations with the LD information, thus
facilitating the identification of putative causal variants. Enlight can be accessed via a web-
based interface.
 
Page 32 of 73


Chapter 3. Fast De Novo Genome Assembly, Evaluation, Gap filling
and Error Correction
ABSTRACT
Short-read sequencing enabled de novo assembly of a few individual human genomes, but with
inherent limitations in characterizing repeat elements. Here we sequence a Chinese individual
HX1 by single-molecule real-time (SMRT) long-read sequencing, construct a physical map by
NanoChannel arrays, and generate a de novo assembly of 2.93Gb (contig N50: 8.3Mb, scaffold
N50: 22.0Mb), including 39.3 Mb N-bases. The assembly fully or partially fills 274 (28.4%) N-
gaps in the reference genome GRCh38. Comparison to GRCh38 reveals 18.4Mb of HX1-
specific sequences, including 378kb that are not present in previously reported Asian genomes.
These novel sequence elements harbor epigenetic regulatory elements and may influence gene
expression. Our results imply that improved characterization of genome functional variation may
require the use of a range of genomic technologies on diverse human populations. Assemblies
are available for download at http://hx1.wglab.org/.
INTRODUCTION
The advent of next-generation short-read sequencing paved the way to characterize the
genomes of thousands of species, and had enabled de novo assembly of a few individual
human genomes
74,75
. However, these assemblies may have inherent technical limitations in
characterizing repeat elements that span longer than the read length
76,77
, yet repeats and
segmental duplications are known to cover approximately half of the human genome. For
example, a formal analysis of the de novo sequence assembly generated from the genome of a
Han Chinese individual and a Yoruban individual showed that 420.2 megabase pairs of
common repeats and 99.1% of validated duplicated sequences were not present, which resulted
in missing thousands of coding exons in the genome assembly
76
. Therefore, the use of
additional genomic techniques, such as fosmid pooling
78
and long-read sequencing
79
, may be
necessary to better characterize complicated genomic regions in human genomes.
Previous studies reported that pervasive genetic differences exist across different ethnicity
groups, especially on structural variants
80-82
. For example, through reconstruction of the
ancestral human genome, it was reported that megabases of DNA were lost in different human
lineages and that large duplications were introgressed from one lineage to another
81
.
Additionally, genomic elements that are absent from reference genomes may be present in
Page 33 of 73


personal genomes
83,84
. For example, a study estimated that a complete human pan-genome
would contain ~19–40 Mb of novel sequence not present in the extant reference genome
85
.
These novel sequences that are not present in the reference genome may harbor functional
genomic elements that are ethnicity-specific, and may affect gene regulations or transcriptional
diversity.
To address the limitations on previously published de novo genome assembly, and to improve
our understanding of transcriptome variations, here we sequence both DNA and RNA of a
Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing
86
. We also
generate a physical map of the HX1 genome using IrysChip
87
, which is a nanopore array that
detects a characteristic 7-nucleotide sequence along very long genomic segments, typically
hundreds of kilobases. We perform de novo genome assembly to build a Chinese reference
genome, using a hybrid approach that combines long-read sequencing data and IrysChip data
79
.
We demonstrate a few unique applications of the HX1 assembly, including the ability to fill gaps
in the human reference genome assembly GRCh38, as well as the ability to identify fine-scale
structural variants. In parallel, leveraging long-read RNA sequencing, we also identify novel
transcriptional elements, especially those with multiple spliced isoforms. Through the combined
use of a few genomic techniques, we perform detailed characterization of the HX1 genome and
demonstrate that long-read sequencing can detect potentially functional elements in human
genomes that are missed by short-read sequencing.
RESULTS
Generation of de novo genome assembly on a Chinese genome
We leveraged SMRT DNA sequencing technology
86
and sequenced genomic DNA from an
anonymous Chinese individual (HX1) with normal karyotype (Figure 14) at 103X genome-wide
coverage (Table 6). In total, we obtained 44.2 million "subreads" (a portion of the sequencing
read that is informative for downstream analysis) after removing adapters and performing quality
control measures. These subreads have a mean length of 7.0kb and a N50 length of 12.1kb
(Table 7, Figure 15), where N50 refers to the length for which the collection of all sequences of
that length or longer contains at least half of the sum of the lengths of all sequences.
Page 34 of 73


Table 6. Statistics of the genomics data sets generated in this study.
Material  Platform  # Cells  # Reads  Bases  Coverage  Mean
length  
N50
length  
DNA  Illumina
HiSeq X  
-  2.8 billion reads  428.8
G  
143X  151  151  
DNA  PacBio
SMRT cell  
377 cells  44.2M reads  309.0G  103X  7.0Kb  12.1Kb  
DNA  BioNano
IrysChip  
12 cells  1.169M
molecules
(>150kb)  
302.8G  101X  259.0Kb  224.7Kb  
RNA  PacBio
SMRT cell  
50 cells (1-2kb, 2-
3kb, 3-5kb,5kb+)  
2.721M error-
corrected reads  
5.8G  -  2.1Kb  2.7Kb  
RNA  Illumina
HiSeq 2500  
NA  48.9M reads  4.4G  -  90  90  

Table 7. Statistics on subreads generated from PacBio long-read DNA sequencing on the genome DNA of
HX1. Filtered subreads are used in error-correction by DALIGNER in Falcon. Error-corrected subreads are
used for de novo assembly.  
Filtered subreads  Error-corrected subreads  
Min  35  501  
25% Quantile  1,754  2,064  
Median  4,853  4,975  
Mean  6,990  6,114  
75% Quantile  10,531  8,621  
Max  56,677  53,633  
N50  12,134  9,137  
Count  44,207,919  26,562,171  

Page 35 of 73



Figure 14. Karyotype. Karyotype analysis did not reveal large-scale chromosomal abnormality in HX1.

Figure 15. Length distribution of subreads and error-corrected subreads. (a) read length distribution of
filtered subreads; (b) read length distribution of error-corrected subreads. Filtered subreads are used in
error-correction by DALIGNER in Falcon. Error-corrected subreads are used for de novo assembly.
Page 36 of 73


We modified and improved the FALCON software
88
and performed de novo genome assembly
on the long reads, resulting in 5,843 contigs (N50=8.3Mb) and a total size of 2.9Gb. In addition,
206Mb of "associate contigs", that is, alternative haplotypes, were constructed along with the
primary contigs. Finally, we also performed short-read sequencing on the Illumina HiSeq X
platform, with 143X coverage of the genome (Table 6). Short reads were used to further polish
HX1 contigs and correct indel and SNV errors. The continuity of the contigs is substantially
higher than assemblies generated from competing technologies in previous studies (Table 9),
demonstrating the clear advantage of long-read sequencing in genome assembly. We note that
a recently published genome using the same SMRT technology reported a contig N50 of
906kb
79
, and we believe that the almost 10-fold improvement in our study can be attributed to
the improved chemistry, longer read length, enhanced assembly algorithm, as well as the much
deeper sequencing depth (Table 8).
Table 8. Assembly quality and sequencing depth. Assembly quality (contig N50 and assembly size) is
compared with regards to sequencing depth and length cutoff for error-corrected subreads. Sequencing
depth (>6kb) denotes average sequencing depth counting only reads longer than 6kb. Depth is
approximately calculated as total base pairs divided by 3 Gb.
Contig N50
(Mb)  
Assembly size
(Gb)  
Sequencing
depth  
Sequencing depth
(>6kb)  
Length cutoff for error-
corrected subreads (kb)  
0.61  2.68  65  48  2  
2.76  2.81  81  64  2  
4.65  2.87  99  80  2  
6.49  2.88  99  80  9  
7.16  2.87  103  83  13  
8.01  2.89  103  83  11  
8.28  2.88  103  83  12  

Page 37 of 73


Table 9. Comparison to previously published de novo assembly on personal genomes. *YHref’s alternative
haplotype is contained in the haploid-resolved diploid genome (HDG) sequence, which is not readily
comparable with our results.
Sample  Assembly  Accession  Sequencin
g Platform  
Scaffoldin
g platform  
Contig
N50  
Scaffold
N50  
Size  Size
(alternative
haplotype)  
# Gaps  Gap length  
Venter  HuRef  GCA_000002125.
2  
Sanger  BAC  108,431  17,664,25
0  
2,844,000,50
4  
NA  66,906  34,429,377  
NA1287
8  
ALLPATHS  GCA_000185165.
1  
Illumina
GAII &
HiSeq  
Fosmid  23,924  12,084,11
8  
2,786,258,56
5  
NA  220,31
8  
171,353,64
4  
BGI-YH  YH_2.0  GCA_000004845.
2  
Illumina
HiSeq  
Fosmid  20,516  20,520,93
2  
2,911,235,36
3  
NA  235,51
4  
105,204,23
0  
BGI-YH  YHref  NA  Ilumina
HiSeq  
Fosmid  484,222  23,192,26
0  
2,883,329,36
1  
NA*  NA  NA  
CHM1  CHM1_1.1  GCF_000306695.
2  
Illumina
HiSeq  
BAC  143,936  50,362,92
0  
3,037,866,61
9  
NA  40,915  210,229,88
0  
NA1287
8  
ASM101398v
1  
GCA_001013985.
1  
PacBio
SMRT  
BioNano
IrysChip  
906kb/  
1.4Mb
(V2)  
26,834,08
1  
3,176,574,37
9  
NA  2,332  146,352,28
6  
HX1  HX1  N/A  PacBio
SMRT  
BioNano
IrysChip  
8,325,00
4  
21,979,25
0  
2,934,084,19
3  
206,388,24
8  
10,901  39,341,483  

To evaluate the completeness and accuracy of the draft HX1 assembly, we performed several
analyses. First, we generated a physical map of the same DNA sample by NanoChannel-based
fluidic IrysChip
87
. From the IrysChip run with 101X whole-genome coverage (based on all
sequence reads >150kb, Table 10, Figure 16), we calculated the mapping rate of these
fragments on different genome assemblies. We noted that this analysis was biased against
more fragmented assemblies such as HX1, since some IrysChip reads may span two different
contigs. With a highly stringent threshold suitable for human genome analysis, the IrysChip
reads yielded comparable mapping rates to GRCh38 and HX1 at 80.2% and 78.9%,
respectively. Second, we aligned HX1 contigs with gapped (N bases included) to GRCh38, and
found that 97.11% of the non-N regions in GRCh38 are covered by HX1 (Figure 17). This ratio
was at a similar level as other personal genome assemblies including YH2.0
83
(94.99%),
NA12878
79
(97.50%) and HuRef
84
(97.04%). Third, we evaluated consensus quality of the
assemblies, by comparing them with the chromosomes in the reference genome GRCh38 using
MUMmer
89
. We found that the consensus accuracy for HX1 was 99.73%, similar to YH2.0
Page 38 of 73


(99.81%), NA12878 (99.73%), and HuRef (99.84%). Fourth, we aligned the NCBI RefSeq
transcripts to several genome assemblies using the NCBI Assembly Evaluation pipeline. We
found that 776 out of 50909 transcripts could not be placed on HX1 versus 455 for NA12878,
306 for YH2.0 and 22 for GRCh38 (primary assembly). The percentage of mapped coding
transcripts with CDS (coding sequence) coverage ≥95% was higher in HX1 than in all
assemblies except GRCh38: 99.96%, 90.32%, 95.30%, and 97.69% for GRCh38, YH2.0,
NA12878 and HX1, respectively (Table 11), indicating the low number of misassemblies in HX1.
Similarly, the numbers of transcripts split across multiple genomic locations in GRCh38, YH_2.0,
NA12878 and HX1 are 11, 1213, 1375, 348, respectively, demonstrating the high quality of HX1.
Last but not least, short read alignments showed that 99.42% and 99.33% of the Illumina reads
can be mapped to GRCh38 and HX1, respectively, suggesting the high quality and
completeness of HX1.
Table 10. Statistics on the filtered (>150kb) BioNano data.
Measure  Value  
Total DNA  302.8Gb  
Total molecules  1,169,210  
Occupancy  2.58%  
Average molecule SNR  16.61  
Average molecule intensity  0.14  
Center of mass (N50)  264.3kb  
Average length  259.0kb  
Median length  224.7kb  
Average label density /100kb  8.92  

Page 39 of 73


Table 11. RefSeq transcripts alignment results from the NCBI Assembly Evaluation pipeline.  
  GRCh38
˦
 YH2.0
*
 NA12878
*
 HX1
#
 
Number of transcripts retrieved from Entrez  50909  50909  50909  50909  
Number of transcripts not aligning  22  306  455  776  
Number of transcripts with multiple best alignments
(split transcripts)  
11  1213  1375  348  
Number of transcripts with CDS coverage < 95% in
aligned transcripts  
16  3798  1836  900  
Number of conflicting placements of transcripts from
different genes  
1  635  490  628  
Percentage of aligned coding transcripts with CDS
coverage >= 95%  
99.96  90.32  95.30  97.69  
*
NCBI accession number: GCA_000004845.2 for YH2, GCA_001013985.1 for NA12878
#
HX1 scaffolds
˦
Unless there is an assembled chrY in an assembly, our code excludes any transcript that aligns best to GRCh38  
chrY from the counts for these assemblies without chrY. It means that the denominator (number of sequences that
we align or that we attempt to align) is a little bit lower for the YH2.0, NA12878 and HX1.  
Page 40 of 73



Figure 16. Summary of the IrysChip analysis. (A) Distribution of the molecule mass versus the size of the
molecules in filtered data. (B) Distribution of the molecule count versus the size of the molecules in filtered
data.
Page 41 of 73



Figure 17. Non-N coverage of GRCh38 by HX1 assembly for each chromosome.
To generate scaffolds on the draft assembly, we applied a hybrid scaffolding approach
79
on the
IrysChip data and the HX1 draft assembly: First, we performed de novo assembly of the
IrysChip reads, resulting in 2,346 contigs with N50 of 1.80Mb. Next, we stitched HX1 contigs
together based on information from the IrysChip assembly. Together with HX1 contigs that
cannot be anchored, the N50 for the hybrid assembly improved to 22.0 Mb. We next evaluated
the mis-joining error rate with a similar strategy as described in a previous study
79
, by
comparing to the reference human genome GRCh38. The HX1 scaffolds had a mis-joining error
rate of 1.26%, similar to what was observed in YH2.0 (3.88%), NA12878 (0.55%) and HuRef
(1.15%). The final genome assembly contained 2.93 Gb primary sequences (including 39.3Mb
N bases) and 206Mb alternative haplotypes.
Page 42 of 73


Gap filling on the reference genome by de novo assembly
The reference human genome GRCh38 contains 966 "N-gaps" as stretches of Ns, and we next
assessed whether we can fill in these gaps using the de novo assembly. The average and
median lengths of gaps in GRCh38 are 180kb and 998 bp, respectively (Figure 18A). One
previous study by Chaisson et al used SMRT sequencing to fill gaps
90
; this study used a local
assembly approach, and was able to close or extend into 31 of the 172 interstitial euchromatic
gaps in GRCh38. Another study that used SMRT sequencing closed 28 interstitial gaps in
GRCh38 with 34kb of assembled sequences
79
. Given the availability of whole-genome de novo
assembly, we developed a novel statistical approach called GFA (gap-filling by assembly,
Figure 23). Interestingly, we found that 28.4% (254) of the 966 N-gaps in GRCh38 can be
completely or partially filled by HX1. Among the 328 gaps over 10kb, 36.0% (118) of them can
be completely or partially filled. The total length of filled or shortened gaps amounts to 7.1Mb
(Figure 18B). Compared to previous studies, gaps filled by HX1 have overlap with 48 of the 172
interstitial gaps defined by Chaisson et al
90
, adding 1.8 Mb sequences to GRCh38; however,
among the 48 interstitial gaps closed by us, only 10 were closed by Chaisson et al, suggesting
that these two gap closing methods are highly complementary. We further evaluated the repeat
contents within the gaps that can be closed by us, and found that simple repeats and satellite
sequences were significantly enriched within the closed gaps compared to GRCh38 (p<0.001)
(Figure 18C). As an example, one ~700bp  gap can be completely and confidently closed
(Figure 18D), where HX1 can be aligned to both flanking segments of the gap (Figure 18E). In
summary, together with Chaisson et al, 69 out of 172 interstitial gaps in GRCh38 can be closed
by long-read sequencing.
Page 43 of 73



Figure 18. Summary of gap filling in GRCh38. (A) Length distribution of all gaps (stretches of "N" in genome
sequence) in GRCh38. (B) Length distribution of all gaps that can be fully or partially closed. (C) Violin plots
showing the distribution of LINE, SINE, LTR, simple repeat and satellite in closed gaps and in GRCh38. (D) A
dotplot showing how a gap on 17p13.3 is closed by a contig in HX1. The plot shows comparison of two
sequences and each dot indicates a region of close similarity between them (E) genome browser shot of the
gap region that was closed. The gap is flanked by two contigs that are new in GRCh38 (not carried forward
from GRCh37), yet a HX1 associate contig (000850F-001-01) can completely align to flanking regions,
therefore filling this assembly gap and revising its length from 718bp to 731bp.
Page 44 of 73


Characterization of structural variants and novel sequences
We next cataloged structural variants (SVs) in the genome of HX1, by comparing to the
GRCh38 genome assembly. From long-read sequencing data, we identified 9,891 deletions and
10,284 insertions by a previously validated local assembly approach
90
(Figure 20A). We
classified these SVs by type and by repeat contents of the variant sequence (Table 12), and
found that about half of the deletion and insertion calls are short tandem repeats (STRs) or
mobile element insertions (MEIs) (Figure 20B). We further compared SVs in HX1 to those
detected in CHM1 - a haploid genome assembled by long-read sequencing and analyzed by the
same method of SV detection
90
, as well as all SVs cataloged in the 1000 Genomes Project
82

(Figure 20C). Due to the increased sensitivity of SV detection from long-read sequencing, HX1
shares substantially more SVs with the CHM1 genome, compared to all SVs cataloged in the
1000 Genomes Project. Additionally, from the IrysChip physical mapping data, we identified 783
insertions and 377 deletions, using a previously validated approach
91
. From the short-read
sequencing data, we also identified 2,403 deletions and 783 insertions. We found that 82.8%
deletions and 66.9% insertions from IrysChip overlap with SVs detected by long-read
sequencing (Figure 19), but only 33.7% deletions and 13.8% insertions from short-read
sequencing can be detected by long-read sequencing. Finally, we demonstrated an example
where a 204.7kb deletion can be visually identified by all technical platforms (Figure 20D,E).
However, a 132bp deletion can only be detected by long-read sequencing (Figure 20F); it is
located within simple repeat regions with high (57.2%) GC contents (Figure 20G) and shows no
clear drop in coverage in Illumina data (Figure 20F), potentially explaining its failure to be
identified in Illumina data using read-depth method. However, an alternative split-read based
method
92
with appropriate parameters was able to identify this deletion. To identify likely
functional SVs specific to HX1, we filtered out calls found in segmental duplications or those
shared with CHM1, and intersected the remaining calls with RefSeq exons. This left us with 35
exonic deletions and 14 exonic insertions. Among these exonic SVs, 8 deletions (23%) and 5
insertions (31%) had been previously observed in the 1000 Genomes Project, including a
homozygous exonic deletion in C1orf168 that completely removes the 10th and 11th exon.
Interestingly, this deletion was only observed as a heterozygous event in East Asian populations
(AF=1.1%), including 10 Han Chinese and 1 Japanese individuals, and was therefore an Asian-
specific SV. In summary, long-read sequencing offers improved sensitivity in identifying SVs,
especially those containing repetitive elements, and some of these SVs may contain functional
genomic elements that are ethnicity-specific.
Page 45 of 73


Table 12. Structural variants detected in HX1 in comparison to GRCh38. All events are larger than 50bp.
Alu_STR: Alu element associated with STR; Complex: multiple repeat elements are found; MEI: mobile
element insertion.
Insertion  Deletion  Ins/del  
Repeat
Category  
Num  Mean
length  
Total
bases  
Num  Mean
length  
Total bases  Total
events  
Total
bases  
MEI  2,172  600  1,302,834  2,499  504  1,258,260  0.87  1.04  
L1  175  1,121  196,227  207  552  114,177  0.85  1.72  
L1HS  117  2,622  306,826  146  2,648  386,599  0.80  0.79  
L1P  152  770  116,964  214  579  123,871  0.71  0.94  
Mosaic Alu  187  496  92,842  141  360  50,724  1.33  1.83  
AluS  110  219  24,144  127  180  22,819  0.87  1.06  
Alu_STR  93  961  89,407  63  439  27,662  1.48  3.23  
AluY  865  286  247,698  1,153  284  327,077  0.75  0.76  
SVA  292  511  149,172  227  618  140,327  1.29  1.06  
HERV  23  318  7,320  31  263  8,141  0.74  0.90  
LTR  92  533  49,047  108  256  27,665  0.85  1.77  
MER  66  351  23,187  82  356  29,198  0.80  0.79  
Centromeric
Satellites  
608  561  340,995  436  501  218,243  1.39  1.56  
ALR  484  612  296,077  387  551  213,177  1.25  1.39  
HSAT  124  362  44,918  49  103  5,066  2.53  8.87  
Complex  822  3,065  2,519,672  766  3,075  2,355,090  1.07  1.07  
STR  3,655  332  1,212,704  2,130  176  374,176  1.72  3.24  
Other  256  269  68,982  431  243  104,795  0.59  0.66  
Unannotated  2,771  210  580,924  3,629  180  654,225  0.76  0.89  
Total  10,284  586  6,026,111  9,891  502  4,964,789  1.04  1.21  

Page 46 of 73



Figure 19. Structural variation calls across different technologies. (A) The number of CNV calls generated by
CNVnator (Illumain short-read sequencing), IrysChip (BioNano physical mapping), FES-SV (PacBio long-read
sequencing) at different size thresholds. (B) The concordance of FES-SV calls with the other two
technologies.
Page 47 of 73



Figure 20. Detection of structural variants by different technologies. (A) Chromosome ideogram showing
large-scale (>1kb) deletions (blue) and insertions (red) identified from long-read sequencing data. (B) Pie
chart showing the distribution of different classes of structural variants identified from long-read sequencing
data. (C) Venn diagram showing the overlap of structural variants between HX1, CHM1 and the 1000
Genomes Project for insertions and deletions, respectively. (D) Integrative Genomics Viewer screen shot of
the long-reads (upper panel) and short reads alignment (lower panel) around a ~200kb deletion (E) Alignment
of de novo assembled genome map (blue) to reference genome map (green) where the ~200kb deletion
occurs. Black vertical lines represent labels for the enzyme recognition site. Contig 2 shows identical label
patterns as reference, yet contig 1 contains the deletion. (F) Integrative Genomics Viewer screen shot of
long-read (upper panel) and short-read (lower panel) alignment around a 132bp deletion on KRTAP1-1. This
deletion is visually discernible from long-read sequencing, because the coverage is reduced and half the
reads contain the deletion in alignments. However, read-depth based method failed to detect this deletion
with short read data. (G) Genome browser screen shot of the region surrounding the 132bp deletion on
KRTAP1-1, demonstrating the presence of simple tandem repeats and the very high GC content of the
deletion.
Page 48 of 73


One of our primary interests lies in the identification of novel genomic elements that are absent
from the reference genome assembly GRCh38. In total, we identified 18.4Mb sequences in HX1
that were not present in GRCh38 primary scaffolds nor its alternative loci, among which only
378kb (~2%) cannot be mapped to previously published Asian genome assemblies
78,83
,
suggesting that most HX1-specific sequences are likely to be found in Asian populations. To
further investigate this, we re-analyzed Illumina short-read sequencing data on a Chinese
subject provided by the National Institute of Standards and Technology (NIST) as standard
benchmarking data. Among the 907 million raw reads, we achieved a mapping rate of 99.56%
to GRCh38; among unmapped reads, we found that 7.8% can be mapped to the 18.4Mb novel
sequence in HX1, confirming that Asian-specific sequence elements were present but were
missed from GRCh38. We next performed variant calling on the NIST genome. In the regions
shared between HX1 and GRCh38, we identified 3,157,818 SNVs on HX1 but 3,852,118 SNVs
on GRCh38, suggesting that Asian-specific reference genome might reduce the number of
called SNVs. However, HX1 might be less appropriate for indel calling due to higher indel error
rate of SMRT long reads. Among the SNVs called on HX1, 37,462 (~1%) resided within the
novel sequences of HX1. We further examined the contents within the novel sequences in HX1
(Table 13), and found that microsatellites are significantly enriched in the novel sequences
compared to genome-wide average (68.4% vs 1.6%, p<10
-5
). Similarly, simple repeats are also
significantly over-represented (10.3% vs 0.9%, p<10
-5
). Therefore, long-read sequencing is
especially effective in capturing highly repetitive regions in the genome.
Page 49 of 73


Table 13. Repeat content within novel (non-GRCh38) sequences in HX1.
Novel sequence in HX1  Random sequences from GRCh38  
GC: 40.79%, 928 sequences, size: 18.4Mb  GC: 40.95%, 251 sequences, size: 18.4Mb  
  element count  length  percentage of
sequence  
element count  length  percentage of
sequence  
SINEs:  4507  578352  3.14%  10081  2350498  12.77%  
ALUs  4312  551425  3.00%  7272  1937478  10.53%  
MIRs  195  26927  0.15%  2790  410656  2.23%  
LINEs:  772  751185  4.08%  5523  3638149  19.77%  
LINE1  637  716433  3.90%  3597  3101890  16.86%  
LINE2  123  33340  0.18%  1657  468664  2.55%  
L3/CR1  9  847  0.00%  201  45972  0.25%  
LTR  404  263736  1.43%  3051  1589941  8.64%  
ERVL  82  41827  0.23%  664  325935  1.77%  
ERVL-MaLRs  185  73600  0.40%  1504  631239  3.43%  
ERV_classI  104  109682  0.60%  701  539504  2.93%  
ERV_classII  21  35248  0.19%  47  57160  0.31%  
DNA elements  347  85282  0.46%  2357  564788  3.07%  
hAT-Charlie  190  40134  0.22%  1271  256052  1.39%  
TcMar-Tigger  105  31711  0.17%  543  201641  1.10%  
Unclassified:  23  30339  0.16%  53  36362  0.20%  
Total
interspersed
repeats  
6053  1708894  9.29%  21065  8179738  44.46%  
Satellites  6216  12576907  68.38%  34  299489  1.63%  
Simple repeats  11945  1890873  10.28%  2489  165630  0.90%  
Low
complexity  
586  38670  0.21%  2251  105424  0.57%  

Page 50 of 73


Several other previously published studies reported novel sequence elements from Asians, so
we performed comparative analysis. For example, Li et al found 7,330 novel sequences (4.9Mb)
absent from GRCh37
85
. Re-analysis of their data showed that 3,440 (3.7Mb) sequences can be
aligned to GRCh38, yet 4,676 (4.4Mb) can be aligned to HX1. Among these sequences, 3,129
(3.5Mb) can be aligned to both GRCh38 and HX1. To assess whether regulatory functional
elements exist in novel sequences in HX1, we analyzed raw sequencing data on five markers
(CTCF, DNase 1 hypersensitivity, H3K4me1, H3K4me3, H3K27ac) on the lymphoblastoid cell
line GM12878 from the ENCODE project (Table 14)
93
. Among all the reads that cannot be
mapped to GRCh38, 1.1% can be mapped to HX1 (Table 15) and we performed peak-calling on
each marker. We also performed RNA-Seq on HX1 using Illumina short-read sequencing, and
found that genes closest to these peaks within the 500kb flanking region tend to have higher
expression levels than all other genes in the RNA-Seq data (Figure 21A). Similar patterns were
also observed for novel non-GRCh38 sequences of HX1 (Figure 21B, Table 16). In summary,
we demonstrated that long-read sequencing can identify potentially functional pieces in
genomes that evade detection by short-read sequencing.
Page 51 of 73


Table 14. List of data sets downloaded from http://encodeproject.org for functional analysis.
Cell line  Producer  Assay  Replicate#  Accession  Control Accession  
GM12878  broad  H3K4me3  1  ENCFF000ASR  ENCFF000ARK  
GM12878  broad  H3K4me3  2  ENCFF000AUB  ENCFF000ARO  
GM12878  uw  H3K4me3  1  ENCFF001EYE  ENCFF001HID  
GM12878  uw  H3K4me3  2  ENCFF001EYF  ENCFF001HID  
GM12878  broad  H3K4me1  1  ENCFF000ASM  ENCFF000ARK  
GM12878  broad  H3K4me1  2  ENCFF000ATK  ENCFF000ARO  
GM12878  broad  H3K27ac  1  ENCFF000ASP  ENCFF000ARK  
GM12878  broad  H3K27ac  2  ENCFF000ASU  ENCFF000ARO  
GM12878  broad  CTCF  1  ENCFF000ARP  ENCFF000ARK  
GM12878  broad  CTCF  2  ENCFF000ARV  ENCFF000ARO  
GM12878  stanford  CTCF  1  ENCFF000VUU  ENCFF000VWV  
GM12878  stanford  CTCF  2  ENCFF000VUW  ENCFF000VWV  
GM12878  uta  CTCF  1  ENCFF000ROU  ENCFF000RPB  
GM12878  uta  CTCF  2  ENCFF000ROX  ENCFF000RPB  
GM12878  uta  CTCF  3  ENCFF000ROZ  ENCFF000RPB  
GM12878  uw  CTCF  1  ENCFF001HHX  ENCFF001HID  
GM12878  uw  CTCF  2  ENCFF001HIA  ENCFF001HID  
GM12878  uw  DNase  1  ENCFF001CUR  NA  
GM12878  uw  DNase  2  ENCFF001CWQ  NA  
GM12878  duke  DNase  1  ENCFF000SLF  NA  
GM12878  duke  DNase  2  ENCFF000SLG  NA  
GM12878  duke  DNase  3  ENCFF000SLL  NA  
GM12878  duke  DNase  4  ENCFF000SLP  NA  
GM12878  duke  DNase  5  ENCFF000SLR  NA  

Page 52 of 73


Table 15. Mapping rate for ENCODE reads from each one of the assays from GM12878 that cannot be
mapped to GRCh38 but can be realigned to HX1.
Total number of
GRCh38-unmapped
reads  
Total number of reads
mapped to HX1  
Mapping rate of
GRCh38-
unmapped reads  
CTCF  52,428,376  583,528  0.011  
DNase  133,235,297  1,102,211  0.008  
H3K4me1  18,969,475  173,028  0.009  
H3K4me3  8,787,823  961,685  0.109  
H3K27ac  55,380,745  262,783  0.005  
Total  268,801,716  3,083,235  0.011  

Page 53 of 73


Table 16. Summary for peaks in HX1 novel sequences generated using ENCODE GM12878 sample and their
associated genes.
Assay  Peak  Gene  
Contig  Start  End  Contig  Start  End  FPKM  Distance
to peak  
CTCF  000504F  2,767,793  2,768,161  000504F  2,529,336  2,531,746  7.96  236048  
000800F  671,615  672,870  000800F  587,324  693,220  47.00  0  
H3K4me1  001058F  115,364  116,029  001058F  181,365  182,584  5.79  65337  
H3K27ac  001058F  121,686  121,940  001058F  181,365  182,584  5.79  59426  
DNase  000102F  4,813,943  4,814,071  000102F  4,944,978  4,980,549  4.28  130908  
000235F  4,325,591  4,325,743  
000235F  4,237,882  4,242,650  3.78  
82942  
000235F  4,325,791  4,325,853  83142  
000504F  2,780,203  2,780,257  000504F  2,529,336  2,531,746  7.96  248458  
000507F  2,367,118  2,367,161  000507F  2,124,214  2,352,056  2.50  15063  
000800F  656,120  656,696  
000800F  587,324  693,220  47.00  
0  
000800F  657,262  658,394  0  
000800F  658,567  660,785  0  
000800F  660,909  665,520  0  
000800F  665,625  668,481  0  
000800F  668,731  672,967  0  
001058F  115,479  116,041  
001058F  181,365  182,584  5.79  
65325  
001058F  116,444  116,827  64539  
001058F  121,651  121,747  59619  
001252F  233,670  234,313  
001252F  85,421  90,442  1.30  
143229  
001252F  234,449  236,250  144008  
001377F  135,128  135,342  001377F  123,711  124,302  4.83  10827  
001632F  79,834  79,884  
001632F  119,584  120,284  5.85  
39701  
001632F  81,055  81,159  38426  
001632F  81,265  81,845  37740  
001632F  81,848  82,407  37178  
001632F  131,939  132,559  11656  
001632F  132,804  133,397  12521  
001632F  133,729  134,363  13446  
002340F  46,144  46,150  
002340F  2,013  10,871  8.02  
35274  
002340F  50,330  50,851  39460  
002360F  14,243  14,345  002360F  29,439  30,130  1.84  15095  

Page 54 of 73



Figure 21. Distribution of FKPM values for genes flanking each of the five markers (CTCF, DNase, H3K4me1,
H3K4me3, H3K27ac), compared to background expression. For each peak, the nearest gene with 500kb
flanking region is analyzed. (A) Violin plot on all peaks in HX1 (B) Violin plot on peaks in HX1 that are located
in novel sequences. Note that in novel sequences, there is no gene that can be associated with H3K4me3
and there is only one gene that can be associated to H3K4me1 and H3K27ac.
DISCUSSION
In the current study, we generated one of the first near-complete de novo genome assemblies
for a Chinese individual. The contig and scaffold N50 values of the assembly were substantially
higher than previous studies on de novo human genome assembly, implicating the unique
advantage of long-read sequencing in assembling complex genomes such as the human
genome.  
There are several current limitations to use long-read sequencing to generate de novo
assemblies and analyze personal genomes. First, although the read length is substantially
higher than short-read sequencing, it is still at the scale of tens of kilobases due to the
technological limitations of the current sequencing platform. Some highly complex genomic
regions may still not be adequately assayed or assembled, especially when sequencing
coverage is low. With the current development of a number of nanopore-based long-read
sequencing platforms, this problem may be alleviated by technological innovations. Second,
some gaps in the reference genome are long and are surrounded by segmental duplications or
other highly repetitive sequences
94
; therefore, they may not be filled by our long read assembly.
For example, 24 of the 966 gaps are longer than 300kb based on current estimation, and none
Page 55 of 73


of them can be closed by our method. Third, compared to the Illumina platform that enabled
$1000 genomes, PacBio long-read sequencing is still relatively cost-prohibitive to be applied to
personal genomes at large scale. With the continued decline in sequencing cost and the
improvement in sequencing throughput per flow cell, this problem may be reduced in the future.
Fourth, due to the much higher error rates (especially for indels) of PacBio long-read
sequencing, variant detection will not be reliable at low sequencing coverage, so analysis of
genetic mutations in personal genomes still needs to rely on more accurate short-read
sequencing data. Finally, in the current study, to demonstrate the use of reference-free de novo
assembly, we used the NanoChannel arrays for scaffolding the contigs from the genome
assembly, which resulted in ~3 fold improvements in N50 values. However, we acknowledge
that we could alternatively use the reference human genome GRCh38 for generating much
longer scaffolds.
In summary, while short-read based alignment and variant calling based on reference genome
remain a common practice to assay personal genomes, de novo assembly by long-read
sequencing may reveal novel and complementary biological insights. Improved understanding
and better characterization of genome functional variation may require the use of a range of
genomic technologies on diverse human populations
95
.
METHODS
Generation of sequence data
Freshly drawn blood samples were obtained from an anonymous healthy Chinese adult male
(HX1) with normal karyotype, using protocols approved by the Institutional Review Board of
Jinan University. HX1 provided written consent for public release of genomic data. For long-read
DNA sequencing, high-molecular-weight DNA was extracted from isolated lymphocytes using
Phenol-Chloroform method and sequenced by the PacBio sequencer RS II with the P6-C4
sequencing reagent. For long-read RNA sequencing, total RNA was extracted from blood using
TRIzol extraction reagent and subjected to the IsoSeq protocol with four library sizes (1-2kb, 2-
3kb, 3-5kb, 5kb+), and sequenced on the PacBio sequencer RS II. For short-read DNA
sequencing, genomic DNA was subject to Illumina TruSeqDNA Nano library preparation
protocol and 151bp paired-end reads were generated by Illumina HiSeq X sequencer. For short-
read RNA sequencing, RNA samples were subject to the TruSeq mRNA Library Kit and
sequenced on Illumina HiSeq2500 sequencer. For BioNano physical mapping, DNA extracted
Page 56 of 73


from freshly drawn whole blood were subject to manufacturer recommended protocols for library
preparation and optical scanning, with the default nicking enzyme NT.BspQI.  
BioNano genome mapping
We used the Irys System from BioNano Genomics, a NanoChannel-based fluidic IrysChip that
can unravel, sort, and confine native-state, long, genomic DNA fragments in a linearized
conformation. The DNA is labeled by a nicking enzyme that recognizes characteristic 7-mer in
genome sequence. Labeled sites appear as a uniquely recognizable pattern of “dots on a string”,
so that once the DNA is stretched inside the NanoChannels, the Irys CCD detector optically
images them. Uniquely, the straightened molecules in solution are able to move smoothly
through the nanoscale fluidic environment, enabling multiple cycles of automated loading and
imaging for high-throughput scanning and analysis.
The BioNano optical mapping data was generated using DNA extracted from 3mL freshly drawn
whole blood, using manufacturer recommended protocols for library preparation and optical
scanning. The default nicking enzyme NT.BspQI was used for digesting DNA. A total of 12 flow
cells were used for this DNA analysis. We selected 150kb as the threshold for size filter (those
molecules <150kb are removed from raw data), and selected dynamic SNR (signal-to-noise
ratio) filter which allows IrysView to automatically calculate the optimal label SNR. The final set
of cleaned data includes 302.8Gb data on 1,169,210 molecules, with N50 of 264.3kb (Table 10).
Of note, 30.0Gb (9% bin mass fraction) of the data has length over 500kb. The IrysView
software (software version 2.1.1.8025 with RefAligner/Assembly version r3827), in a desktop
computer was used for data reformatting, QC and visualization, yet heavy-duty data analysis
(such as assembly, scaffolding and genome comparison) was performed using a computing
cluster. Additional procedures for data analysis are described in several sections below.
Karyotyping for chromosomal abnormality
To assess large-scale chromomosal abnormality or aneuploidy in the HX1 genome, we
performed karyotyping at the Clinical Diagnostic Laboratory at the Women and Children's
Hospital of Guangdong. Standard karyotyping techniques are used on 5mL freshly drawn blood.
Briefly, after a short-term culture of cells derived from the blood, dividing cells are arrested in
metaphase by addition of colchicine, which poisons the mitotic spindle. The cells are next
treated with a hypotonic solution that causes their nuclei to swell and the cells to burst. The cells
were then examined on a glass slide with stains to examine the structural features of the
chromosomes.  
Page 57 of 73


de novo draft genome assembly by Falcon
After we obtained raw data, we performed subread filtering using the SMRT Portal software
(Pacific Biosciences, Meno Park, CA) with default settings for subread protocol. FASTA files
were then pooled together for assembly. The average read length and N50 are 7.0kb and
12.1kb, respectively. Due to the many random errors in SMRT long reads, a hierarchical
genome assembly process assembler called Falcon
88
specifically designed for PacBio reads
was used to do de novo assembly. In particular, we modified Falcon to improve its performance
in an NFS-based computing cluster. The source code is available on github
(https://github.com/WGLab/EnhancedFALCON).
Unlike hybrid assembly method that combines SMRT long reads and Illumina short reads,
Falcon relies on single source of sequencing data for assembly. It first uses DALIGNER
96
to find
overlaps between filtered subreads and then extracts consensus sequences from overlaps (pre-
assembly). The consensus sequences are much more accurate than subreads
88
. DALIGNER is
called again to find overlaps between error-corrected reads. Based on these overlaps, Falcon
constructs a graph where a node denotes a read and a directed edge denotes an overlap. By
removing redundant edges and resolving complicated paths, simple paths can be identified and
later be converted to contigs. Because human genome is diploid, alternative paths may exist in
certain regions. As a result, associate contigs will also be constructed. A visualization of
assembly procedure is shown in Figure 22.
All reads were used to build the assembly (“-a” option in DBsplit). Length cutoff is 6,000 bp for  
error-correction and 12,000 bp for graph construction. Detailed configuration is provided as
follows.  
# The length cutoff used for seed reads used for initial mapping
length_cutoff = 6000

# The length cutoff used for seed reads usef for pre-assembly
length_cutoff_pr = 12000

#-M means limit memory usage, use integer
pa_HPCdaligner_option =  -v -dal128 -t16 -e.70 -l1000 -s1000
ovlp_HPCdaligner_option = -v -dal128 -t32 -h60 -e.96 -l500 -s1000

#-x is length threshold for including reads in DB
#consider add -a option in future to include secondary reads for error correction
pa_DBsplit_option = -x500 -s400 -a
Page 58 of 73


ovlp_DBsplit_option = -x500 -s400 -a

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 6
overlap_filtering_setting = --max_diff 60 --max_cov 100 --min_cov 2 --bestn 10 --n_core 12



Figure 22. Falcon de novo assembly workflow.
Page 59 of 73



Evaluation of draft genome quality by BioNano
We next used the BioNano data as an orthogonal genome mapping approach to evaluate the
quality and completeness of the draft genome assembly. For this analysis, we compared the
draft genome with GRCh38, to assess whether BioNano tend to be better mapped to one
genome versus the other. This analysis is not fair, since a draft assembly contains many short
contigs, and some BioNano reads may not be mappable to these short contigs, resulting in
reduced mapping rate. However, the difference between draft assembly and a reference
genome assembly (GRCh38) can inform us about the completeness of the draft assembly.
Using the "Molecular Quality Report" function in the BioNano IrysView software (version
2.1.1.8025) with RefAligner/Assembly (version r3827), we evaluated what fraction of BioNano
molecules (>150kb) can be mapped to the draft genome assembly and the GRCh38 reference
assembly, respectively. The parameters used are " -nosplit 2 -BestRef 1 -biaswt 0 -Mfast 0 -FP
1.5 -sf 0.2 -sd 0.0 -A 5 -outlier 1e-4 -endoutlier 1e-3 -S -1000  -sr 0.04 -resbias 5 64 -maxmem
36 -M 3 -minlen 150 -T 1e-9 -maxthreads 12 -hashgen 5 3 2.4 1.5 0.05 5.0 1 1 2 -hash -
hashdelta 10 -hashmaxmem 36 -insertThreads 8 -stdout -stderr". These are the default
parameters in the software, except that we used a highly stringent threshold (T=1e-9) to declare
a match and sampled 50,000 molecules (rather than 5,000 as shown by default settings), per
suggestions from BioNano to process human genomes. The mapping rates of BioNano
molecules to the draft assembly and GRCh38 are 78.9% and 80.2%, respectively.
Construction of physical map by BioNano nanochannel array
The BioNano NanoChannel Array (Irys System) linearizes DNA molecules up to megabases in
length, and uses nicking enzymes that recognize characteristic 7-mer in genome sequence to
provide physical maps. This method therefore allows de novo assembly of the human genome,
and such assembly can be further used as scaffolds for PacBio-assembled genomic sequences.
This approach has already been used recently for analyzing human genomes
79
.
Page 60 of 73


We performed de novo genome assembly on the filtered data, with manufacturer-recommended
parameters for human genome (molecular length threshold: 150kb; min label per molecule: 8;
maximum backbone intensity: 0.6; false positive density/100kb: 1.5; false negative rate: 0.15%;
Scaling SD: 0; siteSD: 0.2kb; relative SD: 0.03; initial assembly p-value cutoff: 1e-10; extension
and refinement p-value cutoff: 1e-11; merge p-value cutoff: 1e-15), autonoise adjustment, and 5
iterations of computation. The final results contain 2,346 contigs with the N50 for the assembly
as 1.80Mb.  
Improve draft genome assembly by hybrid scaffolding
After de novo assembly of the physical map using BioNano data, we used the physical map as
scaffolds for PacBio-assembled draft genome. This approach is referred to as "hybrid
scaffolding", which identifies regions that PacBio contigs can be grouped together by BioNano
scaffold, and generate new assembled sequences with improved N50 values, as demonstrated
in a recent study (N50=906kb for PacBio data, and N50=13.6Mb after hybrid scaffolding by
BioNano data)
79
.
With extensive discussion with BioNano, we developed the following strategies and wrote a few
custom scripts (available upon requests) for post-processing of hybrid scaffolds and generating
the complete genome sequence. This procedure is described here:  
(1) Use IrysView software to perform hybrid scaffolding, with BioNano assembly as the
scaffold, and PacBio assembly (as raw FASTA sequence rather than enzyme-digested
cMAP) as the query sequence.  
(2) Use 'Export to FASTA' option in the IrysView software to export the hybrid assembly into
a FASTA file.
(3) Identify the set of contigs (as cMAP IDs) that were filtered out during the scaffolding
process, through a search of the intermediate results in hybrid scaffolding, and back-
track the identifier of the contigs through the key file generated within the intermediate
results  
(4) retrieve these 'discarded' contigs from the original FASTA file from PacBio assembly,
and supplement them into the hybrid assembly to generate the final assembly.
Page 61 of 73


Consensus quality and scaffolding accuracy analysis
Assembly consensus quality was evaluated using MUMmer (nucmer, delta-filter and dnadiff).
Scaffolding accuracy was evaluated following Pendleton’s method
79
with a few modifications.
Sliding 100Kb windows (from start to end) were selected in the assembly to be evaluated.
250bp sequences at head and tail of each window were extracted and aligned to GRCh38 main
chromosomes using BWA-MEM. Among uniquely aligned pairs of reads with mapping quality >
30, if two reads are not within 90 Kb or 110Kb to each other, the corresponding window was
considered mis-joined.
Short read polishing
Illumina short reads were aligned to HX1 contigs with BWA-MEM. SAMtools was used to call
indel and SNVs. Each homozygous (heterozygosity rate<0.1) indel or SNV with genotype
quality > 30 was considered as an error in the assembly. HX1 contigs were polished by
replacing reference alleles with corresponding alternative alleles to the assembly at each variant
site. This procedure was repeated three times to achieve higher consensus quality and lower
indel error rate.
Gap filling in GRCh38
We developed a Gap Filling by Assembly (GFA) procedure for closing gaps in the reference
genome. Any region consisting of continuous runs of N in the target assembly (GRCh38) is
defined as a gap in our method. Therefore, gaps between scaffolds are not considered in our
analysis. Gaps within 200bp to each other are merged. Flanking sequences upstream and
downstream of the gaps were mapped to the source assembly (HX1). If two anchor sequences
for the same gap can both be aligned, they will be examined to remove discordant pairs which
include those alignments with inconsistent orientation, on different contigs, or overlapping with
each other. If only one anchor can be aligned, then the anchor will be extended into the gap
region wherever possible. Code used for gap filling has been deposited to github
(https://github.com/WGLab/uniline). The above steps are illustrated in Figure 23.
For each closable gap, a probability is calculated based on the mapping score and gap length.
A brief description is shown below.
:  ℎ              
 1
:   ℎ  1          
 2
:   ℎ  2          
Page 62 of 73


 :                         ℎ  
=   1
 2
 
When only single anchor is mapped:
= min(   1
,   2
)
 1
and   2
come from mapping quality score of BWA-MEM algorithm, which takes both
alignment quality and sequence context into consideration. We make the assumption that
mapping and gap length are independent of each other. This assumption does not necessarily
hold in all circumstances, and can be improved by explicitly modeling the insertion and deletion
length in the target genome (where gaps are to be filled).  
 (predicted gap is equal to original gap by chance) is calculated as follows.
0
:    ℎ          
 :    ℎ          
:          
Let  =   −  0
. We assume  ~  (0,  2
0
2
), that is we model the difference between predicted
gap length and original gap length with a normal distribution. Therefore we have
 = �
� − �   −  0
� ≤  ≤ �   −  0
� �      ≠  0
( −0.5 ≤  ≤ 0.5)      =  0

Intuitively,   means the (random) chance observing a gap with   or less extreme given  0

under a normal distribution. The following table shows how the model behaves under some
common situations (k=2). As is shown, the model permits ~10% of flexibility at a threshold of 30
and does not penalize harshly when   deviates two to three times from  0
.
L
0
 1000
L
g
500 900 1000 2000 2500 3000
Phred-scaled P
g
(rounded to
nearest integer)
16 32 85  10 6 4

Page 63 of 73


In our analysis, k = 2 was used to calculate gap closing score. Only uniquely closable gaps
were shown in results, i.e., flanking sequences of these gaps were uniquely mapped. Gap
closing score of 30 were used to filter out low quality predictions.

Figure 23. Illustration of the Gap Filling by Assembly (GFA) procedure.
Page 64 of 73



Finding novel sequences and calculating genome coverage
MUMmer
89,97,98
is among the first programs that utilizes suffix-tree algorithm to find maximal
matches. The time requirement for building suffix-tree for a sequence of length n is of O(n).
Given a suffix tree of S and a query sequence Q of length m, all unique maximal matches can
be found in time proportional to m. The architecture of MUMmer makes it extremely fast for
genome-scale sequence comparison. For each contig in GRCh38 including alt loci
(http://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/analysisSet/) and decoy sequences
(hs38d1), we mapped our contigs with MUMmer. The results (*.coords) were post-processed by
custom scripts to filter and extract regions unmappable with >80% identity and gap allowance of
1kb.  Here gap allowance means maximum gap length allowed between two closely mapped
regions that will be merged together. However, MUMmer is not as sensitive as BLAST
99
or its
descendent, LastZ
100
, for genome-scale comparison, so a second round of mapping by LastZ
was used to further refine unmapped sequences. LastZ is a successor for BLASTZ
101,102
, which
is an independent implementation of BLAST
99
. BLASTZ
101,102
differs from BLAST primarily in two
aspects: BLASTZ has an option that requires the alignments must occur in the same order and
orientation in both reference and query sequences; BLASTZ uses a different scoring scheme
that takes sequence conservation into consideration. LastZ
100
was developed based on BLASTZ
and is more robust and flexible. LastZ was run with parameters lenient for gap extension,
repetitive sequences. Hence large gaps that are possibly structural variations would not hinder
alignment, and repetitive sequences do not cause significant slowdown of alignment speed.
Regions unmapped by LastZ with gapped threshold of 10,000 and gap allowance of 1kb are
considered as novel sequences.  
Percentage of coverage on GRCh38 (no_alt_analysis) by different assemblies is calculated
based off of the results from MUMmer. Same alignment procedure was used as described for
finding novel sequence. Regions mapped with >80% identity and gap allowance of 1kb are
considered as valid alignments. Percentage of non-N base pairs covered by these valid
mappings is denoted as percentage of coverage.
The commands used for MUMmer and LastZ are shown as follows. Custom scripts are
deposited in github (https://github.com/WGLab/uniline).
#commands for mummer
nucmer -c 400 -l 150 --prefix=$prefix $ref $query
Page 65 of 73


show-coords -r -c -l -k $prefix.delta > $prefix.coords
#commands for LastZ
lastz $ref $query --notransition --gap=1000,1 --step=20 --ambiguous=iupac --format=maf --gappedthresh=10000 --
identity=80 --progress=10 --maxwordcount=1 --masking=0
Comparative analysis using NIST Chinese genome
To assess quality of HX1 and GRCh38 as reference genome for variant calling, we downloaded
a Chinese sample NA24694 HiSeq reference data at http://ftp-
trace.ncbi.nlm.nih.gov/giab/ftp/data/ChineseTrio/HG006_NA24694-
huCA017E_father/NA24694_Father_HiSeq100x/NA24694_Father_HiSeq100x_fastqs/ from
National Institute of Standards and Technology (NIST). In total, we obtained 907 million reads
(133 billion bases), with an estimated coverage of 41X. The raw FASTQ files were aligned to
both HX1 and GRCh38 by BWA. For HX1, we achieved mapping rate of 99.22% and for
GRCh38 we achieved mapping rate of 99.56%. Reads that cannot be mapped to GRCh38 were
further mapped against HX1.
Functional analysis on novel sequence elements
To explore the potential biological function in the novel sequences, we used GM12878 cell line
and downloaded 17 of its associated ChIP-Seq data sets with four assays, including H3K4me3,
H3K4me1, H3K27ac and CTCF, and 7 DNase-Seq data sets for the same cell line.  
The raw FASTQ files were aligned to GRCh38 using bowtie2 (2.0.5)
23
. To investigate whether
HX1 can be a complementary reference to GRCh38 for mapping sequencing reads for
nucleosome positioning, transcriptional factor binding sites and histone modification, we then
extracted unmapped ENCODE reads from GRCh38 and realigned them to HX1 with bowtie2.  
MACS
103,104
 was used to call ChIP-seq peaks and DNase hot spots. We used settings
recommended by authors of MACS. Specifically, for histone mark ChIP-seq, we used “--
nomodel” option plus a control data set; for transcription factor, we used default parameters plus
a control data set; for DNase-seq, we used “--nolambda” option plus no control data.  
To examine how these regulatory elements can potentially modify gene expression for HX1, we
performed transcriptome analysis using Illumina RNA-Seq data as follows. First, in order to
assess the expression level for each transcript in HX1,we used Cufflinks to generate
transcriptome assemblies and used FPKM from this assembly to assess the expression level of
each gene predicted by Cufflinks. We examined the distribution of FPKM values of all 36,042
genes predicted by Cufflinks. Note that here we used cutoff of 500bp to ensure that all predicted
Page 66 of 73


genes are likely to be real genes, rather than short artifacts generated from Cufflinks prediction.
From the pattern of the distribution FPKM values of all predicted genes, we observed that after
natural log transformation, the distribution of FPKM values follow roughly Gaussian distribution
with several outliers in the tail. Therefore, we excluded additional 180 potential outliers using
99.99% percentile (FPKM=22,026, natural log FPKM=10) as cutoff and used the remaining
35,862 transcripts for downstream analysis. Then, for each regulatory element, we extracted the
FPKM value for nearest gene that is located within its 500kb flanking regions (250kb upstream
and 250kb downstream) and compared it with the background FPKM value of whole
transcriptome in HX1. Here we define background gene set as set of genes that are not flanking
CTCF binding sites, DNase I hypersensitivity sites, H3K4me1, H3K4me3 or H3K27ac histone
marks. We found that there are 2,338 unique genes located within 500kb flanking regions of
CTCF binding sites, 6,061 unique genes near DNase I hypersensitivity sites, 513 genes near
H3K4me1 histone modification sites, 3,146 genes near H4K4me3 sites and 2,079 genes near
H3K27ac sites. Overall, genes flanking regulatory regions in HX1 have increased expression
level compared to the background. For histone modifications, H3K4me3 has the highest median
FPKM value for its regulated genes (median FPKM=10.48) compared to the background
(median FPKM=3.67) and such difference is significant (P<10
-10
by Wilcoxon Rank-sum
test);H3K4me1 has the lowest median FPKM value for its regulated genes (median FPKM=8.76)
and yet such expression level is still significantly higher than the background (P<10
-10
by
Wilcoxon Rank-sum test). This may imply that in HX1 sequences, there could exist activated
enhancers that increase the expression of nearby genes. Similarly, for genes near DNase I
hypersensitivity sites and genes near CTCF-binding sites,their expression is also significantly
higher than the background (P<10
-10
by Wilcoxon Rank-sum test for both DNase and CTCF
assay).
As for regulatory regions in novel sequences, we found 2 predicted genes flanking CTCF
binding sites, 11 genes flanking DNase hypersensitivity sites, 1 gene for H3K4me1 sites, 1 gene
for H3K27ac sites and no genes flanking H3K4me3 sites.Overall, genes flanking regulatory
regions in HX1 novel sequences have increased expression level compared to the background.
Here we define background gene set as set of genes that are not flanking CTCF binding sites,
DNase I hypersensitivity sites, H3K4me1, H3K4me3 or H3K27ac histone marks that are located
in novel sequences in HX1. For histone modifications, H3K4me3 and H3K4me1 are both linked
to a predicted gene located at contig 001058F coordinate of 181,365 to 182,584, with 1.31-fold
increased expression (FPKM=5.79), compared to the background (median FPKM=4.41). Genes
Page 67 of 73


flanking CTCF binding sites and DNase hypersensitivity I sites in novel sequences also have
higher expression than background. For example, for CTCF binding sites, the associated genes
have 6.23-fold increased expression (median FPKM=27.48) compared to the background. This
may imply the existence of potential functional regulatory elements in HX1 novel sequences.
Structural variation calling (Illumina whole-genome sequencing)  
To detect CNVs from the Illumina WGS data, the read depth based tool CNVnator was used,
which uses a Mean-shift Algorithm, to iteratively partition and merge genomic segments for
multiple steps
105
.
For whole-genome sequencing data, all the Illumina short reads were first aligned onto GRCh38
genome, with BWA-MEM
106
, to generate the BAM file. Afterwards, the BAM file was used as
input into CNVnator
105
, to generate CNV calls, with the bin size set as 100 bp. The output is a
list of CNVs, with 37,665 deletions and 1,391 duplications. After filtration with P<0.01 standard
and removal of CNVs overlapping with Immunoglobulin regions and centromeres, there are in
total 2403 deletions and 783 duplications left.
Structural variation calling (BioNano genome map)
Through comparison to an expected digestion map for the GRCh38 genome, the BioNano
Genome Map data can be used to identify structural variations. Using the IrysView software, we
performed a comparison of the cMAP (expected 7-mer maps digested by enzyme) of our
BioNano assembly with GRCh38, and identified 783 insertions, 377 deletions in the HX1
genome in comparison to GRCh38.  
Structural variation calling (SMRT long reads)
We identified structural variants >=50 bp from PacBio long-read sequencing data using FES-SV
as previously described
90
. Briefly, we aligned the PacBio reads to GRCh38 using a modified
version of BLASR (https://github.com/mchaisso/blasr) with affine alignment parameters (-bestn
2 -maxAnchorsPerPosition 100 -advanceExactMatches 10 -affineAlign -affineOpen 100 -
affineExtend 0 -insertion 5 -deletion 5 -extend -maxExtendDropoff 50) to identify 49,054
candidate sites of structural variation. We then locally assembled all reads mapping to each
candidate site using MHAP and Celera (8.3rc1) and aligned each local assembly to its original
locus using BLASR with refined affine alignment parameters (-affineAlign -affineOpen 8 -
affineExtend 0 -bestn 1 -maxMatch 30 -sdpTupleSize 13) to identify the precise breakpoints of
all structural variants. Additionally, we performed the same local assemblies in 152,251 sliding
windows across the genome (60 Kbp withi 20 Kbp slide) to discover any additional variants that
Page 68 of 73


might have been missed by the initial candidate discovery. Finally, we annotated the repetitive
content of the sequence associated with each insertion and deletion using RepeatMasker (3.3.0)
with sensitive alignment parameters (-xsmall -no_is -e wublast -s).
References
1. Rabbani, B., Mahdieh, N., Hosomichi, K., Nakaoka, H. & Inoue, I. Next-generation sequencing:
impact of exome sequencing in characterizing Mendelian disorders. J Hum Genet 57, 621-32
(2012).
2. Bamshad, M.J. et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev
Genet 12, 745-55 (2011).
3. Meyerson, M., Gabriel, S. & Getz, G. Advances in understanding cancer genomes through
second-generation sequencing. Nat Rev Genet 11, 685-96 (2010).
4. Mardis, E.R. The impact of next-generation sequencing technology on genetics. Trends Genet 24,
133-41 (2008).
5. Morozova, O. & Marra, M.A. Applications of next-generation sequencing technologies in
functional genomics. Genomics 92, 255-64 (2008).
6. Genomes Project, C. et al. A map of human genome variation from population-scale sequencing.
Nature 467, 1061-73 (2010).
7. Lam, H.Y. et al. Performance comparison of whole-genome sequencing platforms. Nat
Biotechnol 30, 78-82 (2012).
8. O'Rawe, J. et al. Low concordance of multiple variant-calling pipelines: practical implications for
exome and genome sequencing. Genome Med 5, 28 (2013).
9. Ruffalo, M., LaFramboise, T. & Koyuturk, M. Comparative analysis of algorithms for next-
generation sequencing read alignment. Bioinformatics 27, 2790-6 (2011).
10. Hull, D. et al. Taverna: a tool for building and running workflows of services. Nucleic Acids Res 34,
W729-32 (2006).
11. Reich, M. et al. GenePattern 2.0. Nat Genet 38, 500-1 (2006).
12. Abouelhoda, M., Issa, S.A. & Ghanem, M. Tavaxy: integrating Taverna and Galaxy workflows
with cloud computing support. BMC Bioinformatics 13, 77 (2012).
13. Goecks, J., Nekrutenko, A., Taylor, J. & Galaxy, T. Galaxy: a comprehensive approach for
supporting accessible, reproducible, and transparent computational research in the life sciences.
Genome Biol 11, R86 (2010).
14. Pabinger, S. et al. A survey of tools for variant analysis of next-generation genome sequencing
data. Brief Bioinform (2013).
15. Gentleman, R.C. et al. Bioconductor: open software development for computational biology and
bioinformatics. Genome Biol 5, R80 (2004).
16. Stajich, J.E. et al. The Bioperl toolkit: Perl modules for the life sciences. Genome Res 12, 1611-8
(2002).
17. Chang, X. & Wang, K. wANNOVAR: annotating genetic variants for personal genomes via the
web. J Med Genet 49, 433-6 (2012).
18. Krampis, K. et al. Cloud BioLinux: pre-configured and on-demand bioinformatics computing for
the genomics community. BMC Bioinformatics 13, 42 (2012).
19. Nocq, J., Celton, M., Gendron, P., Lemieux, S. & Wilhelm, B.T. Harnessing virtual machines to
simplify next-generation DNA sequencing analysis. Bioinformatics 29, 2075-83 (2013).
Page 69 of 73


20. Angiuoli, S.V. et al. CloVR: a virtual machine for automated and portable sequence analysis from
the desktop using cloud computing. BMC Bioinformatics 12, 356 (2011).
21. Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res 19,
1124-32 (2009).
22. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S.L. Ultrafast and memory-efficient alignment of
short DNA sequences to the human genome. Genome Biol 10, R25 (2009).
23. Langmead, B. & Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat Methods 9, 357-9
(2012).
24. Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966-
7 (2009).
25. Zaharia, M. et al. Faster and More Accurate Sequence Alignment with SNAP. in ArXiv e-prints Vol.
1111 5572 (2011).
26. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-
generation DNA sequencing data. Genome Res 20, 1297-303 (2010).
27. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078-9
(2009).
28. Koboldt, D.C. et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer
by exome sequencing. Genome Res 22, 568-76 (2012).
29. Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156-8 (2011).
30. R. Pandya, W.B., M. Zaharia, T. Sittler, K. Curtis, C. Hartl, A. Fox, S. Schenker, I. Stoica, D.
Patterson. SNAP: fast, accurate sequence alignment enabling biological applications. ASHG
meeting 2014, San Diego (2014).
31. Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from
high-throughput sequencing data. Nucleic Acids Res 38, e164 (2010).
32. Wang, W., Wei, Z., Lam, T.W. & Wang, J. Next generation sequencing has lower sequence
coverage and poorer SNP-detection capability in the regulatory regions. Sci Rep 1, 55 (2011).
33. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage
analyses. Am J Hum Genet 81, 559-75 (2007).
34. Utsunomiya, Y.T. et al. mendelFix: a Perl script for checking Mendelian errors in high density SNP
data of trio designs. arXiv:1306.2243 (2013).
35. Chen, H. VennDiagram: Generate high-resolution Venn and Euler plots. CRAN (2011).
36. Altshuler, D.M. et al. An integrated map of genetic variation from 1,092 human genomes.
Nature 491, 56-65 (2012).
37. DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation
DNA sequencing data. Nat Genet 43, 491-8 (2011).
38. Lyon, G.J. et al. Exome sequencing and unrelated findings in the context of complex disease
research: ethical and clinical implications. Discovery medicine 12, 41 (2011).
39. Guo, Y., Ding, X., Shen, Y., Lyon, G.J. & Wang, K. SeqMule: automated pipeline for analysis of
human exome/genome sequencing data. Sci Rep 5, 14283 (2015).
40. Genomes Project, C. et al. An integrated map of genetic variation from 1,092 human genomes.
Nature 491, 56-65 (2012).
41. Li, B. et al. A likelihood-based framework for variant calling and de novo mutation detection in
families. PLoS Genet 8, e1002944 (2012).
42. Ramu, A. et al. DeNovoGear: de novo indel and point mutation discovery and phasing. Nat
Methods 10, 985-7 (2013).
43. Peng, G. et al. Rare variant detection using family-based sequencing analysis. Proc Natl Acad Sci
U S A 110, 3985-90 (2013).
Page 70 of 73


44. Nielsen, R., Korneliussen, T., Albrechtsen, A., Li, Y. & Wang, J. SNP calling, genotype calling, and
sample allele frequency estimation from New-Generation Sequencing data. PLoS One 7, e37558
(2012).
45. Afgan, E. et al. Galaxy CloudMan: delivering cloud compute clusters. BMC Bioinformatics 11
Suppl 12, S4 (2010).
46. Blanca, J.M., Pascual, L., Ziarsolo, P., Nuez, F. & Canizares, J. ngs_backbone: a pipeline for read
cleaning, mapping and SNP calling using next generation sequence. BMC Genomics 12, 285
(2011).
47. Shi, L. et al. inverted question markGenotype-first inverted question mark approaches on a
curious case of idiopathic progressive cognitive decline. BMC Med Genomics 7, 66 (2014).
48. Jia, H., Guo, Y., Zhao, W. & Wang, K. Long-range PCR in next-generation sequencing: comparison
of six enzymes and evaluation on the MiSeq sequencer. Sci Rep 4, 5737 (2014).
49. Zhang, X. et al. Exome sequencing on malignant meningiomas identified mutations in
neurofibromatosis type 2 (NF2) and meningioma 1 (MN1) genes. Discov Med 18, 301-11 (2014).
50. Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide association
loci for human diseases and traits. Proc Natl Acad Sci U S A 106, 9362-7 (2009).
51. Maurano, M.T. et al. Systematic localization of common disease-associated variation in
regulatory DNA. Science 337, 1190-5 (2012).
52. Bernstein, B.E. et al. An integrated encyclopedia of DNA elements in the human genome. Nature
489, 57-74 (2012).
53. Schaub, M.A., Boyle, A.P., Kundaje, A., Batzoglou, S. & Snyder, M. Linking disease associations
with regulatory information in the human genome. Genome Res 22, 1748-59 (2012).
54. McCarthy, M.I. & Hirschhorn, J.N. Genome-wide association studies: potential next steps on a
genetic journey. Hum Mol Genet 17, R156-65 (2008).
55. Ward, L.D. & Kellis, M. HaploReg: a resource for exploring chromatin states, conservation, and
regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Res 40, D930-
4 (2012).
56. Boyle, A.P. et al. Annotation of functional variation in personal genomes using RegulomeDB.
Genome Res 22, 1790-7 (2012).
57. Li, M.J., Wang, L.Y., Xia, Z., Sham, P.C. & Wang, J. GWAS3D: Detecting human regulatory variants
by integrative analysis of genome-wide associations, chromosome interactions and histone
modifications. Nucleic Acids Res 41, W150-8 (2013).
58. Coetzee, S.G., Rhie, S.K., Berman, B.P., Coetzee, G.A. & Noushmehr, H. FunciSNP: an
R/bioconductor tool integrating functional non-coding data sets with genetic association studies
to identify candidate regulatory SNPs. Nucleic Acids Res 40, e139 (2012).
59. Pruim, R.J. et al. LocusZoom: regional visualization of genome-wide association scan results.
Bioinformatics 26, 2336-7 (2010).
60. Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding
principles of the human genome. Science 326, 289-93 (2009).
61. Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and characterization.
Nat Methods 9, 215-6 (2012).
62. Liang, G. et al. Distinct localization of histone H3 acetylation and H3-K4 methylation to the
transcription start sites in the human genome. Proc Natl Acad Sci U S A 101, 7357-62 (2004).
63. Barski, A. et al. High-resolution profiling of histone methylations in the human genome. Cell 129,
823-37 (2007).
64. Koch, C.M. et al. The landscape of histone modifications across 1% of the human genome in five
human cell lines. Genome Res 17, 691-707 (2007).
Page 71 of 73


65. Bartke, T. et al. Nucleosome-interacting proteins regulated by DNA and histone methylation. Cell
143, 470-84 (2010).
66. Pique-Regi, R. et al. Accurate inference of transcription factor binding from DNA sequence and
chromatin accessibility data. Genome Res 21, 447-55 (2011).
67. Gilad, Y., Rifkin, S.A. & Pritchard, J.K. Revealing the architecture of gene regulation: the promise
of eQTL studies. Trends Genet 24, 408-15 (2008).
68. Li, W., Notani, D. & Rosenfeld, M.G. Enhancers as non-coding RNA transcription units: recent
insights and future perspectives. Nat Rev Genet 17, 207-23 (2016).
69. Yang, X. et al. Genome-wide association study for serum complement C3 and C4 levels in
healthy Chinese subjects. PLoS Genet 8, e1002916 (2012).
70. Makinde, V.A., Senaldi, G., Jawad, A.S., Berry, H. & Vergani, D. Reflection of disease activity in
rheumatoid arthritis by indices of activation of the classical complement pathway. Ann Rheum
Dis 48, 302-6 (1989).
71. Zeller, T. et al. Genetics and beyond--the transcriptome of human monocytes and disease
susceptibility. PLoS One 5, e10693 (2010).
72. Wellcome Trust Case Control, C. Genome-wide association study of 14,000 cases of seven
common diseases and 3,000 shared controls. Nature 447, 661-78 (2007).
73. Kottgen, A. et al. Genome-wide association analyses identify 18 new loci associated with serum
urate concentrations. Nat Genet 45, 145-54 (2013).
74. Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing.
Genome Res 20, 265-72 (2010).
75. Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel
sequence data. Proc Natl Acad Sci U S A 108, 1513-8 (2011).
76. Alkan, C., Sajjadian, S. & Eichler, E.E. Limitations of next-generation genome sequence assembly.
Nat Methods 8, 61-5 (2011).
77. Chaisson, M.J., Wilson, R.K. & Eichler, E.E. Genetic variation and the de novo assembly of human
genomes. Nat Rev Genet (2015).
78. Cao, H. et al. De novo assembly of a haplotype-resolved human genome. Nat Biotechnol 33, 617-
22 (2015).
79. Pendleton, M. et al. Assembly and diploid architecture of an individual human genome via
single-molecule technologies. Nat Methods 12, 780-6 (2015).
80. Jakobsson, M. et al. Genotype, haplotype and copy-number variation in worldwide human
populations. Nature 451, 998-1003 (2008).
81. Sudmant, P.H. et al. Global diversity, population stratification, and selection of human copy
number variation. Science (2015).
82. Sudmant, P.H. et al. An integrated map of structural variation in 2,504 human genomes. Nature
526, 75-81 (2015).
83. Wang, J. et al. The diploid genome sequence of an Asian individual. Nature 456, 60-5 (2008).
84. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol 5, e254 (2007).
85. Li, R. et al. Building the sequence map of the human pan-genome. Nat Biotechnol 28, 57-63
(2010).
86. Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133-8
(2009).
87. Cao, H., Tegenfeldt, J.O., Austin, R.H. & Chou, S.Y. Gradient nanostructures for interfacing
microfluidics and nanofluidics. Appl Phys Lett 81, 3058-3060 (2002).
88. Chin, C.S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT
sequencing data. Nat Methods 10, 563-9 (2013).
Page 72 of 73


89. Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol 5, R12
(2004).
90. Chaisson, M.J. et al. Resolving the complexity of the human genome using single-molecule
sequencing. Nature 517, 608-11 (2015).
91. Lam, E.T. et al. Genome mapping on nanochannel arrays for structural variation analysis and
sequence assembly. Nat Biotechnol 30, 771-6 (2012).
92. Layer, R.M., Chiang, C., Quinlan, A.R. & Hall, I.M. LUMPY: a probabilistic framework for structural
variant discovery. Genome Biol 15, R84 (2014).
93. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57-74 (2012).
94. International Human Genome Sequencing, C. Finishing the euchromatic sequence of the human
genome. Nature 431, 931-45 (2004).
95. Whole genome? Nat Genet 47, 963 (2015).
96. Myers, E.W. The fragment assembly string graph. Bioinformatics 21 Suppl 2, ii79-85 (2005).
97. Delcher, A.L. et al. Alignment of whole genomes. Nucleic Acids Res 27, 2369-76 (1999).
98. Delcher, A.L., Phillippy, A., Carlton, J. & Salzberg, S.L. Fast algorithms for large-scale genome
alignment and comparison. Nucleic Acids Res 30, 2478-83 (2002).
99. Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res 25, 3389-402 (1997).
100. Harris, R.S. IMPROVED PAIRWISE ALIGNMENT OF GENOMIC DNA. PhD Thesis (2007).
101. Schwartz, S. et al. PipMaker--a web server for aligning two genomic DNA sequences. Genome
Res 10, 577-86 (2000).
102. Schwartz, S. et al. Human-mouse alignments with BLASTZ. Genome Res 13, 103-7 (2003).
103. Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol 9, R137 (2008).
104. Feng, J., Liu, T. & Zhang, Y. Using MACS to identify peaks from ChIP-Seq data. Curr Protoc
Bioinformatics Chapter 2, Unit 2 14 (2011).
105. Abyzov, A., Urban, A.E., Snyder, M. & Gerstein, M. CNVnator: an approach to discover, genotype,
and characterize typical and atypical CNVs from family and population genome sequencing.
Genome research 21, 974-984 (2011).
106. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv
preprint arXiv:1303.3997 (2013).


Page 73 of 73 
Asset Metadata
Creator Guo, Yunfei (author) 
Core Title Novel statistical and computational methods for analyzing genome variation 
Contributor Electronically uploaded by the author (provenance) 
School Keck School of Medicine 
Degree Doctor of Philosophy 
Degree Program Biostatistics 
Publication Date 06/30/2016 
Defense Date 05/31/2016 
Publisher University of Southern California (original), University of Southern California. Libraries (digital) 
Tag annotation,BioNano,causal variant,de novo assembly,GWAS,indel,NGS,OAI-PMH Harvest,pipeline,SMRT sequencing,SNP,Visualization 
Format application/pdf (imt) 
Language English
Advisor Conti, David (committee chair), Thomas, Paul D. (committee member), Wang, Kai (committee member) 
Creator Email guoyunfei1989@gmail.com,yunfeigu@usc.edu 
Permanent Link (DOI) https://doi.org/10.25549/usctheses-c40-259466 
Unique identifier UC11280298 
Identifier etd-GuoYunfei-4491.pdf (filename),usctheses-c40-259466 (legacy record id) 
Legacy Identifier etd-GuoYunfei-4491.pdf 
Dmrecord 259466 
Document Type Dissertation 
Format application/pdf (imt) 
Rights Guo, Yunfei 
Type texts
Source University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection) 
Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law.  Electronic access is being provided by the USC Libraries in agreement with the a... 
Repository Name University of Southern California Digital Library
Repository Location USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Abstract (if available)
Abstract Chapter 1. Automated, Accurate And Flexible Variant Detection: Next-generation sequencing (NGS) technology has greatly improved our ability to identify disease-contributory variants for Mendelian diseases. However, users are often faced with issues such as software compatibility, complicated configuration, and no access to high-performance computing facility. More importantly, discrepancies exist among different aligners and variant callers. We developed a computational pipeline, SeqMule, and a number of helper methods, to perform automated variant calling from NGS data on human genomes and exomes. SeqMule integrates computational-cluster-free parallelization capability built on top of the variant callers, and facilitates normalization/intersection of variant calls to generate consensus set with high confidence. SeqMule integrates 5 alignment tools, 5 variant calling algorithms and accepts various combinations all by one-line command, therefore allowing highly flexible yet fully automated variant calling. In a modern machine (2 Intel Xeon X5650 CPUs, 48GB memory), when fast turn-around is needed, SeqMule generates annotated VCF files in a day from a 30X whole-genome sequencing data set 
Tags
annotation
BioNano
causal variant
de novo assembly
GWAS
indel
NGS
pipeline
SMRT sequencing
SNP
Linked assets
University of Southern California Dissertations and Theses
doctype icon
University of Southern California Dissertations and Theses 
Action button