Resource
A cloud-compatible bioinformatics pipeline
for ultrarapid pathogen identification
from next-generation sequencing of clinical samples Samia N.Naccache,1,2Scot Federman,1,2Narayanan Veeeraraghavan,1,2
Matei Zaharia,3Deanna Lee,1,2Erik Samayoa,1,2Jerome Bouquet,1,2
Alexander L.Greninger,4Ka-Cheung Luk,5Barryett Enge,6Debra A.Wadford,6 Sharon L.Messenger,6Gillian L.Genrich,1Kristen Pellegrino,7Gilda Grard,8Eric Leroy,8 Bradley S.Schneider,9Joseph N.Fair,9Miguel A.Martı´nez,10Pavel Isa,10
John A.Crump,11,12,13Joseph L.DeRisi,4Taylor Sittler,1John Hackett,Jr.,5Steve Miller,1,2 and Charles Y.Chiu1,2,14,15
1Department of Laboratory Medicine,UCSF,San Francisco,California94107,USA;2UCSF-Abbott Viral Diagnostics and Discovery Center,San Francisco,California94107,USA;3Department of Computer Sc
ience,University of California,Berkeley,California94720, USA;4Department of Biochemistry,UCSF,San Francisco,California94107,USA;5Abbott Diagnostics,Abbott Park,Illinois60064, USA;6Viral and Rickettsial Disease Laboratory,California Department of Public Health,Richmond,California94804,USA;
7Department of Family and Community Medicine,UCSF,San Francisco,California94143,USA;8Viral Emergent Diseases Unit,Centre International de Recherches Me´dicales de Franceville,Franceville,BP769,Gabon;9Metabiota,Inc.,San Francisco,California94104, USA;10Departamento de Gene´tica del Desarrollo y Fisiologı´a Molecular,Instituto de Biotecnologı´a,Universidad Nacional Auto´noma de Me´xico,Cuernavaca,Morelos,62260,Mexico;11Division of Infectious Diseases and International Health and the Duke Global Health Institute,Duke University Medical Center,Durham,North Carolina27708,USA;12Kilimanjaro Christian Medical Centre,Moshi, Kilimanjaro,7393,Tanzania;13Centre for International Health,University of Otago,Dunedin,9054,New Zealand;14Department of Medicine,Division of Infectious Diseases,UCSF,San Francisco,California94143,USA
Unbiased next-generation sequencing(NGS)approaches enable comprehensive pathogen detection in the clinical mi-crobiology laboratory and have numerous applications for public health surveillance,
outbreak investigation,and the diagnosis of infectious diseases.However,practical deployment of the technology is hindered by the bioinformatics challenge of analyzing results accurately and in a clinically relevant timeframe.Here we describe SURPI(‘‘sequence-based ultrarapid pathogen identification’’),a computational pipeline for pathogen identification from complex metagenomic NGS data generated from clinical samples,and demonstrate use of the pipeline in the analysis of237clinical samples comprising more than1.1billion sequences.Deployable on both cloud-based and standalone servers,SURPI leverages two state-of-the-art aligners for accelerated analyses,SNAP and RAPSearch,which are as accurate as existing bioinformatics tools but orders of magnitude faster in performance.In fast mode,SURPI detects viruses and bacteria by scanning data sets of7–500million reads in11min to5h,while in comprehensive mode,all known microorganisms are identified,followed by de novo assembly and protein homology searches for divergent viruses in50min to16h.SURPI has also directly contributed to real-time microbial diagnosis in acutely ill patients,underscoring its potential key role in the development of unbiased NGS-based clinical assays in infectious diseases that demand rapid turnaround times.
[Supplemental material is available for this article.]
There is great interest in the use of unbiased next-generation se-quencing(NGS)technology for comp
rehensive detection of pathogens from clinical samples(Dunne et al.2012;Wylie et al. 2012;Chiu2013;Firth and Lipkin2013).Conventional diagnostic testing for pathogens is narrow in scope and fails to detect the etiologic agent in a significant percentage of cases(Barnes et al. 1998;Louie et al.2005;van Gageldonk-Lafeber et al.2005;Bloch and Glaser2007;Denno et al.2012).Failure to accurately diagnose and treat infection in a timely fashion contributes to continued transmission and increased mortality in hospitalized patients (Kollef et al.2008).Ongoing discovery of novel pathogens,such as Bas-Congo rhabdovirus(Grard et al.2012)and MERS(Middle East Respiratory Syndrome)coronavirus(Zaki et al.2012),also un-derscores the need for rapid,broad-spectrum diagnostic assays that are able to recognize these emerging agents.
Ó2014Naccache et al.This article,published in Genome Research,is available under a Creative Commons License(Attribution-NonCommercial  4.0In-ternational),as described at /licenses/by-nc/4.0/.
15Corresponding author
E-mail charles.chiu@ucsf.edu
Article published online before print.Article,supplemental material,and publi-
cation date are at /cgi/doi/10.1101/gr.171934.113.
Freely available online through the Genome Research Open Access option.
24:000–000Published by Cold Spring Harbor Laboratory Press;ISSN1088-9051/ Genome Research1
<
Unbiased NGS holds the promise of identifying all potential pathogens in a single assay without a priori knowledge of the tar-get.Given sufficiently long read lengths,multiple hits to the mi-crobial genome,and a well-annotated reference database,nearly all microorganisms can be uniquely identified on the basis of their specific nucleic acid sequence.Thus,NGS has widespread micro-biological applications,including infectious disease diagnosis in clinical laboratories(Dunne et al.2012),pathogen discovery in acute and chronic illnesses of unknown origin(Chiu2013),and outbreak investigation on a global level(Firth and Lipkin2013). However,the latest NGS laboratory workflows incur minimum turnaround times exceeding8h from clinical sample to sequence (Quail et al.2012).Thus,it is critical that subsequent computa-tional analyses of NGS data be performed within a timeframe suitable for actionable responses in clinical medicine and public ,minutes to ho
urs).Such pipelines must also retain sensitivity,accuracy,and throughput in detecting a broad range of clinically relevant pathogenic microorganisms.
Computational analysis of metagenomic NGS data for path-ogen identification remains challenging for several reasons.First, alignment/classification algorithms must contend with massive amounts of sequence data.Recent advances in NGS technologies have resulted in instruments that are capable of producing>100 gigabases(Gb)of reads in a day(Loman et al.2012).Reference databases of host and pathogen sequences range in size from2Gb for viruses to3.1Gb for the human genome to42Gb for all nu-cleotide sequences in the National Center for Biotechnology In-formation(NCBI)nucleotide(nt)collection(NCBI nt DB)as of January2013.Second,only a small fraction of short NGS reads in clinical metagenomic data typically correspond to pathogens(a ‘‘needle-in-a-haystack’’problem)(Kostic et al.2012;Wylie et al. 2012;Yu et al.2012),and such sparse reads often do not overlap sufficiently to permit de novo assembly into longer contiguous sequences(contigs)(Kostic et al.2011).Thus,individual reads, typically only100–300nucleotides(nt)in length,must be classi-fied to a high degree of accuracy.Finally,novel microorganisms with divergent genomes,particularly viruses,are not adequately represented in existing reference databases and often can only be identified on the basis of remote amino acid homology(Xu et al. 2011;Grard et al.2012).
To address these challenges,the most widely used approach is computational subtraction of reads corresponding to the , human),followed by alignment to reference databases that contain sequences from candidate pathogens(MacConaill and Meyerson 2008;Greninger et al.2010;Kostic et al.2011;Zhao et al.2013). Traditionally,the BLAST algorithm(Altschul et al.1990)is used for classification of human and nonhuman reads at the nucleotide level (BLASTn),followed by low-stringency protein alignments using a translated nucleotide query(BLASTx)for detection of divergent sequences from novel pathogens(Delwart2007;Briese et al.2009; Xu et al.2011;Grard et al.2012;Chiu2013).However,BLAST is too slow for routine analysis of NGS metagenomics data(Niu et al. 2011),and end-to-end processing times,even on multicore com-putational servers,can take several days to weeks.Analysis pipelines that use faster,albeit less sensitive,algorithms upfront for host computational subtraction,such as PathSeq(Kostic et al.2011),still rely on traditional BLAST approaches for final pathogen de-termination.In addition,whereas PathSeq works well for tissue samples in which the vast majority of reads are host-derived and thus subject to subtraction,the pipeline becomes computationally prohibitive when analyzing complex clinical metagenomic samples open to the environment,such as respiratory secretions or stool(Fig.1B;Supplemental Table S1).Other published pipelines are focused solely on limited detection of specific types of microorganisms,are unable to identify highly divergent novel pathogens,and/or utilize computationall
y taxing algorithms such as BLAST(Bhaduri et al. 2012;Borozan et al.2012;Dimon et al.2013;Naeem et al.2013; Wang et al.2013;Zhao et al.2013).Furthermore,there is hitherto scarce reported data on the real-life performance of these pipelines for pathogen identification in clinical samples.
Here we describe SURPI(‘‘sequence-based ultrarapid patho-gen identification’’),a cloud-compatible bioinformatics analysis pipeline that provides extensive classification of reads against viral and bacterial databases in fast mode and against the entire NCBI nt DB in comprehensive mode(Fig.1A).Novel pathogens are also identified in comprehensive mode by amino acid alignment to viral and/or NCBI nr protein databases.Notably,SURPI generates results in a clinically actionable timeframe of minutes to hours by leveraging two alignment tools,SNAP(Fig.1C;Zaharia et al.2011) and RAPSearch(Fig.1D;Zhao et al.2012),which have computa-tional times that are orders of magnitude faster than other avail-able algorithms.Here we evaluate the performance of these tools for pathogen detection using both in silico-generated and clinical data and describe use of the SURPI pipeline in the analysis of15 independent NGS data sets consisting of157clinical samples multiplexed across47barcodes and including over1.1billion reads.These data sets encompass a variety of clinical infections, detected pathogens,sample types,and depths of coverage.We also demonstrate use of the pipeline for detection of emerging novel outbreak viruses and for clinical diagnosis of a case of unknown fever in a returning traveler.
Results
Accuracy of SURPI aligners(SNAP and RAPSearch)using
in silico data
The accuracy of SURPI was evaluated by benchmarking its nucle-otide alignment tool,SNAP,against BLASTn and two other aligners commonly used for human genome mapping,BWA(Li and Durbin 2009)and Bowtie2(BT2)(Fig.2A–E;Langmead and Salzberg2012). In addition,SURPI’s protein similarity search tool,RAPSearch(Zhao et al.2012),was directly compared to BLASTx(Fig.2F;Altschul et al. 1990).A query data set of100base pair(bp)reads was randomly generated in silico from human,bacterial,and viral reference da-tabases.The data set consisted of1million human reads,250,000 bacterial reads,25,000viral reads,and1000reads each from four known viruses(norovirus,ebolavirus,human immunodeficiency virus[HIV-1],and influenza A),and three divergent‘‘novel’’viruses whose genomes had been removed a priori from the reference database(Supplemental Table S2):Bas-Congo rhabdovirus(BASV) (Grard et al.2012),titi monkey adenovirus(TMAdV)(Chen et al. 2011),and bat influenza H17N10(Tong et al.2012).Receiver op-erating characteristic(ROC)curves(Akobeng2007)were generated to assess the sensitivity and specif
icity of each aligner in classifying human,bacterial,or viral reads.All nucleotide aligners shared >99.5%optimal sensitivity and specificity for human sequence identification(Fig.2A),with SNAP exhibiting the highest speci-ficity(>99.8%)and comparable sensitivity to BLASTn(99.9% versus100%).For bacterial detection(Fig.2B),SNAP was more accurate than BWA and BT2,and exhibited reduced sensitivity (99.5%)albeit superior specificity(98.5%)relative to BLASTn (100%and97.9%),as was also the trend for viral detection(Fig. 2C).The accuracy of all four tools in identifying sequences from
教育学论文2Genome Research
< Naccache et al.
known viruses was comparable (Fig.2D),but SNAP and BLASTn were superior to BWA and BT2,in identifying reads from divergent viruses using low-stringency parameters (Fig.2E).Nevertheless,the overall poor performance of all four nucleotide aligners in detecting divergent viral reads (<20%sensitivity)underscored the need for translated nucleotide alignment algorithms such as RAPSearch and BLASTx (Briese et al.2009;Xu et al.2011;Grard et al.2012;Swei et al.2013).By ROC curve analysis,these two algorithms performed similarly in the detection of sequences from divergent viral genomes (Fig.2F).
Speed of SURPI aligners (SNAP and RAPSearch)using in silico data
The computational speed of SNAP relative to BLASTn,BT2,and BWA in aligning NGS reads to the human hg19database (human DB)was evaluated using progressively larger in silico query data
sets of 1.25million,25million,125million,and 1.25billion reads (Fig.3A;Supplemental Tables S2,S3).BLASTn alignments were associated with prohibitively long run times,consuming >19h to analyze only 1.25million reads,with proportionally longer esti-mated times for the larger data sets.Although all three remaining aligners performed comparably well with the 1.25million read data set,SNAP scaled significantly better with larger data sets and was 23À873faster than BWA and BT2.
Next,we investigated the feasibility of using the SNAP algo-rithm to align reads to all sequences in the 42Gb NCBI nt DB.Computational subtraction of human host sequences followed by SNAP alignment to the entire NCBI nt DB was accomplished in under 1h for 1.25million and <40h for 1.25billion reads (Fig.3B,C).Overall timing metrics for SURPI,whether using a cloud server or local server,were comparable (Fig.3B),likely due to the use of high-performance,low-latency solid-state drives (SSDs)for the cloud (Supplemental Methods).We also benchmarked
the
Figure 1.The SURPI pipeline for pathogen detection.(A )A schematic overview of the SURPI pipeline.Raw NGS reads are preprocessed by removal of adapter,low-quality,and low-complexity sequences,followed by computational subtraction of human reads using SNAP.In fast mode,viruses and bacteria are identified by SNAP alignment to viral and bacterial nucleotide databases.In comprehensive mode,reads are aligned using SNAP to all nucleotide sequences in the NCBI nt collection,enabling identification of bacteria,fungi,parasites,and viruses.For pathogen discovery of divergent microorganisms,unmatched reads and contigs generated from de novo assembly are then aligned to a viral protein database or all protein sequences in the NCBI nr collection using RAPSearch.SURPI reports include a list of all classified reads with taxonomic assignments,a summary table of read counts,and both viral and bacterial genomic coverage maps.(B )Relative proportion of NGS reads classified as human,bacterial,viral,or other in different clinical sample types.(C )The SNAP nucleotide aligner (Zaharia et al.2011).SNAP aligns reads by generating a hash table of sequences of length ‘‘s’’from the reference database and then comparing the hash index with ‘‘n’’seeds of length ‘‘s’’generated from the query sequence,producing a match based on the edit distance ‘‘d.’’(D )The RAPSearch protein similarity search tool (Zhao et al.2012).RAPSearch aligns tr
anslated nucleotide queries to a protein database using a compressed amino acid alphabet at the level of chemical similarity for greatly increased processing speed.
Rapid NGS pipeline for pathogen identification
Genome Research
3
<
Naccache et al.
speed of RAPSearch relative to BLASTx in aligning translated query reads to a viral protein database (viral protein DB).RAPSearch was found to be 5À103faster than BLASTx across all query data sets (Fig.3C).
Accuracy of the SURPI aligners (SNAP and RAPsearch)using clinical sample data
To evaluate the ‘‘real-life’’performance of SNAP relative to BLASTn,BT2,and BWA,and of RAPSearch relative to BLASTx,ROC curves for viral detection were generated from three computationally challenging NGS data sets (Fig.4A–C;Supplemental Table S3).The query sets corresponded to (1)a complex metagenomic stool sample pool (Yu et al.2012)harboring a respiratory syncytial virus (RSV)strain with ;10%genomic sequence divergence,(2)a nasal swab sample pool from patients infected with 2009pandemic in-fluenza H1N1[influenza A(H1N1)pdm09]and sequenced using short 65-bp reads (Greninger et al.2010),and (3)a serum sample from a patient with hantavirus pulmonary syndrome from Sin
Nombre virus (SNV)infection (Nunez et al.2014)and sequenced using longer 250-bp reads.Both SNAP and BLASTn exhibited superior sensitivity than BWA and BT2in detection of reads corre-sponding to these three viruses.Across all three data sets,100%specificity was retained using an exp
ectation value (E-value)cutoff of 1310À15for BLASTn and an edit distance of 12for SNAP .At that threshold cutoff,the sensitivities of SNAP and BLASTn for detection of each virus were similar (84.3%/85.1%for RSV ,99.6%/98.7%for influenza A(H1N1)pdm09,and 93.8%/99.7%for hantavirus).Among the 15true hantavirus reads not detected by SNAP and accounting for the reduced 93.8%sensitivity,10were found to be chimeric reads,while five were reads with internal regions of low-quality data.Crop-ping the long 250-bp reads in the hanta-virus data set to 75bp improved sensitivity from 93.8%to 98.1%due to increased de-tection of these previously undetected reads without affecting specificity (Fig.4C).By ROC curve analysis,RAPSearch had comparable accuracy to BLASTx across all three clinical data sets (Fig.4A–C,bottom panels).
The combined in silico and clinical data on SNAP and RAPSearch perfor-mance (Figs.2–4)guided (1)the choice of an edit distance ‘‘d’’of 12as the most appropriate empirical cutoff for SNAP alignment (Fig.1C),(2)read cropping prior to SNAP alignment to a length of
75bp to maximize sensitivity for chime-ric reads or reads with error-prone 39ends from Illumina sequencing and to allow
use of a fixed edit distance threshold (Fig.4C;Supplemental Results;Supplemental Fig.S1),and (3)the serial coupling of the
SNAP and RAPSearch algorithms to maximize speed without sac-rificing breadth of detection.Accurate detection of pathogens from clinical samples using SURPI
SURPI was used to accurately classify viral pathogens down to the species and even strain level in various clinical metagenomic NGS data sets (Fig.5A–G;Supplemental Results;Supplemental Tables S4,S5),automatically generating summary tables (Sup-plemental Tables S6–S21)and coverage maps (Supplemental Fig.S2)corresponding to the actual virus present in the sample.Plasma samples spiked with human immunodeficiency virus (HIV-1)at titers ranging from 102to 104copies/mL were identi-fied and mapped to the correct strain (Fig.5A),showing a linear correlation between number of aligned reads and viral titer,while sapovirus (SaV)and a divergent human parechovirus 1(HPeV1)shed in children with diarrhea and provisionally named HPeV-1isolate MX1were correctly identified (Fig.5B),as was human herpesvirus 3(HHV3)in cerebrospinal fluid (CSF)from a
patient
Figure 3.SURPI aligners (SNAP and RAPSearch)are significantly faster than other tested aligners and
scale better with larger data sets.Timing performance was benchmarked on a single computational server using in silico query data sets of increasing size.The breaks (zigzag lines)represent computational
times that are off-scale.Some of the computational times were estimated (asterisks).(A )Performance
time for alignment of reads to the human DB.(B )Performance time for SNAP alignment of reads to the
entire 42-Gb NCBI nt DB.The z -axis denotes the approximate number of remaining reads following
computational subtraction against the human DB.SNAP performance times were benchmarked sep-arately on local and cloud servers.(C )Performance times for translated nucleotide alignment to the viral protein DB using RAPSearch and BLASTx.
Rapid NGS pipeline for pathogen identification
Genome Research
5
<