Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight

Mark T.W. Ebbert; Tanner D. Jensen; Karen Jansen-West; Jonathon P. Sens; Joseph S. Reddy; Perry G. Ridge; John S.K. Kauwe; Veronique Belzil; Luc Pregent; Minerva M. Carrasquillo; Dirk Keene; Eric Larson; Paul Crane; Yan W. Asmann; Nilufer Ertekin-Taner; Steven G. Younkin; Owen A. Ross; Rosa Rademakers; Leonard Petrucelli; John D. Fryer

doi:10.1186/s13059-019-1707-2

Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight

Mark T.W. Ebbert, Tanner D. Jensen, Karen Jansen-West, Jonathon P. Sens, Joseph S. Reddy, Perry G. Ridge, John S.K. Kauwe, Veronique Belzil, Luc Pregent, Minerva M. Carrasquillo, Dirk Keene, Eric Larson, Paul Crane, Yan W. Asmann, Nilufer Ertekin-Taner, Steven G. Younkin, Owen A. Ross, Rosa Rademakers, Leonard Petrucelli, John D. Fryer

Research output: Contribution to journal › Article › peer-review

37 Scopus citations

Abstract

Background: The human genome contains "dark" gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions with few mappable reads that we call dark by depth, and others that have ambiguous alignment, called camouflaged. We assess how well long-read or linked-read technologies resolve these regions. Results: Based on standard whole-genome Illumina sequencing data, we identify 36,794 dark regions in 6054 gene bodies from pathways important to human health, development, and reproduction. Of these gene bodies, 8.7% are completely dark and 35.2% are ≥ 5% dark. We identify dark regions that are present in protein-coding exons across 748 genes. Linked-read or long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduce dark protein-coding regions to approximately 50.5%, 35.6%, and 9.6%, respectively. We present an algorithm to resolve most camouflaged regions and apply it to the Alzheimer's Disease Sequencing Project. We rescue a rare ten-nucleotide frameshift deletion in CR1, a top Alzheimer's disease gene, found in disease cases but not in controls. Conclusions: While we could not formally assess the association of the CR1 frameshift mutation with Alzheimer's disease due to insufficient sample-size, we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.

Original language	English (US)
Article number	97
Journal	Genome biology
Volume	20
Issue number	1
DOIs	https://doi.org/10.1186/s13059-019-1707-2
State	Published - May 20 2019

Keywords

10x Genomics
APOE
Alzheimer's Disease Sequencing Project (ADSP)
CR1
Camouflaged genes
Dark genes
Long-read sequencing
Oxford Nanopore Technologies (ONT)
Pacific Biosciences (PacBio)

ASJC Scopus subject areas

Ecology, Evolution, Behavior and Systematics
Genetics
Cell Biology

Access to Document

10.1186/s13059-019-1707-2

Cite this

Ebbert, M. T. W., Jensen, T. D., Jansen-West, K., Sens, J. P., Reddy, J. S., Ridge, P. G., Kauwe, J. S. K., Belzil, V., Pregent, L., Carrasquillo, M. M., Keene, D., Larson, E., Crane, P., Asmann, Y. W., Ertekin-Taner, N., Younkin, S. G., Ross, O. A., Rademakers, R., Petrucelli, L., & Fryer, J. D. (2019). Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight. Genome biology, 20(1), Article 97. https://doi.org/10.1186/s13059-019-1707-2

Ebbert, MTW, Jensen, TD, Jansen-West, K, Sens, JP, Reddy, JS, Ridge, PG, Kauwe, JSK, Belzil, V, Pregent, L, Carrasquillo, MM, Keene, D, Larson, E, Crane, P, Asmann, YW , Ertekin-Taner, N , Younkin, SG , Ross, OA, Rademakers, R, Petrucelli, L & Fryer, JD 2019, 'Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight', Genome biology, vol. 20, no. 1, 97. https://doi.org/10.1186/s13059-019-1707-2

@article{7fdaa6f6301040848032d7fdd884a919,

title = "Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight",

abstract = "Background: The human genome contains {"}dark{"} gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions with few mappable reads that we call dark by depth, and others that have ambiguous alignment, called camouflaged. We assess how well long-read or linked-read technologies resolve these regions. Results: Based on standard whole-genome Illumina sequencing data, we identify 36,794 dark regions in 6054 gene bodies from pathways important to human health, development, and reproduction. Of these gene bodies, 8.7% are completely dark and 35.2% are ≥ 5% dark. We identify dark regions that are present in protein-coding exons across 748 genes. Linked-read or long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduce dark protein-coding regions to approximately 50.5%, 35.6%, and 9.6%, respectively. We present an algorithm to resolve most camouflaged regions and apply it to the Alzheimer's Disease Sequencing Project. We rescue a rare ten-nucleotide frameshift deletion in CR1, a top Alzheimer's disease gene, found in disease cases but not in controls. Conclusions: While we could not formally assess the association of the CR1 frameshift mutation with Alzheimer's disease due to insufficient sample-size, we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.",

keywords = "10x Genomics, APOE, Alzheimer's Disease Sequencing Project (ADSP), CR1, Camouflaged genes, Dark genes, Long-read sequencing, Oxford Nanopore Technologies (ONT), Pacific Biosciences (PacBio)",

author = "Ebbert, {Mark T.W.} and Jensen, {Tanner D.} and Karen Jansen-West and Sens, {Jonathon P.} and Reddy, {Joseph S.} and Ridge, {Perry G.} and Kauwe, {John S.K.} and Veronique Belzil and Luc Pregent and Carrasquillo, {Minerva M.} and Dirk Keene and Eric Larson and Paul Crane and Asmann, {Yan W.} and Nilufer Ertekin-Taner and Younkin, {Steven G.} and Ross, {Owen A.} and Rosa Rademakers and Leonard Petrucelli and Fryer, {John D.}",

note = "Publisher Copyright: {\textcopyright} 2019 The Author(s).",

year = "2019",

month = may,

day = "20",

doi = "10.1186/s13059-019-1707-2",

language = "English (US)",

volume = "20",

journal = "Genome biology",

issn = "1474-7596",

publisher = "BioMed Central",

number = "1",

}

TY - JOUR

T1 - Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight

AU - Ebbert, Mark T.W.

AU - Jensen, Tanner D.

AU - Jansen-West, Karen

AU - Sens, Jonathon P.

AU - Reddy, Joseph S.

AU - Ridge, Perry G.

AU - Kauwe, John S.K.

AU - Belzil, Veronique

AU - Pregent, Luc

AU - Carrasquillo, Minerva M.

AU - Keene, Dirk

AU - Larson, Eric

AU - Crane, Paul

AU - Asmann, Yan W.

AU - Ertekin-Taner, Nilufer

AU - Younkin, Steven G.

AU - Ross, Owen A.

AU - Rademakers, Rosa

AU - Petrucelli, Leonard

AU - Fryer, John D.

PY - 2019/5/20

Y1 - 2019/5/20

N2 - Background: The human genome contains "dark" gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions with few mappable reads that we call dark by depth, and others that have ambiguous alignment, called camouflaged. We assess how well long-read or linked-read technologies resolve these regions. Results: Based on standard whole-genome Illumina sequencing data, we identify 36,794 dark regions in 6054 gene bodies from pathways important to human health, development, and reproduction. Of these gene bodies, 8.7% are completely dark and 35.2% are ≥ 5% dark. We identify dark regions that are present in protein-coding exons across 748 genes. Linked-read or long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduce dark protein-coding regions to approximately 50.5%, 35.6%, and 9.6%, respectively. We present an algorithm to resolve most camouflaged regions and apply it to the Alzheimer's Disease Sequencing Project. We rescue a rare ten-nucleotide frameshift deletion in CR1, a top Alzheimer's disease gene, found in disease cases but not in controls. Conclusions: While we could not formally assess the association of the CR1 frameshift mutation with Alzheimer's disease due to insufficient sample-size, we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.

AB - Background: The human genome contains "dark" gene regions that cannot be adequately assembled or aligned using standard short-read sequencing technologies, preventing researchers from identifying mutations within these gene regions that may be relevant to human disease. Here, we identify regions with few mappable reads that we call dark by depth, and others that have ambiguous alignment, called camouflaged. We assess how well long-read or linked-read technologies resolve these regions. Results: Based on standard whole-genome Illumina sequencing data, we identify 36,794 dark regions in 6054 gene bodies from pathways important to human health, development, and reproduction. Of these gene bodies, 8.7% are completely dark and 35.2% are ≥ 5% dark. We identify dark regions that are present in protein-coding exons across 748 genes. Linked-read or long-read sequencing technologies from 10x Genomics, PacBio, and Oxford Nanopore Technologies reduce dark protein-coding regions to approximately 50.5%, 35.6%, and 9.6%, respectively. We present an algorithm to resolve most camouflaged regions and apply it to the Alzheimer's Disease Sequencing Project. We rescue a rare ten-nucleotide frameshift deletion in CR1, a top Alzheimer's disease gene, found in disease cases but not in controls. Conclusions: While we could not formally assess the association of the CR1 frameshift mutation with Alzheimer's disease due to insufficient sample-size, we believe it merits investigating in a larger cohort. There remain thousands of potentially important genomic regions overlooked by short-read sequencing that are largely resolved by long-read technologies.

KW - 10x Genomics

KW - APOE

KW - Alzheimer's Disease Sequencing Project (ADSP)

KW - CR1

KW - Camouflaged genes

KW - Dark genes

KW - Long-read sequencing

KW - Oxford Nanopore Technologies (ONT)

KW - Pacific Biosciences (PacBio)

UR - http://www.scopus.com/inward/record.url?scp=85066014432&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85066014432&partnerID=8YFLogxK

U2 - 10.1186/s13059-019-1707-2

DO - 10.1186/s13059-019-1707-2

M3 - Article

C2 - 31104630

AN - SCOPUS:85066014432

SN - 1474-7596

VL - 20

JO - Genome biology

JF - Genome biology

IS - 1

M1 - 97

ER -

Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this