Abstract
Motivation
Genotype imputation is commonly used in genome-wide association studies (GWAS). However, both the genotyping chips and imputation reference panels are constructed using next-generation sequencing (NGS). Due to the nature of NGS and gaps in the reference, some regions of the genome are inaccessible to sequencing. To date, there has been no extensive evaluation of these regions and their impact on the identification of associations in GWAS remains unclear.
Methods
We systematically assess whether variants in inaccessible regions are underrepresented on genotyping chips and imputation reference panels, in the GWAS catalog and in the ClinVar database. We also determine the proportion of genes located in inaccessible regions and compare the results across inaccessibility masks defined by the 1000 Genomes Project and the TOPMed program.
Results
Fewer variants were observed in inaccessible regions in all analyzed categories. Depending on the mask and normalized for region size, only 4-17% of the genotyped variants are located in inaccessible regions and 52 to 581 genes were almost completely inaccessible. From the Cooperative Health Research in South Tyrol (CHRIS) study, we present a case study of an association located in an inaccessible region that can only be identified with genotyped variants in GRCh37 since imputation was inaccurate. To facilitate researchers assessing gene and variant accessibility easily, we provide an online application (https://gab.gm.eurac.edu). Genotyping, NGS, genotype imputation and downstream applications such as GWAS and fine mapping are systematically biased in inaccessible regions, due to spurious associations and missed variants.