Introduction
The Major Histocompatibility Complex (MHC) comprising HLA class I and class II molecules (antigens) are polymorphic cell-membrane-bound glycoproteins that regulate the immune response by presenting peptides of fragmented proteins to circulating cytotoxic and helper T lymphocytes, respectively. These peptide-presenting-antigens are encoded by the genomic regions within the MHC that in humans are known specifically as the Human Leukocyte Antigen (HLA) class I and class II gene loci. The HLA molecules are investigated continuously due to their crucial role in the regulation of innate and adaptive immune responses (1, 2), during rejection, graft-versus-host disease (GVHD) of hematopoietic stem cell transplants (3, 4) and the pathogenesis of numerous infectious and/or autoimmune diseases (5–8).
Conventional PCR-based genotyping approaches incorporating restriction fragment length polymorphism (PCR-RFLP) (9), single strand conformation polymorphism (PCR-SSCP) (10), sequence-specific oligonucleotides (PCR-SSO) (11), sequence-specific primers (PCR-SSP) (12), and Sanger sequencing-based typing (PCR-SBT) (13) have been used for HLA-testing in disease association and pre-transplantation (14–16) analysis. However, these methods are limited in their ability to decipher chromosomal phase (cis/trans) ambiguity and/or imprecise allele identification (17, 18) and may leave multiple pairs of HLA gene alleles unresolved. Moreover, the traditional methods focus only on the variations in the highly polymorphic regions of the HLA genes, and thus the majority of the assays only interrogate exons 2 and 3 or exons 2, 3, and 4 of the class I loci and exons 2 or exons 2 and 3 of class II. As a result, the genetic variations in the non-coding regions that regulate RNA expression levels (19–21), or the exonic coding sequences outside of the polymorphic domains have remained largely ignored. On the other hand, methods that use full-length gene sequences including the promoter-enhancer region, 5′ and 3′ untranslated regions (UTRs), as well as all exonic and intronic regions for HLA genotyping, would be more accurate for the discovery of new polymorphisms associated with disease susceptibility and transplantation outcome. To this end, however, not more than 10% of ~ 17,700 known alleles in the IPD-IMGT/HLA reference database [Release 3.31, https://www.ebi.ac.uk/ipd/imgt/hla/, (22)] are classified as full-length HLA gene sequences. Consequently, a comprehensive collection and classification of full-length HLA allele sequences including information from the “undetermined” regions of the HLA genes are desired to promote the future development of more accurate and reliable HLA genotyping methods.
During the past decade, next-generation sequencing (NGS) using mostly second-generation short-read sequencing technologies have employed exon sequencing for some or all of the HLA coding regions in an attempt to solve the phase ambiguity problem encountered by the conventional Sanger Sequencing-Based HLA genotyping methods (23). We previously reported on the development of super high resolution-sequence-based typing application using two NGS platforms, Ion PGM (Life Technologies) and GS Junior (Roche) (24–26). In both of the studies, long-range PCR amplicons from the promoter-enhancer region to 3′UTR for eight classical HLA loci, HLA-A, HLA-B, HLA-C, HLA-DRB1, HLA-DQA1, HLA-DQB1, HLA-DPA1, and HLA-DPB1 were shotgun sequenced. Other long-range PCR and NGS-based shotgun sequencing based HLA genotyping methods using the 454 GS-FLX (Roche) and the MiSeq/HiSeq (Illumina) platforms (18, 27–32) also can resolve many of the phase ambiguities. However, all of these methods are affected by the disadvantages associated with the short-read lengths of NGS technologies, which range from 150 to 500 bp. Although the short-read sequences can accurately separate the two phases in SNP dense regions, such as a polymorphic exon, the phasing is impossible in SNP deficient regions (Figure 1). Specifically, it is necessary to have at least two single nucleotide variants (SNVs) in one short-read sequence to separate both phases. The full-length HLA allele sequences are especially difficult to phase in the HLA-DQA1, HLA-DPA1, and HLA-DPB1 loci because their SNP densities are much less than in the other HLA loci (24), making it unlikely that two SNPs will appear in one short-read sequence.