Global outbreaks of human influenza occur from influenza A viruses with novel Hemagglutinin (HA) molecules to which humans have no immunity. So understanding of the origin and evolution of HA genes is of particular importance. Host-origin classification based on the two outer proteins Host classification based on the HA protein achieved accuracies between 91. 2% and 100% using KNN and random forest.
While host classification based on the NA protein achieved accuracies between 91. % and 100% using KNN and random forest. Keywords: Influenza, machine learning, host classification. Introduction Influenza A viruses belong to the Orthomyxoviridae family of negative sense, single-stranded, segmented RNA viruses. The RNA core consists of 8 gene segments. Immunologically, the most significant surface proteins include Hemagglutinin HA (16 subtypes) and Neuraminidase NA (9 subtypes). Influenza A subtypes are usually identified by their HA and NA proteins[1, 2].
The HA and NA proteins are integral membrane proteins and consider as the major surface antigen of the influenza virus virion. The Hemagglutinin (HA) of influenza A viruses is a major surface glycoprotein that is responsible for attachment of the virus to the cell surface of host receptors. The role of NA is to free virus particles from host cell receptors, to allow progeny virions to escape from the cell in which they arose, and so facilitate virus spread [4].
All known subtypes of influenza A viruses are found among avian species that serve as main reservoirs for these agents [2]. In general, an influenza virus infects only a single species; however, whole viruses may occasionally be transmitted from one species to another, and genetic reassortment between viruses from two different hosts can produce a new virus capable of infecting a third host.
Avian influenza viruses are not readily introduced into humans [3], possibly because humans do not possess the a(2,3)-sialyllactose (NeuAc-2,3Gal) receptors required for attachment of the viruses to epithelial cells. However, individual viral genes can be transmitted between humans and avian species, as demonstrated by avian human reassortant viruses that caused the 1957 and 1968 influenza pandemics [4, 5]. This finding suggested that an middle host may be needed for genetic reassortment of human and avian viruses.
Pigs are considered a logical candidate for this role because they can be infected by either avian or human viruses [6, 7] and because they possess both NeuAc-2,3Gal and NeuAc-2,6Gal receptors. In addition, there is good evidence that pigs are more frequently involved in interspecies transmission of influenza A viruses than are other animals [6, 8, 9]. Previous studies have also defined host specificity markers. For example, Allen et al. (2009) predicted positions in the genome associated with human host specificity.
However, the host markers that these workers identified in the surface glycoproteins HA and NA and in the polymerase protein PB1, as well as the alternate transcripts NS2, M2, and PB1-F2, were poor-quality host discriminators In a previous study, Host-specific signatures were identified using class associative rule mining to identify and confirm significant variations between different influenza hosts. Another study used random forest for the prediction of host tropism from both avian and human samples only.
Because of the important functional role of HA and NA in cell-receptor attachment, entry, and infectivity, our focus in this study was specifically on the host markers that were found only in HA and NA. The aim of the present study was to establish accurate host of origin classifiers that are capable of indicating signatures in human, avian, and swine influenza viral genomes for HA and NA proteins. KNN and Random forest classification models were used for host-origin classification.
Materials and Methods All protein sequences that isolated from human, avian and swine hosts were downloaded from the NCBI’s Influenza Virus Resources (http://www. ncbi. nlm. nih. gov/genomes/FLU/FLU. html). The downloaded sequences were forced to be non redundant and complete isolation of HA and NA segments. A total of 1500 and 2345 HA and NA protein sequences respectively were selected to be involved in our study. Part of the data is used for training and the remaining part is used for testing.
We used amino acid sequences (20 letter alphabet) because they are known to give more reliable results than nucleotide sequences when the sequence divergence is high. To compare the genomic patterns of avian, swine and human influenza viruses with each others, we downloaded protein sequences of HA and NA from NCBI’s Influenza Virus Resources, isolated from various host species. The detailed count of sequences used in this study for each host is indicated in table 1.
The sequences were grouped according to host type, and cover all the viral subtypes found in that host. Downloaded FASTA format sequences were parsed into each category such as accession number, subtype, gene, host, occurring year, and other parameters. Abstract Global outbreaks of human influenza occur from influenza A viruses with novel Hemagglutinin (HA) molecules to which humans have no immunity. So understanding of the origin and evolution of HA genes is of particular importance.
Host-origin classification based on the two outer proteins Host classification based on the HA protein achieved accuracies between 91. 2% and 100% using KNN and random forest. While host classification based on the NA protein achieved accuracies between 91. 2% and 100% using KNN and random forest. Keywords: Influenza, machine learning, host classification. Introduction Influenza A viruses belong to the Orthomyxoviridae family of negative sense, single-stranded, segmented RNA viruses. The RNA core consists of 8 gene segments.
Immunologically, the most significant surface proteins include Hemagglutinin HA (16 subtypes) and Neuraminidase NA (9 subtypes). Influenza A subtypes are usually identified by their HA and NA proteins[1, 2]. The HA and NA proteins are integral membrane proteins and consider as the major surface antigen of the influenza virus virion. The Hemagglutinin (HA) of influenza A viruses is a major surface glycoprotein that is responsible for attachment of the virus to the cell surface of host receptors.
The role of NA is to free virus particles from host cell receptors, to allow progeny virions to escape from the cell in which they arose, and so facilitate virus spread [4]. All known subtypes of influenza A viruses are found among avian species that serve as main reservoirs for these agents [2]. In general, an influenza virus infects only a single species; however, whole viruses may occasionally be transmitted from one species to another, and genetic reassortment between viruses from two different hosts can produce a new virus capable of infecting a third host.
Avian influenza viruses are not readily introduced into humans [3], possibly because humans do not possess the a(2,3)-sialyllactose (NeuAc-2,3Gal) receptors required for attachment of the viruses to epithelial cells. However, individual viral genes can be transmitted between humans and avian species, as demonstrated by avian human reassortant viruses that caused the 1957 and 1968 influenza pandemics [4, 5]. This finding suggested that an middle host may be needed for genetic reassortment of human and avian viruses.
Pigs are considered a logical candidate for this role because they can be infected by either avian or human viruses [6, 7] and because they possess both NeuAc-2,3Gal and NeuAc-2,6Gal receptors. In addition, there is good evidence that pigs are more frequently involved in interspecies transmission of influenza A viruses than are other animals [6, 8, 9]. Previous studies have also defined host specificity markers. For example, Allen et al. (2009) predicted positions in the genome associated with human host specificity.
However, the host markers that these workers identified in the surface glycoproteins HA and NA and in the polymerase protein PB1, as well as the alternate transcripts NS2, M2, and PB1-F2, were poor-quality host discriminators In a previous study, Host-specific signatures were identified using class associative rule mining to identify and confirm significant variations between different influenza hosts. Another study used random forest for the prediction of host tropism from both avian and human samples only.
Because of the important functional role of HA and NA in cell-receptor attachment, entry, and infectivity, our focus in this study was specifically on the host markers that were found only in HA and NA. The aim of the present study was to establish accurate host of origin classifiers that are capable of indicating signatures in human, avian, and swine influenza viral genomes for HA and NA proteins. KNN and Random forest classification models were used for host-origin classification.
Materials and Methods All protein sequences that isolated from human, avian and swine hosts were downloaded from the NCBI’s Influenza Virus Resources (http://www. ncbi. nlm. nih. gov/genomes/FLU/FLU. html). The downloaded sequences were forced to be non redundant and complete isolation of HA and NA segments. A total of 1500 and 2345 HA and NA protein sequences respectively were selected to be involved in our study. Part of the data is used for training and the remaining part is used for testing.
We used amino acid sequences (20 letter alphabet) because they are known to give more reliable results than nucleotide sequences when the sequence divergence is high. To compare the genomic patterns of avian, swine and human influenza viruses with each others, we downloaded protein sequences of HA and NA from NCBI’s Influenza Virus Resources, isolated from various host species. The detailed count of sequences used in this study for each host is indicated in table 1.
The sequences were grouped according to host type, and cover all the viral subtypes found in that host. Downloaded FASTA format sequences were parsed into each category such as accession number, subtype, gene, host, occurring year, and other parameters. Accurate detection of influenza viral origin can significantly improve influenza surveillance and vaccine development. The classification models constructed from Amino Acid Composition feature vectors, all achieved high prediction performance that indicate clear difference in both human, avian and swine proteins.