Download PDFOpen PDF in browserApplication of Machine Learning in Analysis of Transcriptomic Data Derived from Next Generation SequencingEasyChair Preprint 19855 pages•Date: November 18, 2019AbstractTobacco Mosaic Virus, the most studied plant virus, could infect over 100 species of plants and over 550 species of flowering plants, causing enormous loss of economy at home and abroad. Microarray, an important analytic tool of Genomics and Genetics, enables researchers to analyze massive gene expression simultaneously. To find out the genes related to replication of the Tobacco Mosaic Virus, the material of this research is gene expression of the cell of Arbidopsis infected by Tobacco Mosaic Virus, which recorded in 5 time points (30 min, 4hr, 6hr, 18hr and 24hr) and made by Next Generation Sequencing. The research analyzes the time-series raw data and adapts the Fast Correlation-Based Filter (FCBF) and the Wrapper algorithms for gene selection. The selected genes are validated by the C4.5 algorithm and Multi-Layer Perceptron. Results show that genes selected by Wrapper algorithm with average accuracy 75%, average true positive rate (classified accuracy of control group) 77.5%, true negative rate (classified accuracy of experiment group) 72.5%, average F-measure 74.85% and average AUC 07965, perform better overall than genes selected by other algorithms. Keyphrases: Tobacco mosaic virus, arbidopsis, machine learning, next generation sequencing
|