Date of Publication

12-1-2022

Document Type

Master's Thesis

Degree Name

Master of Science in Computer Science

Subject Categories

Computer Sciences

College

College of Computer Studies

Department/Unit

Software Technology

Thesis Advisor

Anish Man Singh Shrestha

Defense Panel Chair

Roger Luis Uy

Defense Panel Member

Jennifer Ureta
Anish Man Singh Shrestha

Abstract/Summary

RNA-seq is an experiment technique that utilizes modern, high throughput sequencing technology to sequence a population of mRNA. A common use of RNAseq is for Differential Gene Expression Analysis (DGEA), which is the process of identifying genes with significant changes in their expression levels across conditions. Typical DGEA pipelines, which require an annotated reference genome or transcriptome, cannot be applied to most organisms, since only a few organisms have been extensively studied and have a high quality annotated reference transcriptome available. A more complex pipeline is often used for DGEA in the case of organisms without an annotated reference transcriptome. This complex pipeline involves constructing a de novo transcriptome assembly, which is the process of reconstructing transcript sequences from the RNA-seq reads. However, constructing a de novo assembly is computationally expensive. Recently, we proposed a novel alternative, in which we directly align the RNA-seq reads to a protein database of a close relative. The alternative pipeline provides improvements in speed and memory usage, while improving the precision and recall in identifying genes that are differentially expressing. However, this alternative pipeline utilizes full sequence alignments which take time and generate information unnecessary for DGEA. This study replaces full sequence alignments with quasi-mapping, which determines the mapping by rapid look-ups of sub-strings of a query sequence. We report a further speed-up by replacing full sequence alignment with quasi-mapping, making our pipeline > 1000× faster than assembly-based approach, and still more accurate. We also compared quasi-mapping to other mapping techniques, and show that it is faster but at the cost of sensitivity.

Abstract Format

html

Language

English

Format

Electronic

Physical Description

44 leaves

Keywords

Nucleotide sequence; Gene expression

Upload Full Text

wf_yes

Embargo Period

12-12-2022

Share

COinS