To access this work you must either be on the Smith College campus OR have valid Smith login credentials.
On Campus users: To access this work if you are on campus please Select the Download button.
Off Campus users: To access this work from off campus, please select the Off-Campus button and enter your Smith username and password when prompted.
Non-Smith users: You may request this item through Interlibrary Loan at your own library.
Publication Date
2025-5
First Advisor
Kaitlyn Cook
Second Advisor
Luce Ward
Document Type
Honors Project
Degree Name
Bachelor of Arts
Department
Statistical and Data Sciences
Keywords
uncertainty, evolution, biology, prokaryotes, data science, Bayesian, hidden Markov models, bioinformatics, computational biology, phylogenetics, prediction, long-read
Abstract
Using 16S rRNA gene sequencing data is a fast, inexpensive method of performing taxonomic classification on prokaryotes. Many statistical methods have been developed to do such classifications. However, no existing tools for 16S analysis are geared towards long-read data. Long-read data is becoming increasingly accessible, and makes getting full-length 16S gene data significantly more feasible. We explore statistical methods in taxonomic assignment towards the development of a 16S analysis pipeline focused on long-read, full-gene data. We focus on the RDP Classifier, Bayesian Lowest Common Ancestor (BLCA), and Hidden Markov Model-based Utra-Fast OTU tools and assess their effectiveness on a testing set of Proteobacteria. We find that, when using continuous assignment, BLCA performs very similarly to RDP Classifier when using either bootstrap confidence scores or posterior probabilities to perform assignment. We conclude that BLCA’s novel Bayesian method shows great promise for growth and potential inclusion in a long-read, full-gene 16S analysis pipeline.
Rights
©2025 Elm Markert. Access limited to the Smith College community and other researchers while on campus. Smith College community members also may access from off-campus using a Smith College log-in. Other off-campus researchers may request a copy through Interlibrary Loan for personal use.
Language
English
Recommended Citation
Markert, Elm, "Towards a 16S rRNA Sequence Analysis Pipeline: Exploring Statistical Methods for Taxonomic Assignment" (2025). Honors Project, Smith College, Northampton, MA.
https://scholarworks.smith.edu/theses/2676
Smith Only:
Off Campus Download

Comments
x, 77 pages : illustrations (some color), charts. Includes bibliographical references (pages 73-77).