Pairwise alignment-free protein sequence clustering
Bachelor of Arts
Bioinformatics, Protein data bank, Proteins, Clustering, Amino acid sequence, Proteins-Database
Proteins are macromolecules that play a pivotal role in biological processes in living organisms. Structural information for proteins is collected in a large Protein Data Bank database, which contains at this time over 122,000 structures . Grouping, or clustering, similar protein sequences based on their similarity allows biologists to identify homologous sequences, or those with shared gene ancestry.The current implementation on the RCSB PDB site (www.rcsb.org) uses BLASTClust , which is run weekly to account for the frequent protein data placed into the Protein Data Bank database. The issue is that these updates take about half a day to run. To determine the similarity between pairs of protein sequences, there are methods that align the sequences, for example by inserting gaps to be able to match more pairs of amino acids. The goal of this project was to cluster protein sequences by creating a computational tool that would do pairwise alignment-free protein sequence clustering using Java and Apache Spark, a cluster-computing framework for big data processing and machine learning  .
Cheng, Marina Kristy, "Pairwise alignment-free protein sequence clustering" (2017). Honors Project, Smith College, Northampton, MA.
Off Campus Download