Pairwise alignment-free protein sequence clustering

Publication Date


Document Type

Honors Project

Degree Name

Bachelor of Arts


Computer Science


Sara Mathieson


Bioinformatics, Protein data bank, Proteins, Clustering, Amino acid sequence, Proteins-Database


Proteins are macromolecules that play a pivotal role in biological processes in living organisms. Structural information for proteins is collected in a large Protein Data Bank database, which contains at this time over 122,000 structures [24]. Grouping, or clustering, similar protein sequences based on their similarity allows biologists to identify homologous sequences, or those with shared gene ancestry.The current implementation on the RCSB PDB site (www.rcsb.org) uses BLASTClust [16], which is run weekly to account for the frequent protein data placed into the Protein Data Bank database. The issue is that these updates take about half a day to run. To determine the similarity between pairs of protein sequences, there are methods that align the sequences, for example by inserting gaps to be able to match more pairs of amino acids. The goal of this project was to cluster protein sequences by creating a computational tool that would do pairwise alignment-free protein sequence clustering using Java and Apache Spark, a cluster-computing framework for big data processing and machine learning [43] [49].




73 pages : color illustrations. Includes bibliographical references (pages 69-73)

This document is currently not available here.