To access this work you must either be on the Smith College campus OR have valid Smith login credentials.

On Campus users: To access this work if you are on campus please Select the Download button.

Off Campus users: To access this work from off campus, please select the Off-Campus button and enter your Smith username and password when prompted.

Non-Smith users: You may request this item through Interlibrary Loan at your own library.

Publication Date


First Advisor

Sara Mathieson

Document Type

Honors Project

Degree Name

Bachelor of Arts


Computer Science


Bioinformatics, Protein data bank, Proteins, Clustering, Amino acid sequence, Proteins-Database


Proteins are macromolecules that play a pivotal role in biological processes in living organisms. Structural information for proteins is collected in a large Protein Data Bank database, which contains at this time over 122,000 structures [24]. Grouping, or clustering, similar protein sequences based on their similarity allows biologists to identify homologous sequences, or those with shared gene ancestry.The current implementation on the RCSB PDB site ( uses BLASTClust [16], which is run weekly to account for the frequent protein data placed into the Protein Data Bank database. The issue is that these updates take about half a day to run. To determine the similarity between pairs of protein sequences, there are methods that align the sequences, for example by inserting gaps to be able to match more pairs of amino acids. The goal of this project was to cluster protein sequences by creating a computational tool that would do pairwise alignment-free protein sequence clustering using Java and Apache Spark, a cluster-computing framework for big data processing and machine learning [43] [49].




73 pages : color illustrations. Includes bibliographical references (pages 69-73)