Title

Pairwise alignment-free protein sequence clustering

Publication Date

2017-5

Document Type

Honors Thesis

Degree Name

Bachelor of Arts

Department

Computer Science

Advisors

Sara Mathieson

Keywords

Bioinformatics, Protein data bank, Proteins, Clustering, Amino acid sequence, Proteins-Database

Abstract

Proteins are macromolecules that play a pivotal role in biological processes in living organisms. Structural information for proteins is collected in a large Protein Data Bank database, which contains at this time over 122,000 structures [24]. Grouping, or clustering, similar protein sequences based on their similarity allows biologists to identify homologous sequences, or those with shared gene ancestry.The current implementation on the RCSB PDB site (www.rcsb.org) uses BLASTClust [16], which is run weekly to account for the frequent protein data placed into the Protein Data Bank database. The issue is that these updates take about half a day to run. To determine the similarity between pairs of protein sequences, there are methods that align the sequences, for example by inserting gaps to be able to match more pairs of amino acids. The goal of this project was to cluster protein sequences by creating a computational tool that would do pairwise alignment-free protein sequence clustering using Java and Apache Spark, a cluster-computing framework for big data processing and machine learning [43] [49].

Language

English

Comments

73 pages : color illustrations. Includes bibliographical references (pages 69-73)

This document is currently not available here.

Share

COinS