Theses, Dissertations, and Projects

To access this work you must either be on the Smith College campus OR have valid Smith login credentials.

On Campus users: To access this work if you are on campus please Select the Download button.

Off Campus users: To access this work from off campus, please select the Off-Campus button and enter your Smith username and password when prompted.

Non-Smith users: You may request this item through Interlibrary Loan at your own library.

Pairwise alignment-free protein sequence clustering

Marina Kristy Cheng, Smith College

Publication Date

2017-5

First Advisor

Sara Mathieson

Document Type

Honors Project

Degree Name

Bachelor of Arts

Department

Computer Science

Keywords

Bioinformatics, Protein data bank, Proteins, Clustering, Amino acid sequence, Proteins-Database

Abstract

Proteins are macromolecules that play a pivotal role in biological processes in living organisms. Structural information for proteins is collected in a large Protein Data Bank database, which contains at this time over 122,000 structures [24]. Grouping, or clustering, similar protein sequences based on their similarity allows biologists to identify homologous sequences, or those with shared gene ancestry.The current implementation on the RCSB PDB site (www.rcsb.org) uses BLASTClust [16], which is run weekly to account for the frequent protein data placed into the Protein Data Bank database. The issue is that these updates take about half a day to run. To determine the similarity between pairs of protein sequences, there are methods that align the sequences, for example by inserting gaps to be able to match more pairs of amino acids. The goal of this project was to cluster protein sequences by creating a computational tool that would do pairwise alignment-free protein sequence clustering using Java and Apache Spark, a cluster-computing framework for big data processing and machine learning [43] [49].

Language

English

Comments

73 pages : color illustrations. Includes bibliographical references (pages 69-73)

Recommended Citation

Cheng, Marina Kristy, "Pairwise alignment-free protein sequence clustering" (2017). Honors Project, Smith College, Northampton, MA.
https://scholarworks.smith.edu/theses/1815

Download

Smith Only:
Off Campus Download

COinS

Smith ScholarWorks

Theses, Dissertations, and Projects

Pairwise alignment-free protein sequence clustering

Publication Date

First Advisor

Document Type

Degree Name

Department

Keywords

Abstract

Language

Comments

Recommended Citation

Search

Browse

Smith ScholarWorks

Theses, Dissertations, and Projects

Pairwise alignment-free protein sequence clustering

Author

Publication Date

First Advisor

Document Type

Degree Name

Department

Keywords

Abstract

Language

Comments

Recommended Citation

Share

Search

Browse