Dna Cancer Prediction

In this assignment, students are tasked with categorizing antimicrobial and antibiofilm peptide sequences using feature extraction techniques by the due date of January 25, 2020. Throughout the process, students must submit daily reports and compile a final PDF report for evaluation. The grading criteria and additional opportunities for earning extra credit are specified.

https://github.com/davidanastasiu/coen-342-wi22

PR1: Peptide Classification

Published Date:

Jan. 12, 2020, 5:00 p.m.

Deadline Date:

Jan. 25, 2020, 11:59 p.m.

Description:

This is an individual assignment.

Overview and Assignment Goals:

The objectives of this assignment are the following:



Create feed-forward neural networks and train them using your own codes and

frameworks.



Experiment with different feature extraction techniques.



Think about dealing with imbalanced data.

Detailed Description:

Develop predictive neural networks that can determine, given an antibacterial peptide,

whether it is also an antibiofilm peptide.

“Proteins are large biomolecules, or macromolecules, consisting of one or more long

chains of amino acid residues. Proteins perform a vast array of functions within organisms,

including catalysing metabolic reactions, DNA replication, responding to stimuli, providing

structure to cells, and organisms, and transporting molecules from one location to another.

Proteins differ from one another primarily in their sequence of amino acids, which is

dictated by the nucleotide sequence of their genes, and which usually results in protein

folding into a specific three-dimensional structure that determines its activity.

A linear chain of amino acid residues is called a polypeptide. A protein contains at least

one long polypeptide. Short polypeptides, containing less than 20-30 residues, are rarely

considered to be proteins and are commonly called peptides. […] The sequence of amino

acid residues in a protein is defined by the sequence of a gene, which is encoded in the

genetic code. In general, the genetic code specifies 20 standard amino acids; […] Proteins

can also work together to achieve a particular function, and they often associate to form

stable protein complexes.” [Wikipedia, Accessed 2020-02-07,

https://en.wikipedia.org/wiki/Protein]

Biofilms are tightly-connected multicellular communities of microorganisms encased in self-

secreted extra-cellular matrices. They are currently one of the major causes of disease for

two main reasons. First, roughly 75% of all human infections are caused by biofilms.

Second, due to the robust multicellular cellular matrix structure, they are resistant both to

the host defense mechanisms and to traditional antimicrobial compounds (antibiotics).

Thus, it is important to identify peptide sequences that are not only antimicrobial (can

destroy or render inert the invading microorganism), but also antibiofilm (can penetrate the

extra-cellular matrix so it can get to the microorganism in the first place).

You have been provided with a training set (train.dat) and a test set (test.dat) consisting of

peptide sequences, one per line in the file. Peptides are encoded as strings with characters

from an alphabet of 20 characters, each representing an amino-acid residue. The training

set also includes the label for each sequence as 1 (antibiofilm) or -1 (not antibiofilm) as the

first character in each line of the training file, separated from the sequence by a tab (\t)

character.

The input to your classifiers will not be the peptides themselves, but rather features

extracted from the peptides. Two simple approaches for feature extraction are the bag-of-

words and the k-mer models you should have learned about in Data Mining or Machine

Learning, where a word is one of the amino-acids in the peptide. You should not use any

additional external data in this assignment.

Note that the dataset is imbalanced. We will Matthews’s correlation coefficient (MCC) as

evaluation metric for this assignment, which, similar to the F-1 score, combines aspects of

the result’s sensitivity and specificity. Given the normal confusion matrix resulting from

comparing the predicted and true classes of the test samples, MCC is defined as,

Programs:

You are required to write two separate programs for the classification. The first may only

use basic Python structures (from numpy or scipy) and you should implement your own

functions for training the neural network. This is also the program you will use to make CLP

submissions. In addition, you should write a second program that uses a deep learning

framework of your choice to train the neural network. The structure of the network may be

the same or different from the one you created in the first program. You will present results

from this program (which should be at least as good as those from the first program) in

your report.

Considerations:

Try extracting different features from the peptide strings.
Consider oversampling the negative class to fix the apparent imbalance.
Try out different network configurations and activation functions.
Consider regularization as a way to keep weights balanced in the network.

Data Description:

The training dataset consists of 1566 records and the test dataset consists of 392 records.

We provide you with the training class labels and the test labels are held out. Your task is

to predict those labels for the peptides in the test set and create a test.txt file containing

those labels, which you will submit to CLP. Note that CLP only accepts files with extensions

.txt or .dat for your predicted labels, and .py or .ipynb or .zip or .tgz for codes.

Rules:



This is an individual assignment. Discussion of broad level strategies are allowed but

any copying of prediction files and source codes will result in an honor code violation.



You are allowed 5 submissions per day.



After the submission deadline, only your chosen or last submission is considered for

the leaderboard.

Deliverables:



Valid submissions to the Leader Board website: https://clp.engr.scu.edu (username is

your SCU username and your password is your SCU password).

Canvas Submission for the report:



Include a 2-page, single-spaced report describing details regarding the steps you

followed for feature extraction, designing your neural network, and training your model.

The report should be in PDF format and the file should be called report.pdf. The report

needs to be structured as a technical report (title, abstract, introduction, sections,

conclusion), be free from grammatical errors, and use standard page and font sizes (letter

size page, 10 or 11 pt font). Be sure to include the following in the report:

Name and SCU ID.
Rank & MCC-score for your submission (at the time of writing the report). If

you chose not to see the leaderboard, state so.

Your approach.

Your methodology of choosing the approach and associated parameters.

 Ensure you submitted the correct code on CLP that matches your output.

 Zip up your report and codes for both programs in an archive called .zip or

.tgz and submit the archive to Canvas.

Grading:

Grading for the Assignment will be split on your implementation (70%) and report (30%).

Extra credit (1% of final grade) will be awarded to the top-3 performing algorithms. Note

that extra credit throughout the semester will be tallied outside of Canvas and will be added

to the final grade at the end of the semester.

Files: available on Canvas.

Dna Cancer Prediction

Comments