COMP 462 / 561 – Homework #3
IMPORTANT: Question 1 of this assignment requires a substantial amount of
programming, and portions (c), (d), and (e) depend on you having a working program. Do
not leave this to the last minute! Get started with the programming early, and get our
help if you need.
Question 1 (80 points).
In this question, you will implement the Viterbi algorithm and apply it to the gene finding
HMM seen in class to predict genes in a bacterial genome.
Vibrio cholerae is the bacteria that causes cholera. It was first sequenced in 2000 ( see
https://www.nature.com/articles/35020000 ). Download the genome sequence here:
• When I click the link above in Acrobat Reader, it prompts me for a password,
which I don’t have. Instead, just copy-paste the link into your browser, and it
won’t ask you for that password. Weird…
• Although the genome of Vibrio cholera Vibrio cholera consists of two
chromosomes, the sequence file contains more than two sequences, because it is
Download the gene annotation here:
The format of both files should be fairly self-explanatory, but you can find more
information about the gff3 file format here: http://gmod.org/wiki/GFF3. The file contains
annotations of different types of functional elements. We will only consider those called
“CDS” (for CoDing Sequences); ignore the other lines (labeled “gene” or “exon” or
• One important detail we’ve omitted in our in-class discussion is that genes can
actually be located either on the forward or the reverse DNA strand. In class, we
have only considered the genes located on the forward strand. In this assignment,
we will also only focus on the genes on the forward strand. It is not much more
complicated to handle genes on the negative strand, but it’s just annoying, so we
will pretend that genes on the negative strand do not exist, i.e. that portions of the
genome occupied by genes on the reverse strand are actually intergenic regions.
• Another level of complexity that we will ignore is the (fairly rare) possibility that
two genes can overlap each other. This obviously breaks our HMM gene finder.
Because such events are very rare, we will assume that this never occurs. The gff3
file given to you actually contains a few of those pairs of overlapping genes. You
can deal with them in whichever way you find simplest; this is not going to affect
your results significantly.
• In bacteria, there are actually multiple possible start codons (not only ATG).
a) (10 points) Using the genome sequence and gene annotation for Vibrio cholerae,
infer the (i) average length of intergenic regions, (ii) average length of genic
region, (iii) the nucleotide frequency table for intergenic regions; (iv) the codon
frequency table for genic regions.
b) (30 points) Using the programming language of your choice, implement the
Viterbi algorithm. Your program does not need to work with an arbitrary HMM.
It but only need to work for the specific bacterial gene-finding HMM seen in class
(4 states: Intergenic, Start, Middle, Stop). The program’s argument should be
I. A Fasta file containing one or more bacterial genome sequences.
II. A configuration file (formatted as you want), which contains the
information found in (a), containing the emission and transition
probabilities for your HMM. (Assume that the initial state probability is 1
for the intergenic state, and zero for all other states).
The output should be a GFF3 file with one line per predicted gene (only lines
corresponding to portions annotated as genes need to be included), e.g. something
DN38.contig00011 ena CDS 85269 86705 . + 0 .
DN38.contig00011 ena CDS 87006 87230 . + 0 .
DN38.contig00011 ena CDS 88079 89323 . + 0 .
Submit your code, along with a simple explanation about how to run it.
c) (10 points) Vibrio vulnificus is a species of bacteria closely related to Vibrio
cholerae. In 2005, it caused an outbreak following the hurricane Katrina
(https://www.ncbi.nlm.nih.gov/pubmed/16195696 ). Its genome was sequenced
and can be downloaded here:
Run your program on this genome, using the parameters obtained for Vibrio
cholerae. Submit the GFF3 file with your gene predictions (forward strand only).
d) (15 points) The genes of Vibrio vulnificus were annotated based on a variety of
types of evidence. You can download the GFF3 file here:
Evaluate the accuracy of your predictions against the gene annotation available
here (always limiting our attention to genes on the positive strand). Specifically,
• The fraction of annotated genes on the positive strand that:
o Perfectly match both ends of one of your predicted genes
o Match the start but not the end of a predicted gene
o Match the end but not the start of a predicted gene
o Do not match neither the start not the end of a predicted gene
• The fraction of your predicted genes that:
o Perfectly match both ends of one of an annotated genes
o Match the start but not the end of an annotated gene
o Match the end but not the start of an annotated gene
o Do not match neither the start not the end of an annotated gene
Note: Gene prediction is not an easy task! You should expect that only
approximately half of the annotated genes are perfectly predicted.
e) (15 points) What properties of annotated genes are associated to an elevated risk
of being partially or completely missed by your predictor? What are the properties
of genes predicted by your predictor that do not match an annotated gene? Write a
paragraph for each question, supported by data and/or figures.
Question 2. (20 points)
If we think of an HMM as a machine to generate a random sequence of observations, the
number of consecutive steps the path will remain at a given state follows a geometric
distribution (https://en.wikipedia.org/wiki/Geometric_distribution ). This means, for
example, that, in our gene-finding HMM, the length distribution of exons, introns, and
intergenic regions will be assumed to be geometric. However, in reality, these regions
have length distributions that can be far from geometric. Consider the very simple twostate gene finding HMM shown below. Assume that we have a desired gene length
distribution, provided in the form of a discrete probability distribution for lengths ranging
from 1 to 1000: Pr[length = k] = pk.
Describe (in at most half a page) how you could modify the HMM below to produce a
second HMM where the distribution over the duration of stay in the gene state (or states)
is exactly the target length distribution. Note that the Gene state will probably need to be
subdivided in several Gene sub-states. Describe not only the states of your HMM but also
the transition probabilities.