Wednesday 24 August 2016

Edit Distance

In this problem we are looking for the Levenshtein distance between two protein sequences. The distance is calculated as the number of single character changes (insertion, deletion or substitution) needed to turn one sequence into the other. One way to do this is described in the Wikipedia page and involves building a matrix of distances between the prefixes of the two strings. The last value to be calculated (in the bottom right position of the matrix) then equals the distance between the two strings.

Sample Dataset
>Rosalind_39
PLEASANTLY
>Rosalind_11
MEANLY

Expected Output
5

The following python3 code initiates a matrix M with len(s) rows and len(t) columns. It then fills the matrix based on if and what type of change is needed between the two strings. The last value is then printed as the answer.

from Bio import SeqIO
seqs = []
with open('sampledata.fasta', 'r') as f:
    for record in SeqIO.parse(f, 'fasta'):
        seqs.append(str(record.seq))
s = seqs[0]
t = seqs[1]
M = [[0 for j in range(len(t) + 1)] for i in range(len(s) + 1)]
for i in range(1, len(s) + 1):
    M[i][0] = i
for i in range(1, len(t) + 1):
    M[0][i] = i

for i in range(1, len(s) + 1):
    for j in range(1, len(t) + 1):
        if s[i - 1] == t[j - 1]:
            M[i][j] = M[i - 1][j - 1]
        else:
            M[i][j] = min(M[i - 1][j] + 1, M[i][j - 1] + 1,
                          M[i - 1][j - 1] + 1)
print(M[len(s)][len(t)])

No comments:

Post a Comment