Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have a xls file with one column and 10000 strings I want to do few things

1- make a heatmap or a cluster figure shows the similarity percentage between each string with another one.

In order to find the percentage of similaity between one with another, I found this post Find the similarity percent between two strings and I tried to make it work for me

As an example, I have these in a xls file where each line is one string

AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR
AAAAAGPLQPETENAGTSV
AAAAANNGAAPPDLSLMALAR
AAAAASAVNDYYGTWGQK
AAAAASGASNTDSSATKPK
AAAAGFNWDDADVK
AAAAGFNWDDADVKK

I could not figure out how to use that example, for when I have many combinations for example in my example , I have 7 strings and each one has a similarity with another one.

import xlrd
from difflib import SequenceMatcher

workbook = xlrd.open_workbook('data.xlsx')
    def similar(a, b):
        return SequenceMatcher(None, a, b).ratio()
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
133 views
Welcome To Ask or Share your Answers For Others

1 Answer

Taking two of the strings from your list as a sample, I offer this way of calculating a measure.

>>> from collections import Counter
>>> stringA = 'AAAAAGPLQPETENAGTSV'
>>> stringB = 'AAAAANNGAAPPDLSLMALAR'
>>> unionSize = len(stringA) + len(stringB)
>>> A=Counter(list(stringA))
>>> B=Counter(list(stringB))
>>> A
Counter({'A': 6, 'G': 2, 'E': 2, 'T': 2, 'P': 2, 'V': 1, 'Q': 1, 'S': 1, 'N': 1, 'L': 1})
>>> B
Counter({'A': 9, 'L': 3, 'N': 2, 'P': 2, 'G': 1, 'M': 1, 'S': 1, 'R': 1, 'D': 1})
>>> symDiff = set(A.keys()).symmetric_difference(set(B.keys()))
>>> symDiff
{'M', 'V', 'Q', 'E', 'T', 'D', 'R'}
>>> symDiffSize = 0
>>> for key in symDiff:
...     if key in A.keys():
...         symDiffSize += A[key]
...     else:
...         symDiffSize += B[key]
...         
>>> symDiffSize, unionSize
(9, 40)

If the two strings had all letters in common then there would be no letters in their 'symmetric difference', which would make the denominator zero. This would seem to mean that the more letters in common and the fewer that are unshared the greater the fraction. You could perhaps take its logarithm.


I don't have Excel. This code accepts a list of strings which you could glean from Excel. It avoids redundant calculations of multisets (aka bags) to save resources. Also, it returns a pair, rather than a ratio because sometimes the denominator can be zero.

from collections import Counter

strings = [
    'AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR', 
    'AAAAAGPLQPETENAGTSV', 
    'AAAAANNGAAPPDLSLMALAR', 
    'AAAAASAVNDYYGTWGQK', 
    'AAAAASGASNTDSSATKPK', 
    'AAAAGFNWDDADVK', 
    'AAAAGFNWDDADVKK', 
    ]

class NikDistance():
    def __init__ (self, strings):
        self.stringLengths = [len(str) for str in strings]
        self.stringCounters = []
        for str in strings:
            self.stringCounters.append(Counter(list(str)))
    def __call__ (self, i, j):
        unionDiff = self.stringLengths[i] + self.stringLengths[j]
        symDiff = set(self.stringCounters[i].keys()).symmetric_difference(set(self.stringCounters[j].keys()))
        symDiffSize = 0
        for key in symDiff:
            if key in self.stringCounters[i].keys():
                symDiffSize += self.stringCounters[i][key]
            else:
                symDiffSize += self.stringCounters[j][key]
        return (symDiffSize, unionDiff)

nikDistance = NikDistance(strings)

for i in range(len(strings)):
    for j in range(i+1, len(strings)):
        print (strings[i], strings[j], nikDistance(i,j))

Result:

AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAAGPLQPETENAGTSV (7, 52)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAANNGAAPPDLSLMALAR (11, 54)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAASAVNDYYGTWGQK (9, 51)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAASGASNTDSSATKPK (9, 52)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAGFNWDDADVK (13, 47)
AAAAAAAAAAAAAADPVAGDHENIVSDSTQASR AAAAGFNWDDADVKK (14, 48)
AAAAAGPLQPETENAGTSV AAAAANNGAAPPDLSLMALAR (9, 40)
AAAAAGPLQPETENAGTSV AAAAASAVNDYYGTWGQK (10, 37)
AAAAAGPLQPETENAGTSV AAAAASGASNTDSSATKPK (8, 38)
AAAAAGPLQPETENAGTSV AAAAGFNWDDADVK (15, 33)
AAAAAGPLQPETENAGTSV AAAAGFNWDDADVKK (16, 34)
AAAAANNGAAPPDLSLMALAR AAAAASAVNDYYGTWGQK (14, 39)
AAAAANNGAAPPDLSLMALAR AAAAASGASNTDSSATKPK (9, 40)
AAAAANNGAAPPDLSLMALAR AAAAGFNWDDADVK (12, 35)
AAAAANNGAAPPDLSLMALAR AAAAGFNWDDADVKK (13, 36)
AAAAASAVNDYYGTWGQK AAAAASGASNTDSSATKPK (6, 37)
AAAAASAVNDYYGTWGQK AAAAGFNWDDADVK (6, 32)
AAAAASAVNDYYGTWGQK AAAAGFNWDDADVKK (6, 33)
AAAAASGASNTDSSATKPK AAAAGFNWDDADVK (10, 33)
AAAAASGASNTDSSATKPK AAAAGFNWDDADVKK (10, 34)
AAAAGFNWDDADVK AAAAGFNWDDADVKK (0, 29)

Consider the last item. There are 29 characters altogether, and there are no (zero) characters that don't appear in both strings.

Look at the penultimate item. There are a total of 34 characters. Ten (10) of them do not appear in both strings.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...