I have a file containing genes of different genomes. Gene is denoted by NZ_CP019047.1_2993 and Genome by NZ_CP019047 They look like this :
NZ_CP019047.1_2993
NZ_CP019047.1_2994
NZ_CP019047.1_2995
NZ_CP019047.1_2999
NZ_CP019047.1_3000
NZ_CP019047.1_3001
NZ_CP019047.1_3003
KE699235.1_379
KE699235.1_1000
KE699235.1_1001
what I want to do is group the genes of a genome (if a genome has more than 1 gene) regarding their distance meaning, if I have genes nearer than 4 positions I want to group them together.The position can be understood as the number after '_'. I want something like these:
[NZ_CP019047.1_2993,NZ_CP019047.1_2994,NZ_CP019047.1_2995]
[NZ_CP019047.1_2999,NZ_CP019047.1_3000,NZ_CP019047.1_3001,NZ_CP019047.1_3003]
[KE699235.1_1000,KE699235.1_1001]
What I have tried so far is creating a dictionary holding for each genome, in my case NZ_CP019047 and KE699235, all the number after '_'. Then I calculate their differences, if it is less than 4 I try to group them. The problem is that I am having duplication and I am having problem in the case when 1 genome has more than 1 group of genes like this case :
[NZ_CP019047.1_2993,NZ_CP019047.1_2994,NZ_CP019047.1_2995]
[NZ_CP019047.1_2999,NZ_CP019047.1_3000,NZ_CP019047.1_3001,NZ_CP019047.1_3003]
This is my code:
for key in sortedDict1:
cassette = ''
differences = []
numbers = sortedDict1[key]
differences = [x - numbers[i - 1] for i, x in enumerate(numbers)][1:]
print(differences)
for i in range(0,len(differences)):
if differences[i] <= 3:
pos = i
el1 = key + str(numbers[i])
el2 = key + str(numbers[i+1])
cas = el1 + ' '
cassette += cas
cas = el2 + ' '
cassette += cas
else:
cassette + '/n'
i+=1
I am referring to groups with variable cassette. Can someone please help?
question from:https://stackoverflow.com/questions/66063626/how-to-group-genes-regarding-their-id-and-position-python