Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I have a file containing genes of different genomes. Gene is denoted by NZ_CP019047.1_2993 and Genome by NZ_CP019047 They look like this :

NZ_CP019047.1_2993 
NZ_CP019047.1_2994 
NZ_CP019047.1_2995 
NZ_CP019047.1_2999 
NZ_CP019047.1_3000
NZ_CP019047.1_3001
NZ_CP019047.1_3003
KE699235.1_379 
KE699235.1_1000
KE699235.1_1001

what I want to do is group the genes of a genome (if a genome has more than 1 gene) regarding their distance meaning, if I have genes nearer than 4 positions I want to group them together.The position can be understood as the number after '_'. I want something like these:

[NZ_CP019047.1_2993,NZ_CP019047.1_2994,NZ_CP019047.1_2995]
[NZ_CP019047.1_2999,NZ_CP019047.1_3000,NZ_CP019047.1_3001,NZ_CP019047.1_3003]
[KE699235.1_1000,KE699235.1_1001]

What I have tried so far is creating a dictionary holding for each genome, in my case NZ_CP019047 and KE699235, all the number after '_'. Then I calculate their differences, if it is less than 4 I try to group them. The problem is that I am having duplication and I am having problem in the case when 1 genome has more than 1 group of genes like this case :

[NZ_CP019047.1_2993,NZ_CP019047.1_2994,NZ_CP019047.1_2995]
[NZ_CP019047.1_2999,NZ_CP019047.1_3000,NZ_CP019047.1_3001,NZ_CP019047.1_3003]

This is my code:

for key in sortedDict1:
        cassette = ''
        differences = []
        numbers = sortedDict1[key]
        differences = [x - numbers[i - 1] for i, x in enumerate(numbers)][1:]
        print(differences)
        for i in range(0,len(differences)):
            if differences[i] <= 3:
                pos = i
                el1 = key + str(numbers[i])
                el2 = key + str(numbers[i+1])
                cas = el1 + ' ' 
                cassette += cas
                cas = el2 + ' '
                cassette += cas
            else:
                cassette + '/n' 
                i+=1

I am referring to groups with variable cassette. Can someone please help?

question from:https://stackoverflow.com/questions/66063626/how-to-group-genes-regarding-their-id-and-position-python

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
134 views
Welcome To Ask or Share your Answers For Others

1 Answer

Please see below. You can modify the labels and distances to your requirements.

def get_genome_groups(genome_info):
    genome_info.sort(key = lambda x: (x.split('.')[0], int(x.split('_')[-1])))
    #print(genome_info)
    genome_groups = []
    close_genome_group = []
    last_genome = ''
    position = 0
    last_position = 0

    #'NZ_CP019047.1_2995',
    for genomes in genome_info:
        genome, position = genomes.split('.')
        position = int(position.split('_')[1])

        if last_genome and (genome != last_genome):
            genome_groups.append(close_genome_group)
            close_genome_group = []

        elif close_genome_group and position and (position > last_position+3):
            genome_groups.append(close_genome_group)
            close_genome_group = []

        if genomes:
            close_genome_group.append(genomes)
        last_position = position
        last_genome = genome

    if close_genome_group:
        genome_groups.append(close_genome_group)

    return genome_groups

if __name__ == '__main__':
    genome_group = get_genome_groups(genome_info)
    print(genome_group)

user@Inspiron:~/code/general$ python group_genes.py 
[['KE699235.1_379'], ['KE699235.1_1000', 'KE699235.1_1001'], ['NZ_CP019047.1_2993', 'NZ_CP019047.1_2994', 'NZ_CP019047.1_2995'], ['NZ_CP019047.1_2999', 'NZ_CP019047.1_3000', 'NZ_CP019047.1_3001', 'NZ_CP019047.1_3003']]
user@Inspiron:~/code/general$ 



与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...