pandas - parsing 4 dataframe and a fasta file

Question

Welcome To Ask or Share your Answers For Others

pandas - parsing 4 dataframe and a fasta file

asked Jan 31, 2022 in Technique[技术] by 深蓝 (71.8m points)

I have actually 4 different dataframe corresponding to informations from gene predicted with augustus for 2 different species and within these species, I trained the database with the training parameters of the sp1 for the sp2 and the training parameters of the sp2 for the sp1.

Here is the exemple of the syntax name to better understand.

0035: Lepidoptera
0042: WASP

g1.t1_0035_0035 : this gene has been predicted with the database of the specie 0035 and its own training parameters.

g1.t1_0035_0042 : this gene has been predicted with the database of the specie 0035 and with the training parameters of the specie 0042.

g1.t1_0042_0042 : this gene has been predicted with the database of the specie 0042 and its own training parameters.

g1.t1_0042_0035 : this gene has been predicted with the database of the specie 0042 and with the training parameters of the specie 0035.

And now I have 4 dataframe such :

gene_name   scaf_name       scaf_length cov_depth       GC
g3.t1       scaffold 6      56786         79            0.39
g4.t1       scaffold 6      56786         79            0.39
g1.t1       scaffold 256    789765        86            0.42
g2.t1       scaffold 890    3456          85            0.40
g5.t1       scaffold 1234   590           90            0.41

as you can see, the gene names do not have the name with _number1_number2 but each file corresponds to a specific situation: here are the file's name:

ggf_0042_0042.csv for all the genex_0042_0042
ggf_0042_0035.csv for all the genex_0042_0035
ggf_0035_0035.csv for all the genex_0035_0035
ggf_0042_0035.csv for all the genex_0042_0035

and what I actually would like is simply to parse a fasta file for exemple:

>g13600.t1_0042_0042
MERVINTQLLRYLEDHQLISDRQYGFR...
>g34744.t1_0042_0035
MSVPAHVAQIFEAIRRSGQQIDED...
>g28436.t1_0035_0042
WKKAKAENALDSYHHNHLMSEE...
>g14327.t1_0042_0042
MTYGAETWSLTVGLVRKLRVTQR...
>g30148.t1_0035_0042
MLRPVLSSKLPTNTKLRVYKTYIRSRLTY...
>g24481.t1_0035_0035
PCAGSNIKLKGTECFEKSFEVCLRNY...

and say:

if in the gene name there is the number _0035_0035, then, go into the file ggf_0035_0035.csv and grab the row corresponding to the same gene name and fill a new dataframe with this row.

Here is an hypothetical exemple of an output:

gene_name               scaf_name       scaf_length   cov_depth       GC
g345.t1_0035_0035       scaffold 567      56778         78            0.39
g23.t1_0042_0035        scaffold 43       434           79            0.43
g46.t1_0042_0042        scaffold 276      785660        87            0.41
g2.t1_0042_0035         scaffold 845      345656        87            0.40

and so on...

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

161 views

1 Answer

深蓝 · Answer 1 · 2022-01-31T07:16:06+0000

Using Biopython,

from Bio import SeqIO

first create a dictionary

ggf = {}

Now iterate over the records

for record in SeqIO.parse("example.fasta", "fasta"):
    id_ = record.id

Try to match the form

    parts = id.split('_')
    if len(parts) != 3:
        continue

See if you already parsed it, and update if not

    if (parts[1], parts[2]) not in ggf:
        f_name = '_'.join('ggf', parts[1], parts[2]) + '.csv'
        ggf[(parts[1], parts[2])] = pd.read_csv(f_name)

Now just use

    df = ggf[(parts[1], parts[2])]
    df[df.gene_name == parts[0]]

Categories

pandas - parsing 4 dataframe and a fasta file

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags