Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

My code loops through multiple files in a directory, parses each file and appends the parsed content of each file to FinalVariantfile.txt.

The code works, but duplicates the content of each file.

When I ran the code with two files the output contained 4 files. Could someone please explain why this is happening and how to fix this?

    #!/usr/bin/perl -w

    use strict;

    #directory structure

    my $home         = "/data/";
    my $tsvdirectory = $home . "test_all_runs/" . $ARGV[0];
    my $tsvfiles     = $home . "test_all_runs/" . $ARGV[0] . "/tsv_files.txt";

    my $FinalVariants = $home . "test_all_runs/" . $ARGV[0] . "/FinalVariantfile.txt";

    my @tsvfiles        = ();
    my @currentlines    = ();
    my $currentline     = '';
    my $currentCNVline  = '';
    my @currentCNVlines = ();
    my @HotSpotLines    = ();
    my @CNVLines        = ();

    # command to produce the vcf_files.txt file stored in each individual run
    # directory; the file list includes solely vcf files which have not been
    # previously prepared and/or annotated
    my $cmd = `ls $tsvdirectory/FOCUS*.tsv > $tsvfiles`;

    # print "$cmd";
    my $cmda = "ls $tsvdirectory/FOCUS*.tsv > $tsvfiles";

    # print "$cmda";

    # this code opens the vcf_files.txt file and passes each line into an array for
    # indidivudal manipulation
    open( TXT2, "$tsvfiles" );
    while ( <TXT2> ) {
        push( @tsvfiles, $_ );
    }
    close(TXT2);

    foreach ( @tsvfiles ) {
        chop($_);
    }

    # this code then parses each of the files listed by name in the tsvfiles array
    foreach ( @tsvfiles ) {

        my $currenttsvfile = "$_";    # establishes the current file being manipulated

        my $MDLfinaltsvfile = $currenttsvfile;
        $MDLfinaltsvfile =~ s/.tsv/_prepared.txt/g;

        # this series of variable calls names the various intermediate or
        # final output files

        my $MDLlinestsvfile = $currenttsvfile;
        $MDLlinestsvfile =~ s/.tsv/_withCNV.txt/g;

        my $Variantlinestsvfile = $currenttsvfile;
        $Variantlinestsvfile =~ s/.tsv/_HotSpot.txt/g;

        my $MDLtsvfile = $currenttsvfile;
        $MDLtsvfile =~ s/.tsv/_FilteredAllcolumns.txt/g;

        my $MDLsampleid = $currenttsvfile;
        $MDLsampleid =~ s/-oncogene.tsv//g;
        print "The currentVCFis############# " . $currenttsvfile . "
";

        my @SampleID = ();
        @SampleID = split ///, $MDLsampleid;
        print "The sampleIDis##############" . $SampleID[4] . "
";

        my $CNVdata = $currenttsvfile;
        $CNVdata =~ s/.tsv/_cnv.txt/g;

        my $FinalCNVdata = $currenttsvfile;
        $FinalCNVdata =~ s/.tsv/_finalcnv.txt/g;

        my $cmd2 = `fgrep -v "#" $currenttsvfile > $MDLlinestsvfile`;
        print "$cmd2";    # this code extracts from the current vcf file all of the
                          # lines of data and outputs them into a separate file

        my $cmd5 = `grep -vwE "(CNV|intronic|synonymous|utr_3|utr_5)" 
#removes lines that contain CNV/intronic/synonymous/utr_3/utr_5"

$MDLlinestsvfile > $Variantlinestsvfile`;
        print "$cmd5";

        open( my $fh_in, '<', $Variantlinestsvfile )
                or die "cannot open $Variantlinestsvfile: $!
"; 
#removes lines that contain 0/0 and ./. genotypes from field 70.

        open( my $fh_out, '>', $MDLtsvfile )
                or die "cannot open $MDLtsvfile: $!
";

        while ( my $line = <$fh_in> ) {

            # tab/field-based:
            my @fields = split( /s+/, $line );
            print $fh_out $line unless ( $fields[70] =~ m|([0.])/1| );
        }
        close($fh_in);
        close($fh_out);

        #open each filtered file with all columns and pushes it into array.
        open( TXT2, "$MDLtsvfile" );
        while (<TXT2>) {
            push( @HotSpotLines, $_ );
        }
        close(TXT2);

        foreach (@HotSpotLines) {
            chop($_);

            my @HotSpotEntries = ();
            my $currentMDLline = $_;
            @HotSpotEntries = split( /	/, $currentMDLline );

            my $chr        = $HotSpotEntries[9];
            my $position   = $HotSpotEntries[10];
            my $cosmicids  = $HotSpotEntries[21];
            my $refforward = $HotSpotEntries[67];
            my $genotype   = $HotSpotEntries[70];
            my $altforward = $HotSpotEntries[77];
            my $altreverse = $HotSpotEntries[78];
            my $cDNA       = $HotSpotEntries[81];
            my $exon       = $HotSpotEntries[83];
            my $conseq     = $HotSpotEntries[84];
            my $location   = $HotSpotEntries[88];
            my $geneclass  = $HotSpotEntries[92];
            my $aachange   = $HotSpotEntries[98];
            my $transcript = $HotSpotEntries[100];

            $currentline
                    = $SampleID[4] . "	"
                    . $chr . "	"
                    . $position . "	"
                    . $cosmicids . "	"
                    . $refforward . "	"
                    . $refreverse . "	"
                    . $genotype . "	"
                    . $altforward . "	"
                    . $altreverse . "	"
                    . $cDNA . "	"
                    . $exon . "	"
                    . $conseq . "	"
                    . $location . "	"
                    . $geneclass . "	"
                    . $aachange . "	"
                    . $transcript;

            # print "The currentVCFlineis ".$currentline."
";
            push( @currentlines, $currentline );

        }

        my $i;

        for ( $i = 0; $i < @currentlines; $i += 1 ) {

            my $currentguiline = $currentlines[$i];

            my $cmd5 = `echo "$currentguiline" >> $FinalVariants`;
            print "$cmd5";

            #my $cmd9 = `sed -i '1i$SampleID[4]' $FinalVariants`; print $cmd9;
        }
    }
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
129 views
Welcome To Ask or Share your Answers For Others

1 Answer

There is no need to start so many new shell subprocesses to do such basic operations. ls, fgrep, grep and echo all have equivalents in Perl, and especially calling echo for each line of text is a very poor way of copying one file to another

I suspect that your problem is because of the line

my $cmd5 = `echo "$currentguiline" >> $FinalVariants`;

which will append each element of @currentlines to the end of the file. So the first time you run your program it will contain a single copy of the result, but every subsequent run will just add more data to the end of your file and it will keep growing

I hate to offer a hack to get things working, but it would take me ages to understand what your program is doing behind all the confusion and write a proper concise version. You can fix it temporarily by adding the line

unlink $FinalVariants or die $!;

before the foreach ( @tsvfiles ) { ... } loop. This will delete the file and ensure that a new version is created for each execution of your program.



Okay, I've studied your code carefully and I think this will do what you intend. Without any data or even file name samples I've been unable to test it beyond making sure that it compiles, so it will be a miracle if it works first time, but I believe it's the best chance you have of getting a coherent solution

Note that there's a problem with $refreverse that you use in your own code but never declare or define it, so there's no way that the code you show will create the problem you say it does because it dies during compilation with the error message

Global symbol "$refreverse" requires explicit package name

I've guessed that it's right after $ref_forward at index 68

Please report back about how well this functions

#!/usr/bin/perl

use strict;
use warnings 'all';

my $home          = "/data";
my $tsv_directory = "$home/test_all_runs/$ARGV[0]";

my $final_variants = "$tsv_directory/final_variant_file.txt";

open my $out_fh, '>', $final_variants
        or die qq{Unable to open "$final_variants" for output: $!};

my @tsv_files = glob "$tsv_directory/FOCUS*.tsv";

for my $tsv_file ( @tsv_files ) {

    print "The current VCF is ############# $tsv_file
";

    $tsv_file =~ m|([^/]+)-oncogene.tsv$| or die "Cant extract Sample ID";
    my $sample_id = $1;
    print "The sample ID is ############## $sample_id
";

    open my $in_fh, '<', $tsv_file
            or die qq{Unable to open "$tsv_file" for input: $!};

    while ( <$in_fh> ) {

        next if /^#/;
        next if /(?:CNV|intronic|synonymous|utr_3|utr_5)/;

        my @fields = split;
        next if $fields[70] eq '0/0' or $fields[70] eq './.';

        my @wanted = ( 9, 10, 21, 67, 68, 70, 77, 78, 81, 83, 84, 88, 92, 98, 100 );
        my $current_line = join "	", @fields[@wanted];

        print $out_fh $current_line, "
";
    }
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...