mardi 14 juin 2016

How to optimize my perl code to make it faster

I have this code that is working fine but it's taking pretty much 100% of my cpu to run and it takes around 25min. I'd really like to optimize it but don't know what parts I could improve.

The main problem I guess is that I'm using coordinates of a file to be compared with a 3,5GB file. So I tried putting the information of this 3,5GB file in a hash to make it faster but it's not what I thought it would be. I want to compare exon coordinates to the whole human genome (hg38), see which two letters are before the exon position and which two are in the end, so that would count as an intron. Then, for each intron I want to see if those two letters are compatible with GUAG or not.

#!/usr/bin/perl
use warnings;
use strict;

print "Enter file path (refGene)n";
my $ref_gene = <>; 
chomp ($ref_gene);
my $local1 = "./$ref_gene";

print "Enter file path (hg38)n";
my $filename = <>; 
chomp ($filename);
my $local2 = "./$filename";

open (HG, $local2) or die "error opening $filenamen";

my %chromosome;
my $key;
while (my $line = <HG>) {
    chomp ($line);
    if ($line =~ /^>/) {
        $key = $line;
        $key =~ s/>//;          # / stop editor red coloring
        $chromosome{$key} = "";
    }
    else {
        $chromosome{$key} .= uc $line;
    }}

close (HG);

open (REFGENE, $local1) or die "error opening $ref_genen";

my $total_introns = 0;
my $GUAG_introns = 0;
while (my $line = <REFGENE>) {
    my @column = split ("t", $line);
    my $chr = $column[2]; 
    my $sense = $column[3];
    my $number_exons = $column[8];
    my $beginning_of_exon = $column[9];
    my $end_of_exon = $column[10];

    my @beginning_exons = split (",", $beginning_of_exon);
    my @end_exons = split (",", $end_of_exon);
    foreach my $element (@end_exons) { #subtract 1 from end of exon because its 1-based index
        $element -= 1;
    }

    my $introns = ($number_exons - 1); #amount of introns = exons - 1
    $total_introns += $introns;

    my $current_seq = $chromosome{$chr};

    for (my $i=0; $i<@beginning_exons-1; $i++) {
        my $end = substr ($current_seq, $end_exons[$i]+1, 2); #get two letters after end of exon
        my $beginning = substr ($current_seq, $beginning_exons[$i+1]-2, 2); #get two letters before beggining of exon

        if ($sense eq "+") {
            if ( ($end eq "GT") && ($beginning eq "AG") ) {
            $GUAG_introns += 1;
            }}

        elsif ($sense eq "-") {

            if ( ($end eq "CT") && ($beginning eq "AC") ) {
                $GUAG_introns += 1;
            }}}
}

close (REFGENE);


my $GUAG = (($GUAG_introns/$total_introns)*100); #percentage of GUAG introns
print "number of GUAG introns = $GUAG_introns n total introns = $total_introns n";
print " $GUAG% of GT-AG introns.nn";

exit;

input

717     NM_000525       chr11   -       17385248        17388659        17386918        17388091        1       17385248,       17388659,     
987     NM_000242       chr10   -       52765379        52771700        52768136        52771635        4      52765379,52769246,52770669,52771448,    52768510,52769315,52770786,52771700,

I know it's a lot of code and that nobody will download 3,5GB of human genome to run this but I'd like to know which kind of improvements I can do to optimize the code and not use 100% of cpu. Thanks

edit: The two letters before the exon position would be AG or the last two letters of an intron

column 9 has the coordinates to the start of an exon and column 10 to the end, I'm trying to get the things in the middle and see if they start with GT and end with AG. Column 3 is the sense of the sequence, so if it's negative (-) instead of looking for GUAG (U cause it's RNA in the end but the reference(hg38) is T) I'll be looking for the reverse complement (AC instead of GT, and CT instead of AG). Column 2 is the chromosome, so for chromosome 6 I'd look the reference in hg38 for chromosome 6 and count the introns. The other file is 15MB

Aucun commentaire:

Enregistrer un commentaire