I have this code that is working fine but it's taking pretty much 100% of my cpu to run and it takes around 25min. I'd really like to optimize it but don't know what parts I could improve.
The main problem I guess is that I'm using coordinates of a file to be compared with a 3,5GB file. So I tried putting the information of this 3,5GB file in a hash to make it faster but it's not what I thought it would be. I want to compare exon coordinates to the whole human genome (hg38), see which two letters are before the exon position and which two are in the end, so that would count as an intron. Then, for each intron I want to see if those two letters are compatible with GUAG or not.
#!/usr/bin/perl
use warnings;
use strict;
print "Enter file path (refGene)n";
my $ref_gene = <>;
chomp ($ref_gene);
my $local1 = "./$ref_gene";
print "Enter file path (hg38)n";
my $filename = <>;
chomp ($filename);
my $local2 = "./$filename";
open (HG, $local2) or die "error opening $filenamen";
my %chromosome;
my $key;
while (my $line = <HG>) {
chomp ($line);
if ($line =~ /^>/) {
$key = $line;
$key =~ s/>//; # / stop editor red coloring
$chromosome{$key} = "";
}
else {
$chromosome{$key} .= uc $line;
}}
close (HG);
open (REFGENE, $local1) or die "error opening $ref_genen";
my $total_introns = 0;
my $GUAG_introns = 0;
while (my $line = <REFGENE>) {
my @column = split ("t", $line);
my $chr = $column[2];
my $sense = $column[3];
my $number_exons = $column[8];
my $beginning_of_exon = $column[9];
my $end_of_exon = $column[10];
my @beginning_exons = split (",", $beginning_of_exon);
my @end_exons = split (",", $end_of_exon);
foreach my $element (@end_exons) { #subtract 1 from end of exon because its 1-based index
$element -= 1;
}
my $introns = ($number_exons - 1); #amount of introns = exons - 1
$total_introns += $introns;
my $current_seq = $chromosome{$chr};
for (my $i=0; $i<@beginning_exons-1; $i++) {
my $end = substr ($current_seq, $end_exons[$i]+1, 2); #get two letters after end of exon
my $beginning = substr ($current_seq, $beginning_exons[$i+1]-2, 2); #get two letters before beggining of exon
if ($sense eq "+") {
if ( ($end eq "GT") && ($beginning eq "AG") ) {
$GUAG_introns += 1;
}}
elsif ($sense eq "-") {
if ( ($end eq "CT") && ($beginning eq "AC") ) {
$GUAG_introns += 1;
}}}
}
close (REFGENE);
my $GUAG = (($GUAG_introns/$total_introns)*100); #percentage of GUAG introns
print "number of GUAG introns = $GUAG_introns n total introns = $total_introns n";
print " $GUAG% of GT-AG introns.nn";
exit;
input
717 NM_000525 chr11 - 17385248 17388659 17386918 17388091 1 17385248, 17388659,
987 NM_000242 chr10 - 52765379 52771700 52768136 52771635 4 52765379,52769246,52770669,52771448, 52768510,52769315,52770786,52771700,
I know it's a lot of code and that nobody will download 3,5GB of human genome to run this but I'd like to know which kind of improvements I can do to optimize the code and not use 100% of cpu. Thanks
edit: The two letters before the exon position would be AG or the last two letters of an intron
column 9 has the coordinates to the start of an exon and column 10 to the end, I'm trying to get the things in the middle and see if they start with GT and end with AG. Column 3 is the sense of the sequence, so if it's negative (-) instead of looking for GUAG (U cause it's RNA in the end but the reference(hg38) is T) I'll be looking for the reverse complement (AC instead of GT, and CT instead of AG). Column 2 is the chromosome, so for chromosome 6 I'd look the reference in hg38 for chromosome 6 and count the introns. The other file is 15MB
Aucun commentaire:
Enregistrer un commentaire