Recent Updates Toggle Comment Threads | Keyboard Shortcuts

  • Wang Qinhu 5:14 pm on August 25, 2012 Permalink | Reply
    Tags: , , , parse blast, , tab delimited   

    Parse blast by Perl and format the output 

    Parse blast result is frequently used in our daily research, here is my script for parsing the original blast output, include the full report in tab-delimited format and brief in tab-delimited format, which can be easily identified in Excel or a Text editor.

    For example, we have a demo sequence file named seq.fas, we performed blast search by typing the following:

    blastall -p blastp -i seq.fas -d /disk/www/blast/db/nr -e 1e-10 -b 3 -v 3 -a 8 -o blast.out

    The output file is like this.
    After that, please run this script as following:

    perl parse_blast.pl

    And you will get two report files in full and brief tab-delimited formats.
    Those are the demo output files: full_tab brief_tab
    Hope this is useful for you.
    Here is the code:
    [click here to save this file directly]

    #!/usr/bin/perl -w

    use strict;
    use Bio::SearchIO;

    my $blast_file = $ARGV[0] || "blast.out";
    my $report = $ARGV[1] || 'blast.report';
    my $evalue = undef;

    open (RPT,">$report");
    print RPT "#Query_Name\tTop_Hit\tDescription [Organism]\tE-value\n";
    my $in = new Bio::SearchIO(
    -format => 'blast',
    -file => "$blast_file");
    while(my $result = $in->next_result) {
    while(my $hit = $result->next_hit) {
    while(my $hsp = $hit->next_hsp) {
    print RPT $result->query_name, "\t";
    print RPT $hit->name, "\t", $hit->description, "\t";
    $evalue = $hsp->evalue;
    $evalue =~ s/\,//;
    print RPT $evalue, "\n";
    last;
    }
    last;
    }
    }
    close(RPT);

    open (REC, "$report");
    open (BRI, ">$report.xls");
    while (my $line = ) {

    my ($a, $b, $c, $d) = split /\t/, $line;
    my ($c_head, $c_tail) = split /\]/, $c, 2;
    my $rec = "$a\t$b\t$c_head]\t$d";
    $rec =~ tr/\[//;
    print BRI $rec;
    }
    close REC;
    close BRI;

     
  • Wang Qinhu 5:27 pm on March 9, 2012 Permalink | Reply
    Tags: barplot, , qPCR, R   

    Showing the qPCR data via R 


    When we have generated a batch of  Real-Time quantitative PCR (qPCR) data, to illustrate them, some of the applications may lose their power or generate very ugly graphics, here I supply the following R script to help biologist custom their bar plot (with error bar) when publishing their qPCR data.

    The data file are organized as follow:

    Control Ct_Ref_Rep1 Ct_Ref_RepN Ct_Test_Rep1 Ct_Test_RepN
    SampleA Ct_Ref_Rep1 Ct_Ref_RepN Ct_Test_Rep1 Ct_Test_RepN

    SampleZ Ct_Ref_Rep1 Ct_Ref_RepN Ct_Test_Rep1 Ct_Test_RepN

    My demo files is available here: GeneA, GeneB, and GeneC.

    Please note that I used 2^-ddCt method for the data analysis.

    Here is the script:


    #Error bar fuction
    error.bar <- function(x, y, upper, lower=upper, length=0.1,...){
    if(length(x) != length(y) | length(y) !=length(lower) | length(lower) != length(upper))
    stop("vectors must be same length")
    arrows(x,y+upper, x, y-lower, angle=90, code=3, length=length/3, ...)
    }

    ### User Data
    #Genes list
    #Use the corresponding gene names listed here as the Ct file name
    genes<-c("GeneA","GeneB","GeneC")
    #Number of sample
    num_sam <- 10;
    #Number of repeat (biological or technical)
    num_rep <- 2;
    #Line of Control
    lctrl <- 1;

    num_gene <- length(genes);
    ###Figure layout
    #Number of row
    num_row <- 1;
    #Number of column
    num_col <- 3;
    layout(mat=matrix(1:num_gene, num_row, num_col, byrow=TRUE))

    #Main
    for (myct in genes) {

    #Read data
    ct<- read.table(myct)
    #Format data
    row.names(ct)<-ct$V1
    for (j in 1:(2*num_rep)) {
    ct[j]<-ct[j+1]
    }
    ct<-ct[-(2*num_rep+1)]

    expr=rep(NA, num_sam*num_rep)
    dim(expr)<-c(num_sam,num_rep)

    #ctr_ref
    ref_calibrator<-mean(as.numeric(ct[lctrl,1:num_rep]))
    calibrator<-mean(as.numeric(ct[lctrl,(num_rep+1):(2*num_rep)]-ref_calibrator))

    for (i in 1:num_sam) {
    ref<-mean(as.numeric(ct[i,1:num_rep]))
    #dCt
    dct<-ct[i,(num_rep+1):(2*num_rep)]-ref
    #ddCt
    ddct<-dct-calibrator
    #fold
    expr[i,1:num_rep]<-2^-ddct
    }
    fold<-t(expr)

    fold.means=rep(NA, num_sam)
    fold.sd=rep(NA, num_sam)

    for (i in 1:num_sam) {
    fold.means[i]<-mean(fold[,i])
    fold.sd[i]<-sd(fold[,i])
    }

    ymax=max(fold.means)+1.1*max(fold.sd)
    barx <- barplot(fold.means,col=1,ylim=c(0,ymax),names.arg=row.names(ct), main=myct,xlab="Tissue type",ylab="Normlized fold expression")
    error.bar(barx,fold.means, fold.sd)

    }

    Feed this script with the demo files (GeneA, GeneB and GeneC), we will get the following demo output graphic.

     
  • Wang Qinhu 1:22 pm on February 16, 2012 Permalink | Reply
    Tags: citation report, , ,   

    Generate a citation report of your publication 

    Take the FunCat paper in NAR as an example, following this procedure:
    First, download the full report of the query publication from web of knowledge, click here to get my demo file.
    Second, download all the citations of your query paper in full report format (usually your need to select web of science database rather than all the database), demo file is available here.
    Third, execute the following perl script.
    NOTE: please save your query and citation file in tab-delimted CRLF format, or you may commit a failure.
    A: total citations exclude the authors citations;
    B: total of the author citations;
    C: citations beyond the year you are retrieved.
    A+B+C is the sum of citations, that is, all the citations in web of science database.
    To preview your finally report file click here.


    #!/usr/bin/perl -w

    1. File Name: qsci.pl
    2. Version 0.01
    3. Copyright (c) 2012 NWAFU
    4. Author: Wang Qinhu (qinhu.wang # gmail.com)
    5. Created on Feb 16, 2012

    use strict;

    my $label = $ARGV[0] || “citation.report”;
    my $query = $ARGV[1] || “query.txt”;
    my $database = $ARGV[2] || “database.txt”;
    my $year = $ARGV[3] || “2006″;

    open (QR, $query) or die “Cannot open file $query: $!\n”;
    my $author = undef;
    while (my $line = ) {
    chomp $line;
    if ($line =~ /;/) {
    my @record = split /\t/, $line;
    $author = $record[1];
    }
    }
    close QR;

    my @own_auther = split /\;\s/, $author;

    open (DB, $database) or die “Cannot open file $database: $!\n”;
    my @citation = ();
    my $i = 0;
    while (my $dbr = ) {
    $citation[$i++] = $dbr;
    }
    close DB;

    open (RP, “>$label.txt”) or die “Cannot open file $label.txt: $!\n”;
    my $a = 0;
    my $b = 0;
    my $c = 0;

    foreach my $i (1..$#citation) {
    my $ci = $citation[$i];
    my $own = 0;
    my $once = 0;
    my @split_sci = split /\t/, $ci;
    if ($split_sci[37] >= $year) {
    foreach my $au (@own_auther) {
    if ($ci =~ /$au/) {
    $own++ if $once++ == 0;
    }
    }
    if ($own == 0) {
    $a++;
    print RP “//\nType: A\n”;
    print RP “Author(s): $split_sci[1]\n”;
    print RP “Title: $split_sci[7]\n”;
    print RP “Journal: $split_sci[8]\n”;
    print RP “Year: $split_sci[37]\n”;
    print RP “Volume: $split_sci[38]\n”;
    } else {
    $b++;
    print RP “//\nType: B\n”;
    print RP “Author(s): $split_sci[1]\n”;
    print RP “Title: $split_sci[7]\n”;
    print RP “Journal: $split_sci[8]\n”;
    print RP “Year: $split_sci[37]\n”;
    print RP “Volume: $split_sci[38]\n”;
    }
    } else {
    $c++;
    print RP “//\nType: C\n”;
    print RP “Author(s): $split_sci[1]\n”;
    print RP “Title: $split_sci[7]\n”;
    print RP “Journal: $split_sci[8]\n”;
    print RP “Year: $split_sci[37]\n”;
    print RP “Volume: $split_sci[38]\n”;
    }
    }

    print RP “\n***********************\n”;
    print RP “Citation Report:\n”;
    print RP “A\t$a\n”;
    print RP “B\t$b\n”;
    print RP “C\t$c\n”;
    print RP “***********************\n\n”;
    close RP;

     
  • Wang Qinhu 3:52 pm on October 24, 2011 Permalink | Reply
    Tags: Arabidopsis, , , , , , , ,   

    Functional Catalogue of customized gene list 

    Functional Catalogue (FunCat) is a hierarchically structured, organism-independent, flexible and scalable controlled classification system enabling the functional description of proteins from any organism [1]. Here is my solution to perform FunCat classification in Arabidopsis. Please note in this Perl script, NCBI BLAST was called and bioperl was used to parse the blast result, we use eight cores of CPU in default to perform this analysis.


    #!/usr/bin/perl -w

    1. File Name: FunCat.pl
    2. Note: FASTA files are expected for input.
    3. Version 1.00
    4. Copyright (c) 2008-2010 NWAFU
    5. Author: Wang Qinhu (qinhu.wang # gmail.com)
    6. Created on Aug, 2008

    use strict;
    use Bio::SearchIO;

    #System Parameters:
    my $query_file = $ARGV[0] || 'query.fa';
    my $cpu = $ARGV[1] || '8';

    #Format DataBase
    system("formatdb -i ATprotein -p T -o T");
    #Perform BLAST
    system("blastall -p blastx -d ATprotein -i $query_file -e 1e-5 -a $cpu -b 2 -o blast.out");
    system("rm ATprotein.p*");
    system("rm formatdb.log");
    #Read Arabidopsis FunCat Scheme
    my $id = "";
    my $num = "";
    my %list = ();
    open (IN, "athaliana_funcat2008");
    while (my $line = ) {
    chomp $line;

    ($id, $num) = split(/\|/, $line);
    if (!exists $list{$id}) {
    $list{$id} = $num;
    } else {
    next;
    }
    }
    close IN;
    $list{"NO HITS FOUND"} = "00";

    #Parse BLAST, and Assign FunNum
    my $in = new Bio::SearchIO(
    -format => 'blast',
    -file => 'blast.out');
    open (OUT,">BLAST.cat");
    my $Gene = "";
    my $HitId = "";
    my %Cata = ();
    while(my $result = $in->next_result) {
    $Gene = $result->query_name;
    while(my $hit = $result->next_hit) {
    $Cata{$Gene} = $hit->name;
    last;
    }
    if (!defined $Cata{$Gene}) {
    $Cata{$Gene} = "No hits found";
    }
    print OUT "$Gene\t";
    print OUT "$Cata{$Gene}\t";
    $Cata{$Gene} =~ tr/[a-z]/[A-Z]/;
    if (!exists $list{$Cata{$Gene}}) {
    $list{$Cata{$Gene}} = "-1";
    }
    print OUT $list{$Cata{$Gene}}, "\n";
    }
    close(OUT);

    #Output Cat
    my %dna = ();
    my %log = ();
    my $logdat = undef;
    open (IN, "BLAST.cat");
    while (my $line = ) {
    chomp $line;
    my ($gene, $hit, $cata) = split /\t/, $line;
    my $infor = $gene . "\t" . $hit;
    $dna{$infor} = $cata;
    }
    close IN;
    open (OUT, ">Clean.cat");
    foreach my $infor (sort By_CatNum keys %dna) {
    print OUT $dna{$infor}, "\t", $infor, "\n";
    $logdat = $dna{$infor};
    $logdat = substr($logdat, 0, 2);
    $log{$logdat}++;
    }
    close OUT;
    open (CAT, ">FunCat.dat");
    print CAT "Cata\tNum\n";
    foreach my $logdat (sort keys %log) {
    print CAT $logdat, "\t", $log{$logdat}, "\n";
    }
    close CAT;
    sub By_CatNum { $dna{$a} <=> $dna{$b} }

    Reference:
    1. Ruepp, A. et al. (2004) The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res. 32, 5539–5545.

     
  • Wang Qinhu 5:24 pm on October 20, 2011 Permalink | Reply
    Tags: , Gene Ontology, GoAnna, GOslim, , ,   

    GoAnna tail 

    This script was wrote three years ago, now I decide to share the script and hope this could inspire the one who want to perform Gene Ontology (GO) annotation themselves, not by the ready-to-use applications.


    #!/usr/bin/perl

    1. A perl script to handle the result of GoAnna, which is used for Geneontology Annotation, and it is available at:
    2. http://www.agbase.msstate.edu/GOAnna.html
    3. This scrit should be feeded by the output of GoAnna (the *.xls file only),
    4. and it generates human readable and computer minable go annotation, also goslims results.
    5. File Name: GoAnna_tail_delta.pl
    6. Version 1.03
    7. Copyright (c) 2008-2011 NWAFU
    8. Author: Wang Qinhu (qinhu.wang # gmail.com)
    9. Created on Aug, 2008
    10. Last Modified: Aug 9, 2009
    11. Additional files:
    12. goaslim.map is available at: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/goslim/goaslim.map
    13. goslim_goa.go is available at: ftp://ftp.geneontology.org/pub/go/GO_slims/goslim_goa.go

    use strict;

    my $in = $ARGV[0] || "GoAnno.sprot.out";
    my $table = $ARGV[1] || "GOA.Table";
    my $goa = $ARGV[2] || "GOA";

    open (IN, "$in") or die "Cannot open file $in: $!\n";
    open (TMP, ">Temp.table") or die "Cannot create file Temp.table: $!\n";

    my $seq_id = undef;
    my $id_counter = 0;
    my @seq_id = ();
    my @feilds = ();
    my %blastx_hit = ();
    my %blastx_pro = ();
    my %blastx_exp = ();

    print TMP "#UniGene_Id\tGO_Id\tOntology\tGO_Term\n";

    #Create Table.
    while (my $line = ) {
    chomp $line;
    if ( $line =~ /Input/) {
    next;
    } elsif ($line =~ /Database/) {
    next;
    } elsif ($line =~ /^=[\D\d]+\>([\D\d]+)\"/) {
    $seq_id = $1;
    $seq_id[$id_counter] = $seq_id;
    $id_counter++;
    } else {
    @feilds = split /\t/, $line;
    #To get the size of this array.
    my $feilds = @feilds;
    #print $feilds, "\n";
    if ($feilds == 5) {
    $blastx_hit{$seq_id} = $feilds[1];
    $blastx_pro{$seq_id} = $feilds[2];
    $blastx_exp{$seq_id} = $feilds[4];
    } elsif ($feilds == 18) {
    if ($blastx_exp{$seq_id} <= 1e-5) {
    print TMP $seq_id, "\t";
    print TMP $feilds[6], "\t", $feilds[10], "\t", $feilds[11], "\n";
    }
    } else {
    next;
    }
    }
    }

    close TMP;
    close IN;

    #Remove Redundant Record.
    open (TAB, "Temp.table") or die "Cannot open file Temp.table: $!\n";
    open (RRR, ">$table.nr") or die "Cannot create file $table.nr: $!\n";

    my @table = ();
    my $i = 0;
    while (my $line = ) {
    #Load Table.
    $table[$i] = $line;
    $i++;
    }
    my @table_nr = ();
    my %control = ();
    foreach (@table) {
    #Use Hash to Remove Redundance.
    if (!$control{$_}) {
    push (@table_nr, $_);
    }
    $control{$_}++;
    }

    #Output Temp.table (or Table.alpha).
    print RRR "@table_nr\n";

    close RRR;
    close TAB;

    open (NR, "$table.nr") or die "Cannot open file $table.nr: $!\n";

    my $g_index = undef;
    my %goa_dict = ();
    my %goa = ();
    while (my $line = ) {
    if ($line =~ /^#/) {
    next;
    } elsif ($line =~ /\S+/) {
    my @element = split /\t/, $line;
    $g_index = $element[1];
    $goa_dict{$g_index} = $element[2] . "\t" . $element[3];
    $goa{$g_index} .= $element[0] . "\n";
    } else {
    next;
    }
    }

    close NR;

    #split into different ontologies.
    open (NR, "$table.nr") or die "Cannot open file $table.nr: $!\n";
    open (F, ">$table.F") or die "Cannot create file $table.F: $!\n";
    open (P, ">$table.P") or die "Cannot create file $table.P: $!\n";
    open (C, ">$table.C") or die "Cannot create file $table.C: $!\n";
    print F "#UniGene_Id\tGO_Id\tOntology\tGO_Term\n";
    print P "#UniGene_Id\tGO_Id\tOntology\tGO_Term\n";
    print C "#UniGene_Id\tGO_Id\tOntology\tGO_Term\n";

    while (my $line = ) {
    if ($line =~ /^\#/) {
    next;
    }
    if ($line =~ /^\s*$/) {
    next;
    }
    my @words = split /\t/, $line;
    if (length($words[2]) != 1) {
    print $words[2], "\n";
    }
    my $ontology = $words[2];
    if ($ontology eq "F") {
    print F $line;
    } elsif ($ontology eq "P") {
    print P $line;
    } elsif ($ontology eq "C") {
    print C $line;
    } else {
    next;
    }
    }

    close NR;
    close F;
    close P;
    close C;

    #Assign unigene to goa
    open (GOA, ">$goa") or die "Cannot create file $goa: $!\n";

    foreach (keys %goa) {
    print GOA $_, "\t", $goa_dict{$_}, $goa{$_};
    }

    close GOA;

    open (GOA, "$goa") or die "Cannot open file $goa: $!\n";
    open (F, ">$goa.F") or die "Cannot create file $goa.F: $!\n";
    open (P, ">$goa.P") or die "Cannot create file $goa.P: $!\n";
    open (C, ">$goa.C") or die "Cannot create file $goa.C: $!\n";

    my $go = undef;
    my %goterm = ();
    my %onto = ();
    my %freq = ();

    while (my $line = ) {
    chomp $line;
    if ($line =~ /^#/) {
    next;
    } elsif ($line =~ /^GO/) {
    my @header = split /\t/, $line;
    $go = $header[0];
    $onto{$go} = $header[1];
    $goterm{$go} = $header[2];
    $freq{$go} = 0;
    } elsif ($line =~ /^\s\S+/) {
    $freq{$go}++;
    } else {
    next;
    }
    }

    foreach my $go (keys %onto) {
    if ($onto{$go} eq "F") {
    print F $go, "\t", $freq{$go}, "\t", $goterm{$go}, "\n";
    } elsif ($onto{$go} eq "P") {
    print P $go, "\t", $freq{$go}, "\t", $goterm{$go}, "\n";
    } elsif ($onto{$go} eq "C") {
    print C $go, "\t", $freq{$go}, "\t", $goterm{$go}, "\n";
    } else {
    print "Opps!\n";
    }
    }

    close F;
    close P;
    close C;
    close GOA;

    #################################################################
    #Parse Goaslim.Map
    #################################################################

    open (MAP, "goaslim.map") or die "Cannot open file GOA.F: $!\n";

    my %map_ori = ();
    my $go_id = undef;

    while (my $line = ) {
    chomp $line;
    if ($line =~ /^!/) {
    next;
    } elsif ($line =~ /^\s*$/) {
    next;
    } else {
    my @map = split /\t/, $line;
    #Updated v1.02
    unless ($map[0] eq $map[1]) {
    $go_id = $map[0];
    $map_ori{$go_id} = $map[1];
    }
    }
    }

    close MAP;

    #Remap the lower gomap which its higher gomaps has a super-higher gomap.
    #Updated at v_1.02
    my %map = ();
    foreach (keys %map_ori) {
    if (exists $map_ori{$map_ori{$_}}) {
    $map{$_} = $map_ori{$map_ori{$_}};
    } else {
    $map{$_} = $map_ori{$_};
    }

    }

    #################################################################
    #Parsing Goslim_goa
    #################################################################
    open (GO, "goslim_goa.go") or die "Cannot open file goslim_generic.go: $!\n";
    open (GOB, ">goslim.brief") or die "Cannot create file goslim.brief: $!\n";

    my $gi = undef;
    my %on = ();

    while (my $line = ) {
    if ($line =~ /^!/) {
    next;
    } elsif ($line =~ /^$/) {
    next;
    } elsif ($line =~ /^\s*$/) {
    next;
    } elsif ($line =~ /^\s%(\S+)\s;/) {
    print GOB $1, "\n";
    } elsif ($line =~ /^\s{2,4}[%<]([a-z\s\\,]+)\s;\s(GO:[\d]+)/) {
    $gi = $2;
    $on{$gi} = $1;
    $on{$gi} =~ s/\\//g;
    print GOB $gi, "\t", $on{$gi}, "\n";
    } else {
    next;
    }
    }
    $on{"GO:OMF"} = "other molecular function";
    $on{"GO:OBP"} = "other biological processes";
    $on{"GO:OCC"} = "other cellular components";
    close GOB;
    close GO;
    #################################################################

    open (SLIM, ">Goslim.xls") or die "Cannot create file Goslim.xls: $!\n";
    open (LOG, ">Nomap.log") or die "Cannot create file Nomap.log: $!\n";
    #################################################################
    #Molecular Function
    #################################################################
    open (GOAF, "GOA.F") or die "Cannot open file GOA.F: $!\n";
    print LOG "GOA.F.Nomap\n";
    print LOG "*" x 25, "\n";
    my %slimf_num = ();
    while (my $line = ) {
    chomp $line;
    if ($line =~ /^\s*$/) {
    next;
    } elsif ($line =~ /GO:/) {
    my @slim_child = split /\t/, $line;
    if (exists $map{$slim_child[0]}) {
    $slimf_num{$map{$slim_child[0]}} += $slim_child[1];
    } else {
    print LOG $slim_child[0], "\n";
    $slimf_num{"GO:OMF"} += $slim_child[1];
    }
    } else {
    next;
    }
    }
    close GOAF;
    print SLIM "Molecular Function\n";
    foreach (sort {$slimf_num{$a} <=> $slimf_num{$b}} keys %slimf_num) {
    print SLIM "$_\t$on{$_}\t$slimf_num{$_}\n";
    }

    #################################################################
    #Biological Processes
    #################################################################
    open (GOAP, "GOA.P") or die "Cannot open file GOA.P: $!\n";
    print LOG "\n\nGOA.P.Nomap\n";
    print LOG "*" x 25, "\n";
    my %slimp_num = ();
    while (my $line = ) {
    chomp $line;
    if ($line =~ /^\s*$/) {
    next;
    } elsif ($line =~ /GO:/) {
    my @slim_child = split /\t/, $line;
    if (exists $map{$slim_child[0]}) {
    $slimp_num{$map{$slim_child[0]}} += $slim_child[1];
    } else {
    print LOG $slim_child[0], "\n";
    $slimp_num{"GO:OBP"} += $slim_child[1];
    }
    } else {
    next;
    }
    }
    close GOAP;
    print SLIM "\n\nBiological Processes\n";
    foreach (sort {$slimp_num{$a} <=> $slimp_num{$b}} keys %slimp_num) {
    print SLIM "$_\t$on{$_}\t$slimp_num{$_}\n";
    }

    #################################################################
    #Cellular Components
    #################################################################
    open (GOAC, "GOA.C") or die "Cannot open file GOA.C: $!\n";
    print LOG "\n\nGOA.C.Nomap\n";
    print LOG "*" x 25, "\n";
    my %slimc_num = ();
    while (my $line = ) {
    chomp $line;
    if ($line =~ /^\s*$/) {
    next;
    } elsif ($line =~ /GO:/) {
    my @slim_child = split /\t/, $line;
    if (exists $map{$slim_child[0]}) {
    $slimc_num{$map{$slim_child[0]}} += $slim_child[1];
    } else {
    print LOG $slim_child[0], "\n";
    $slimc_num{"GO:OCC"} += $slim_child[1];
    }
    } else {
    next;
    }
    }
    close GOAF;
    print SLIM "\n\nCellular Components\n";
    foreach (sort {$slimc_num{$a} <=> $slimc_num{$b}} keys %slimc_num) {
    print SLIM "$_\t$on{$_}\t$slimc_num{$_}\n";
    }

    close LOG;
    close SLIM;

    #List the unigenes of each goslim.
    #This function is added in v1.03
    #Particular for biological processes.
    open (BP, "$table.P") or die "Cannot open file $table.P: $!\n";
    open (BPL, ">$table.P.list") or die "Cannot open file $table.P.list: $!\n";
    my %bpl = ();
    my $i = 0;
    while (my $line = ) {
    if ($line =~ /^#/) {
    next;
    } elsif ($line =~ /\S+/) {
    my ($unigene, $goid, $ontology, $goterm) = split /\t/, $line;
    if (exists $on{$map{$goid}}) {
    $bpl{$i} = "$on{$map{$goid}}\t$unigene\t$goid\t$goterm";
    $i++;
    }
    } else {
    next;
    }
    }
    foreach (sort values %bpl) {
    print BPL $_;
    }
    close BP;
    close BPL;

    1. __END__

     
  • Wang Qinhu 5:34 am on October 19, 2011 Permalink | Reply
    Tags: GIMP,   

    Gimp arrow brushes 

    I prepared these arrow brushes for personal use, however, anyone are welcome to download and use them, if wish.
    To install, just copy the gbr files to your brushes folder of your GIMP. These brushes were tested on Mac OS X, but it should also works on Linux and Windows.
    Click here to download.

     
  • Wang Qinhu 7:45 am on April 25, 2011 Permalink | Reply
    Tags: , , , , map, , , sort,   

    Mutiple sorting by Perl 

    Here is the code I used to sort the map file produced by bowtie, few modification should make it suitable for your files to deal with, hope it useful to you.


    #!/usr/bin/perl -w

    1. Sort map, also could be used in bed or wig files.
    2. Contact: Wang Qinhu (qinhu.wang at gmail.com)

    use strict;

    my @map = ();
    my $i = 0;

    open (IN, "reads.map") or die "Cannot open file: $!\n";
    while () {
    chomp;
    my @words = split /\t/, $_;
    my ($id, $num) = split /\_/, $words[0];
    my $strand = $words[1];
    my $scaffold = $words[2];
    my $start = $words[3];
    my $end = $words[3] + length($words[4]);
    my $read = $words[4];
    my ($head, $ord) = split /\_/, $scaffold;
    my $scaffold_num = undef;
    if ($ord < 10 ) {
    $scaffold_num = $head . "_000" . "$ord";
    } elsif ($ord < 100) {
    $scaffold_num = $head . "_00" . "$ord";
    } elsif ($ord < 1000) {
    $scaffold_num = $head . "_0" . "$ord";
    } elsif ($ord < 10000) {
    $scaffold_num = $head . "_" . "$ord";
    } else {
    print "More than 10000 RefSeq found!\n";
    exit;
    }
    $map[$i++] = { "scaffold" => $scaffold,
    "scaffold_num" => $scaffold_num,
    "strand" => $strand,
    "start" => $start,
    "end" => $end,
    "num" => $num,
    "id" => $id,
    "read" => $read};
    }
    close IN;

    my @sort_map = sort by_map @map;
    open (OUT, ">./output/sort.map") or die "Cannot open file: $!\n";
    foreach my $map (@sort_map) {
    print OUT $map->{"scaffold"}, "\t", $map->{"strand"}, "\t", $map->{"start"}, "\t";
    print OUT $map->{"end"}, "\t", $map->{"num"}, "\t", $map->{"id"}, "\t", $map->{"read"}, "\n";
    }
    close OUT;

    sub by_map {
    $a->{"scaffold_num"} cmp $b->{"scaffold_num"}
    or $a->{"strand"} cmp $b->{"strand"}
    or $a->{"start"} <=> $b->{"start"}
    or $b->{"num"} <=> $a->{"num"}
    or $a->{"end"} <=> $b->{"end"}
    }

    This is my demo:

    wangqinhu@IceThink[map]cat reads.map
    lab0000001_285583 + scaffold_4 116913 GGCGAGCCCGGCGGAGTCGC
    lab0000001_285583 - scaffold_25 11613 CGTGGGTCAGTGCGACGCGTG
    lab0000002_136986 + scaffold_4 116912 GGCAGCGCGGAAGTCGTGTC
    lab0000002_136986 - scaffold_25 11613 GGTAGCGCGGCGATGCGCG
    lab0000003_59405 + scaffold_4 116911 GGAGCGGGCGACGACGGCGC
    lab0000003_59405 - scaffold_25 11613 GCGTAGTACGTCGTCGTCAGT
    lab0000004_42594 + scaffold_4 116913 GGTGTCAGTACGCGTCGTGTC
    lab0000004_42594 - scaffold_25 11612 GGTCAGTACGTGTACCGTCGTAG
    lab0000005_32758 - scaffold_25 12003 GACGTGTCAGTCGTCGTC
    lab0000005_32758 + scaffold_4 116520 GGTCGTCAGTCGACGTACGTC
    wangqinhu@IceThink[map]perl sortmap.pl
    wangqinhu@IceThink[map]cat sort.map
    scaffold_4 + 116520 116541 32758 lab0000005 GGTCGTCAGTCGACGTACGTC
    scaffold_4 + 116911 116931 59405 lab0000003 GGAGCGGGCGACGACGGCGC
    scaffold_4 + 116912 116932 136986 lab0000002 GGCAGCGCGGAAGTCGTGTC
    scaffold_4 + 116913 116933 285583 lab0000001 GGCGAGCCCGGCGGAGTCGC
    scaffold_4 + 116913 116934 42594 lab0000004 GGTGTCAGTACGCGTCGTGTC
    scaffold_25 - 11612 11635 42594 lab0000004 GGTCAGTACGTGTACCGTCGTAG
    scaffold_25 - 11613 11634 285583 lab0000001 CGTGGGTCAGTGCGACGCGTG
    scaffold_25 - 11613 11632 136986 lab0000002 GGTAGCGCGGCGATGCGCG
    scaffold_25 - 11613 11634 59405 lab0000003 GCGTAGTACGTCGTCGTCAGT
    scaffold_25 - 12003 12021 32758 lab0000005 GACGTGTCAGTCGTCGTC

     
  • Wang Qinhu 3:45 pm on December 1, 2010 Permalink | Reply
    Tags: , , , cross_match, , ftp, , phred, shell   

    Sequences transport and cDNA pipeline 

    Hardware and Software:
    LAB Server: RedHat Enterprise Linux 4.5.
    HPC Server: RedHat Enterprise Linux 5.4, with vsftp.
    PC: BioLinux 6 (Ubuntu Linux 10.04)
    SSH for remote visit.

    Flow:
    Local PC [cDNA sequences files (*.ab1 files)] => HPC Server [cDNA sequences files (*.ab1 files)] ]
    Local PC ==> LAB Server
    LAB Server: run airman.tcsh

    Function:
    Transport local cDNA sequences to HPC (NOT in this airman.tcsh)
    LAB Server download cDNA sequences from HPC ~/lab2seq to /home/wangqinhu/seqman/input
    Simple cDNA piple:
    phred: convert chromat files to phd files
    phd2fasta: phd files to fasta files
    cross_match: screen vector
    cap3: NOT include in this script, if you need, type in one line “cap3 xyz.fa > xyz.cap3 &” if you pre-installed cap3
    LAB Server upload result file to HPC ~/lab2seq

    code of airman.tcsh


    #!/bin/tcsh
    #Seqman For Lab2
    #Filename: airman.tcsh
    #Author Wang Qinhu (qinhu.wang # gmail.com)
    #Date: Dec 1, 2010

    #download data
    echo
    echo Download files form HPC
    ftp -niv < open 123.45.67.89
    user wangqinhu passwd
    binary
    cd lab2seq
    lcd /home/wangqinhu/seqman/input
    mget *.ab1
    bye
    !
    echo OK
    echo
    sleep 1

    cd /home/wangqinhu/seqman

    #seqman
    echo Prepare to Process Sequences
    #Parameters
    set input = "input"
    set label = "lab2seq"
    set vector = $1
    setenv PHRED_PARAMETER_FILE /usr/local/etc/phredpar.dat
    mkdir chrom_dir
    mkdir phd_dir
    mkdir work_dir
    cp $input/*ab1 chrom_dir/
    echo
    echo Current DATA or FILENAME processed:
    echo
    echo $input
    echo
    echo Convert chromat file to phd file
    echo
    phred -id chrom_dir -pd phd_dir -qa ./work_dir/$input.qual -trim_alt 50
    echo
    echo Convert phd file to fasta file
    echo
    phd2fasta -id phd_dir -os ./work_dir/$label -oq ./work_dir/$label.screen.qual
    echo
    echo Screen out the vector sequences, and generates the
    echo screened sequence file "$label.screen"
    echo
    cross_match ./work_dir/$label $vector -minmatch 12 -minscore 20 -screen > ./work_dir/screen.out
    echo
    tar -czf $label.tar.gz chrom_dir/* phd_dir/* work_dir/*
    rm -r input/*
    rm -r chrom_dir
    rm -r phd_dir
    rm -r work_dir
    echo
    echo OK
    echo
    sleep 1

    #upload data
    echo
    echo Upload files to HPC
    ftp -niv < open 123.45.67.89
    user wangqinhu passwd
    binary
    cd lab2seq
    put lab2seq.tar.gz
    bye
    !
    rm lab2seq.tar.gz
    echo OK
    echo
    sleep 1

    echo Sir, all done, data uploaded to HPC ~/lab2seq
    echo Goodbye

     
  • Wang Qinhu 4:43 am on November 29, 2010 Permalink | Reply
    Tags: , , , , , ,   

    Remove duplicated sequences from NCBI 

    We usually download sequences from NCBI by one entrez retrieve, but for more than one retrieves, we may get some duplicated sequences, in this case, we often need to remove the duplicated sequences and get a unique sequences set. Here I supply a simple perl script to do this. Please make sure that your sequences have the same ids must have the same sequences contents, correspondingly; or this script is not suitable for you, but you can use cap3 or other assemble tool instead.


    #!/usr/bin/perl -w
    #Use hash to remove the duplicated sequences downloaded from NCBI.
    #The sequences from NCBI have unique accession number, when remove duplicated ID, duplicated
    #sequences were removed meanwhile.

    use strict;

    my $id = undef;
    my %seq = ();

    open (IN, "ncbi.fa") or die "Cannot open file: $!\n";
    while () {
    chomp;
    if (/^\>/) {
    my @head = split /\|/, $_;
    $id = $head[3];
    if (exists $seq{$id}) {
    $seq{$id} = '';
    }
    } else {
    $seq{$id} .= $_;
    }
    }
    close IN;

    open (OUT, ">uniseq.fa");
    foreach (sort keys %seq) {
    print OUT ">", $_, "\n";
    print OUT $seq{$_}, "\n";
    }
    close OUT;

    Note: Your input/output sequences format is FASTA.

     
  • Wang Qinhu 11:28 am on August 30, 2010 Permalink | Reply
    Tags: , , , , ,   

    Split the sequences having significant BLAST hits 

    To perform sequences similarity search, we usually use NCBI BLAST. However, we often need to parse the result to split the sequences which have significant and/or insignificant hits, respectively. Here is the sample to do this work via bioperl, I hope it will supply useful information to you. Thank you!

    #!/usr/bin/perl -w

    use strict;
    use Bio::SearchIO;

    my $query = "";
    my $SeqId = undef;
    my %hit = ();
    my %Seq = ();
    my %Des = ();

    open (PNI, "seq.fasta");
    while (my $line =
    ) {
    if ($line =~ /^>(\S+)\s*.*$/) {
    $SeqId = $1;
    $Des{$SeqId} = $line;
    } else {
    $Seq{$SeqId} .= $line;
    }
    }
    close PNI;

    open (NO, ">nohits.fasta");
    open (HIT, ">hits.fasta");

    my $in = new Bio::SearchIO(
    -format => 'blast',
    -file => 'blast.txt');

    while(my $result = $in->next_result) {
    $query = $result->query_name;
    while (my $hit = $result->next_hit) {
    $hit{$query} = $hit->name;
    }
    if (!defined $hit{$query}) {
    #$hit{$query} = "No hits found";
    print NO $Des{$query};
    print NO $Seq{$query};
    } else {
    print HIT $Des{$query};
    print HIT $Seq{$query};
    }
    }

    close NO;
    close HIT;

     
    • Wordpress Themes 12:12 pm on September 25, 2010 Permalink | Reply

      Good fill someone in on and this enter helped me alot in my college assignement. Thank you seeking your information.

    • Jerrod Rist 6:17 am on November 4, 2010 Permalink | Reply

      Superb blog post, precisely what I had been on the lookout for.

c
compose new post
j
next post/next comment
k
previous post/previous comment
r
reply
e
edit
o
show/hide comments
t
go to top
l
go to login
h
show/hide help
shift + esc
cancel