NAME

CompBio::Simple.pm - Simple OO interface to some basic methods useful in bioinformatics.


SYNOPSIS

use CompBio::Simple;

my $cbs = CompBio::Simple-new;>

my $fa_seq = $cbs-tbl_to_fa($tblseq,%params);>

$tblseq could be either scalar, array by reference, 2d array by reference, or a hash by reference; return data type is same as submitted (see RETURN_TYPE option under general method description). Scalar may be either sequence records (one record per line), or a filename. Array's should contain an entire sequence record (loci\tsequence) in each indexed position. 2D arrays should be ids in the first column, sequences in the second, such that $array[2][1] would be the aa sequence for the third record. Hash's should have the loci as the key, and only the sequence as the value. Newlines at the ends of sequences in array and hash types are unneccisarry and will be stripped off


DESCRIPTION

Originally developed at the BioMolecular Engineering Research Center (http://bmerc-www.bu.edu), this module is intended to take a number of small commonly used methods, and make them into a single package.

The early versions of this module assumed installation on our local system. Although I have tried to correct this in the current version, you may find this package requires a litle twidling to get working. I'll try to leave comments where I think it is most likely, but hopefully use of a relational database and local setting changes in the base module CompBio.pm will have taken care of it. If not _please_ email me at seanq@darwin.bu.edu with the details.

Thanks!


Methods

A couple of important general notes. All methods will describe the key required arguments and will also indicate which arg is the optional %params hash. %params can be passed as a hash or reference. Also, most arguments that handle sequences or ids accept input as scalar (as is or by reference), array (by reference), 2D array (by reference), or hash (by reference). scalar may be a filename. If I miss documenting it, try it just in case. Unless a parameter option is passed to signify otherwise, I assume 'doing the right thing' is to return the same type of data construct as recieved. The prefered method for submitting sequence data is as a reference to an array, each index containing one sequence record (and that preferably in table format).

All methods in Simple.pm accept a %params hash as the final argument. Options available for a specific method are described in that methods section. Options that can be defined for any (well, more than one at least) method are:

RETURN_TYPE <type>
Value from: (SCALAR,REFSCALAR,ARRAY,2D,HASH)

This option overrides the default behavior of returning data in the same format submitted and return data in the manner specified.

OUTFILE <filename>
Opens up <filename> for writing. Results of operation are written to file and 0 is returned.

DEBUG <value>
A numeric value supplied with this option sets the debug level for the specific method call only. Does not affect the 'global' debug level set when creating a new CompBio::Simple object.

Also note that the params hash is shared by all methods called by the method invoked, so params for those internal methods could also be added. The most notable case of where this might be used is that check_type is actually called by almost every other method (Simple tries to never assume or restrict what sequence format is going to be used). So check_type's CONFIDENCE option could be included to affect it's behavior.

new

Construct an object for invoking methods in CompBio::Simple.

Options:

DEBUG: Sets default debug level for all method calls using this object. Can be overridden for a specific operation by defining a DEBUG option with the method call. Higher integer values indicate more output. 1 and 2 are good enough generally, 3 and higher can produce overwhelming amounts of output, 4 or higher will also cause the program to die where it would otherwise warn.


notes

OK, so the idea of handeling the output/return method with dynamically designed code is good, but current state is ugly and re-introduces the problem of converting fa back to table on tbl_to_fa type calls. So the kludge below will no longer work. Somehow we need to pass the method into _munge_input so the right RET_CODE can be designed (hopefully more gracefully). i.e. if in tbl_to_fa, just send to chosen output stream (how to let user define an output stream, such as a network socket?) without converting back to tbl. However, current edit state has moved return to _munge_input, which only sets it for file/stdout. It needs to be set whenever needed, either sending to filehandle or into array as nec.

OK, this whole mess badly needs to be cleaned up. Likely we need to either give up on using autoload, or be a lot more explicit in the breakdown.

check_type

Checks a given sequence or set of sequences for it's type. Currently groks fasta(.fa), table(.tbl), raw genome(.raw), intelligenics[?](.ig) and coding dna sequence(.cdna) types. Each index of the referenced array should be an entire sequence record. * It is however not recomended that you load up an entire raw genome into memory to do this test - see perlfunc read *

$type = check_type(\@seqs,%parameters);


Possible return types are CDNA, TBL, FA, IG, RAW and UNKNOWN.

OPTION:

CONFIDENCE: Set this parameter to an positive integer representing the number of checks against your sequence set you would like performed. The default value is 3, however if only one or two sequences are in your set only one check will be performed. If CONFIDENCE is set to a value equal to or greater than the number of sequences in your set, it will be reset to a value approx. 66% of the number of sequences in your set (a very high level of confidence).

Check type will then make CONFIDENCE # of tests on youe sequence set, randomly selecting 2 members into a subset and checking the format. If more than one format type guesses check_type will throw an exception reporting all the types found.

tbl_to_fa

Converts sequence records in table (tab delimited, usually .tbl file extension) format to fasta format.

$fa_seq = $cbs-tbl_to_fa($tblseq,%params);>

Extra data fields in the table format will be added to the annotation line (>) in the fasta output.

tbl_to_ig

Converts sequence records in table (tab delimited) format to .ig format.

$aref_igseqs = $cbc-tbl_to_ig(\@tbl_seqs,%params);>

Extra data fields in the table format will be placed on a single comment line (;) in the ig output.

fa_to_tbl

Converts sequence records in fasta format to table format.

$aref_faseqs = $cbc-fa_to_tbl(\@fa_seq);>

Extra data in the fasta format will be placed in an extra tab delimited field after the sequence record.

Options:

CLEAN: Reduces signifier to the first strech of non white space characters found at least 4 characters long with none of the characters (| \ ? / ! *)

ig_to_tbl

Accepts ig format sequences in a referenced array, one complete sequence record per index. This method returns the sequence(s) in table(.tbl) format contained in a referenced array.

$aref_igseqs = $cbc-ig_to_tbl(\@fa_seq);>

Extra comment lines in the ig format will be placed as extra tab delimited values after the sequence record.

dna_to_aa

Convert dna sequences, containing no whitespace, to the amino acid residues as coded by the standard 'universal' genetic code. dna sequence may contain standard special characters (i.e. R,S,B,N, ect.).

$aa = dna_to_aa(\$dna_seq,%params);

Options:

C: Set to a true value to indicate dna should be converted to it's complement before translation.

ALTCODE: A reference to a hash containing alternate coding keys where the value is the new aa to code for. Stop codons are represented by ``.''.

SEQFIX: Set to true to alter first position, making V or L an M, and removing stop in last position.

complement

Converts dna to it's complementary strand. DNA sequence is submitted as scalar by reference. There is no return as sequence is modified. To maintain original sequence, send a reference to a copy.

complement(\$dna);

six_frame

$result = six_frame($raw_file,$id,$seq_len,$out_file);


Converts a submitted dna sequence into all 6 frame translations to aa.
Output id's have start and stop positions encoded; if first value is larger,
translated from anti-sense.

Options:

ALTCODE: A reference to a hash containing alternate coding keys where the value is the new aa to code for. Stop codons are represented by ``.''. Multiple allowable values for the final dna in the codon may be provided in the format AT[TCA].

ID: Prefix for translated segment identifiers, default is SixFrame.

SEQLEN: Minimum length of aa sequence to return, default is 10.

OUTPUT: A filename to pipe results to. This is recomended for large dna sequences such as large contigs or whole chromosomes, otherwise results are stored in memory until process complete ** this includes the use of the general method option of OUTFILE ** . If the value of OUTFILE is STDOUT results will be sent directly to standard out.

AA_HASH

Creates a hash table with amino acid lookups by codon. Includes all cases where even an alternate na code (such as M for A or C) would return an unambiguous aa. Also consistent with the complement method in this package, ie, lower cases in some contexts, for ease of use with six_frame.

 C<%aa_hash = aa_hash;>


EXPORT

None by default.


HISTORY

  1. 01
    Original version; created by h2xs 1.20 with options
      -AXC -n CompBio::Simple

  2. 42
    Got initial set of methods linking to CompBio in. Developed the _munge stuff.

  3. 43
    Brought tests up to same point as in CompBio. Added six_frame. Moved the data handeling for dna_to_aa to own _munge type internal method. Added RETURN_TYPE parameter to _munge input so user can pick. Created the CompBio::Simple::ArrayFile module to TIE a sequence file to an array. Added six_frame using same array processing _munge as dna_to_aa.

  4. 44
    Made seriouse attempt to get the docs caught up. Added random sampling and CONFIDENCE to check_type.

  5. 46
    Lots of changes to this version, I'm a bit behind in making version notes I'm afraid, and due to time constraints, these will be too brief, so I'll just try to get up to date.

    Simple seems to now handle using autoload and not only passes the built in tests in test.pl, but the more brutal usage of a few of the utils. Unfortunately this has left a bit of sloppy code as I fixed bugs without real rewrites. The design of using munges in and out seems to be holding up, the major complications comming from IO to and from files and the tied array. While that seems to have been adressed functionally, I look foward to making the code cleaner, and hopefully more eleagant, for the next major version. I also think _munge_array_to_scalar has been largely made needless by the new layout (CompBio handles loops internally in most functions now), and the RET_CODE parameter opens up some exiting possibilities.


TO DO

_munge_seq_imput needs to also detect when it just has a list of loci and hand request to DB to attempy to fulfil, and/or accept a DB param.

Look at using a tie type method similar to ArrayFile.pm for fetching data from a database - simply throwing an exception if no db module provided (eval use?).

rewrite (<sigh>) _munges to detect submited format (fasta, etc) and return same format whenever possible by default, and adding RETURN_FORMAT parameter, negating the need for user to call converter after dna_to_aa for example. *This is probably over the top since most methods _are_ the converters, but why make myself write if/elsif block in all the utils that offer different format returns?*


COPYRIGHT

Developed at the BioMolecular Engineering Research Center at Boston University under the NHLBI's Programs for Genomic Applications grant.

Copyright Sean Quinlan, Trustees of Boston University 2000-2001.

All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.


AUTHOR

Sean Quinlan, seanq@darwin.bu.edu

Please email me with any changes you make or suggestions for changes/additions. Latest version is available under ftp://mcclintock.bu.edu/BMERC/perl/.

Thank you!


SEE ALSO

perl(1), CompBio(3).