fastachar package¶
Submodules¶
fastachar.fasta_doc module¶
fastachar.fasta_io module¶
-
class
fastachar.fasta_io.
Alignment
(sequences=None)¶ Bases:
object
Class to hold sequences
-
are_sequences_of_equal_lengths
(sequences)¶ Check whether all sequences are equally long.
Parameters: sequences (list of str) – contains a list of sequence characters. Returns: True if all are equal length, False otherwise Return type: boolean
-
generate_regex_dict
(header_format, IDregex, SPECIESregex)¶ Generate dictionaries containing the regex for fasta header parsing.
Parameters: - header_format (str) – Regular expression for the header format. Must contain the litteral strings {ID} and {SPECIES}
- IDregex (str) – Regular expression that should match the IDs or lab codes.
- SPECIESregex (str) – Regular expression that should match the species names
Returns: - pattern_dict (dict of {str : str}) – dictionary with the regex patterns
- regex_dict (dict of {str: re.compile}) – dictionary with compiled regular expressions.
Notes
The header_format string should contain both the litteral strings {ID} and {SPECIES}, which are placeholders for the IDregex and SPECIESregex strings.
-
get_species_info
()¶ Return a dictionary mapping species name and lab codes.
Returns: dictionary of {str – species info with species name as key, IDs as values. Return type: str}
-
get_species_list
()¶ Get a sorted list of species names.
Returns: list of sorted species names. list of number of sequences in (sorted) species names Return type: tuple of (list of str, list of int)
-
load
(fn)¶ Load sequence data from file
Parameters: fn (str) – filename of file to open Returns: - errorcode (int) – errocode indicating what went wrong if something did go wrong Returns 0 if OK, otherwise see error codes above.
- arg (string) – Error message
-
parse_hdr
(hdr, **kwds)¶ Parse the header string of the sequence
Parameters: - hdr (string) – fasta header to parse
- **kwds – if available, pattern_dict and regex_dict are extracted from the parameter list, otherwise default values are used.
Returns: - IDstring (string) – a string representation of the ID or lab code
- species (string) – the name of the species
- This method tries to parse the header of a sequence as read from a fasta file.
-
select_sequences
(regex, invert=False, exclude=None)¶ select sequences using regular expressions
Parameters: - regex (string) – a regular expression or exact string to match the species names
- invert (bool) – if True, the inverted selection is returned (not matching species)
- exclude (None or a regular expression) – exclude the matches that are included by the regex parameter.
Returns: List of sequences.
Return type: list of
fastachar.fasta_logic.Sequence
This method can be used to select a set of species using regular expressions. All species are returned that match the spefified regex, or all except these if invert is set to True. An optional exlude regex can be given to filter the list further. This will affect the behaviour of the invert option, see the note below.
Notes
If an expression is given for the exlude parameter and invert==True, then those sequences that match the regex selection AND the exclude selection is returned.
-
select_sequences_from_list
(itemlist)¶ Selects sequence objects from a list of species names
Parameters: itemlist (list of strings) – list of species names Returns: list of – list of matching sequences. Return type: class” fastachar.fasta_logic.Sequence
-
select_two_sequence_sets
(regex)¶ Return two selections of sequences, one that matches regex, and one that does not.
Parameters: regex (string) – a regular expression or exact string Returns: - set_A (
fastachar.fasta_logic.Sequence
) – list of matching sequences - set_B (
fastachar.fasta_logic.Sequence
) – list of non-matching sequences
Notes
The union of the two sets is identical to the whole data set.
- set_A (
-
set_fasta_hdr_fmt
(header_format='{ID}[_ ]{SPECIES}', IDregex='[A-Za-z0-9_]+[0-9\\.]+[A-Za-z0-9]*', SPECIESregex='[A-Za-z_]+')¶ Sets the regular expressions used to parse the fasta headers
Parameters: - header_format (str) – regular expression and containing the strings {ID} and {SPECIES}
- IDregex (str) – regular expression matching IDs and lab codes
- SPECIESregex (str) – regular expression matchin species names.
Notes
If it cannot get to work to parse the header strings correctly, a workaround can be to specify the header_format as ‘{SPECIES}’, and let the SPECIESregex capture anything by setting it to ‘.+’
-
-
class
fastachar.fasta_io.
Report
(filename=None, output_filename=None, reportxls=None)¶ Bases:
object
Class for reporting results
Parameters: - filename (str, optional) – filename of fasta file used for data input
- output_filename (str or None, optional) – name of output filename. If not None, then output is sent to stdout.
- reportxls (
ReportXLS
or None, optional) – instance of an excel worksheet object
-
filename
¶ filename of fasta file used for data input
Type: str, optional
-
output_filename
¶ name of output filename. If not None, then output is sent to stdout.
Type: str or None, optional
Write footer of the report
-
report_header
(set_A, set_B, method)¶ Write header of the report
Parameters: - set_A (list of
fastachar.fasta_logic.Sequence
) – Sequence list A - set_B (list of
fastachar.fasta_logic.Sequence
) – Sequence list B - method (str) – Description of operation method
- set_A (list of
-
report_mdcs
(set_name, set_A, set_B, mdcs, method)¶ Write results of molecular diagnostic characters
Parameters: - set_name (str) – name of the set (List A for example)
- set_A (list of
fastachar.fasta_logic.Sequence
) – Sequence list A - set_B (list of
fastachar.fasta_logic.Sequence
) – Sequence list B - mdcs (list of tuples of (int,
fastachar.fasta_logic.State
)) – list of position and State tuples, i.e. molecular diagnostic characters - method (str) – short description of operation method.
-
report_nucs
(set_name, set_A, nucs)¶ Report non-unique characters in list of sequences
Parameters: - set_name (str) – Name of the set
- set_A (list of
fastachar.fasta_logic.Sequence
) – list of sequences - nucs (list of tuples of (int,
fastachar.fasta_logic.State
)) – list of position and State tuples
-
class
fastachar.fasta_io.
ReportXLS
¶ Bases:
object
A class to report results in Excel format.
-
clear
()¶ Remove all worksheets
-
define_styles
()¶ Define some styles used
Returns: s Return type: dict of styles
Report Header
Notes
Not implemented.
-
report_header
(set_A, set_B, method)¶ Write header of the report
Parameters: - set_A (list of
fastachar.fasta_logic.Sequence
) – Sequence list A - set_B (list of
fastachar.fasta_logic.Sequence
) – Sequence list B - method (str) – Description of operation method
- set_A (list of
-
report_mdcs
(set_name, set_A, set_B, mdcs, method)¶ Write results of molecular diagnostic characters
Parameters: - set_name (str) – name of the set (List A for example)
- set_A (list of
fastachar.fasta_logic.Sequence
) – Sequence list A - set_B (list of
fastachar.fasta_logic.Sequence
) – Sequence list B - mdcs (list of tuples of (int,
fastachar.fasta_logic.State
)) – list of position and State tuples, i.e. molecular diagnostic characters - method (str) – description of operation method
-
report_mdcs_summary
(set_A, set_B, mdcs, method)¶
-
report_nucs
(set_name, nucs)¶ Report non-unique characters in list of sequences
Parameters: - set_name (str) – Name of the set
- set_A (list of
fastachar.fasta_logic.Sequence
) – list of sequences - nucs (list of tuples of (int,
fastachar.fasta_logic.State
)) – list of position and State tuples
-
save
(fn)¶ Save to results file
Parameters: fn (str) – name of the file to write the results into Notes
If the workbook has no data, nothing is saved, and any errors are silently ignored.
-
fastachar.fasta_logic module¶
-
class
fastachar.fasta_logic.
Char
(c, masked)¶ Bases:
set
A character object representation a nucleotide in a sequence
The object is initialised with a character from the IUPAC list. Ambiguous characters, such as Y and W are expanded into their base nucleotides.
Parameters: - c (str) – IUPAC character
- subst_c (str) – IUPAC character substitute for logical operations.
-
_value
¶ (non-expanded) character representation of nucleotide character.
Type: str
Notes
The (IUPAC) characters supported are:
A T C G - (gap)
The ambiguous characters and their expansions:
Y -> C and T R -> A and G W -> A and T S -> G and C K -> T and G M -> C and A D -> A, G and T V -> A, G and C H -> A, C and T B -> C, G and T
The masking characters X and N expand to A, G, T and G.
-
IUPAC
= {'-': '-', 'A': 'A', 'B': 'CGT', 'C': 'C', 'D': 'AGT', 'G': 'G', 'H': 'ACT', 'K': 'TG', 'M': 'CA', 'N': 'ACTG', 'R': 'AG', 'S': 'GC', 'T': 'T', 'V': 'AGC', 'W': 'AT', 'X': 'ACTG', 'Y': 'CT'}¶
-
is_masked
¶ Evaluates to True if this character is a masked character.
-
class
fastachar.fasta_logic.
Sequence
(ID, species, sequence_chars)¶ Bases:
collections.UserList
A class to hold the information of a single sequence
Parameters: - ID (str) – ID or lab code
- species (str) – species name
- sequences_chars (str) – ascii representation of the sequence
-
PATTERNS
= (re.compile('^[N]+'), re.compile('[N]+$'), re.compile('^[X]+'), re.compile('[X]+$'), re.compile('^[-]+'), re.compile('[-]+$'))¶
-
get_masked_positions
(sequence_chars)¶ Get masked positions
Returns the positions where this sequences has a continuous block of N, X or - characters, either leading, or trailing.
Parameters: sequence_chars (str) – string of sequence characters Returns: m – True where masked N appears. Return type: list of int
-
class
fastachar.fasta_logic.
SequenceLogic
¶ Bases:
object
Class for state comparison
-
compute_mdcs
(set_A, set_B, method='MDC')¶ Computes molecular diagnostic characters
Parameters: Returns: Each tuple contains the position, its state for list A, and its state for list B sequences.
Return type: This method computes molecular diagnostic characters by comparing the sequences in list set_A and set_B. Two different criteria for comparison can be selected: return molecular diagnostic characters, or only the potential modlecular diagnostic characters.
- Method determining the comparison method:
- “MDC” returns Molecular Diagnostic Characters only
- conditions 1 and 2 are honoured
- “potential_MDC_only” return MDCs only
- condition 2 is honoured, condition 1 is violated.
-
list_non_unique_characters_in_set
(aset)¶ list non-unique characters in set.
Parameters: aset (list of Char
) – list of sequencesReturns: Returns list of tuples of position and characters, for which more than one different characters were found. Return type: list of tuple of (int, State
)
-
list_unique_characters_in_set
(aset)¶ list where aset has unique characters
Parameters: aset (list of Char
) – list of sequencesReturns: Returns list of tuples of position and characters, for which only one characeter was found. Return type: list of tuple of (int, State
)
-
mark_unit_length_states_within_set
(aset)¶ marks for each position whether this position has a unique character
Parameters: aset (list of :obj: Char) – Returns: a list of tuples with first element True for unique character, and second element the character(s) on this position of State
.Return type: list of tuple of (bool, State
)
-
-
class
fastachar.fasta_logic.
State
(chars)¶ Bases:
set
- The class’ purpose is to hold a number of Char objects
- and treat these as a set.
Parameters: chars (iterable of :obj: Char) – -
_value
¶ ascii representation of characters.
Type: list of str
-
intersection_of_subsets
()¶
-
state
¶
-
update
(s)¶ update the set with a new element
Parameters: s (instance of a Char object) –
fastachar.tkgui module¶
Module implementing the graphical interface
-
fastachar.tkgui.
CONFIG
¶ default settings for the configuration files.
Type: dict of {string:string}
-
fastachar.tkgui.
DEFAULT
¶ default regular expressions
Type: dict
-
class
fastachar.tkgui.
Case
¶ Bases:
object
Class to hold the information for case files
-
data
¶ dictionary containing all information to write to file.
Type: dict
-
LIST_KWDS
= ['species', 'setA', 'setB']¶
-
load
(filename)¶ Load a case file
Parameters: filename (str) – name of cae file Returns: - error (int) – error code
- arg (str) – error message
-
parse_line
(line)¶ Parse a line read from the case file
Parameters: line (str) – header string Returns: - kwd (str) – attribute of the configuration
- value (str or list of str) – the value of the attribute
-
populate
(filename, species, setA, setB, operation, regex_header_format, regex_id, regex_species)¶ Write the case information into data dictionary
Parameters: - filename (str) – Name of input fasta file
- species (list of str) – Names of all species read
- setA (list of
fastachar.fasta_logic.Sequence
) – list of sequences in list A - setB (list of
fastachar.fasta_logic.Sequence
) – list of sequences in list B - operation (int) – operation of comparison
- regex_header_format (str) – regular expression for the header format
- regex_id (str) – regular expression for matching the ID or lab codes
- regex_species (str) – regular expression for matching the name of the species.
-
save
(filename)¶ Save a case file
Parameters: filename (str) – Name of the case file
-
-
class
fastachar.tkgui.
ConfigFastachar
¶ Bases:
object
Class to contain the configuration of the Fastachar gui
-
config
¶ A ConfigParser object holding the configuration read from file.
Type: configparser.ConfigParser
-
get_home
()¶ Gets the user’s home directory
Returns: home_dir – Home directory Return type: str
-
get_path
()¶ Return the full path to the configuration file.
Returns: path Return type: str
-
load
()¶ Load and parse configuration file
Upon calling this method, the configuration dictionary self.config gets populated.
-
save
()¶ Save the the current configuration to file.
This method writes self.config to file.
-
set_defaults
(section, **p)¶ Sets the default values for the configuration file
Parameters: - section (str) – section name of the configuration
- **p – optional keywords that are part of the section
-
-
class
fastachar.tkgui.
Gui
¶ Bases:
object
Class defining the grahical user interface
-
root
¶ main window
Type: Tk.Tk()
-
config
¶ Configuration of FastaChar
Type: ConfigFastachar
-
cwd
¶ current working directory
Type: str
-
alignment
¶ aligned sequences.
Type: fasta_io.Alignment
-
reportxls
¶ object for reporting results as excel work sheets.
Type: fasta_io.ReportXLS
-
about_window
()¶ Create and populate the About window
-
cb_about
()¶ Callback to call about window
-
cb_b1_motion_lb
(event)¶
-
cb_b1_release_lb
(event)¶
-
cb_close_regex
(window, v)¶ callback to close regex window
Parameters: - window – window to close
- v – list with Tk variables holding the regexes.
Notes
If a fasta file is marked for successful opening, it will be read.
-
cb_clr
()¶ Callback to clear output
-
cb_help
()¶ Callback to call help window
-
cb_open_case_file
()¶ Callback to open case file.
-
cb_open_fasta_file
()¶ Callback top open a fasta file.
-
cb_open_fasta_file_for_hdr
(parent, regexs)¶ Callback function to open fasta char for reading headers
Parameters: - parent – parent window
- regexes (list) – list of regular expressions
Notes
This method sets
fasta_file
. Depending on the results of reading this method setsfasta_file_is_valid
.
-
cb_open_text_window
(lines)¶ Create a general text window
Parameters: lines (list of str) – Text to be displayed
-
cb_reset
(v)¶ Callback to reset the regular expression to the default values.
-
cb_run
()¶ Callback to run operation and do the comparison.
-
cb_save_case_file
()¶ Callback to save the case file.
-
cb_save_report
()¶ Callback to save report.
-
cb_save_report_xls
()¶ Callback to save report as xls file.
-
cb_set_regex
()¶ Callback function to allow the user to set the regular expressions
Notes
This method sets
fasta_file
initially to None and it may be set to a str when by other methods called from this callback.
-
cb_set_working_dir
()¶ Callback to set the working directory
-
create_bindings
()¶ Create the key bindings
-
create_layout
()¶ Create the main layout of the application
Create the Menu entries
-
error_window
(err_code, arg='', parent=None)¶ Create and populate an error window
Parameters: - err_code (int) – error code
- arg (str) – error message to be displayed.
-
getcwd
()¶ Get current working directory
Returns: cwd – current working directory Return type: str
-
help_window
()¶ Create and populate the Help window
-
move_items
(lb_from, lb_to)¶ Move items from one list to another
Parameters: - lb_from – from list (listbox)
- lb_to – to list (listbox)
-
open_case_file
(case_file)¶ Read and process a case file.
-
open_fasta_file
()¶ Read and process a fasta file.
-
populate_list_with_items
(items, lb, delete_all=True)¶
-
release_in_listbox
(event)¶
-
setcwd
(cwd)¶ Set current working directory
Parameters: cwd (str) – current working directory
-
-
fastachar.tkgui.
main
()¶ Main function starting the GUI