fastachar package

Submodules

fastachar.fasta_doc module

fastachar.fasta_io module

class fastachar.fasta_io.Alignment(sequences=None)

Bases: object

Class to hold sequences

are_sequences_of_equal_lengths(sequences)

Check whether all sequences are equally long.

Parameters:sequences (list of str) – contains a list of sequence characters.
Returns:True if all are equal length, False otherwise
Return type:boolean
generate_regex_dict(header_format, IDregex, SPECIESregex)

Generate dictionaries containing the regex for fasta header parsing.

Parameters:
  • header_format (str) – Regular expression for the header format. Must contain the litteral strings {ID} and {SPECIES}
  • IDregex (str) – Regular expression that should match the IDs or lab codes.
  • SPECIESregex (str) – Regular expression that should match the species names
Returns:

  • pattern_dict (dict of {str : str}) – dictionary with the regex patterns
  • regex_dict (dict of {str: re.compile}) – dictionary with compiled regular expressions.

Notes

The header_format string should contain both the litteral strings {ID} and {SPECIES}, which are placeholders for the IDregex and SPECIESregex strings.

get_species_info()

Return a dictionary mapping species name and lab codes.

Returns:dictionary of {str – species info with species name as key, IDs as values.
Return type:str}
get_species_list()

Get a sorted list of species names.

Returns:list of sorted species names. list of number of sequences in (sorted) species names
Return type:tuple of (list of str, list of int)
load(fn)

Load sequence data from file

Parameters:fn (str) – filename of file to open
Returns:
  • errorcode (int) – errocode indicating what went wrong if something did go wrong Returns 0 if OK, otherwise see error codes above.
  • arg (string) – Error message
parse_hdr(hdr, **kwds)

Parse the header string of the sequence

Parameters:
  • hdr (string) – fasta header to parse
  • **kwds – if available, pattern_dict and regex_dict are extracted from the parameter list, otherwise default values are used.
Returns:

  • IDstring (string) – a string representation of the ID or lab code
  • species (string) – the name of the species
  • This method tries to parse the header of a sequence as read from a fasta file.

select_sequences(regex, invert=False, exclude=None)

select sequences using regular expressions

Parameters:
  • regex (string) – a regular expression or exact string to match the species names
  • invert (bool) – if True, the inverted selection is returned (not matching species)
  • exclude (None or a regular expression) – exclude the matches that are included by the regex parameter.
Returns:

List of sequences.

Return type:

list of fastachar.fasta_logic.Sequence

This method can be used to select a set of species using regular expressions. All species are returned that match the spefified regex, or all except these if invert is set to True. An optional exlude regex can be given to filter the list further. This will affect the behaviour of the invert option, see the note below.

Notes

If an expression is given for the exlude parameter and invert==True, then those sequences that match the regex selection AND the exclude selection is returned.

select_sequences_from_list(itemlist)

Selects sequence objects from a list of species names

Parameters:itemlist (list of strings) – list of species names
Returns:list of – list of matching sequences.
Return type:class” fastachar.fasta_logic.Sequence
select_two_sequence_sets(regex)

Return two selections of sequences, one that matches regex, and one that does not.

Parameters:regex (string) – a regular expression or exact string
Returns:

Notes

The union of the two sets is identical to the whole data set.

set_fasta_hdr_fmt(header_format='{ID}[_ ]{SPECIES}', IDregex='[A-Za-z0-9_]+[0-9\\.]+[A-Za-z0-9]*', SPECIESregex='[A-Za-z_]+')

Sets the regular expressions used to parse the fasta headers

Parameters:
  • header_format (str) – regular expression and containing the strings {ID} and {SPECIES}
  • IDregex (str) – regular expression matching IDs and lab codes
  • SPECIESregex (str) – regular expression matchin species names.

Notes

If it cannot get to work to parse the header strings correctly, a workaround can be to specify the header_format as ‘{SPECIES}’, and let the SPECIESregex capture anything by setting it to ‘.+’

class fastachar.fasta_io.MyWorkbook(*p, **kw)

Bases: xlwt.Workbook.Workbook

remove_all_sheets()
class fastachar.fasta_io.Report(filename=None, output_filename=None, reportxls=None)

Bases: object

Class for reporting results

Parameters:
  • filename (str, optional) – filename of fasta file used for data input
  • output_filename (str or None, optional) – name of output filename. If not None, then output is sent to stdout.
  • reportxls (ReportXLS or None, optional) – instance of an excel worksheet object
filename

filename of fasta file used for data input

Type:str, optional
output_filename

name of output filename. If not None, then output is sent to stdout.

Type:str or None, optional
reportxls

instance of an excel worksheet object

Type:ReportXLS or None, optional

Write footer of the report

report_header(set_A, set_B, method)

Write header of the report

Parameters:
report_mdcs(set_name, set_A, set_B, mdcs, method)

Write results of molecular diagnostic characters

Parameters:
report_nucs(set_name, set_A, nucs)

Report non-unique characters in list of sequences

Parameters:
class fastachar.fasta_io.ReportXLS

Bases: object

A class to report results in Excel format.

clear()

Remove all worksheets

define_styles()

Define some styles used

Returns:s
Return type:dict of styles

Report Header

Notes

Not implemented.

report_header(set_A, set_B, method)

Write header of the report

Parameters:
report_mdcs(set_name, set_A, set_B, mdcs, method)

Write results of molecular diagnostic characters

Parameters:
report_mdcs_summary(set_A, set_B, mdcs, method)
report_nucs(set_name, nucs)

Report non-unique characters in list of sequences

Parameters:
save(fn)

Save to results file

Parameters:fn (str) – name of the file to write the results into

Notes

If the workbook has no data, nothing is saved, and any errors are silently ignored.

fastachar.fasta_logic module

class fastachar.fasta_logic.Char(c, masked)

Bases: set

A character object representation a nucleotide in a sequence

The object is initialised with a character from the IUPAC list. Ambiguous characters, such as Y and W are expanded into their base nucleotides.

Parameters:
  • c (str) – IUPAC character
  • subst_c (str) – IUPAC character substitute for logical operations.
_value

(non-expanded) character representation of nucleotide character.

Type:str

Notes

The (IUPAC) characters supported are:

A
T
C
G
- (gap)

The ambiguous characters and their expansions:

Y -> C and T
R -> A and G
W -> A and T
S -> G and C
K -> T and G
M -> C and A

D -> A, G and T
V -> A, G and C
H -> A, C and T
B -> C, G and T

The masking characters X and N expand to A, G, T and G.

IUPAC = {'-': '-', 'A': 'A', 'B': 'CGT', 'C': 'C', 'D': 'AGT', 'G': 'G', 'H': 'ACT', 'K': 'TG', 'M': 'CA', 'N': 'ACTG', 'R': 'AG', 'S': 'GC', 'T': 'T', 'V': 'AGC', 'W': 'AT', 'X': 'ACTG', 'Y': 'CT'}
is_masked

Evaluates to True if this character is a masked character.

class fastachar.fasta_logic.Sequence(ID, species, sequence_chars)

Bases: collections.UserList

A class to hold the information of a single sequence

Parameters:
  • ID (str) – ID or lab code
  • species (str) – species name
  • sequences_chars (str) – ascii representation of the sequence
PATTERNS = (re.compile('^[N]+'), re.compile('[N]+$'), re.compile('^[X]+'), re.compile('[X]+$'), re.compile('^[-]+'), re.compile('[-]+$'))
get_masked_positions(sequence_chars)

Get masked positions

Returns the positions where this sequences has a continuous block of N, X or - characters, either leading, or trailing.

Parameters:sequence_chars (str) – string of sequence characters
Returns:m – True where masked N appears.
Return type:list of int
class fastachar.fasta_logic.SequenceLogic

Bases: object

Class for state comparison

compute_mdcs(set_A, set_B, method='MDC')

Computes molecular diagnostic characters

Parameters:
  • set_A (list of Char) – list of sequences in list A
  • set_B (list of Char) – list of sequence in list B
  • method ({"MDC", "potential_MDC_only"}) – method of comparison.
Returns:

Each tuple contains the position, its state for list A, and its state for list B sequences.

Return type:

list of tuples of (int, State, State)

This method computes molecular diagnostic characters by comparing the sequences in list set_A and set_B. Two different criteria for comparison can be selected: return molecular diagnostic characters, or only the potential modlecular diagnostic characters.

Method determining the comparison method:
  • “MDC” returns Molecular Diagnostic Characters only
    conditions 1 and 2 are honoured
  • “potential_MDC_only” return MDCs only
    condition 2 is honoured, condition 1 is violated.
list_non_unique_characters_in_set(aset)

list non-unique characters in set.

Parameters:aset (list of Char) – list of sequences
Returns:Returns list of tuples of position and characters, for which more than one different characters were found.
Return type:list of tuple of (int, State)
list_unique_characters_in_set(aset)

list where aset has unique characters

Parameters:aset (list of Char) – list of sequences
Returns:Returns list of tuples of position and characters, for which only one characeter was found.
Return type:list of tuple of (int, State)
mark_unit_length_states_within_set(aset)

marks for each position whether this position has a unique character

Parameters:aset (list of :obj: Char) –
Returns:a list of tuples with first element True for unique character, and second element the character(s) on this position of State.
Return type:list of tuple of (bool, State)
class fastachar.fasta_logic.State(chars)

Bases: set

The class’ purpose is to hold a number of Char objects
and treat these as a set.
Parameters:chars (iterable of :obj: Char) –
_value

ascii representation of characters.

Type:list of str
intersection_of_subsets()
state
update(s)

update the set with a new element

Parameters:s (instance of a Char object) –

fastachar.tkgui module

Module implementing the graphical interface

fastachar.tkgui.CONFIG

default settings for the configuration files.

Type:dict of {string:string}
fastachar.tkgui.DEFAULT

default regular expressions

Type:dict
class fastachar.tkgui.Case

Bases: object

Class to hold the information for case files

data

dictionary containing all information to write to file.

Type:dict
LIST_KWDS = ['species', 'setA', 'setB']
clear()

Clear data

load(filename)

Load a case file

Parameters:filename (str) – name of cae file
Returns:
  • error (int) – error code
  • arg (str) – error message
parse_line(line)

Parse a line read from the case file

Parameters:line (str) – header string
Returns:
  • kwd (str) – attribute of the configuration
  • value (str or list of str) – the value of the attribute
populate(filename, species, setA, setB, operation, regex_header_format, regex_id, regex_species)

Write the case information into data dictionary

Parameters:
  • filename (str) – Name of input fasta file
  • species (list of str) – Names of all species read
  • setA (list of fastachar.fasta_logic.Sequence) – list of sequences in list A
  • setB (list of fastachar.fasta_logic.Sequence) – list of sequences in list B
  • operation (int) – operation of comparison
  • regex_header_format (str) – regular expression for the header format
  • regex_id (str) – regular expression for matching the ID or lab codes
  • regex_species (str) – regular expression for matching the name of the species.
save(filename)

Save a case file

Parameters:filename (str) – Name of the case file
class fastachar.tkgui.ConfigFastachar

Bases: object

Class to contain the configuration of the Fastachar gui

config

A ConfigParser object holding the configuration read from file.

Type:configparser.ConfigParser
get_home()

Gets the user’s home directory

Returns:home_dir – Home directory
Return type:str
get_path()

Return the full path to the configuration file.

Returns:path
Return type:str
load()

Load and parse configuration file

Upon calling this method, the configuration dictionary self.config gets populated.

save()

Save the the current configuration to file.

This method writes self.config to file.

set_defaults(section, **p)

Sets the default values for the configuration file

Parameters:
  • section (str) – section name of the configuration
  • **p – optional keywords that are part of the section
class fastachar.tkgui.Gui

Bases: object

Class defining the grahical user interface

root

main window

Type:Tk.Tk()
config

Configuration of FastaChar

Type:ConfigFastachar
cwd

current working directory

Type:str
alignment

aligned sequences.

Type:fasta_io.Alignment
case

Case object

Type:Case
reportxls

object for reporting results as excel work sheets.

Type:fasta_io.ReportXLS
about_window()

Create and populate the About window

cb_about()

Callback to call about window

cb_b1_motion_lb(event)
cb_b1_release_lb(event)
cb_close_regex(window, v)

callback to close regex window

Parameters:
  • window – window to close
  • v – list with Tk variables holding the regexes.

Notes

If a fasta file is marked for successful opening, it will be read.

cb_clr()

Callback to clear output

cb_help()

Callback to call help window

cb_open_case_file()

Callback to open case file.

cb_open_fasta_file()

Callback top open a fasta file.

cb_open_fasta_file_for_hdr(parent, regexs)

Callback function to open fasta char for reading headers

Parameters:
  • parent – parent window
  • regexes (list) – list of regular expressions

Notes

This method sets fasta_file. Depending on the results of reading this method sets fasta_file_is_valid.

cb_open_text_window(lines)

Create a general text window

Parameters:lines (list of str) – Text to be displayed
cb_reset(v)

Callback to reset the regular expression to the default values.

cb_run()

Callback to run operation and do the comparison.

cb_save_case_file()

Callback to save the case file.

cb_save_report()

Callback to save report.

cb_save_report_xls()

Callback to save report as xls file.

cb_set_regex()

Callback function to allow the user to set the regular expressions

Notes

This method sets fasta_file initially to None and it may be set to a str when by other methods called from this callback.

cb_set_working_dir()

Callback to set the working directory

create_bindings()

Create the key bindings

create_layout()

Create the main layout of the application

create_menu()

Create the Menu entries

error_window(err_code, arg='', parent=None)

Create and populate an error window

Parameters:
  • err_code (int) – error code
  • arg (str) – error message to be displayed.
getcwd()

Get current working directory

Returns:cwd – current working directory
Return type:str
help_window()

Create and populate the Help window

move_items(lb_from, lb_to)

Move items from one list to another

Parameters:
  • lb_from – from list (listbox)
  • lb_to – to list (listbox)
open_case_file(case_file)

Read and process a case file.

open_fasta_file()

Read and process a fasta file.

populate_list_with_items(items, lb, delete_all=True)
release_in_listbox(event)
setcwd(cwd)

Set current working directory

Parameters:cwd (str) – current working directory
fastachar.tkgui.main()

Main function starting the GUI

Module contents