API Reference

HCRProbeDesign.probeDesign

Core probe design workflow and CLI entry points.

build_parser()

Build the argument parser for probe design CLIs.

Returns:
  • argparse.ArgumentParser instance.

calcOligoCost(tiles, pricePerBase=0.19)

Calculate the cost of the oligo library

Parameters:
  • tiles

    a list of Tile objects

  • pricePerBase

    the cost of a base of probe

Returns:
  • A list of oligos.

main()

Main function for HCR Probe design. Called when used directly from cmdline

main_batch()

Batch probe design for multi-record FASTA inputs.

outputIDT(tiles, outHandle=sys.stdout)

Formats tile output for direct ordering using IDT template

outputRunParams(args)

Print run parameters to stderr.

Parameters:
  • args

    Parsed argparse namespace.

Returns:
  • None.

outputTable(tiles, outHandle=sys.stdout)

Formats tile output and writes to outHandle

scanSequence(sequence, seqName, tileStep=1, tileSize=52)

Given a sequence, a name for the sequence, a step size, and a tile size, scanSequence will return a list of Tile objects that tile across the sequence

Parameters:
  • sequence

    the sequence to be tiled

  • seqName

    the name of the sequence

  • tileStep

    The step size to take between tiles, defaults to 1 (optional)

  • tileSize

    The size of the tile, defaults to 52 (optional)

Returns:
  • A list of Tile objects

test()

Manual test harness with hard-coded parameters and FASTA file.

Returns:
  • None.

HCRProbeDesign.tiles

Tile and probe representation used in the design pipeline.

Tile(sequence, seqName, startPos)

Represents a candidate probe tile extracted from a target sequence.

Initialize a Tile from a sequence and positional metadata.

Parameters:
  • sequence

    Tile sequence (string).

  • seqName

    Source sequence name.

  • startPos

    1-based start position in the source sequence.

GC()

Return GC percentage for the tile sequence.

RajTm()

Return the SantaLucia-style melting temperature estimate.

Tm()

Return the basic melting temperature estimate for the tile.

__cmp__(other)

Legacy comparison for sorting tiles by name and position.

__eq__(other)

Compare tiles by their sequences.

__hash__()

Hash tiles by their sequence.

__iter__()

Iterate over the tile sequence bases.

__len__()

Return the length of the tile sequence.

__repr__()

Return a debug-friendly representation of the tile.

__str__()

Return a human-readable representation of the tile.

calcGibbs()

Calculate the Gibbs free energy of binding for a given sequence

calcdTm()

Calculate the difference in melting temperature between the 5' and 3' sequences

distance(b, enforceStrand=False)

Returns absolute distance between self and another interval start positions.

hasRuns(runChar, runLength, mismatches)

Given a sequence, a run character, a run length, and a number of mismatches, returns True if the sequence has a run of the specified character of the specified length, with the specified number of mismatches

Parameters:
  • runChar

    the character that indicates a run

  • runLength

    the length of the run of the same character

  • mismatches

    the number of mismatches allowed in the run

Returns:
  • A boolean value.

isMasked()

Return True if the tile contains masked bases.

makeProbes(channel)

This function creates the probes for the channel.

Parameters:
  • channel

    the channel that the probe is on

overlaps(b)

Return true if b overlaps self

splitProbe()

Split sequence in half with two bases in the middle removed (flexible gap to help initiator sequence land) ie. a 52mer will be split into two 25mers with the middle two bases of the 52mer dropped

toBed()

Placeholder for BED formatting support.

toFasta()

Return the tile formatted as a FASTA record.

validate()

Run lightweight validation checks on the tile.

TileError(value)

Bases: Exception

Custom exception type for tile validation and processing.

HCRProbeDesign.genomeMask

Genome masking utilities based on Bowtie2 alignments.

countHitsFromSam(samFile)

For each read in the sam file, add 1 to the count of hits for that read

Parameters:
  • samFile

    the name of the sam file

Returns:
  • A dictionary with the read name as the key and the number of hits as the value.

genomemask(fasta_string, handleName='tmp', species='mouse', nAlignments=3, index=None)

Run Bowtie2 to align probe tiles and write a SAM file to disk.

Parameters:
  • fasta_string

    FASTA formatted string with probe sequences.

  • handleName

    Prefix for FASTA/SAM output files.

  • species

    Species key in HCRconfig.yaml.

  • nAlignments

    Number of alignments to report per read.

  • index

    Optional Bowtie2 index prefix override.

Returns:
  • Bowtie2 subprocess return code.

install_index(url='https://genome-idx.s3.amazonaws.com/bt/mm10.zip', genome='mm10', species='mouse')

Download and extract a prebuilt Bowtie2 index into the package indices.

Parameters:
  • url

    URL to a zipped Bowtie2 index archive.

  • genome

    Genome name used to name the extraction directory.

Returns:
  • None.

test()

Quick manual test for Bowtie2 masking and SAM parsing.

HCRProbeDesign.referenceGenome

Build and register Bowtie2 indices for reference genomes.

build_bowtie2_index(fasta_paths, species, index_name=None, indices_dir=None, threads=1, force=False)

Build a Bowtie2 index from the provided FASTA files.

Parameters:
  • fasta_paths

    List of FASTA file paths.

  • species

    Species name for the index directory.

  • index_name

    Optional index basename override.

  • indices_dir

    Output directory for indices.

  • threads

    Number of threads for bowtie2-build.

  • force

    Overwrite existing index files if True.

Returns:
  • Index prefix path.

Raises:
  • RuntimeError

    If bowtie2-build is not available.

  • FileExistsError

    If index exists and force is False.

collect_fasta_inputs(paths)

Resolve FASTA inputs from files and directories.

Parameters:
  • paths

    List of FASTA files or directories.

Returns:
  • Deduplicated list of FASTA file paths.

Raises:
  • FileNotFoundError

    If a path or directory has no FASTA files.

format_index_path(index_prefix)

Format an index prefix relative to the package when possible.

Parameters:
  • index_prefix

    Bowtie2 index prefix path.

Returns:
  • Relative path if within package, otherwise absolute path.

load_config(config_path=DEFAULT_CONFIG_PATH)

Load the HCRconfig.yaml file.

Parameters:
  • config_path

    Path to the YAML configuration file.

Returns:
  • Parsed config dictionary (empty if missing).

main()

CLI entry point for building and registering a reference genome index.

register_species(config_path, species, index_prefix, force=False)

Register a species and its Bowtie2 index prefix in the config file.

Parameters:
  • config_path

    Path to HCRconfig.yaml.

  • species

    Species key to register.

  • index_prefix

    Bowtie2 index prefix path.

  • force

    Overwrite an existing species entry if True.

Returns:
  • None.

Raises:
  • ValueError

    If the species exists and force is False.

save_config(config, config_path=DEFAULT_CONFIG_PATH)

Write configuration data to HCRconfig.yaml.

Parameters:
  • config

    Configuration dictionary.

  • config_path

    Path to write the configuration.

Returns:
  • None.

HCRProbeDesign.sequencelib

Sequence parsing and utility functions.

FastaIterator(handle)

Generator function to iterate over fasta records in : Use in a loop to apply to each Seq record contained in a .fasta file Input: record handle as obtained by handle = open(,'r') Returns an iterator across Sequences in file

GenRandomSeq(length, type='DNA')

Generate a random sequence of DNA or RNA of a given length

Parameters:
  • length

    the length of the random sequence

  • type

    DNA or RNA, defaults to DNA (optional)

Returns:
  • A string of length length consisting of random characters from chars.

allindices(string, sub, listindex=[], offset=0)

Return a list of all indices of a substring in a string

Parameters:
  • string

    the string to be searched

  • sub

    The string you're looking for

  • listindex

    an empty list

  • offset

    the index in the string where you want to start searching, defaults to 0 (optional)

Returns:
  • A list of all the indices where the substring sub is found in string.

complement(s)

Return the complement of a DNA sequence

Parameters:
  • s

    sequence

Returns:
  • A list of the complement bases of the sequence.

draw(distribution)

Draw a random value from the distribution, where values with a higher probability are drawn more often

Parameters:
  • distribution

    a list of positive numbers that sum to 1

Returns:
  • The index of the array that the random number falls into.

find_all(seq, sub)

Find all occurences of a substring in a string

Parameters:
  • seq

    the string to be searched

  • sub

    The substring to search for

Returns:
  • A list of all the positions where the substring was found.

gc_content(seq)

Given a DNA sequence, return the percentage of G's and C's in the sequence

Parameters:
  • seq

    the sequence to be analyzed

Returns:
  • The GC content of the sequence.

genRandomFromDist(length, freqs)

Generates a random sequence of length 'length' drawing from a distribution of base frequencies in a dictionary

getGC(seq)

The function getGC(seq) takes a string of DNA sequence as input and returns the GC content of the sequence

Parameters:
  • seq

    the sequence to be analyzed

Returns:
  • A list of tuples. Each tuple contains the name of the gene, the GC content, and the length of the gene.

getTm(seq)

The function getTm(seq) takes a sequence as an argument and returns the melting temperature of the sequence

Parameters:
  • seq

    the sequence of interest

Returns:
  • The melting temperature of the sequence.

get_seeds(iter, seeds={})

Given a list of sequences, return a dictionary of the counts of each seed

Parameters:
  • iter

    the iterator of sequences

  • seeds

    a dictionary of seeds and their counts

Returns:
  • A dictionary with the seeds as keys and the number of times they occur as values.

kmer_dictionary(seq, k, dic={}, offset=0)

Returns dictionary of k,v = kmer:'list of kmer start positions in seq'

kmer_dictionary_counts(seq, k, dic={})

Returns a dictionary of k,v = kmer:'count of kmer in seq'

kmer_stats(kmer, dic, genfreqs)

Takes as argument a kmer string, a dictionary with kmers as keys from kmer_dictionary_counts, and a dictionary of genomic frequencies with kmers as keys. Returns a dictionary of stats for kmer ("Signal2Noise Ratio, Z-score")

makeDistFromFreqs(freqs)

Given a dictionary of character frequencies, return a list of cumulative frequencies

Parameters:
  • freqs

    a dictionary of the frequencies of each nucleotide at each position

Returns:
  • a list of cumulative frequencies.

mcount(s, chars)

Sums the counts of appearances of each char in chars

Parameters:
  • s

    the string to search

  • chars

    a string of characters to count

Returns:
  • The number of times the characters in chars appear in s.

prob_seq(seq, pGC=0.5)

Given a sequence and a background GC probability, what is the probability of getting that sequence

Parameters:
  • seq

    the sequence of interest

  • pGC

    the probability of a GC base pair

Returns:
  • The probability of the sequence given the background GC probability.

rcomp(s)

Does same thing as reverse_complement only cooler

reverse_complement(s)

Return the reverse complement of a DNA sequence

Parameters:
  • s

    The sequence to be reverse complemented

Returns:
  • The reverse complement of the input sequence.

seed()

Seed the random number generator with system entropy.

transcribe(seq)

The function transcribe() takes a DNA sequence and replaces each instance of the nucleotide T with a uracil (U) in the transcribed RNA sequence

Parameters:
  • seq

    the sequence to be transcribed

Returns:
  • The transcribed RNA sequence.

HCRProbeDesign.thermo

@authors: Marshall J. Levesque Arjun Raj Daniel Wei

Tm(sequence)

The function calculates the melting temperature of a sequence

Parameters:
  • sequence

    the sequence of the primer

Returns:
  • The melting temperature of the primer.

Tm_RNA_DNA(sequence)

Given a sequence, the function returns the Tm of the sequence using the SantaLucia 98 parameters

Parameters:
  • sequence

    the sequence of the primer

Returns:
  • The dG value.

containsAny(astring, aset)

Check whether a string contains any of the given characters.

Parameters:
  • astring

    Input string.

  • aset

    Iterable of characters to search for.

Returns:
  • True if any character is present.

gibbs(dH, dS, temp=37)

Calc Gibbs Free Energy in cal/mol from enthaply, entropy, and temperature

Arguments: dH -- enthalpy in kcal/mol dS -- entropy in cal/(mol * Kelvin) temp -- temperature in celcius (default 37 degrees C)

init_dna_dna(inseq)

Return [enthalpy, entropy] list with units kcal/mol and cal/(mol*Kelvin) for DNA/DNA duplex initiation for the input DNA sequence (actg 5'->3'). Values from SantaLucia 1998. Argument is DNA

init_rna_dna()

Return [enthalpy, entropy] list in kcal/mol and cal/(mol*Kelvin) for RNA/DNA duplex initiation. Values from Sugimoto et al 1995

melting_temp(dH, dS, ca, cb, salt)

Calculates the melting temperature of a DNA sequence.

Parameters:
  • dH

    Enthalpy (delta H). This is the energy required to separate the strands of the DNA duplex in kilo Joules per mole

  • dS

    Entropy of hybridization (cal/(K*mol))

  • ca

    concentration of a strand in nM

  • cb

    concentration of the complementary strand (M)

  • salt

    the molarity of the Na+ in the hybridisation reaction

Returns:
  • The melting temperature of the primer.

overhang_dna(inseq, end)

Return Gibbs free energy at 37degC (in kcal/mol) contribution from single base overhang in DNA/DNA duplex.

Arguments: inseq - 2bp DNA sequence (5' -> 3') end - specifies which end the over hang is on (valid values: 3 or 5)

Table 2 in Bommarito, S. (2000). Nucleic Acids Research

overhang_rna(inseq, end)

Return Gibbs free energy at 37degC (in kcal/mol) contribution from single base overhang in RNA/RNA duplex.

Arguments: inseq - 2bp RNA sequence (5' -> 3') uracil->thymidine end - specifies which end the over hang is on (valid values: 3 or 5)

Table 3 in Freier et al, Biochemistry, 1986

salt_adjust(delG, nbases, saltconc)

Adjust Gibbs Free Energy from 1M Na+ for another concentration

Arguments: delG -- Gibbs free energy in kcal/mol nbases -- number of bases in the sequence saltconc -- desired Na+ concentration for new Gibbs free energy calculation

Equation 7 SantaLucia 1998

stacks_dna_dna(inseq, temp=37)

Calculate thermodynamic values for DNA/DNA hybridization.

Input Arguments: inseq -- the input DNA sequence of the DNA/DNA hybrid (5'->3') temp -- in celcius for Gibbs free energy calc (default 37degC) salt -- salt concentration in units of mol/L (default 0.33M)

Return [enthalpy, entropy] list in kcal/mol and cal/(mol*Kelvin)

stacks_rna_dna(inseq)

Calculate RNA/DNA base stack thermodynamic values (Sugimoto et al 1995)

Sugimoto 95 parameters for RNA/DNA Hybridization (Table 3) "Thermodynamic Parameters To Predict Stability of RNA/DNA Hybrid Duplexes" in Biochemistry 1995

Input Arguments: inseq -- RNA sequence of the RNA/DNA hybrid ( 5'->3' uracil->thymidine)

Return [enthalpy, entropy] list in kcal/mol and cal/(mol*Kelvin)

HCRProbeDesign.utils

Miscellaneous utilities for sequence processing and formatting.

FastaIterator(handle)

Generator function to iterate over fasta records in : Use in a loop to apply to each sequence record contained in a .fasta file Input: record handle as obtained by handle = open(,'r') Returns an iterator across sequences in file

buildTags(numTags, tagLength, sites=None)

Generate random DNA tags with optional restriction site filtering.

Parameters:
  • numTags

    Number of tags to generate.

  • tagLength

    Length of each tag.

  • sites

    Comma-delimited restriction sites to avoid.

Returns:
  • List of DNA tag strings.

eprint(*args, **kwargs)

Print to stderr with the same signature as print().

Returns:
  • None.

estimateAffixLength(sequence, tagLength)

Estimate sequence length after tag insertion.

Parameters:
  • sequence

    Sequence containing an optional '@' tag marker.

  • tagLength

    Length of the tag to be inserted.

Returns:
  • Adjusted sequence length.

Raises:
  • TileError

    If multiple tag markers are present.

findUnique(tiles)

Return a list of unique Tile objects from the input list.

hasRestrictionSites(sequence, sites)

Check if a sequence contains restriction sites.

Parameters:
  • sequence

    Sequence to scan.

  • sites

    Comma-delimited restriction site names.

Returns:
  • True if any sites are present.

onlyNucleic(seq, set=['a', 'c', 'g', 't', 'u', 'A', 'C', 'G', 'T', 'U', 'n', 'N', '@'])

Check whether a sequence contains only nucleic characters.

Parameters:
  • seq

    Input sequence string.

  • set

    Allowed characters.

Returns:
  • True if all characters are allowed.

pp(d, level=-1, maxw=0, maxh=0, parsable=0)

wrapper around pretty_print that prints to stdout

pretty_print(f, d, level=-1, maxw=0, maxh=0, gap='', first_gap='', last_gap='')

Pretty-print nested structures to a file handle.

Parameters:
  • f

    Output file-like object.

  • d

    Data structure to render.

  • level

    Recursion depth (-1 for unlimited).

  • maxw

    Maximum width per line.

  • maxh

    Maximum items per list/dict/tuple.

  • gap

    Base indentation string.

  • first_gap

    Indentation for opening delimiter line.

  • last_gap

    Indentation for closing delimiter line.

Returns:
  • None.

warnRestrictionSites(sequence, name, sites)

Print a warning if restriction sites are found in a sequence.

Parameters:
  • sequence

    Sequence to scan.

  • name

    Sequence label for logging.

  • sites

    Comma-delimited restriction site names.

Returns:
  • None.

HCRProbeDesign.BLAST

NCBI BLAST utilities for probe sequence validation.

blastProbes(fasta_string, species='mouse', verbose=True)

Submit a BLASTN job for the given FASTA string.

Parameters:
  • fasta_string

    FASTA-formatted string containing probe sequences.

  • species

    Species key for Entrez restriction (mouse or human).

  • verbose

    Emit progress messages to stderr.

Returns:
  • NCBIWWW result handle for subsequent parsing.

getNHits(blast_handle, verbose=True)

Report the number of hits for each record in a BLAST response.

Parameters:
  • blast_handle

    Handle returned by NCBIWWW.qblast.

  • verbose

    Reserved for future verbosity controls.

Returns:
  • None.

HCRProbeDesign.repeatMask

RepeatMasker web API helper utilities (deprecated).

repeatmask(sequence, dnasource='mouse')

This function takes a sequence and returns a masked sequence with help from RepeatMasker

Parameters:
  • sequence

    The sequence to be masked

  • dnasource

    vertebrate, mammal, human, rodent, mouse, rat, danio, drosophila, elegans, defaults to mouse (optional)

Returns:
  • A masked sequence.

repeatmasker_local(sequence, dnasource='mouse')

Placeholder for a local RepeatMasker wrapper.

Parameters:
  • sequence

    Sequence to mask.

  • dnasource

    RepeatMasker DNA source key.

Returns:
  • None.

test()

Simple smoke test for the remote RepeatMasker flow.