API

katz

back_off

class katz.back_off.BackOff(corpus)[source]

Bases: object

get_alpha(old_phrase)[source]

Compute the back-off weight

Args:

old_phrase (tuple):: The (n-1)-length tuple used to find the back-off weight for the n-length tuple

Returns:

alpha (float):: The back-off weight
beta (float):: The left-over probability mass for the (n-1)-gram

get_d(phrase)[source]

Compute the amount of discounting found by Good-Turing estimation

Args:

phrase (tuple):: Collection of words to find discounting for

Returns:

d (float):: Good-Turing estimate for discounting

get_pbo(wnew, old_phrase)[source]

Compute the probability for word wnew given the preceeding set of words old_phrase

Args:

wnew (str):: The new word we wish to know the probability of obtaining
old_phrase (tuple):: The preceeding phrase

Returns:

pbo (float):: The conditional probability P(wnew|old_phrase)

sort_endings(phrase)[source]

Find all ways of completing a phrase such that the new phrase appears in the corpus

Args:

phrase (tuple):: Collection of words to find valid completions to

Returns:

seen (list):: List of words which can complete the phrase to produce a phrase found in the corpus
unseen (list):: List of words which, if appended to the phrase, produce a phrase NOT found in the corpus

good_turing

class katz.good_turing.GoodTuring(corpus)[source]

Bases: object

actual_count(word)[source]

Return the number of times word appeared in the corpus

Args:

word:: The word whose frequency we wish to find

Returns:

int:: The number of times this word appeared in the corpus

expected_count(word)[source]

Compute the predicted number of times a word should appear in a text equal to the length of the corpus

Args:

word:: The word whose expected frequency we wish to find

Returns:

float: The estimated freuqency of occurrence of this word in the corpus

get_S(r)[source]

Compute the smoothed frequency estimate, S

Args:

r (int):: The number of occurences a given species was previous observed

Returns:

S (float):: The smoothed/adjusted estimate for the number of objects which occur r times

prior

class katz.prior.KatzPrior(n, basis_functions, in_eqfile, out_eqfile, input_delimiter=',')[source]

Bases: object

logprior(eq)[source]

Compute the natural logarithm of the prior of a given equation

Args:

eq (str):: The equation to find the prior probability of

Returns:

p (float):: The natural logarithm of the prior of the supplied equation

process_equations

class katz.process_equations.SymbolCoder(basis_functions)[source]

Bases: object

equation2ntuples(n, eq, locs)[source]

Convert an equation into n-tuples describing the tree structure of the function

Args:

n (int):: The length of the n-tuples to produce
eq (str):: The equation to convert to an n-tuple
locs (dict):: dictionary of string:sympy objects describing variables

Returns:

:ntuples (list); List of n-tuples which describe tree structure of function

labels2ntuples(n, labels)[source]

Convert a list of labels giving the tree representation of an equation into n-tuples describing the tree structure of the function

Args:

n (int):: The length of the n-tuples to produce
labels (list):: The list giving the equation to convert to an n-tuple

Returns:

:ntuples (list); List of n-tuples which describe tree structure of function

nodes2ntuples(n, nodes)[source]

Convert a node object giving the tree representation of an equation into n-tuples describing the tree structure of the function

Args:

n (int):: The length of the n-tuples to produce
nodes (esr.generation.generator.DecoratedNode):: Node object corresponding to the tree

Returns:

:ntuples (list); List of n-tuples which describe tree structure of function

op2codeword(op)[source]

Convert an operator name as defined by sympy into the codeword assigned to it

Args:

op (str):: Operator name of sympy class

Returns:

str: The codeword used to represent this symbol

op2str(op)[source]

Convert operator names defined by sympy into symbols used here

Args:

op (str):: Operator name of sympy class

Returns:

str: The equivalent symbol used here

process_all_equations(n, all_eq, maxvar)[source]

Turn all equations into n-tuples describing the tree structures of their functions

Args:

n (int):: The length of the n-tuples to produce
all_eq (list):: List of equations as strings to convert to n-tuples
maxvar (int):: The maximum number of variables appearing in any of the equations

Returns:

ntuples (list):: List of n-tuples which describe tree structures of the functions

katz.process_equations.split_by_punctuation(s)[source]

Split a string s into a list, where each instance of punctuation or a space causes a split. E.g. the string s = ‘Hello, how are you?’ becomes [‘Hello’, ‘,’, ‘ ‘, ‘how’, ‘ ‘, ‘are’, ‘ ‘, ‘you’, ‘?’]

Args:

s (str):: String we wish to split

Returns:

split_str (list):: List of strings split by punctuation

katz.process_equations.standardise_file(in_name, out_name, input_delimiter)[source]

Standardise the input equations used so that variables are named x0, x1, …, x9

Args:

in_name (str):: Name of file containing the equations to study. If None, then equations read from out_eqfile
out_name (str):: Name of file to output the new equations to
input_delimiter (str):: The delimiter used in the input csv file

Returns:

all_eq (list):: List of equations as strings with the standardised variable names
max_var (int):: The maximum number of variables appearing in any of the equations

scripts

esr_prior

fit_benchmark

class scripts.fit_benchmark.MockLikelihood(name, nx, frac_sigx, samp_num)[source]

Bases: object

clear_data()[source]: Clear data used for numerical integration (not required here)

get_pred(xdata, a, eq_numpy, **kwargs)[source]

Return evaluated function

Args:

xdata (float or np.array):: input variable
a (list):: parameters to subsitute into equation considered
eq_numpy (numpy function):: function to use which gives H^2

Returns:

y (float or np.array):: the predicted value of y

negloglike(a, eq_numpy, **kwargs)[source]

Negative log-likelihood for a given function

Args:

a (list):: parameters to subsitute into equation considered
eq_numpy (numpy function):: function to use which gives y

Returns:

nll (float):

log(likelihood) for this function and parameters

run_sympify(fcn_i, **kwargs)[source]

Sympify a function

Args:

fcn_i (str):: string representing function we wish to fit to data

Returns:

fcn_i (str):: string representing function we wish to fit to data (with superfluous characters removed)
eq (sympy object):: sympy object representing function we wish to fit to data
integrated (bool, always False):: whether we analytically integrated the function (True) or not (False)

scripts.fit_benchmark.apply_language_prior(likelihood, comp, tmax=5)[source]

Function which applies language-model function prior model selection methods to the results of an ESR run given a likelihood class

Args:

likelihood (esr.fitting.likelihood object):: object containing data and likelihood function
comp (int):: The complexity of function to consider
tmax (float):: maximum time in seconds to run any one part of simplification procedure for a given function

Returns:

:None

scripts.fit_benchmark.fit_mocks(name, nx, frac_sigx, samp_num, comp, tmax=5)[source]

Run ESR for a given mock sample using the default MDL prescription

Args:

name (str):: The name of the equation to be used
nx (int):: The number of data points to be used in the mock
frac_sigx (float):: The fraction of the standard deviation to use as sigma
samp_num (int):: The mock number
comp (int):: The complexity of function to consider
tmax (float):: maximum time in seconds to run any one part of simplification procedure for a given function

Returns:

:None

scripts.fit_benchmark.get_sig(f, x_range, frac_sigx)[source]

Determine the noise level to use. This is given as frac_sigx times the standard deviation of the functions values evaluated on 10^5 randomly generated points within the x_range

Args:

f (str):: The function which should be evaluated
x_range (list[float, floar]):: The [min, max] values of x to consider
frac_sigx (float):: The fraction of the standard deviation to use as sigma

Returns:

sig (float):: The value of sigma to use

scripts.fit_benchmark.get_split_idx(L)[source]

Find the indices of an array of length L which should be considered by this rank. For output data_start, data_end, the rank considers the entries array[data_start:data_end] where len(array) = L.

Args:

L (int):: The length of the array we wish to split among ranks.

Returns:

data_start (int):: The first index of the array to be considered by the rank
data_end (int):: The final index (+1) to be considered by the rank.

scripts.fit_benchmark.get_top_eqs(all_fun, all_loss, nkeep, all_true_eq=None)[source]

Rank functions by their loss and keep up to nkeep of these, while removing duplicates of the true equation

Args:

all_fun (list[str]):: The list of functions
all_loss (np.ndarray):: The loss function to sort by
nkeep (int):: The maximum number of the top equations to keep
all_true_eq (list[str] or None):: List of variants of the true equation to find

Returns:

m_fun (list[str]):: The strings of the top functions
m_loss (np.ndarray):: The loss functions of the top functions

scripts.fit_benchmark.get_truth_location(fun, loss, all_true_eq)[source]

Find the highest ranked of the `true’ equations

Args:

fun (list[str]):: The list of functions
loss (np.ndarray):: The loss function to sort by
all_true_eq (list[str]):: List of variants of the true equation to find

Returns:

m_ftrue (float):: The true equation used
m_ltrue (str):: The loss function of the truth

scripts.fit_benchmark.main()[source]: Run the benchmarks

scripts.fit_benchmark.make_data(name, f, x_range, nx, frac_sigx, samp_num, sig, make_fig=False)[source]

Make a mock data sample from a given equation with a certain number of data points and a given noise level. The results are saved to a file in the directory ‘../data/’

Args:

name (str):: The name of the equation to be used
f (str):: The function which should be evaluated
x_range (list[float, floar]):: The [min, max] values of x to consider
nx (int):: The number of data points to be used in the mock
frac_sigx (float):: The fraction of the standard deviation to use as sigma
samp_num (int):: The mock number (which sets the seed of the random number generator)
sig (float):: The value of sigma to use for Gaussian noise
make_fig (bool):: Whether to make a plot of the mock data with the generating function

Returns:

:None

scripts.fit_benchmark.print_text(text)[source]

Function to print progress announcements in standardised format

Args:

text (str):: The text to be printed

Returns:

:None

scripts.fit_benchmark.process_data(dirname, final_prefix, all_comp)[source]

Convert results of all optimisations into a list of functions.

Args:

dirname (str):: Directory name containing the optimisation results
final_prefix (str):: Start of file names which contain result
all_comp (list[int]):: All complexity of equation to consider

Returns:

fun (list[str]):: The list of functions selected
res (np.ndarray):: The terms used for model selection of the returned functions
params (np.ndarray):: The maximimum likelihood parameters of the returned functions
store_comp (np.ndarray):: The complexities of the returned functions

scripts.fit_benchmark.process_fit(dirname, all_comp, nx, all_true_eq=None, remove_by_like=True)[source]

Process the results of all fits to give a function ranking according to different model selection methods. If all_true_eq is not None, then this will also find the location of the true equation in the rankings. The results are outputted to a file called selection_summary.csv in the directory given by dirname.

Args:

dirname (str):: Directory name containing the optimisation results
all_comp (list[int]):: All complexity of equation to consider
nx (int):: The number of data points to be used in the mock
all_true_eq (list[str] or None):: List of variants of the true equation to find
remove_by_like (bool, default=True):: Whether to remove duplicates if they have the same likelihood

Returns:

:None

scripts.fit_benchmark.remove_duplicates_and_sort(all_fun, all_loss, all_like, remove_by_like=True)[source]

Attempt to keep only the highest ranked of any duplicate equation. This will not catch all duplicates, so the user must check for them.

Args:

all_fun (list[str]):: The list of functions
all_loss (np.ndarray):: The loss function to sort by
all_like (np.ndarray):: The negative log-likelihood values
remove_by_like (bool, default=True):: Whether to remove duplicates if they have the same likelihood

Returns:

fun (list[str]):: The processed list of functions
loss (np.ndarray):: The sorted loss functions
like (np.ndarray):: The log-likelihoods of the sorted functions