API

katz

back_off

class katz.back_off.BackOff(corpus)[source]

Bases: object

get_alpha(old_phrase)[source]

Compute the back-off weight

Args:
old_phrase (tuple):

The (n-1)-length tuple used to find the back-off weight for the n-length tuple

Returns:
alpha (float):

The back-off weight

beta (float):

The left-over probability mass for the (n-1)-gram

get_d(phrase)[source]

Compute the amount of discounting found by Good-Turing estimation

Args:
phrase (tuple):

Collection of words to find discounting for

Returns:
d (float):

Good-Turing estimate for discounting

get_pbo(wnew, old_phrase)[source]

Compute the probability for word wnew given the preceeding set of words old_phrase

Args:
wnew (str):

The new word we wish to know the probability of obtaining

old_phrase (tuple):

The preceeding phrase

Returns:
pbo (float):

The conditional probability P(wnew|old_phrase)

sort_endings(phrase)[source]

Find all ways of completing a phrase such that the new phrase appears in the corpus

Args:
phrase (tuple):

Collection of words to find valid completions to

Returns:
seen (list):

List of words which can complete the phrase to produce a phrase found in the corpus

unseen (list):

List of words which, if appended to the phrase, produce a phrase NOT found in the corpus

good_turing

class katz.good_turing.GoodTuring(corpus)[source]

Bases: object

actual_count(word)[source]

Return the number of times word appeared in the corpus

Args:
word:

The word whose frequency we wish to find

Returns:
int:

The number of times this word appeared in the corpus

expected_count(word)[source]

Compute the predicted number of times a word should appear in a text equal to the length of the corpus

Args:
word:

The word whose expected frequency we wish to find

Returns:

float: The estimated freuqency of occurrence of this word in the corpus

get_S(r)[source]

Compute the smoothed frequency estimate, S

Args:
r (int):

The number of occurences a given species was previous observed

Returns:
S (float):

The smoothed/adjusted estimate for the number of objects which occur r times

prior

class katz.prior.KatzPrior(n, basis_functions, in_eqfile, out_eqfile, input_delimiter=',')[source]

Bases: object

logprior(eq)[source]

Compute the natural logarithm of the prior of a given equation

Args:
eq (str):

The equation to find the prior probability of

Returns:
p (float):

The natural logarithm of the prior of the supplied equation

process_equations

class katz.process_equations.SymbolCoder(basis_functions)[source]

Bases: object

equation2ntuples(n, eq, locs)[source]

Convert an equation into n-tuples describing the tree structure of the function

Args:
n (int):

The length of the n-tuples to produce

eq (str):

The equation to convert to an n-tuple

locs (dict):

dictionary of string:sympy objects describing variables

Returns:

:ntuples (list); List of n-tuples which describe tree structure of function

labels2ntuples(n, labels)[source]

Convert a list of labels giving the tree representation of an equation into n-tuples describing the tree structure of the function

Args:
n (int):

The length of the n-tuples to produce

labels (list):

The list giving the equation to convert to an n-tuple

Returns:

:ntuples (list); List of n-tuples which describe tree structure of function

nodes2ntuples(n, nodes)[source]

Convert a node object giving the tree representation of an equation into n-tuples describing the tree structure of the function

Args:
n (int):

The length of the n-tuples to produce

nodes (esr.generation.generator.DecoratedNode):

Node object corresponding to the tree

Returns:

:ntuples (list); List of n-tuples which describe tree structure of function

op2codeword(op)[source]

Convert an operator name as defined by sympy into the codeword assigned to it

Args:
op (str):

Operator name of sympy class

Returns:

str: The codeword used to represent this symbol

op2str(op)[source]

Convert operator names defined by sympy into symbols used here

Args:
op (str):

Operator name of sympy class

Returns:

str: The equivalent symbol used here

process_all_equations(n, all_eq, maxvar)[source]

Turn all equations into n-tuples describing the tree structures of their functions

Args:
n (int):

The length of the n-tuples to produce

all_eq (list):

List of equations as strings to convert to n-tuples

maxvar (int):

The maximum number of variables appearing in any of the equations

Returns:
ntuples (list):

List of n-tuples which describe tree structures of the functions

katz.process_equations.split_by_punctuation(s)[source]

Split a string s into a list, where each instance of punctuation or a space causes a split. E.g. the string s = ‘Hello, how are you?’ becomes [‘Hello’, ‘,’, ‘ ‘, ‘how’, ‘ ‘, ‘are’, ‘ ‘, ‘you’, ‘?’]

Args:
s (str):

String we wish to split

Returns:
split_str (list):

List of strings split by punctuation

katz.process_equations.standardise_file(in_name, out_name, input_delimiter)[source]

Standardise the input equations used so that variables are named x0, x1, …, x9

Args:
in_name (str):

Name of file containing the equations to study. If None, then equations read from out_eqfile

out_name (str):

Name of file to output the new equations to

input_delimiter (str):

The delimiter used in the input csv file

Returns:
all_eq (list):

List of equations as strings with the standardised variable names

max_var (int):

The maximum number of variables appearing in any of the equations

scripts

esr_prior

fit_benchmark

class scripts.fit_benchmark.MockLikelihood(name, nx, frac_sigx, samp_num)[source]

Bases: object

clear_data()[source]

Clear data used for numerical integration (not required here)

get_pred(xdata, a, eq_numpy, **kwargs)[source]

Return evaluated function

Args:
xdata (float or np.array):

input variable

a (list):

parameters to subsitute into equation considered

eq_numpy (numpy function):

function to use which gives H^2

Returns:
y (float or np.array):

the predicted value of y

negloglike(a, eq_numpy, **kwargs)[source]

Negative log-likelihood for a given function

Args:
a (list):

parameters to subsitute into equation considered

eq_numpy (numpy function):

function to use which gives y

Returns:
nll (float):
  • log(likelihood) for this function and parameters

run_sympify(fcn_i, **kwargs)[source]

Sympify a function

Args:
fcn_i (str):

string representing function we wish to fit to data

Returns:
fcn_i (str):

string representing function we wish to fit to data (with superfluous characters removed)

eq (sympy object):

sympy object representing function we wish to fit to data

integrated (bool, always False):

whether we analytically integrated the function (True) or not (False)

scripts.fit_benchmark.apply_language_prior(likelihood, comp, tmax=5)[source]

Function which applies language-model function prior model selection methods to the results of an ESR run given a likelihood class

Args:
likelihood (esr.fitting.likelihood object):

object containing data and likelihood function

comp (int):

The complexity of function to consider

tmax (float):

maximum time in seconds to run any one part of simplification procedure for a given function

Returns:

:None

scripts.fit_benchmark.fit_mocks(name, nx, frac_sigx, samp_num, comp, tmax=5)[source]

Run ESR for a given mock sample using the default MDL prescription

Args:
name (str):

The name of the equation to be used

nx (int):

The number of data points to be used in the mock

frac_sigx (float):

The fraction of the standard deviation to use as sigma

samp_num (int):

The mock number

comp (int):

The complexity of function to consider

tmax (float):

maximum time in seconds to run any one part of simplification procedure for a given function

Returns:

:None

scripts.fit_benchmark.get_sig(f, x_range, frac_sigx)[source]

Determine the noise level to use. This is given as frac_sigx times the standard deviation of the functions values evaluated on 10^5 randomly generated points within the x_range

Args:
f (str):

The function which should be evaluated

x_range (list[float, floar]):

The [min, max] values of x to consider

frac_sigx (float):

The fraction of the standard deviation to use as sigma

Returns:
sig (float):

The value of sigma to use

scripts.fit_benchmark.get_split_idx(L)[source]

Find the indices of an array of length L which should be considered by this rank. For output data_start, data_end, the rank considers the entries array[data_start:data_end] where len(array) = L.

Args:
L (int):

The length of the array we wish to split among ranks.

Returns:
data_start (int):

The first index of the array to be considered by the rank

data_end (int):

The final index (+1) to be considered by the rank.

scripts.fit_benchmark.get_top_eqs(all_fun, all_loss, nkeep, all_true_eq=None)[source]

Rank functions by their loss and keep up to nkeep of these, while removing duplicates of the true equation

Args:
all_fun (list[str]):

The list of functions

all_loss (np.ndarray):

The loss function to sort by

nkeep (int):

The maximum number of the top equations to keep

all_true_eq (list[str] or None):

List of variants of the true equation to find

Returns:
m_fun (list[str]):

The strings of the top functions

m_loss (np.ndarray):

The loss functions of the top functions

scripts.fit_benchmark.get_truth_location(fun, loss, all_true_eq)[source]

Find the highest ranked of the `true’ equations

Args:
fun (list[str]):

The list of functions

loss (np.ndarray):

The loss function to sort by

all_true_eq (list[str]):

List of variants of the true equation to find

Returns:
m_ftrue (float):

The true equation used

m_ltrue (str):

The loss function of the truth

scripts.fit_benchmark.main()[source]

Run the benchmarks

scripts.fit_benchmark.make_data(name, f, x_range, nx, frac_sigx, samp_num, sig, make_fig=False)[source]

Make a mock data sample from a given equation with a certain number of data points and a given noise level. The results are saved to a file in the directory ‘../data/’

Args:
name (str):

The name of the equation to be used

f (str):

The function which should be evaluated

x_range (list[float, floar]):

The [min, max] values of x to consider

nx (int):

The number of data points to be used in the mock

frac_sigx (float):

The fraction of the standard deviation to use as sigma

samp_num (int):

The mock number (which sets the seed of the random number generator)

sig (float):

The value of sigma to use for Gaussian noise

make_fig (bool):

Whether to make a plot of the mock data with the generating function

Returns:

:None

scripts.fit_benchmark.print_text(text)[source]

Function to print progress announcements in standardised format

Args:
text (str):

The text to be printed

Returns:

:None

scripts.fit_benchmark.process_data(dirname, final_prefix, all_comp)[source]

Convert results of all optimisations into a list of functions.

Args:
dirname (str):

Directory name containing the optimisation results

final_prefix (str):

Start of file names which contain result

all_comp (list[int]):

All complexity of equation to consider

Returns:
fun (list[str]):

The list of functions selected

res (np.ndarray):

The terms used for model selection of the returned functions

params (np.ndarray):

The maximimum likelihood parameters of the returned functions

store_comp (np.ndarray):

The complexities of the returned functions

scripts.fit_benchmark.process_fit(dirname, all_comp, nx, all_true_eq=None, remove_by_like=True)[source]

Process the results of all fits to give a function ranking according to different model selection methods. If all_true_eq is not None, then this will also find the location of the true equation in the rankings. The results are outputted to a file called selection_summary.csv in the directory given by dirname.

Args:
dirname (str):

Directory name containing the optimisation results

all_comp (list[int]):

All complexity of equation to consider

nx (int):

The number of data points to be used in the mock

all_true_eq (list[str] or None):

List of variants of the true equation to find

remove_by_like (bool, default=True):

Whether to remove duplicates if they have the same likelihood

Returns:

:None

scripts.fit_benchmark.remove_duplicates_and_sort(all_fun, all_loss, all_like, remove_by_like=True)[source]

Attempt to keep only the highest ranked of any duplicate equation. This will not catch all duplicates, so the user must check for them.

Args:
all_fun (list[str]):

The list of functions

all_loss (np.ndarray):

The loss function to sort by

all_like (np.ndarray):

The negative log-likelihood values

remove_by_like (bool, default=True):

Whether to remove duplicates if they have the same likelihood

Returns:
fun (list[str]):

The processed list of functions

loss (np.ndarray):

The sorted loss functions

like (np.ndarray):

The log-likelihoods of the sorted functions