API
katz
back_off
- class katz.back_off.BackOff(corpus)[source]
Bases:
object- get_alpha(old_phrase)[source]
Compute the back-off weight
- Args:
- old_phrase (tuple):
The (n-1)-length tuple used to find the back-off weight for the n-length tuple
- Returns:
- alpha (float):
The back-off weight
- beta (float):
The left-over probability mass for the (n-1)-gram
- get_d(phrase)[source]
Compute the amount of discounting found by Good-Turing estimation
- Args:
- phrase (tuple):
Collection of words to find discounting for
- Returns:
- d (float):
Good-Turing estimate for discounting
- get_pbo(wnew, old_phrase)[source]
Compute the probability for word wnew given the preceeding set of words old_phrase
- Args:
- wnew (str):
The new word we wish to know the probability of obtaining
- old_phrase (tuple):
The preceeding phrase
- Returns:
- pbo (float):
The conditional probability P(wnew|old_phrase)
- sort_endings(phrase)[source]
Find all ways of completing a phrase such that the new phrase appears in the corpus
- Args:
- phrase (tuple):
Collection of words to find valid completions to
- Returns:
- seen (list):
List of words which can complete the phrase to produce a phrase found in the corpus
- unseen (list):
List of words which, if appended to the phrase, produce a phrase NOT found in the corpus
good_turing
- class katz.good_turing.GoodTuring(corpus)[source]
Bases:
object- actual_count(word)[source]
Return the number of times word appeared in the corpus
- Args:
- word:
The word whose frequency we wish to find
- Returns:
- int:
The number of times this word appeared in the corpus
prior
process_equations
- class katz.process_equations.SymbolCoder(basis_functions)[source]
Bases:
object- equation2ntuples(n, eq, locs)[source]
Convert an equation into n-tuples describing the tree structure of the function
- Args:
- n (int):
The length of the n-tuples to produce
- eq (str):
The equation to convert to an n-tuple
- locs (dict):
dictionary of string:sympy objects describing variables
- Returns:
:ntuples (list); List of n-tuples which describe tree structure of function
- labels2ntuples(n, labels)[source]
Convert a list of labels giving the tree representation of an equation into n-tuples describing the tree structure of the function
- Args:
- n (int):
The length of the n-tuples to produce
- labels (list):
The list giving the equation to convert to an n-tuple
- Returns:
:ntuples (list); List of n-tuples which describe tree structure of function
- nodes2ntuples(n, nodes)[source]
Convert a node object giving the tree representation of an equation into n-tuples describing the tree structure of the function
- Args:
- n (int):
The length of the n-tuples to produce
- nodes (esr.generation.generator.DecoratedNode):
Node object corresponding to the tree
- Returns:
:ntuples (list); List of n-tuples which describe tree structure of function
- op2codeword(op)[source]
Convert an operator name as defined by sympy into the codeword assigned to it
- Args:
- op (str):
Operator name of sympy class
- Returns:
str: The codeword used to represent this symbol
- op2str(op)[source]
Convert operator names defined by sympy into symbols used here
- Args:
- op (str):
Operator name of sympy class
- Returns:
str: The equivalent symbol used here
- process_all_equations(n, all_eq, maxvar)[source]
Turn all equations into n-tuples describing the tree structures of their functions
- Args:
- n (int):
The length of the n-tuples to produce
- all_eq (list):
List of equations as strings to convert to n-tuples
- maxvar (int):
The maximum number of variables appearing in any of the equations
- Returns:
- ntuples (list):
List of n-tuples which describe tree structures of the functions
- katz.process_equations.split_by_punctuation(s)[source]
Split a string s into a list, where each instance of punctuation or a space causes a split. E.g. the string s = ‘Hello, how are you?’ becomes [‘Hello’, ‘,’, ‘ ‘, ‘how’, ‘ ‘, ‘are’, ‘ ‘, ‘you’, ‘?’]
- Args:
- s (str):
String we wish to split
- Returns:
- split_str (list):
List of strings split by punctuation
- katz.process_equations.standardise_file(in_name, out_name, input_delimiter)[source]
Standardise the input equations used so that variables are named x0, x1, …, x9
- Args:
- in_name (str):
Name of file containing the equations to study. If None, then equations read from out_eqfile
- out_name (str):
Name of file to output the new equations to
- input_delimiter (str):
The delimiter used in the input csv file
- Returns:
- all_eq (list):
List of equations as strings with the standardised variable names
- max_var (int):
The maximum number of variables appearing in any of the equations
scripts
esr_prior
fit_benchmark
- class scripts.fit_benchmark.MockLikelihood(name, nx, frac_sigx, samp_num)[source]
Bases:
object- get_pred(xdata, a, eq_numpy, **kwargs)[source]
Return evaluated function
- Args:
- xdata (float or np.array):
input variable
- a (list):
parameters to subsitute into equation considered
- eq_numpy (numpy function):
function to use which gives H^2
- Returns:
- y (float or np.array):
the predicted value of y
- negloglike(a, eq_numpy, **kwargs)[source]
Negative log-likelihood for a given function
- Args:
- a (list):
parameters to subsitute into equation considered
- eq_numpy (numpy function):
function to use which gives y
- Returns:
- nll (float):
log(likelihood) for this function and parameters
- run_sympify(fcn_i, **kwargs)[source]
Sympify a function
- Args:
- fcn_i (str):
string representing function we wish to fit to data
- Returns:
- fcn_i (str):
string representing function we wish to fit to data (with superfluous characters removed)
- eq (sympy object):
sympy object representing function we wish to fit to data
- integrated (bool, always False):
whether we analytically integrated the function (True) or not (False)
- scripts.fit_benchmark.apply_language_prior(likelihood, comp, tmax=5)[source]
Function which applies language-model function prior model selection methods to the results of an ESR run given a likelihood class
- Args:
- likelihood (esr.fitting.likelihood object):
object containing data and likelihood function
- comp (int):
The complexity of function to consider
- tmax (float):
maximum time in seconds to run any one part of simplification procedure for a given function
- Returns:
:None
- scripts.fit_benchmark.fit_mocks(name, nx, frac_sigx, samp_num, comp, tmax=5)[source]
Run ESR for a given mock sample using the default MDL prescription
- Args:
- name (str):
The name of the equation to be used
- nx (int):
The number of data points to be used in the mock
- frac_sigx (float):
The fraction of the standard deviation to use as sigma
- samp_num (int):
The mock number
- comp (int):
The complexity of function to consider
- tmax (float):
maximum time in seconds to run any one part of simplification procedure for a given function
- Returns:
:None
- scripts.fit_benchmark.get_sig(f, x_range, frac_sigx)[source]
Determine the noise level to use. This is given as frac_sigx times the standard deviation of the functions values evaluated on 10^5 randomly generated points within the x_range
- Args:
- f (str):
The function which should be evaluated
- x_range (list[float, floar]):
The [min, max] values of x to consider
- frac_sigx (float):
The fraction of the standard deviation to use as sigma
- Returns:
- sig (float):
The value of sigma to use
- scripts.fit_benchmark.get_split_idx(L)[source]
Find the indices of an array of length L which should be considered by this rank. For output data_start, data_end, the rank considers the entries array[data_start:data_end] where len(array) = L.
- Args:
- L (int):
The length of the array we wish to split among ranks.
- Returns:
- data_start (int):
The first index of the array to be considered by the rank
- data_end (int):
The final index (+1) to be considered by the rank.
- scripts.fit_benchmark.get_top_eqs(all_fun, all_loss, nkeep, all_true_eq=None)[source]
Rank functions by their loss and keep up to nkeep of these, while removing duplicates of the true equation
- Args:
- all_fun (list[str]):
The list of functions
- all_loss (np.ndarray):
The loss function to sort by
- nkeep (int):
The maximum number of the top equations to keep
- all_true_eq (list[str] or None):
List of variants of the true equation to find
- Returns:
- m_fun (list[str]):
The strings of the top functions
- m_loss (np.ndarray):
The loss functions of the top functions
- scripts.fit_benchmark.get_truth_location(fun, loss, all_true_eq)[source]
Find the highest ranked of the `true’ equations
- Args:
- fun (list[str]):
The list of functions
- loss (np.ndarray):
The loss function to sort by
- all_true_eq (list[str]):
List of variants of the true equation to find
- Returns:
- m_ftrue (float):
The true equation used
- m_ltrue (str):
The loss function of the truth
- scripts.fit_benchmark.make_data(name, f, x_range, nx, frac_sigx, samp_num, sig, make_fig=False)[source]
Make a mock data sample from a given equation with a certain number of data points and a given noise level. The results are saved to a file in the directory ‘../data/’
- Args:
- name (str):
The name of the equation to be used
- f (str):
The function which should be evaluated
- x_range (list[float, floar]):
The [min, max] values of x to consider
- nx (int):
The number of data points to be used in the mock
- frac_sigx (float):
The fraction of the standard deviation to use as sigma
- samp_num (int):
The mock number (which sets the seed of the random number generator)
- sig (float):
The value of sigma to use for Gaussian noise
- make_fig (bool):
Whether to make a plot of the mock data with the generating function
- Returns:
:None
- scripts.fit_benchmark.print_text(text)[source]
Function to print progress announcements in standardised format
- Args:
- text (str):
The text to be printed
- Returns:
:None
- scripts.fit_benchmark.process_data(dirname, final_prefix, all_comp)[source]
Convert results of all optimisations into a list of functions.
- Args:
- dirname (str):
Directory name containing the optimisation results
- final_prefix (str):
Start of file names which contain result
- all_comp (list[int]):
All complexity of equation to consider
- Returns:
- fun (list[str]):
The list of functions selected
- res (np.ndarray):
The terms used for model selection of the returned functions
- params (np.ndarray):
The maximimum likelihood parameters of the returned functions
- store_comp (np.ndarray):
The complexities of the returned functions
- scripts.fit_benchmark.process_fit(dirname, all_comp, nx, all_true_eq=None, remove_by_like=True)[source]
Process the results of all fits to give a function ranking according to different model selection methods. If all_true_eq is not None, then this will also find the location of the true equation in the rankings. The results are outputted to a file called selection_summary.csv in the directory given by dirname.
- Args:
- dirname (str):
Directory name containing the optimisation results
- all_comp (list[int]):
All complexity of equation to consider
- nx (int):
The number of data points to be used in the mock
- all_true_eq (list[str] or None):
List of variants of the true equation to find
- remove_by_like (bool, default=True):
Whether to remove duplicates if they have the same likelihood
- Returns:
:None
- scripts.fit_benchmark.remove_duplicates_and_sort(all_fun, all_loss, all_like, remove_by_like=True)[source]
Attempt to keep only the highest ranked of any duplicate equation. This will not catch all duplicates, so the user must check for them.
- Args:
- all_fun (list[str]):
The list of functions
- all_loss (np.ndarray):
The loss function to sort by
- all_like (np.ndarray):
The negative log-likelihood values
- remove_by_like (bool, default=True):
Whether to remove duplicates if they have the same likelihood
- Returns:
- fun (list[str]):
The processed list of functions
- loss (np.ndarray):
The sorted loss functions
- like (np.ndarray):
The log-likelihoods of the sorted functions