LikelihoodMachine

class remu.likelihood.LikelihoodMachine(data_vector, response_matrix, truth_limits=None, limit_method='raise', eff_threshold=0.0, eff_indices=None, is_sparse=False)[source]

Class that calculates likelihoods for truth vectors.

Parameters:
data_vector : array like

The vector with the actual data

response_matrix : array like

One or more response matrices as ndarrays

truth_limits : array like, optional

The upper limits of the truth bins up to which the response matrix stays valid.

limit_method : {‘raise’, ‘prohibit’}, optional

How to handle truth values outside their limits.

eff_threshold : float, optional

Only consider truth bins with an efficienct above this threshold.

eff_indices : list of ints, optional

Use only these truth bins for likelihood calculations.

is_sparse : bool, optional

Assume that the response matrix is already a sparse matrix with only the eff_indices being present.

Notes

The optional truth_limits tells the LikelihoodMachine up to which truth bin value the response matrix stays valid. If the machine is asked to calculate the likelihood of an out-of-bounds truth vector, it is handled according to limit_method:

‘raise’ (default)
An exception is raised.
‘prohibit’
A likelihood of 0 (log likelihood of -inf)is returned.

This can be used to constrain the testable theories to events that have been simulated enough times in the detector Monte Carlo data. I.e. if one wants demands a 10x higher MC statistic:

truth_limits = generator_truth_vector / 10.

The eff_threshold determines above which total reconstruction efficiency a truth bin counts as efficient. Is the total efficiency equal to or below the threshold, the bin is ignored in all likelihood calculations.

Alternatively, a list of eff_indices can be provided. Only the specified truth bins are used for likelihood calculations in that case. If the flag is_sparse is set to True, the provided response_matrix is not sliced according to the eff_indices. Instead it is assumed that the given matrix already only contains the columns as indicated by the eff_indices array, i.e. it must fulfill the following condition:

response_matrix.shape[-1] == len(eff_indixes)

If the matrix is sparse, the vector of truth limits must be provided. Its length must be that of the non-sparse response matrix, i.e. the number of truth bins irrespective of efficient indices.

static generate_random_data_sample(response_matrix, truth_vector, size=None, each=False)[source]

Generate random data samples from the provided truth_vector.

Parameters:
response_matrix : array like

The matrix that translates the truth vector to reco-space expecation values.

truth_vector : array like

The truth-space expectation values used to generate the data.

size : int or tuple of ints, optional

The number of data vectors to be generated.

each : bool, optional

Generate size vectors for each response matrix. Otherwise size determines the total number of generated data vectors and the response matrices are chosen randomly.

likelihood_p_value(truth_vector, N=2500, generator_matrix_index=None, systematics='marginal', **kwargs)[source]

Calculate the likelihood p-value of a truth vector.

The likelihood p-value is the probability of the data yielding a lower likelihood than the actual data, given that the simple hypothesis of truth_vector is true.

Parameters:
truth_vector : array like

The evaluated hypotheis expressed as a vector of truth expectation values.

N : int, optional

The number of MC evaluations of the hypothesis.

generator_matrix_index : int or tuple, optional

The index of the response matrix to be used to generate the fake data. This needs to be specified only if the LikelihoodMachine contains more than one response matrix. If it is None, N data sets are thrown for each matrix, and a p-value is evaluated for all of them. The return value thus becomes an array of p-values.

systematics : {‘profile’, ‘marginal’}, optional

How to deal with detector systematics, i.e. multiple response matrices:

‘profile’, ‘maximum’

Choose the response matrix that yields the highest likelihood.

‘marginal’, ‘average’

Take the arithmetic average probability yielded by the matrices.

**kwargs : optional

Additional keyword arguments will be passed to log_likelihood().

Returns:
p : float or ndarray

The likelihood p-value.

Notes

The p-value is estimated by randomly creating N data samples according to the given truth_vector. The number of data-sets that yield a likelihood as bad as, or worse than the likelihood given the actual data, n, are counted. The estimate for p is then:

p = n/N.

The variance of the estimator follows that of binomial statistics:

         var(n)   Np(1-p)      1
var(p) = ------ = ------- <= ---- .
          N^2       N^2       4N

The expected uncertainty can thus be directly influenced by choosing an appropriate number of evaluations.

log_likelihood(truth_vector, systematics='marginal')[source]

Calculate the log likelihood of a vector of truth expectation values.

Parameters:
truth_vector : array like

Array of truth expectation values. Can be a multidimensional array of truth vectors. The shape of the array must be (a, b, c, …, n_truth_values).

systematics : {‘profile’, ‘marginal’} or tuple or ndarray or None

How to deal with detector systematics, i.e. multiple response matrices:

‘profile’, ‘maximum’

Choose the response matrix that yields the highest probability.

‘marginal’, ‘average’

Take the arithmetic average probability yielded by the matrices.

tuple(index)

Select one specific matrix.

array(indices)

Select a specific matrix for each truth vector. Must have the shape (a, b, c, ..., len(index)).

None

Do nothing, return multiple likelihoods.

static log_probability(data_vector, response_matrix, truth_vector, _constant=None)[source]

Calculate the log probabilty of some data given a response matrix and truth vector.

Parameters:
data_vector : array like

The measured data.

response_matrix : array like

The detector response matrix as ndarray.

truth_vector : array like

The vector of truth expectation values.

Returns:
p : ndarray

The log likelihood of data_vector given the response_matrix and truth_vector.

Notes

Each of the three objects can actually be an array of vectors/matrices:

data_vector.shape = (a,b,...,n_data)
response_matrix.shape = (c,d,...,n_data,n_truth)
truth_vector.shape = (e,f,...,n_truth)

In this case, the return value will have the following shape:

p.shape = (a,b,...,c,d,...,e,f,...)
max_likelihood_p_value(composite_hypothesis, parameters=None, N=250, generator_matrix_index=None, systematics='marginal', nproc=0, **kwargs)[source]

Calculate the maximum-likelihood p-value of a composite hypothesis.

The maximum-likelihood p-value is the probability of the data yielding a lower maximum likelihood (over the possible parameter space of the composite hypothesis) than the actual data, given that the best-fit parameter set of the composite hypothesis is true.

Parameters:
composite_hypothesis : CompositeHypothesis

The evaluated composite hypothesis.

parameters : array like, optional

The assumed true parameters of the composite hypothesis. If no parameters are given, they will be determined with the maximum likelihood method.

N : int, optional

The number of MC evaluations of the hypothesis.

generator_matrix_index : int or tuple, optional

The index of the response matrix to be used to generate the fake data. This needs to be specified only if the LikelihoodMachine contains more than one response matrix. If it is None, N data sets are thrown for each matrix, and a p-value is evaluated for all of them. The return value thus becomes an array of p-values.

systematics : {‘profile’, ‘marginal’}, optional

How to deal with detector systematics, i.e. multiple response matrices:

‘profile’, ‘maximum’

Choose the response matrix that yields the highest likelihood.

‘marginal’, ‘average’

Take the arithmetic average probability yielded by the matrices.

nproc : int, optional

If this parameters is >= 1, the according number of processes are spawned to calculate the p-value in parallel.

**kwargs : optional

Additional keyword arguments will be passed to max_log_likelihood().

Returns:
p : float or ndarray

The maximum-likelihood p-value.

Notes

When used to reject composite hypotheses, this p-value is somtime called the “profile plug-in p-value”, as one “plugs in” the maximum likelihood estimate of the hypothesis parameters to calculate it. It’s coverage properties are not exact, so care has to be taken to make sure it performs as expected (e.g. by testing it with simulated data)..

The p-value is estimated by randomly creating N data samples according to the given truth_vector. The number of data-sets that yield a likelihood ratio as bad as, or worse than the likelihood given the actual data, n, are counted. The estimate for p is then:

p = n/N.

The variance of the estimator follows that of binomial statistics:

         var(n)   Np(1-p)      1
var(p) = ------ = ------- <= ---- .
          N^2       N^2       4N

The expected uncertainty can thus be directly influenced by choosing an appropriate number of evaluations.

max_likelihood_ratio_p_value(H0, H1, par0=None, par1=None, N=250, generator_matrix_index=None, systematics='marginal', nproc=0, nested=True, **kwargs)[source]

Calculate the maximum-likelihood-ratio p-value of two composite hypotheses.

The maximum-likelihood-ratio p-value is the probability of the data yielding a lower ratio of maximum likelihoods (over the possible parameter spaces of the composite hypotheses) than the actual data, given that the best-fit parameter set of hypothesis H0 is true.

Parameters:
H0 : CompositeHypothesis

The tested composite hypothesis. Usually a subset of H1.

H1 : CompositeHypothesis

The alternative composite hypothesis.

par0 : array like, optional

The assumed true parameters of the tested hypothesis.If no parameters are given, they will be determined with the maximum likelihood method.

par1 : array like, optional

The maximum likelihood parameters of the alternative hypothesis. If no parameters are given, they will be determined with the maximum likelihood method.

N : int, optional

The number of MC evaluations of the hypothesis.

generator_matrix_index : int or tuple, optional

The index of the response matrix to be used to generate the fake data. This needs to be specified only if the LikelihoodMachine contains more than one response matrix. If it is None, N data sets are thrown for each matrix, and a p-value is evaluated for all of them. The return value thus becomes an array of p-values.

systematics : {‘profile’, ‘marginal’}, optional

How to deal with detector systematics, i.e. multiple response matrices:

‘profile’, ‘maximum’

Choose the response matrix that yields the highest likelihood.

‘marginal’, ‘average’

Take the arithmetic average probability yielded by the matrices.

nproc : int, optional

If this parameters is >= 1, the according number of processes are spawned to calculate the p-value in parallel.

nested : bool or ‘ignore’, optional

Is H0 a nested theory, i.e. does it cover a subset of H1? In this case, the maximum likelihoods must always follow L0 <= L1. If True or ‘ignore’, the calculation of likelihood ratios is re-tried a couple of times if no valid likelihood ratio is found. If True and no valid value was found after 10 tries, an errors is raised. If False, those cases are just accepted.

**kwargs : optional

Additional keyword arguments will be passed to max_log_likelihood().

Returns:
p : float or ndarray

The maximum-likelihood-ratio p-value.

Notes

When used to reject composite hypotheses, this p-value is somtime called the “profile plug-in p-value”, as one “plugs in” the maximum likelihood estimate of the hypothesis parameters to calculate it. It’s coverage properties are not exact, so care has to be taken to make sure it performs as expected (e.g. by testing it with simulated data)..

The p-value is estimated by randomly creating N data samples according to the given truth_vector. The number of data-sets that yield a likelihood ratio as bad as, or worse than the likelihood given the actual data, n, are counted. The estimate for p is then:

p = n/N.

The variance of the estimator follows that of binomial statistics:

         var(n)   Np(1-p)      1
var(p) = ------ = ------- <= ---- .
          N^2       N^2       4N

The expected uncertainty can thus be directly influenced by choosing an appropriate number of evaluations.

max_log_likelihood(composite_hypothesis, *args, **kwargs)[source]

Calculate the maximum possible likelihood in the given CompositeHypothesis.

Parameters:
composite_hypothesis : CompositeHypothesis

The hypothesis to be evaluated.

systematics : {‘profile’, ‘marginal’}, optional

How to deal with detector systematics, i.e. multiple response matrices:

‘profile’, ‘maximum’

Choose the response matrix that yields the highest probability.

‘marginal’, ‘average’

Take the arithmetic average probability yielded by the matrices.

disp : bool, optional

Display status messages during optimization.

method : {‘differential_evolution’, ‘basinhopping’}, optional

Select the method to be used for maximization.

kwargs : dict, optional

Keyword arguments to be passed to the minimizer. If empty, reasonable default values will be used.

Returns:
res : OptimizeResult

Object containing the maximum log likelihood res.L. In case of systematics=='profile', it also contains the index of the response matrix that yielded the maximum likelihood res.i.

static max_log_probability(data_vector, response_matrix, composite_hypothesis, systematics='marginal', disp=False, method='basinhopping', kwargs={})[source]

Calculate the maximum possible probability of the data in a CompositeHypothesis.

Parameters:
data_vector : array like

Vector of measured values.

response_matrix : array like

The response matrix that translates truth into reco space. Can be an arbitrarily shaped array of response matrices.

composite_hypothesis : CompositeHypothesis

The hypothesis to be evaluated.

systematics : {‘profile’, ‘marginal’}, optional

How to deal with detector systematics, i.e. multiple response matrices:

‘profile’, ‘maximum’

Choose the response matrix that yields the highest probability.

‘marginal’, ‘average’

Take the arithmetic average probability yielded by the matrices.

disp : bool, optional

Display status messages during optimization.

method : {‘differential_evolution’, ‘basinhopping’}, optional

Select the method to be used for maximization.

kwargs : dict, optional

Keyword arguments to be passed to the minimizer. If empty, reasonable default values will be used.

Returns:
res : OptimizeResult

Object containing the maximum log probability res.P. In case of systematics=='profile', it also contains the index of the response matrix that yielded the maximum likelihood res.i.

mcmc(composite_hypothesis, prior_only=False)[source]

Return a Marcov Chain Monte Carlo object for the hypothesis.

The hypothesis must define priors for its parameters.

Parameters:
composite_hypothesis : CompositeHypothesis
prior_only : bool, optional

Use only the prior infomation. Ignore the data. Useful for testing purposes.

Notes

See documentation of PyMC for a description of the MCMC class.

plot_bin_efficiencies(filename, plot_limits=False, bins_per_plot=20)[source]

Plot bin by bin efficiencies.

Parameters:
filename : string

Where to save the plot.

plot_limits : bool, optional

Also plot the truth limits for each bin on a second axis.

bins_per_plot : int, optional

How many bins are combined into a single plot.

Returns:
fig : Figure
ax : list of Axis
plot_reco_bin_traces(filename, trace, toy_index=None, plot_data=False, bins_per_plot=20)[source]

Plot bin by bin MCMC reco traces.

Parameters:
filename : string

Where to save the plot.

trace : array like

The posterior trace of the truth bin values of an MCMC.

toy_index : array like, optional

The posterior trace of the chosen toy matrices of an MCMC.

plot_data : bool or ‘relative’, optional

Also plot the actual data content of the reco bins. If ‘relative’, the values are divided by the data before plotting.

bins_per_plot : int, optional

How many bins are combined into a single plot.

Returns:
fig : Figure
ax : list of Axis
plot_truth_bin_traces(filename, trace, plot_limits=False, bins_per_plot=20)[source]

Plot bin by bin MCMC truth traces.

Parameters:
filename : string

Where to save the plot.

trace : array like

The posterior trace of the truth bin values of an MCMC.

plot_limits : bool, optional

Also plot the truth limits.

bins_per_plot : int, optional

How many bins are combined into a single plot.

Returns:
fig : Figure
ax : list of Axis
plr(H0, parameters0, toy_indices0, H1, parameters1, toy_indices1)[source]

Calculate the Posterior distribution of the log Likelihood Ratio.

Parameters:
H0, H1 : CompositeHypothesis

Composite Hypotheses to be compared.

parameters0, parameters1 : array like

Arrays of parameter vectors, drawn from the posterior distribution of the hypotheses, e.g. with the MCMC objects:

parameters0 = [
    [1.0, 2.0, 3.0],
    [1.1, 1.9, 2.8],
    ...
    ]
toy_indices0, toy_indices1 : array_like

The corresponding systematic toy indices, in an array of equal dimensionality. That means, even if the toy index is just a single integer, it must be provided as arrays of length 1:

toy_indices0 = [
    [0],
    [3],
    ...
    ]
Returns:
PLR : ndarray

A sample from the PLR as calculated from the parameter sets.

model_preference : float

The resulting model preference.

Notes

The model preference is calculated as the fraction of likelihood ratios in the PLR that prefer H1 over H0:

model_preference = N(PLR > 0) / N(PLR)

It can be interpreted as the posterior probability for the data prefering H1 over H0.

The PLR is symmetric:

PLR(H0, H1) = -PLR(H1, H0)
preference(H0, H1) = 1. - preference(H1, H0) # modulo the cases of PLR = 0.