LikelihoodMachine¶
-
class
remu.likelihood.
LikelihoodMachine
(data_vector, response_matrix, truth_limits=None, limit_method='raise', eff_threshold=0.0, eff_indices=None, is_sparse=False)¶ Class that calculates likelihoods for truth vectors.
Parameters: - data_vector : array like
The vector with the actual data
- response_matrix : array like
One or more response matrices as ndarrays
- truth_limits : array like, optional
The upper limits of the truth bins up to which the response matrix stays valid.
- limit_method : {‘raise’, ‘prohibit’}, optional
How to handle truth values outside their limits.
- eff_threshold : float, optional
Only consider truth bins with an efficienct above this threshold.
- eff_indices : list of ints, optional
Use only these truth bins for likelihood calculations.
- is_sparse : bool, optional
Assume that the response matrix is already a sparse matrix with only the eff_indices being present.
Notes
The optional truth_limits tells the LikelihoodMachine up to which truth bin value the response matrix stays valid. If the machine is asked to calculate the likelihood of an out-of-bounds truth vector, it is handled according to limit_method:
- ‘raise’ (default)
- An exception is raised.
- ‘prohibit’
- A likelihood of 0 (log likelihood of -inf)is returned.
This can be used to constrain the testable theories to events that have been simulated enough times in the detector Monte Carlo data. I.e. if one wants demands a 10x higher MC statistic:
truth_limits = generator_truth_vector / 10.
The eff_threshold determines above which total reconstruction efficiency a truth bin counts as efficient. Is the total efficiency equal to or below the threshold, the bin is ignored in all likelihood calculations.
Alternatively, a list of eff_indices can be provided. Only the specified truth bins are used for likelihood calculations in that case. If the flag is_sparse is set to True, the provided response_matrix is not sliced according to the eff_indices. Instead it is assumed that the given matrix already only contains the columns as indicated by the eff_indices array, i.e. it must fulfill the following condition:
response_matrix.shape[-1] == len(eff_indixes)
If the matrix is sparse, the vector of truth limits must be provided. Its length must be that of the non-sparse response matrix, i.e. the number of truth bins irrespective of efficient indices.
-
static
generate_random_data_sample
(response_matrix, truth_vector, size=None, each=False)¶ Generate random data samples from the provided truth_vector.
Parameters: - response_matrix : array like
The matrix that translates the truth vector to reco-space expecation values.
- truth_vector : array like
The truth-space expectation values used to generate the data.
- size : int or tuple of ints, optional
The number of data vectors to be generated.
- each : bool, optional
Generate size vectors for each response matrix. Otherwise size determines the total number of generated data vectors and the response matrices are chosen randomly.
-
likelihood_p_value
(truth_vector, N=2500, generator_matrix_index=None, systematics='marginal', **kwargs)¶ Calculate the likelihood p-value of a truth vector.
The likelihood p-value is the probability of the data yielding a lower likelihood than the actual data, given that the simple hypothesis of truth_vector is true.
Parameters: - truth_vector : array like
The evaluated hypotheis expressed as a vector of truth expectation values.
- N : int, optional
The number of MC evaluations of the hypothesis.
- generator_matrix_index : int or tuple, optional
The index of the response matrix to be used to generate the fake data. This needs to be specified only if the LikelihoodMachine contains more than one response matrix. If it is None, N data sets are thrown for each matrix, and a p-value is evaluated for all of them. The return value thus becomes an array of p-values.
- systematics : {‘profile’, ‘marginal’}, optional
How to deal with detector systematics, i.e. multiple response matrices:
- ‘profile’, ‘maximum’
Choose the response matrix that yields the highest likelihood.
- ‘marginal’, ‘average’
Take the arithmetic average probability yielded by the matrices.
- **kwargs : optional
Additional keyword arguments will be passed to
log_likelihood()
.
Returns: - p : float or ndarray
The likelihood p-value.
Notes
The p-value is estimated by randomly creating N data samples according to the given truth_vector. The number of data-sets that yield a likelihood as bad as, or worse than the likelihood given the actual data, n, are counted. The estimate for p is then:
p = n/N.
The variance of the estimator follows that of binomial statistics:
var(n) Np(1-p) 1 var(p) = ------ = ------- <= ---- . N^2 N^2 4N
The expected uncertainty can thus be directly influenced by choosing an appropriate number of evaluations.
-
log_likelihood
(truth_vector, systematics='marginal')¶ Calculate the log likelihood of a vector of truth expectation values.
Parameters: - truth_vector : array like
Array of truth expectation values. Can be a multidimensional array of truth vectors. The shape of the array must be (a, b, c, …, n_truth_values).
- systematics : {‘profile’, ‘marginal’} or tuple or ndarray or None
How to deal with detector systematics, i.e. multiple response matrices:
- ‘profile’, ‘maximum’
Choose the response matrix that yields the highest probability.
- ‘marginal’, ‘average’
Take the arithmetic average probability yielded by the matrices.
- tuple(index)
Select one specific matrix.
- array(indices)
Select a specific matrix for each truth vector. Must have the shape
(a, b, c, ..., len(index))
.- None
Do nothing, return multiple likelihoods.
-
static
log_probability
(data_vector, response_matrix, truth_vector, _constant=None)¶ Calculate the log probabilty of some data given a response matrix and truth vector.
Parameters: - data_vector : array like
The measured data.
- response_matrix : array like
The detector response matrix as ndarray.
- truth_vector : array like
The vector of truth expectation values.
Returns: - p : ndarray
The log likelihood of data_vector given the response_matrix and truth_vector.
Notes
Each of the three objects can actually be an array of vectors/matrices:
data_vector.shape = (a,b,...,n_data) response_matrix.shape = (c,d,...,n_data,n_truth) truth_vector.shape = (e,f,...,n_truth)
In this case, the return value will have the following shape:
p.shape = (a,b,...,c,d,...,e,f,...)
-
marginal_log_likelihood
(composite_hypothesis, parameters, toy_indices)¶ Calculate the marginal likelihood.
Parameters: - composite_hypothesis : CompositeHypothesis
The composite hypotheses for which the likelihood will be calculated.
- parameters : array like
Array of parameter vectors, drawn from the prior or posterior distribution of the hypothesis, e.g. with the MCMC objects:
parameters = [ [1.0, 2.0, 3.0], [1.1, 1.9, 2.8], ... ]
- toy_indices, : array_like
The corresponding systematic toy indices, in an array of equal dimensionality. That means, even if the toy index is just a single integer, it must be provided as arrays of length 1:
toy_indices0 = [ [0], [3], ... ]
Returns: - L : float
The marginal log-likelihood.
Notes
The marginal likelihood is used in the construction of bayes factors, when comparing the evidence in the data for two hypotheses:
bayes_factor = marginal_likelihood0 / marginal_likelihood1
or in the case of log-likelihoods:
log_bayes_factor = marginal_log_likelihood0 - marginal_log_likelihood1
-
max_likelihood_p_value
(composite_hypothesis, parameters=None, N=250, generator_matrix_index=None, systematics='marginal', nproc=0, **kwargs)¶ Calculate the maximum-likelihood p-value of a composite hypothesis.
The maximum-likelihood p-value is the probability of the data yielding a lower maximum likelihood (over the possible parameter space of the composite hypothesis) than the actual data, given that the best-fit parameter set of the composite hypothesis is true.
Parameters: - composite_hypothesis : CompositeHypothesis
The evaluated composite hypothesis.
- parameters : array like, optional
The assumed true parameters of the composite hypothesis. If no parameters are given, they will be determined with the maximum likelihood method.
- N : int, optional
The number of MC evaluations of the hypothesis.
- generator_matrix_index : int or tuple, optional
The index of the response matrix to be used to generate the fake data. This needs to be specified only if the LikelihoodMachine contains more than one response matrix. If it is None, N data sets are thrown for each matrix, and a p-value is evaluated for all of them. The return value thus becomes an array of p-values.
- systematics : {‘profile’, ‘marginal’}, optional
How to deal with detector systematics, i.e. multiple response matrices:
- ‘profile’, ‘maximum’
Choose the response matrix that yields the highest likelihood.
- ‘marginal’, ‘average’
Take the arithmetic average probability yielded by the matrices.
- nproc : int, optional
If this parameters is >= 1, the according number of processes are spawned to calculate the p-value in parallel.
- **kwargs : optional
Additional keyword arguments will be passed to
max_log_likelihood()
.
Returns: - p : float or ndarray
The maximum-likelihood p-value.
See also
Notes
When used to reject composite hypotheses, this p-value is somtime called the “profile plug-in p-value”, as one “plugs in” the maximum likelihood estimate of the hypothesis parameters to calculate it. It’s coverage properties are not exact, so care has to be taken to make sure it performs as expected (e.g. by testing it with simulated data)..
The p-value is estimated by randomly creating N data samples according to the given truth_vector. The number of data-sets that yield a likelihood ratio as bad as, or worse than the likelihood given the actual data, n, are counted. The estimate for p is then:
p = n/N.
The variance of the estimator follows that of binomial statistics:
var(n) Np(1-p) 1 var(p) = ------ = ------- <= ---- . N^2 N^2 4N
The expected uncertainty can thus be directly influenced by choosing an appropriate number of evaluations.
-
max_likelihood_ratio_p_value
(H0, H1, par0=None, par1=None, N=250, generator_matrix_index=None, systematics='marginal', nproc=0, nested=True, **kwargs)¶ Calculate the maximum-likelihood-ratio p-value of two composite hypotheses.
The maximum-likelihood-ratio p-value is the probability of the data yielding a lower ratio of maximum likelihoods (over the possible parameter spaces of the composite hypotheses) than the actual data, given that the best-fit parameter set of hypothesis H0 is true.
Parameters: - H0 : CompositeHypothesis
The tested composite hypothesis. Usually a subset of H1.
- H1 : CompositeHypothesis
The alternative composite hypothesis.
- par0 : array like, optional
The assumed true parameters of the tested hypothesis.If no parameters are given, they will be determined with the maximum likelihood method.
- par1 : array like, optional
The maximum likelihood parameters of the alternative hypothesis. If no parameters are given, they will be determined with the maximum likelihood method.
- N : int, optional
The number of MC evaluations of the hypothesis.
- generator_matrix_index : int or tuple, optional
The index of the response matrix to be used to generate the fake data. This needs to be specified only if the LikelihoodMachine contains more than one response matrix. If it is None, N data sets are thrown for each matrix, and a p-value is evaluated for all of them. The return value thus becomes an array of p-values.
- systematics : {‘profile’, ‘marginal’}, optional
How to deal with detector systematics, i.e. multiple response matrices:
- ‘profile’, ‘maximum’
Choose the response matrix that yields the highest likelihood.
- ‘marginal’, ‘average’
Take the arithmetic average probability yielded by the matrices.
- nproc : int, optional
If this parameters is >= 1, the according number of processes are spawned to calculate the p-value in parallel.
- nested : bool or ‘ignore’, optional
Is H0 a nested theory, i.e. does it cover a subset of H1? In this case, the maximum likelihoods must always follow
L0 <= L1
. If True or ‘ignore’, the calculation of likelihood ratios is re-tried a couple of times if no valid likelihood ratio is found. If True and no valid value was found after 10 tries, an errors is raised. If False, those cases are just accepted.- **kwargs : optional
Additional keyword arguments will be passed to
max_log_likelihood()
.
Returns: - p : float or ndarray
The maximum-likelihood-ratio p-value.
See also
Notes
When used to reject composite hypotheses, this p-value is somtime called the “profile plug-in p-value”, as one “plugs in” the maximum likelihood estimate of the hypothesis parameters to calculate it. It’s coverage properties are not exact, so care has to be taken to make sure it performs as expected (e.g. by testing it with simulated data)..
The p-value is estimated by randomly creating N data samples according to the given truth_vector. The number of data-sets that yield a likelihood ratio as bad as, or worse than the likelihood given the actual data, n, are counted. The estimate for p is then:
p = n/N.
The variance of the estimator follows that of binomial statistics:
var(n) Np(1-p) 1 var(p) = ------ = ------- <= ---- . N^2 N^2 4N
The expected uncertainty can thus be directly influenced by choosing an appropriate number of evaluations.
-
max_log_likelihood
(composite_hypothesis, *args, **kwargs)¶ Calculate the maximum possible likelihood in the given CompositeHypothesis.
Parameters: - composite_hypothesis : CompositeHypothesis
The hypothesis to be evaluated.
- systematics : {‘profile’, ‘marginal’}, optional
How to deal with detector systematics, i.e. multiple response matrices:
- ‘profile’, ‘maximum’
Choose the response matrix that yields the highest probability.
- ‘marginal’, ‘average’
Take the arithmetic average probability yielded by the matrices.
- disp : bool, optional
Display status messages during optimization.
- method : {‘differential_evolution’, ‘basinhopping’}, optional
Select the method to be used for maximization.
- kwargs : dict, optional
Keyword arguments to be passed to the minimizer. If empty, reasonable default values will be used.
Returns: - res : OptimizeResult
Object containing the maximum log likelihood
res.L
. In case ofsystematics=='profile'
, it also contains the index of the response matrix that yielded the maximum likelihoodres.i
.
-
static
max_log_probability
(data_vector, response_matrix, composite_hypothesis, systematics='marginal', disp=False, method='basinhopping', kwargs={})¶ Calculate the maximum possible probability of the data in a CompositeHypothesis.
Parameters: - data_vector : array like
Vector of measured values.
- response_matrix : array like
The response matrix that translates truth into reco space. Can be an arbitrarily shaped array of response matrices.
- composite_hypothesis : CompositeHypothesis
The hypothesis to be evaluated.
- systematics : {‘profile’, ‘marginal’}, optional
How to deal with detector systematics, i.e. multiple response matrices:
- ‘profile’, ‘maximum’
Choose the response matrix that yields the highest probability.
- ‘marginal’, ‘average’
Take the arithmetic average probability yielded by the matrices.
- disp : bool, optional
Display status messages during optimization.
- method : {‘differential_evolution’, ‘basinhopping’}, optional
Select the method to be used for maximization.
- kwargs : dict, optional
Keyword arguments to be passed to the minimizer. If empty, reasonable default values will be used.
Returns: - res : OptimizeResult
Object containing the maximum log probability
res.P
. In case ofsystematics=='profile'
, it also contains the index of the response matrix that yielded the maximum likelihoodres.i
.
-
mcmc
(composite_hypothesis, prior_only=False)¶ Return a Marcov Chain Monte Carlo object for the hypothesis.
The hypothesis must define priors for its parameters.
Parameters: - composite_hypothesis : CompositeHypothesis
- prior_only : bool, optional
Use only the prior infomation. Ignore the data. Useful for testing purposes.
See also
plr
,plot_truth_bin_traces
,plot_reco_bin_traces
,likelihood_p_value
,max_likelihood_p_value
,max_likelihood_ratio_p_value
Notes
See documentation of PyMC for a description of the MCMC class.
-
plot_bin_efficiencies
(filename, plot_limits=False, bins_per_plot=20)¶ Plot bin by bin efficiencies.
Uses Matplotlibs
boxplot
, showing the median (line), quartiles (box), 5% and 95% percentile (error bars), and mean (point) of the efficiencies over all matrices.Parameters: - filename : string
Where to save the plot.
- plot_limits : bool, optional
Also plot the truth limits for each bin on a second axis.
- bins_per_plot : int, optional
How many bins are combined into a single plot.
Returns: - fig : Figure
- ax : list of Axis
-
plot_reco_bin_traces
(filename, trace, toy_index=None, plot_data=False, bins_per_plot=20)¶ Plot bin by bin MCMC reco traces.
Uses Matplotlibs
boxplot
, showing the traces’ median (line), quartiles (box), 5% and 95% percentile (error bars), and mean (point).Parameters: - filename : string
Where to save the plot.
- trace : array like
The posterior trace of the truth bin values of an MCMC.
- toy_index : array like, optional
The posterior trace of the chosen toy matrices of an MCMC.
- plot_data : bool or ‘relative’, optional
Also plot the actual data content of the reco bins. If ‘relative’, the values are divided by the data before plotting.
- bins_per_plot : int, optional
How many bins are combined into a single plot.
Returns: - fig : Figure
- ax : list of Axis
See also
-
plot_truth_bin_traces
(filename, trace, plot_limits=False, bins_per_plot=20)¶ Plot bin by bin MCMC truth traces.
Uses Matplotlibs
boxplot
, showing the traces’ median (line), quartiles (box), 5% and 95% percentile (error bars), and mean (point).Parameters: - filename : string
Where to save the plot.
- trace : array like
The posterior trace of the truth bin values of an MCMC.
- plot_limits : bool or ‘relative’, optional
Also plot the truth limits. If ‘relative’, the values are divided by the limits before plotting.
- bins_per_plot : int, optional
How many bins are combined into a single plot.
Returns: - fig : Figure
- ax : list of Axis
See also
-
plr
(H0, parameters0, toy_indices0, H1, parameters1, toy_indices1)¶ Calculate the Posterior distribution of the log Likelihood Ratio.
Parameters: - H0, H1 : CompositeHypothesis
Composite Hypotheses to be compared.
- parameters0, parameters1 : array like
Arrays of parameter vectors, drawn from the posterior distribution of the hypotheses, e.g. with the MCMC objects:
parameters0 = [ [1.0, 2.0, 3.0], [1.1, 1.9, 2.8], ... ]
- toy_indices0, toy_indices1 : array_like
The corresponding systematic toy indices, in an array of equal dimensionality. That means, even if the toy index is just a single integer, it must be provided as arrays of length 1:
toy_indices0 = [ [0], [3], ... ]
Returns: - PLR : ndarray
A sample from the PLR as calculated from the parameter sets.
- model_preference : float
The resulting model preference.
See also
Notes
The model preference is calculated as the fraction of likelihood ratios in the PLR that prefer H1 over H0:
model_preference = N(PLR > 0) / N(PLR)
It can be interpreted as the posterior probability for the data prefering H1 over H0.
The PLR is symmetric:
PLR(H0, H1) = -PLR(H1, H0) preference(H0, H1) = 1. - preference(H1, H0) # modulo the cases of PLR = 0.