matrix_utils
Utility functions for the work with response matrices.
- remu.matrix_utils.compatibility(first, second, N=None, return_all=False, truth_indices=None, min_quality=0.95, **kwargs)[source]
Calculate the compatibility between this and another response matrix.
Basically, this checks whether the point of “the matrices are identical” is an outlier in the distribution of matrix differences as defined by the statistical uncertainties of the matrix elements. This is done using the Mahalanobis distance as the test statistic. If the point “the matrices are identical” is not a reasonable part of the distribution, it is not reasonable to assume that the true matrices are identical.
- Parameters
- secondResponseMatrix
The second response matrix.
- Nint, optional
Number of random matrices to be generated for the calculation. This number must be larger than the number of reco bins! Otherwise the covariances cannot be calculated correctly. Defaults to
#(reco bins) + 100)
.- return_allbool, optional
If
False
, return only null_prob_count, and null_prob_chi2.- truth_indiceslist of ints, optional
Only use the given truth indices to calculate the compatibility. If this is not specified, only indices with a minimum “quality” are used. This quality requires enough statistics in the bins to make the difference between the mean matrices not be dominated by the shared prior.
- Returns
- null_prob_countfloat
The Bayesian p-value evaluated by counting the expected number of random matrix differences more extreme than the mean difference.
- null_prob_chi2float
The Bayesian p-value evaluated by assuming a chi-square distribution of the squares of Mahalanobis distances.
- null_distancefloat, optional
The squared Mahalanobis distance of the mean differences between the two matrices:
D_M^2( mean(first.random_matrices - second.random_matrices) )
- distancesndarray, optional
The set of squared Mahalanobis distances between randomly generated matrix differences and the mean matrix difference:
D_M^2( (first.random_matrices - second.random_matrices) - mean(first.random_matrices - second.random_matrices) )
- dfint, optional
Degrees of freedom of the assumed chi-squared distribution of the squared Mahalanobis distances. This is equal to the number of matrix elements that are considered for the calculation:
df = len(truth_indices) * #(reco_bins in matrix)
See also
Notes
The distribution of matrix differences is evaluated by generating
N
random response matrices from both compared matrices and calculating the (n-dimensional) differences. The resulting set of matrix differences defines the meanmean(differences)
and the covariance matrixcov(differences)
. The covariance in turn defines a metric for the Mahalanobis distanceD_M(x)
on the space of matrix differences, wherex
is a set of matrix element differences.The distance between the mean difference and the Null hypothesis, that the two true matrices are identical, is the
null_distance
:null_distance = D_M(0 - mean(differences)) = D_M(mean(differences))
The compatibility between the matrices is now defined as the Bayesian probability that the true difference between the matrices is more extreme (has a larger distance from the mean difference) than the Null hypothesis. For this, we can just evaluate the set of matrix differences that was used to calculate the covariance matrix:
distances = D_M(differences - mean(differences)) null_prob_count = np.sum(distances >= null_distance) / distances.size
It will be 1 if the mean difference between the matrices is 0, and tend to 0 when the mean difference between the matrices is far from 0. “Far” in this case is determined by the uncertainty, i.e. the covariance, of the difference determination.
In the case of normal distributed differences, the distribution of squared Mahalanobis distances becomes chi-squared distributed. The numbers of degrees of freedom of that distribution is the number of variates, i.e. the number of response matrix elements that are being considered. This can be used to calculate a theoretical value for the compatibility:
df = len(truth_indices) * #(reco_bins) null_prob_chi2 = chi2.sf(null_distance**2, df)
Since the distribution of differences is not necessarily Gaussian, this is only an estimate. Its advantage is that it is less dependent on the number of randomly drawn matrices.
- remu.matrix_utils.improve_stats(response_matrix, data_index=None)[source]
Reduce the statistical uncertainty by merging some bins in the truth binning.
- Parameters
- response_matrixResponseMatrix
- data_indexint, optional
Improve the stats at this truth binning data index. Defaults to lowest entries bin.
- Returns
- new_response_matrixResponseMatrix
Warning
The resulting matrix will have the nuisance/impossible indices set to
[]
!Notes
Depending on the truth binning, one or more bins will be merged. The bin corresponding to data_index will be among them. The “direction” of the merge (i.e. which neighbouring bin to merge it with) is decided by the compatibility of the sets of to-be-merged bins. I.e. the algorithm tries to minimize the response difference between the merged bins.
- remu.matrix_utils.mahalanobis_distance(first, second, shape=None, N=None, return_distances_from_mean=False, **kwargs)[source]
Calculate the squared Mahalanobis distance of the two matrices for each truth bin.
- Parameters
- first, secondResponseMatrix
The second ResponseMatrix for the comparison.
- shapetuple of ints, optional
The shape of the returned matrix. Defaults to
(#(truth bins),)
.- Nint, optional
Number of random matrices to be generated for the calculation. This number must be larger than the number of reco bins! Otherwise the covariances cannot be calculated correctly. Defaults to
#(reco bins) + 100)
.- return_distances_from_meanbool, optional
Also return the ndarray
distances_from_mean
.- **kwargsoptional
Additional keyword arguments are passed through to
generate_random_response_matrices()
.
- Returns
- distancendarray
Array of shape shape with the squared Mahalanobis distance of the mean difference between the matrices for each truth bin:
D_M^2( mean(first.random_matrices - second.random_matrices) )
- distances_from_meanndarray, optional
Array of shape
(N,)+shape
with the squared Mahalanobis distances between the randomly generated matrix differences and the mean matrix difference for each truth bin:D_M^2( (first.random_matrices - second.random_matrices) - mean(first.random_matrices - second.random_matrices) )
See also
- remu.matrix_utils.plot_compatibility(first, second, filename=None, **kwargs)[source]
Plot the compatibility of the two matrices.
- Parameters
- first, secondResponseMatrix
Two instances of
ResponseMatrix
for comparison.- filenamestring
The filename where the plot will be saved.
- **kwargsoptional
Additional keyword arguments are passed to
compatibility()
.
- Returns
- figFigure
The figure that was used for plotting.
- axAxis
The axis that was used for plotting.
See also
- remu.matrix_utils.plot_in_bin_variation(response_matrix, filename=None, **kwargs)[source]
Plot the maximum in-bin variation vor each truth bin.
This plots will contain the minimum, maximum, and median marginalization of these maximum numbers.
- Parameters
- response_matrixResponseMatrix
The thing to plot.
- filenamestring
The filename where the plot will be saved.
- **kwargsoptional
Additional keyword arguments are passed to the plotting function.
- Returns
- figFigure
The figure that was used for plotting.
- axAxis
The axis that was used for plotting.
- remu.matrix_utils.plot_mahalanobis_distance(first, second, filename=None, plot_expectation=True, **kwargs)[source]
Plot the squared Mahalanobis distance
D_M^2
between two matrices.- Parameters
- first, secondResponseMatrix
The two response matrices for the comparison.
- plot_expectationbool
Also plot the expected distance.
- filenamestr, optional
Save the plot to this location
- **kwargsoptional
Additional keyword arguments are passed to the plotting function.
- Returns
- figFigure
The figure that has been plotted on.
- axAxes
The axes that have been plotted into.
See also
Notes
The expected distance is only an estimate based on the statistics in the bins. It is not exact and should be treated as a rough guide rather than a hard compatibility criterion.
- remu.matrix_utils.plot_mean_efficiency(response_matrix, filename=None, nuisance_value=0.0, **kwargs)[source]
Plot mean efficiencies for all truth bins.
This ignores the statistical uncertainties of the bin entries. The plot will contain the minimum, maximum, and median marginalization of these mean efficiencies.
- Parameters
- response_matrixResponseMatrix
The thing to plot.
- filenamestring
The filename where the plot will be saved.
- nuisance_valuefloat, optional
Nuisance bins are set to this value.
- **kwargsoptional
Additional keyword arguments are passed to the plotting function.
- Returns
- figFigure
The figure that was used for plotting.
- axAxis
The axis that was used for plotting.
- remu.matrix_utils.plot_mean_response_matrix(response_matrix, filename=None, **kwargs)[source]
Plot the smearing and efficiency.
- Parameters
- response_matrixResponseMatrix
The thing to plot.
- filenamestring
The filename where the plot will be saved.
- **kwargsoptional
Additional keyword arguments are passed to the plotting function.
- Returns
- figFigure
The figure that was used for plotting.
- axAxis
The axis that was used for plotting.
- remu.matrix_utils.plot_relative_in_bin_variation(response_matrix, filename=None, **kwargs)[source]
Plot the maximum in-bin variation relative to statistical uncertainty.
This plots will contain the minimum, maximum, and median marginalization of these maximum numbers.
- Parameters
- response_matrixResponseMatrix
The thing to plot.
- filenamestring
The filename where the plot will be saved.
- **kwargsoptional
Additional keyword arguments are passed to the plotting function.
- Returns
- figFigure
The figure that was used for plotting.
- axAxis
The axis that was used for plotting.
- remu.matrix_utils.plot_statistical_uncertainty(response_matrix, filename=None, **kwargs)[source]
Plot the maximum sqrt(statistical variance) of each truth bin.
This plots will contain the minimum, maximum, and median marginalization of these maximum numbers.
- Parameters
- response_matrixResponseMatrix
The thing to plot.
- filenamestring
The filename where the plot will be saved.
- **kwargsoptional
Additional keyword arguments are passed to the plotting function.
- Returns
- figFigure
The figure that was used for plotting.
- axAxis
The axis that was used for plotting.