ResponseMatrix¶
-
class
remu.migration.
ResponseMatrix
(reco_binning, truth_binning, nuisance_indices=[], impossible_indices=[], response_binning=None)¶ Matrix that describes the detector response to true events.
Parameters: - truth_binning : RectangularBinning
The Binning object describing the truth categorization.
- reco_binning : RectangularBinning
The Binning object describing the reco categorization.
- nuisance_indices : list of ints, optional
List of indices of nuisance truth bins. These are treated like their efficiency is exactly 1.
- impossible_indices :list of ints, optional
List of indices of impossible reco bins. These are treated like their probability is exactly 0.
- response_binning : RectangularBinning, optional
The Binning object describing the reco and truth categorization. Usually this will be generated from the truth and reco binning using their
RectangularBinning.cartesian_product()
method.
Notes
The truth and reco binnings will be combined with their cartesian_product method.
The truth bins corresonding to the nuisance_indices will be treated like they have a total efficiency of 1.
The reco bins corresonding to the impossible_indices will be treated like they are filled with a probability of 0.
Two response matrices can be combined by adding them
new_resp = respA + respB
. This yields a new matrix that is equivalent to one that has been filled with the data in bothrespA
andrespB
. The truth and reco binnings inrespA
andrespB
must be identical for this to make sense.-
compatibility
(other, N=None, return_all=False, truth_indices=None, **kwargs)¶ Calculate the compatibility between this and another response matrix.
Basically, this checks whether the point of “the matrices are identical” is an outlier in the distribution of matrix differences as defined by the statistical uncertainties of the matrix elements. This is done using the Mahalanobis distance as the test statistic. If the point “the matrices are identical” is not a reasonable part of the distribution, it is not reasonable to assume that the true matrices are identical.
Parameters: - other : ResponseMatrix
The other response matrix.
- N : int, optional
Number of random matrices to be generated for the calculation. This number must be larger than the number of reco bins! Otherwise the covariances cannot be calculated correctly. Defaults to
#(reco bins) + 100)
.- return_all : bool, optional
If
False
, return only null_prob_count, and null_prob_chi2.- truth_indices : list of ints, optional
Only use the given truth indices to calculate the compatibility. If this is not specified, only the indices with at least one entry in both matrices are used.
Returns: - null_prob_count : float
The Bayesian p-value evaluated by counting the expected number of random matrix differences more extreme than the mean difference.
- null_prob_chi2 : float
The Bayesian p-value evaluated by assuming a chi-square distribution of the squares of Mahalanobis distances.
- null_distance : float
The squared Mahalanobis distance of the mean differences between the two matrices.
- distances : ndarray
The set of squared Mahalanobis distances between randomly generated matrix differences and the mean matrix difference.
- df : int
Degrees of freedom of the assumed chi-squared distribution of the squared Mahalanobis distances. This is equal to the number of matrix elements that are considered for the calculation:
df = len(truth_indices) * len(reco_bins in matrix)
.
Notes
The distribution of matrix differences is evaluated by generating
N
random response matrices from both compared matrices and calculating the (n-dimensional) differences. The resulting set of matrix differences defines the meanmean(differences)
and the covariance matrixcov(differences)
. The covariance in turn defines a metric for the Mahalanobis distanceD_M(x)
on the space of matrix differences, wherex
is a set of matrix element differences.The distance between the mean difference and the Null hypothesis, that the two true matrices are identical, is the
null_distance
:null_distance = D_M(0 - mean(differences))
The compatibility between the matrices is now defined as the Bayesian probability that the true difference between the matrices is more extreme (has a larger distance from the mean difference) than the Null hypothesis. For this, we can just evaluate the set of matrix differences that was used to calculate the covariance matrix:
distances = D_M(differences - mean(differences)) null_prob_count = np.sum(distances >= null_distance) / distances.size
It will be 1 if the mean difference between the matrices is 0, and tend to 0 when the mean difference between the matrices is far from 0. “Far” in this case is determined by the uncertainty, i.e. the covariance, of the difference determination.
In the case of normal distributed differences, the distribution of squared Mahalanobis distances becomes chi-squared distributed. The numbers of degrees of freedom of that distribution is the number of variates, i.e. the number of response matrix elements that are being considered. This can be used to calculate a theoretical value for the compatibility:
df = len(truth_indices) * #(reco_bins) null_prob_chi2 = chi2.sf(null_distance**2, df)
Since the distribution of differences is not necessarily Gaussian, this is only an estimate. Its advantage is that it is less dependent on the number of randomly drawn matrices.
-
distance
(other, N=None, return_distances_from_mean=False, **kwargs)¶ Return the overall squared Mahalanobis distance between the two matrices.
Parameters: - other : ResponseMatrix
The other ResponseMatrix for the comparison.
- N : int, optional
Number of random matrices to be generated for the calculation. This number must be larger than the number of reco bins! Otherwise the covariances cannot be calculated correctly. Defaults to
#(reco bins) + 100)
.- return_distances_from_mean : bool, optional
Also return the ndarray
distances_from_mean
.- kwargs : optional
Additional keyword arguments are passed through to generate_random_response_matrices.
Returns: - distance : flaot
The squared Mahalanobis distance of the mean difference between the matrices:
D^2( mean(self.random_matrices - other.random_matrices) )
- distances_from_mean : ndarray
Array of shape
(N,)
with the squared Mahalanobis distances between the randomly generated matrix differences and the mean matrix difference:D^2( (self.random_matrices - other.random_matrices) - mean(self.random_matrices - other.random_matrices) )
-
distance_as_ndarray
(other, shape=None, N=None, return_distances_from_mean=False, **kwargs)¶ Calculate the squared Mahalanobis distance of the two matrices for each truth bin.
Parameters: - other : ResponseMatrix
The other ResponseMatrix for the comparison.
- shape : tuple of ints, optional
The shape of the returned matrix. Defaults to
(#(truth bins),)
.- N : int, optional
Number of random matrices to be generated for the calculation. This number must be larger than the number of reco bins! Otherwise the covariances cannot be calculated correctly. Defaults to
#(reco bins) + 100)
.- return_distances_from_mean : bool, optional
Also return the ndarray
distances_from_mean
.- kwargs : optional
Additional keyword arguments are passed through to generate_random_response_matrices.
Returns: - distance : ndarray
Array of shape shape with the squared Mahalanobis distance of the mean difference between the matrices for each truth bin:
D^2( mean(self.random_matrices - other.random_matrices) )
- distances_from_mean : ndarray, optional
Array of shape
(N,)+shape
with the squared Mahalanobis distances between the randomly generated matrix differences and the mean matrix difference for each truth bin:D^2( (self.random_matrices - other.random_matrices) - mean(self.random_matrices - other.random_matrices) )
-
fill
(*args, **kwargs)¶ Fill events into the binnings.
-
fill_from_csv_file
(*args, **kwargs)¶ Fill binnings from csv file.
See
Binning.fill_from_csv_file
for a description of the parameters.See also
fill_up_truth_from_csv_file
- Re-fill only truth bins from different file.
-
fill_up_truth
(*args, **kwargs)¶ Re-fill the truth bins with the given events file.
This can be used to get proper efficiencies if the true signal events are stored separate from the reconstructed events.
It takes the same parameters as
fill()
.Notes
A new truth binning is created and filled with the events from the provided events. Each bin is compared to the corresponding bin in the already present truth binning. The larger value of the two is taken as the new truth. This way, event types that are not present in the pure truth data, e.g. background, are not affected by this. It can only increase the value of the truth bins, lowering their efficiency.
For each truth bin, one of the following must be true for this operation to make sense:
- All events in the migration matrix are also present in the new truth events. In this case, the additional truth events lower the efficiency of the truth bin. This is the case, for example, if not all true signal events are reconstructed.
- All events in the new truth events are also present in the migration matrix. In this case, the events in the new truth events have no influence on the response matrix. This is the case, for example, if only a subset of the reconstructed background is saved in the truth file.
If there are events in the response matrix that are not in the new truth events and there are events in the new truth events that are not in the response matrix, this method will lead to a wrong efficiency of the affected truth bin.
-
fill_up_truth_from_csv_file
(*args, **kwargs)¶ Re-fill the truth bins with the given csv file.
This can be used to get proper efficiencies if the true signal events are saved in a separate file from the reconstructed events.
It takes the same parameters as
fill_from_csv_file()
.Notes
A new truth binning is created and filled with the events from the provided file. Each bin is compared to the corresponding bin in the already present truth binning. The larger value of the two is taken as the new truth. This way, event types that are not present in the pure truth data, e.g. background, are not affected by this. It can only increase the value of the truth bins, lowering their efficiency.
For each truth bin, one of the following must be true for this operation to make sense:
- All events in the migration matrix are also present in the truth file. In this case, the additional truth events lower the efficiency of the truth bin. This is the case, for example, if not all true signal events are reconstructed.
- All events in the truth file are also present in the migration matrix. In this case, the events in the truth file have no influence on the response matrix. This is the case, for example, if only a subset of the reconstructed background is saved in the truth file.
If there are events in the response matrix that are not in the truth tree and there are events in the truth tree that are not in the response matrix, this method will lead to a wrong efficiency of the affected truth bin.
-
generate_random_response_matrices
(size=None, shape=None, **kwargs)¶ Generate random response matrices according to the estimated variance.
Parameters: - size : int or tuple of ints, optional
How many random matrices should be generated.
- shape : tuple of ints, optional
The shape of the returned matrices. Defaults to
(#(reco bins), #(truth bins))
.- kwargs : optional
See
get_mean_response_matrix_as_ndarray()
for a description of more optional kwargs.
Returns: - ndarray
Notes
This is a three step process:
- Draw the binomal efficiencies from Beta distributions
- Draw the multinomial reconstruction probabilities from a Dirichlet distribution.
- Draw weight corrections from normal distributions.
If no shape is specified, it will be set to
(#(reco bins, #(truth bins))
.If truth_indices are provided, a sliced matrix with only the given columns will be returned.
-
get_in_bin_variation_as_ndarray
(shape=None, truth_only=True, ignore_variables=[], variable_slices={}, truth_indices=None)¶ Get an estimate for the variation of the response within a bin.
The in-bin variation is estimated from the maximum difference to the surrounding bins. The differences are normalized to the estimated statistical errors, so values close to one indicate a statistically dominated variation.
Parameters: - shape : tuple of ints, optional
The shape of the returned ndarray. Default:
(#(reco bins), #(truth bins))
- truth_only : bool, optional
Only consider the neighbouring bins along the truth-axes.
- ignore_variables : list of strings, optional
These variables will not be considered. This is useful to exclude categorical variables, where the response is not expected to vary smoothly.
- variable_slices : dict of slices, optional
For variables in variable_slices only the specified slice will be used for comparison, e.g. variable_slices = {‘var_A’: slice(1,5)}. Useful if the response is only expected to be smooth over a given range of the variable.
- truth_indices : list of ints, optional
Return a sliced matrix with only the given columns.
Returns: - ndarray
-
get_mean_response_matrix_as_ndarray
(shape=None, **kwargs)¶ Get the means of the posterior distributions of the response matrix elements.
This is different from the “raw” matrix one gets from
get_response_matrix_as_ndarray()
. The latter simply divides the sum of weights in the respective bins.Parameters: - shape : tuple of ints, optional
The shape of the returned matrices. Defaults to
(#(reco bins), #(truth bins))
.- expected_weight : float, optional
The expected average weight of the events. This is used int the calculation of the weight variance. Default: 1.0
- nuisance_indices : list of ints, optional
List of truth bin indices. These bins will be treated like their efficiency is exactly 1. Default: Use the nuisance_indices attribute of the ResponseMatrix.
- impossible_indices : list of ints, optional
List of reco bin indices. These bins will be treated like their their probability is exactly 0. Default: Use the impossible_indices attribute of the ResponseMatrix.
- truth_indices : list of ints, optional
List of truth bin indices. Only return the response of the given truth bins. Default: Return full matrices.
Returns: - ndarray
-
get_reco_entries_as_ndarray
(*args, **kwargs)¶ Get the number of entries in the reco binning as ndarray.
-
get_reco_sumw2_as_ndarray
(*args, **kwargs)¶ Get the sum of squared weights in the reco binning as ndarray.
-
get_reco_values_as_ndarray
(*args, **kwargs)¶ Get the values of the reco binning as ndarray.
-
get_response_entries_as_ndarray
(*args, **kwargs)¶ Get the number of entries in the response binning as ndarray.
-
get_response_matrix_as_ndarray
(shape=None, truth_indices=None)¶ Return the ResponseMatrix as a ndarray.
Uses the information in the truth and response binnings to calculate the response matrix.
Parameters: - shape : tuple of ints, optional
The shape of the returned ndarray. Default:
(#(reco bins), #(truth bins))
- truth_indices : list of ints, optional
Only return the response of the given truth bins. Default: Return full matrix.
Returns: - ndarray
See also
Notes
If shape is None, it s set to
(#(reco bins), #(truth bins))
. The expected response of a truth vector can then be calculated like this:v_reco = response_matrix.dot(v_truth)
If truth_indices are provided, a sliced matrix with only the given columns will be returned.
-
get_response_sumw2_as_ndarray
(*args, **kwargs)¶ Get the sum of squared weights in the response binning as ndarray.
-
get_response_values_as_ndarray
(*args, **kwargs)¶ Get the values of the response binning as ndarray.
-
get_statistical_variance_as_ndarray
(shape=None, **kwargs)¶ Get the statistical variance of the single ResponseMatrix elements as ndarray.
The variance is estimated from the actual bin contents in a Bayesian motivated way.
Parameters: - shape : tuple of ints, optional
The shape of the returned matrix. Defaults to
(#(reco bins), #(truth bins))
.- kwargs : optional
See
get_mean_response_matrix_as_ndarray()
for a description of more optional kwargs.
Returns: - ndarray
Notes
The response matrix creation is modeled as a three step process:
- Reconstruction efficiency according to a binomial process.
- Distribution of truth events among the reco bins according to a multinomial distribution.
- Correction of the categorical probabilities according to the mean weights of the events in each bin.
So the response matrix element can be written like this:
R_ij = m_ij * p_ij * eff_j
where
eff_j
is the total efficiency of events in truth binj
,p_ij
is the unweighted multinomial reconstruction probability in reco bini
andm_ij
the weight correction. The variance ofR_ij
is estimated by estimating the variances of these values separately.The variance of
eff_j
is estimated by using the Bayesian conjugate prior for biinomial distributions: the Beta distribution. We assume a prior that is uniform in the reconstruction efficiency. We then update it with the simulated events. The variance of the posterior distribution is taken as the variance of the efficiency.The variance of
p_ij
is estimated by using the Bayesian conjugate prior for multinomial distributions: the Dirichlet distribution. We assume a prior that is uniform in the ignorant about reconstruction probabilities. We then update it with the simulated events. The variance of the posterior distribution is taken as the variance of the transition probability.If a list of nuisance_indices is provided, the probabilities of not reconstructing events in the respective truth categories will be fixed to 0. This is useful for background categories where one is not interested in the true number of events.
If a list of impossible_indices is provided, the probabilities of reconstructing events in the respective reco categories will be fixed to 0. This is useful for bins that are impossible to have any events in them by theiur definition.
The variances of m_ij is estimated from the errors of the average weights in the matrix elements as classical “standard error of the mean”. To avoid problems with bins with 0 or 1 entries, we add a “prior expectation” point to the data. This ensures that all bins have at least 1 entry (no divisions by zero) and that variances can be estimated even for bins with only one (true) entry (from the difference to the expected value).
This is just an estimate! The true variance of the randomly generated response matrices can deviate from the returned numbers. Also, these variances ignore the correlations between matrix elements.
If no shape is specified, it will be set to (N_reco, N_truth).
If truth_indices are provided, a sliced matrix with only the given columns will be returned.
-
get_truth_entries_as_ndarray
(*args, **kwargs)¶ Get the number of entries in the truth binning as ndarray.
-
get_truth_sumw2_as_ndarray
(*args, **kwargs)¶ Get the sum of squared weights in the truth binning as ndarray.
-
get_truth_values_as_ndarray
(*args, **kwargs)¶ Get the values of the truth binning as ndarray.
-
maximize_stats_by_rebinning
(in_bin_variation_limit=5.0, select='entries', ignore_variables=[], variable_slices={}, **kwargs)¶ Maximize the number of events per bin by rebinning the matrix.
Parameters: - in_bin_variation_limit : float, optional
Bins will only be merged if the maximum in-bin variation of the resulting matrix does not exceed the in_bin_variation_limit.
- select : {‘entries’, ‘entries_sum’, ‘in-bin’, ‘in-bin_sum’}
Determines how the merging candidate is selected:
- entries
the bin with the lowest number of truth entries
- entries_sum
the pair of bins with the lowest number of truth entries
- in-bin
the bin with the lowest maximum in-bin variation
- in-bin_sum
the pair of bins with the lowest sum of maximum in-bin variations
- kwargs : optional
Additional keyword arguments will be passed to the method
get_in_bin_variation_as_ndarray()
.
See also
-
plot_compatibility
(filename, other)¶ Plot the compatibility of the two matrices.
Parameters: - filename : string
The filename where the plot will be stored.
- other : ResponseMatrix
The other Response Matrix for the comparison.
Returns: - fig : Figure
The figure that was used for plotting.
- ax : Axis
The axis that was used for plotting.
-
plot_distance
(filename, other, expectation=True, variables=None, kwargs1d={}, kwargs2d={}, figax=None, reduction_function=<function sum>, **kwargs)¶ Plot the squared mahalanobis distance between two matrices.
Parameters: - filename : string
The filename of the plot to be stored.
- other : ResponseMatrix
The other ResponseMatrix for the comparison
- expectation : bool, optional
Plot an approximation of the expected values for identical matrices in the 1D histograms.
- variables : {None, (None, None), list, (list, list)}, optional
None
, plot marginal histograms for all variables.(None, None)
, plot 2D histograms of all possible variable combinations.list
, plot marginal histograms for these variables.(list, list)
, plot 2D histograms of the Cartesian product of the two variable lists. 2D histograms where both variables are identical are plotted as 1D histograms.- kwargs1d, kwargs2d : dict, optional
- Additional keyword arguments for the 1D/2D histograms.
If the key label is present, a legend will be drawn.
- figax : (figure, axes), optional
Pair of figure and axes to be used for plotting. Can be used to plot multiple binnings on top of one another. Default: Create new figure and axes.
- reduction_function : function, optional
Use this function to marginalize out variables.
- kwargs : optional
Additional kwargs will be passed to calculate_squared_mahalanobis_distance.
Returns: - fig : Figure
The Figure that was drawn on.
- ax : Axis
The axes objects.
-
plot_entries
(*args, **kwargs)¶ Plot the entries of the response binning.
This plots the distribution of events that have both a truth and reco bin.
-
plot_expected_efficiency
(filename, variables=None, kwargs1d={}, kwargs2d={}, figax=None, **kwargs)¶ Plot expected efficiencies for projections on all truth variables.
This assumes the truth values are distributed like the generator data. This does not consider the statistical uncertainty of the matrix elements.
Parameters: - filename : string or None
The target filename of the plot. If None, the plot fill not be saved to disk. This is only useful with the figax option.
- variables : optional
One of the following:
- list of strings
List of variables to plot marginal histograms for.
- None
Plot marginal histograms for all variables.
- (list of strings, list of strings)
Plot 2D histograms of the cartesian product of the two variable lists. 2D histograms where both variables are identical are plotted as 1D histograms.
- (None, None)
Plot 2D histograms of all possible variable combinations. 2D histograms where both variables are identical are plotted as 1D histograms.
Default: None
- kwargs1d, kwargs2d : dict, optional
Additional keyword arguments for the 1D/2D histograms. If the key label is present, a legend will be drawn.
- figax : tuple of (Figure, list of list of Axis), optional
Pair of figure and axes to be used for plotting. Can be used to plot multiple binnings on top of one another. Default: Create new figure and axes.
- kwargs : optional
Additional kwargs will be passed to
get_response_values_as_ndarray()
andget_truth_values_as_ndarray()
.
Returns: - fig : Figure
The Figure that was used for plotting.
- ax : list of list of Axis
The axes that were used for plotting.
-
plot_in_bin_variation
(filename, variables=None, kwargs1d={}, kwargs2d={}, figax=None, **kwargs)¶ Plot the maximum in-bin variation.
Parameters: - filename : string or None
The target filename of the plot. If None, the plot fill not be saved to disk. This is only useful with the figax option.
- variables : optional
One of the following:
- list of strings
List of variables to plot marginal histograms for.
- None
Plot marginal histograms for all variables.
- (list of strings, list of strings)
Plot 2D histograms of the cartesian product of the two variable lists. 2D histograms where both variables are identical are plotted as 1D histograms.
- (None, None)
Plot 2D histograms of all possible variable combinations. 2D histograms where both variables are identical are plotted as 1D histograms.
Default: None
- kwargs1d, kwargs2d : dict, optional
Additional keyword arguments for the 1D/2D histograms. If the key label is present, a legend will be drawn.
- figax : tuple of (Figure, list of list of Axis), optional
Pair of figure and axes to be used for plotting. Can be used to plot multiple binnings on top of one another. Default: Create new figure and axes.
- kwargs : optional
Additional kwargs will be passed on to
get_in_bin_variation_as_ndarray()
.
Returns: - fig : Figure
The Figure that was used for plotting.
- ax : list of list of Axis
The axes that were used for plotting.
See also
-
plot_statistical_variance
(filename, variables=None, kwargs1d={}, kwargs2d={}, figax=None, **kwargs)¶ Plot the maximum statistical variation for projections on all truth variables.
Parameters: - filename : string or None
The target filename of the plot. If None, the plot fill not be saved to disk. This is only useful with the figax option.
- variables : optional
One of the following:
- list of strings
List of variables to plot marginal histograms for.
- None
Plot marginal histograms for all variables.
- (list of strings, list of strings)
Plot 2D histograms of the cartesian product of the two variable lists. 2D histograms where both variables are identical are plotted as 1D histograms.
- (None, None)
Plot 2D histograms of all possible variable combinations. 2D histograms where both variables are identical are plotted as 1D histograms.
Default: None
- kwargs1d, kwargs2d : dict, optional
Additional keyword arguments for the 1D/2D histograms. If the key label is present, a legend will be drawn.
- figax : tuple of (Figure, list of list of Axis), optional
Pair of figure and axes to be used for plotting. Can be used to plot multiple binnings on top of one another. Default: Create new figure and axes.
- kwargs : optional
Additional kwargs will be passed on to
get_statistical_variance_as_ndarray()
.
Returns: - fig : Figure
The Figure that was used for plotting.
- ax : list of list of Axis
The axes that were used for plotting.
See also
-
plot_values
(*args, **kwargs)¶ Plot the values of the response binning.
This plots the distribution of events that have both a truth and reco bin.
See also
remu.Binning.RectangularBinning.plot_values
-
rebin
(remove_binedges)¶ Return a new ResponseMatrix with the given bin edges removed.
The values of the bins adjacent to the removed bin edges will be summed up in the resulting larger bin. Please note that bin values are lost if the first or last binedge of a variable are removed.
Parameters: - remove_binedges : dict of list of ints
A dictionary specifying the bin edge indices of each variable that should be removed. Binning variables that are not part of the dictionary are kept as is. E.g. if you want to remove bin edge 2 in
var_A
and bin edges 3, 4 and 7 invar_C
:remove_binedges = { 'var_A': [2], 'var_C': [3, 4, 7] }
Returns: - ResponseMatrix
The new response matrix with the given bin edges removed.
Warning
Please note that the nuisance_indices and impossible_indices of the new matrix are set to []!
-
reset
()¶ Reset all binnings.