ResponseMatrix

class remu.migration.ResponseMatrix(reco_binning, truth_binning, nuisance_indices=[], impossible_indices=[], response_binning=None)[source]

Matrix that describes the detector response to true events.

Parameters:
truth_binning : RectangularBinning

The Binning object describing the truth categorization.

reco_binning : RectangularBinning

The Binning object describing the reco categorization.

nuisance_indices : list of ints, optional

List of indices of nuisance truth bins. These are treated like their efficiency is exactly 1.

impossible_indices :list of ints, optional

List of indices of impossible reco bins. These are treated like their probability is exactly 0.

response_binning : RectangularBinning, optional

The Binning object describing the reco and truth categorization. Usually this will be generated from the truth and reco binning using their RectangularBinning.cartesian_product() method.

Notes

The truth and reco binnings will be combined with their cartesian_product method.

The truth bins corresonding to the nuisance_indices will be treated like they have a total efficiency of 1.

The reco bins corresonding to the impossible_indices will be treated like they are filled with a probability of 0.

Two response matrices can be combined by adding them new_resp = respA + respB. This yields a new matrix that is equivalent to one that has been filled with the data in both respA and respB. The truth and reco binnings in respA and respB must be identical for this to make sense.

compatibility(other, N=None, return_all=False, truth_indices=None, **kwargs)[source]

Calculate the compatibility between this and another response matrix.

Basically, this checks whether the point of “the matrices are identical” is an outlier in the distribution of matrix differences as defined by the statistical uncertainties of the matrix elements. This is done using the Mahalanobis distance as the test statistic. If the point “the matrices are identical” is not a reasonable part of the distribution, it is not reasonable to assume that the true matrices are identical.

Parameters:
other : ResponseMatrix

The other response matrix.

N : int, optional

Number of random matrices to be generated for the calculation. This number must be larger than the number of reco bins! Otherwise the covariances cannot be calculated correctly. Defaults to #(reco bins) + 100).

return_all : bool, optional

If False, return only null_prob_count, and null_prob_chi2.

truth_indices : list of ints, optional

Only use the given truth indices to calculate the compatibility. If this is not specified, only the indices with at least one entry in both matrices are used.

Returns:
null_prob_count : float

The Bayesian p-value evaluated by counting the expected number of random matrix differences more extreme than the mean difference.

null_prob_chi2 : float

The Bayesian p-value evaluated by assuming a chi-square distribution of the squares of Mahalanobis distances.

null_distance : float

The squared Mahalanobis distance of the mean differences between the two matrices.

distances : ndarray

The set of squared Mahalanobis distances between randomly generated matrix differences and the mean matrix difference.

df : int

Degrees of freedom of the assumed chi-squared distribution of the squared Mahalanobis distances. This is equal to the number of matrix elements that are considered for the calculation: df = len(truth_indices) * len(reco_bins in matrix).

Notes

The distribution of matrix differences is evaluated by generating N random response matrices from both compared matrices and calculating the (n-dimensional) differences. The resulting set of matrix differences defines the mean mean(differences) and the covariance matrix cov(differences). The covariance in turn defines a metric for the Mahalanobis distance D_M(x) on the space of matrix differences, where x is a set of matrix element differences.

The distance between the mean difference and the Null hypothesis, that the two true matrices are identical, is the null_distance:

null_distance = D_M(0 - mean(differences))

The compatibility between the matrices is now defined as the Bayesian probability that the true difference between the matrices is more extreme (has a larger distance from the mean difference) than the Null hypothesis. For this, we can just evaluate the set of matrix differences that was used to calculate the covariance matrix:

distances = D_M(differences - mean(differences))
null_prob_count = np.sum(distances >= null_distance) / distances.size

It will be 1 if the mean difference between the matrices is 0, and tend to 0 when the mean difference between the matrices is far from 0. “Far” in this case is determined by the uncertainty, i.e. the covariance, of the difference determination.

In the case of normal distributed differences, the distribution of squared Mahalanobis distances becomes chi-squared distributed. The numbers of degrees of freedom of that distribution is the number of variates, i.e. the number of response matrix elements that are being considered. This can be used to calculate a theoretical value for the compatibility:

df = len(truth_indices) * #(reco_bins)
null_prob_chi2 = chi2.sf(null_distance**2, df)

Since the distribution of differences is not necessarily Gaussian, this is only an estimate. Its advantage is that it is less dependent on the number of randomly drawn matrices.

distance(other, N=None, return_distances_from_mean=False, **kwargs)[source]

Return the overall squared Mahalanobis distance between the two matrices.

Parameters:
other : ResponseMatrix

The other ResponseMatrix for the comparison.

N : int, optional

Number of random matrices to be generated for the calculation. This number must be larger than the number of reco bins! Otherwise the covariances cannot be calculated correctly. Defaults to #(reco bins) + 100).

return_distances_from_mean : bool, optional

Also return the ndarray distances_from_mean.

kwargs : optional

Additional keyword arguments are passed through to generate_random_response_matrices.

Returns:
distance : flaot

The squared Mahalanobis distance of the mean difference between the matrices: D^2( mean(self.random_matrices - other.random_matrices) )

distances_from_mean : ndarray

Array of shape (N,) with the squared Mahalanobis distances between the randomly generated matrix differences and the mean matrix difference: D^2( (self.random_matrices - other.random_matrices) - mean(self.random_matrices - other.random_matrices) )

distance_as_ndarray(other, shape=None, N=None, return_distances_from_mean=False, **kwargs)[source]

Calculate the squared Mahalanobis distance of the two matrices for each truth bin.

Parameters:
other : ResponseMatrix

The other ResponseMatrix for the comparison.

shape : tuple of ints, optional

The shape of the returned matrix. Defaults to (#(truth bins),).

N : int, optional

Number of random matrices to be generated for the calculation. This number must be larger than the number of reco bins! Otherwise the covariances cannot be calculated correctly. Defaults to #(reco bins) + 100).

return_distances_from_mean : bool, optional

Also return the ndarray distances_from_mean.

kwargs : optional

Additional keyword arguments are passed through to generate_random_response_matrices.

Returns:
distance : ndarray

Array of shape shape with the squared Mahalanobis distance of the mean difference between the matrices for each truth bin: D^2( mean(self.random_matrices - other.random_matrices) )

distances_from_mean : ndarray, optional

Array of shape (N,)+shape with the squared Mahalanobis distances between the randomly generated matrix differences and the mean matrix difference for each truth bin: D^2( (self.random_matrices - other.random_matrices) - mean(self.random_matrices - other.random_matrices) )

fill(event, weight=1.0)[source]

Fill events into the binnings.

fill_from_csv_file(*args, **kwargs)[source]

Fill binnings from csv file.

See Binning.fill_from_csv_file for a description of the parameters.

See also

fill_up_truth_from_csv_file
Re-fill only truth bins from different file.
fill_up_truth_from_csv_file(*args, **kwargs)[source]

Re-fill the truth bins with the given csv file.

This can be used to get proper efficiencies if the true signal events are saved in a separate file from the reconstructed events.

It takes the same parameters as fill_from_csv_file().

Notes

A new truth binning is created and filled with the events from the provided file. Each bin is compared to the corresponding bin in the already present truth binning. The larger value of the two is taken as the new truth. This way, event types that are not present in the pure truth data, e.g. background, are not affected by this. It can only increase the value of the truth bins, lowering their efficiency.

For each truth bin, one of the following must be true for this operation to make sense:

  • All events in the migration matrix are also present in the truth file. In this case, the additional truth events lower the efficiency of the truth bin. This is the case, for example, if not all true signal events are reconstructed.
  • All events in the truth file are also present in the migration matrix. In this case, the events in the truth file have no influence on the response matrix. This is the case, for example, if only a subset of the reconstructed background is saved in the truth file.

If there are events in the response matrix that are not in the truth tree and there are events in the truth tree that are not in the response matrix, this method will lead to a wrong efficiency of the affected truth bin.

generate_random_response_matrices(size=None, shape=None, **kwargs)[source]

Generate random response matrices according to the estimated variance.

Parameters:
size : int or tuple of ints, optional

How many random matrices should be generated.

shape : tuple of ints, optional

The shape of the returned matrices. Defaults to (#(reco bins), #(truth bins)).

kwargs : optional

See get_mean_response_matrix_as_ndarray() for a description of more optional kwargs.

Returns:
ndarray

Notes

This is a three step process:

  1. Draw the binomal efficiencies from Beta distributions
  2. Draw the multinomial reconstruction probabilities from a Dirichlet distribution.
  3. Draw weight corrections from normal distributions.

If no shape is specified, it will be set to (#(reco bins, #(truth bins)).

If truth_indices are provided, a sliced matrix with only the given columns will be returned.

get_in_bin_variation_as_ndarray(shape=None, truth_only=True, ignore_variables=[], variable_slices={}, truth_indices=None)[source]

Get an estimate for the variation of the response within a bin.

The in-bin variation is estimated from the maximum difference to the surrounding bins. The differences are normalized to the estimated statistical errors, so values close to one indicate a statistically dominated variation.

Parameters:
shape : tuple of ints, optional

The shape of the returned ndarray. Default: (#(reco bins), #(truth bins))

truth_only : bool, optional

Only consider the neighbouring bins along the truth-axes.

ignore_variables : list of strings, optional

These variables will not be considered. This is useful to exclude categorical variables, where the response is not expected to vary smoothly.

variable_slices : dict of slices, optional

For variables in variable_slices only the specified slice will be used for comparison, e.g. variable_slices = {‘var_A’: slice(1,5)}. Useful if the response is only expected to be smooth over a given range of the variable.

truth_indices : list of ints, optional

Return a sliced matrix with only the given columns.

Returns:
ndarray
get_mean_response_matrix_as_ndarray(shape=None, **kwargs)[source]

Get the means of the posterior distributions of the response matrix elements.

This is different from the “raw” matrix one gets from get_response_matrix_as_ndarray(). The latter simply divides the sum of weights in the respective bins.

Parameters:
shape : tuple of ints, optional

The shape of the returned matrices. Defaults to (#(reco bins), #(truth bins)).

expected_weight : float, optional

The expected average weight of the events. This is used int the calculation of the weight variance. Default: 1.0

nuisance_indices : list of ints, optional

List of truth bin indices. These bins will be treated like their efficiency is exactly 1. Default: Use the nuisance_indices attribute of the ResponseMatrix.

impossible_indices : list of ints, optional

List of reco bin indices. These bins will be treated like their their probability is exactly 0. Default: Use the impossible_indices attribute of the ResponseMatrix.

truth_indices : list of ints, optional

List of truth bin indices. Only return the response of the given truth bins. Default: Return full matrices.

Returns:
ndarray
get_reco_entries_as_ndarray(*args, **kwargs)[source]

Get the number of entries in the reco binning as ndarray.

get_reco_sumw2_as_ndarray(*args, **kwargs)[source]

Get the sum of squared weights in the reco binning as ndarray.

get_reco_values_as_ndarray(*args, **kwargs)[source]

Get the values of the reco binning as ndarray.

get_response_entries_as_ndarray(*args, **kwargs)[source]

Get the number of entries in the response binning as ndarray.

get_response_matrix_as_ndarray(shape=None, truth_indices=None)[source]

Return the ResponseMatrix as a ndarray.

Uses the information in the truth and response binnings to calculate the response matrix.

Parameters:
shape : tuple of ints, optional

The shape of the returned ndarray. Default: (#(reco bins), #(truth bins))

truth_indices : list of ints, optional

Only return the response of the given truth bins. Default: Return full matrix.

Returns:
ndarray

Notes

If shape is None, it s set to (#(reco bins), #(truth bins)). The expected response of a truth vector can then be calculated like this:

v_reco = response_matrix.dot(v_truth)

If truth_indices are provided, a sliced matrix with only the given columns will be returned.

get_response_sumw2_as_ndarray(*args, **kwargs)[source]

Get the sum of squared weights in the response binning as ndarray.

get_response_values_as_ndarray(*args, **kwargs)[source]

Get the values of the response binning as ndarray.

get_statistical_variance_as_ndarray(shape=None, **kwargs)[source]

Get the statistical variance of the single ResponseMatrix elements as ndarray.

The variance is estimated from the actual bin contents in a Bayesian motivated way.

Parameters:
shape : tuple of ints, optional

The shape of the returned matrix. Defaults to (#(reco bins), #(truth bins)).

kwargs : optional

See get_mean_response_matrix_as_ndarray() for a description of more optional kwargs.

Returns:
ndarray

Notes

The response matrix creation is modeled as a three step process:

  1. Reconstruction efficiency according to a binomial process.
  2. Distribution of truth events among the reco bins according to a multinomial distribution.
  3. Correction of the categorical probabilities according to the mean weights of the events in each bin.

So the response matrix element can be written like this:

R_ij = m_ij * p_ij * eff_j

where eff_j is the total efficiency of events in truth bin j, p_ij is the unweighted multinomial reconstruction probability in reco bin i and m_ij the weight correction. The variance of R_ij is estimated by estimating the variances of these values separately.

The variance of eff_j is estimated by using the Bayesian conjugate prior for biinomial distributions: the Beta distribution. We assume a prior that is uniform in the reconstruction efficiency. We then update it with the simulated events. The variance of the posterior distribution is taken as the variance of the efficiency.

The variance of p_ij is estimated by using the Bayesian conjugate prior for multinomial distributions: the Dirichlet distribution. We assume a prior that is uniform in the ignorant about reconstruction probabilities. We then update it with the simulated events. The variance of the posterior distribution is taken as the variance of the transition probability.

If a list of nuisance_indices is provided, the probabilities of not reconstructing events in the respective truth categories will be fixed to 0. This is useful for background categories where one is not interested in the true number of events.

If a list of impossible_indices is provided, the probabilities of reconstructing events in the respective reco categories will be fixed to 0. This is useful for bins that are impossible to have any events in them by theiur definition.

The variances of m_ij is estimated from the errors of the average weights in the matrix elements as classical “standard error of the mean”. To avoid problems with bins with 0 or 1 entries, we add a “prior expectation” point to the data. This ensures that all bins have at least 1 entry (no divisions by zero) and that variances can be estimated even for bins with only one (true) entry (from the difference to the expected value).

This is just an estimate! The true variance of the randomly generated response matrices can deviate from the returned numbers. Also, these variances ignore the correlations between matrix elements.

If no shape is specified, it will be set to (N_reco, N_truth).

If truth_indices are provided, a sliced matrix with only the given columns will be returned.

get_truth_entries_as_ndarray(*args, **kwargs)[source]

Get the number of entries in the truth binning as ndarray.

get_truth_sumw2_as_ndarray(*args, **kwargs)[source]

Get the sum of squared weights in the truth binning as ndarray.

get_truth_values_as_ndarray(*args, **kwargs)[source]

Get the values of the truth binning as ndarray.

maximize_stats_by_rebinning(in_bin_variation_limit=5.0, select='entries', ignore_variables=[], variable_slices={}, **kwargs)[source]

Maximize the number of events per bin by rebinning the matrix.

Parameters:
in_bin_variation_limit : float, optional

Bins will only be merged if the maximum in-bin variation of the resulting matrix does not exceed the in_bin_variation_limit.

select : {‘entries’, ‘entries_sum’, ‘in-bin’, ‘in-bin_sum’}

Determines how the merging candidate is selected:

entries

the bin with the lowest number of truth entries

entries_sum

the pair of bins with the lowest number of truth entries

in-bin

the bin with the lowest maximum in-bin variation

in-bin_sum

the pair of bins with the lowest sum of maximum in-bin variations

kwargs : optional

Additional keyword arguments will be passed to the method get_in_bin_variation_as_ndarray().

plot_compatibility(filename, other)[source]

Plot the compatibility of the two matrices.

Parameters:
filename : string

The filename where the plot will be stored.

other : ResponseMatrix

The other Response Matrix for the comparison.

Returns:
fig : Figure

The figure that was used for plotting.

ax : Axis

The axis that was used for plotting.

plot_distance(filename, other, expectation=True, variables=None, kwargs1d={}, kwargs2d={}, figax=None, reduction_function=<function sum>, **kwargs)[source]

Plot the squared mahalanobis distance between two matrices.

Parameters:
filename : string

The filename of the plot to be stored.

other : ResponseMatrix

The other ResponseMatrix for the comparison

expectation : bool, optional

Plot an approximation of the expected values for identical matrices in the 1D histograms.

variables : {None, (None, None), list, (list, list)}, optional

None, plot marginal histograms for all variables. (None, None), plot 2D histograms of all possible variable combinations. list, plot marginal histograms for these variables. (list, list), plot 2D histograms of the Cartesian product of the two variable lists. 2D histograms where both variables are identical are plotted as 1D histograms.

kwargs1d, kwargs2d : dict, optional
Additional keyword arguments for the 1D/2D histograms.

If the key label is present, a legend will be drawn.

figax : (figure, axes), optional

Pair of figure and axes to be used for plotting. Can be used to plot multiple binnings on top of one another. Default: Create new figure and axes.

reduction_function : function, optional

Use this function to marginalize out variables.

kwargs : optional

Additional kwargs will be passed to calculate_squared_mahalanobis_distance.

Returns:
fig : Figure

The Figure that was drawn on.

ax : Axis

The axes objects.

plot_entries(*args, **kwargs)[source]

Plot the entries of the response binning.

This plots the distribution of events that have both a truth and reco bin.

plot_expected_efficiency(filename, variables=None, kwargs1d={}, kwargs2d={}, figax=None, **kwargs)[source]

Plot expected efficiencies for projections on all truth variables.

This assumes the truth values are distributed like the generator data. This does not consider the statistical uncertainty of the matrix elements.

Parameters:
filename : string or None

The target filename of the plot. If None, the plot fill not be saved to disk. This is only useful with the figax option.

variables : optional

One of the following:

list of strings

List of variables to plot marginal histograms for.

None

Plot marginal histograms for all variables.

(list of strings, list of strings)

Plot 2D histograms of the cartesian product of the two variable lists. 2D histograms where both variables are identical are plotted as 1D histograms.

(None, None)

Plot 2D histograms of all possible variable combinations. 2D histograms where both variables are identical are plotted as 1D histograms.

Default: None

kwargs1d, kwargs2d : dict, optional

Additional keyword arguments for the 1D/2D histograms. If the key label is present, a legend will be drawn.

figax : tuple of (Figure, list of list of Axis), optional

Pair of figure and axes to be used for plotting. Can be used to plot multiple binnings on top of one another. Default: Create new figure and axes.

kwargs : optional

Additional kwargs will be passed to get_response_values_as_ndarray() and get_truth_values_as_ndarray().

Returns:
fig : Figure

The Figure that was used for plotting.

ax : list of list of Axis

The axes that were used for plotting.

plot_in_bin_variation(filename, variables=None, kwargs1d={}, kwargs2d={}, figax=None, **kwargs)[source]

Plot the maximum in-bin variation.

Parameters:
filename : string or None

The target filename of the plot. If None, the plot fill not be saved to disk. This is only useful with the figax option.

variables : optional

One of the following:

list of strings

List of variables to plot marginal histograms for.

None

Plot marginal histograms for all variables.

(list of strings, list of strings)

Plot 2D histograms of the cartesian product of the two variable lists. 2D histograms where both variables are identical are plotted as 1D histograms.

(None, None)

Plot 2D histograms of all possible variable combinations. 2D histograms where both variables are identical are plotted as 1D histograms.

Default: None

kwargs1d, kwargs2d : dict, optional

Additional keyword arguments for the 1D/2D histograms. If the key label is present, a legend will be drawn.

figax : tuple of (Figure, list of list of Axis), optional

Pair of figure and axes to be used for plotting. Can be used to plot multiple binnings on top of one another. Default: Create new figure and axes.

kwargs : optional

Additional kwargs will be passed on to get_in_bin_variation_as_ndarray().

Returns:
fig : Figure

The Figure that was used for plotting.

ax : list of list of Axis

The axes that were used for plotting.

plot_statistical_variance(filename, variables=None, kwargs1d={}, kwargs2d={}, figax=None, **kwargs)[source]

Plot the maximum statistical variation for projections on all truth variables.

Parameters:
filename : string or None

The target filename of the plot. If None, the plot fill not be saved to disk. This is only useful with the figax option.

variables : optional

One of the following:

list of strings

List of variables to plot marginal histograms for.

None

Plot marginal histograms for all variables.

(list of strings, list of strings)

Plot 2D histograms of the cartesian product of the two variable lists. 2D histograms where both variables are identical are plotted as 1D histograms.

(None, None)

Plot 2D histograms of all possible variable combinations. 2D histograms where both variables are identical are plotted as 1D histograms.

Default: None

kwargs1d, kwargs2d : dict, optional

Additional keyword arguments for the 1D/2D histograms. If the key label is present, a legend will be drawn.

figax : tuple of (Figure, list of list of Axis), optional

Pair of figure and axes to be used for plotting. Can be used to plot multiple binnings on top of one another. Default: Create new figure and axes.

kwargs : optional

Additional kwargs will be passed on to get_statistical_variance_as_ndarray().

Returns:
fig : Figure

The Figure that was used for plotting.

ax : list of list of Axis

The axes that were used for plotting.

plot_values(*args, **kwargs)[source]

Plot the values of the response binning.

This plots the distribution of events that have both a truth and reco bin.

See also

remu.Binning.RectangularBinning.plot_values

rebin(remove_binedges)[source]

Return a new ResponseMatrix with the given bin edges removed.

The values of the bins adjacent to the removed bin edges will be summed up in the resulting larger bin. Please note that bin values are lost if the first or last binedge of a variable are removed.

Parameters:
remove_binedges : dict of list of ints

A dictionary specifying the bin edge indices of each variable that should be removed. Binning variables that are not part of the dictionary are kept as is. E.g. if you want to remove bin edge 2 in var_A and bin edges 3, 4 and 7 in var_C:

remove_binedges = { 'var_A': [2], 'var_C': [3, 4, 7] }
Returns:
ResponseMatrix

The new response matrix with the given bin edges removed.

Warning

Please note that the nuisance_indices and impossible_indices of the new matrix are set to []!

reset()[source]

Reset all binnings.