Example 05 – Backgrounds

Aims

Understand and deal with background events in measurements

Instructions

Real experiments will almost always include some sort of background or noise events in the recorded data. ReMU is well equipped to deal with those in a consistent fashion.

Let us first create the noisy data as well as simulations of background and noise events:

# Create "real" data
../simple_experiment/run_experiment.py 10 real_data.txt --enable-background

# Create BG simulation
../simple_experiment/simulate_experiment.py 100 background nominal_bg_data.txt bg_truth.txt

# Create Noise simulation
../simple_experiment/simulate_experiment.py 100 noise noise_data.txt noise_truth.txt

# Create Variations
../simple_experiment/vary_detector.py nominal_bg_data.txt bg_data.txt

In this context, we call “noise” events that are recorded, but that do not have a meaningful true information associated with them. “Background” on the other hand can be understood as regular events in the detector, but which we are not interested in (e.g. because they are created by a different process than our main subject of study).

Background events can behave differently than signal events, but because they also have a defined true information, we can treat them just like signal. The only difference is that they will occupy different truth bins than the signal events.

Noise events on the other hand do not have meaningful true properties, so we will just assign them all to a single truth bin. The content of that single bin will then determine the number of expected noise events in the reconstructed binnings.

Starting from the truth binning we created for the signal models in a previous example, let us create a binning for the background events:

import numpy as np
from remu import binning
from remu import migration
from remu import matrix_utils
from remu import plotting

builder = migration.ResponseMatrixArrayBuilder(1)

with open("../01/reco-binning.yml", 'rt') as f:
    reco_binning = binning.yaml.full_load(f)
with open("../01/optimised-truth-binning.yml", 'rt') as f:
    signal_binning = binning.yaml.full_load(f)

bg_binning = signal_binning.clone()
resp = migration.ResponseMatrix(reco_binning, bg_binning)
i = 0
resp.fill_from_csv_file("bg_data.txt", weightfield='weight_%i'%(i,),
    rename={'reco_x_%i'%(i,): 'reco_x'}, buffer_csv_files=True)
resp.fill_up_truth_from_csv_file("bg_truth.txt", buffer_csv_files=True)
entries = resp.get_truth_entries_as_ndarray()
while np.min(entries) < 10:
    resp = matrix_utils.improve_stats(resp)
    entries = resp.get_truth_entries_as_ndarray()
bg_binning = resp.truth_binning
bg_binning.reset()
reco_binning.reset()

This bg_binning will now include at least 10 events in each truth bin, when building the response matrix.

Next we will create a binning that can distinguish between noise, background and signal events, according to a variable aptly called event_type:

truth_binning = binning.LinearBinning(
    variable = 'event_type',
    bin_edges = [-1.5, -0.5, 0.5, 1.5],
    subbinnings = {
        1: bg_binning,
        2: signal_binning,
        }
    )

with open("truth-binning.yml", 'wt') as f:
    binning.yaml.dump(truth_binning, f)

The LinearBinning would have only three bins, but we insert the previously created background and signal binnings as subbinnings into the second and third bin respectively. This means that events which would fall into those bins will get further subdivided by the subbinnings. We thus have a binning that puts noise events (event_type == -1) into a single bin, sorts background events (event_type == 0) according to bg_binning, and sorts signal events (event_type == 1) according to signal_binning.

Now we create a ResponseMatrix with this binning, just like we did in previous examples:

resp = migration.ResponseMatrix(reco_binning, truth_binning,
                                nuisance_indices=[0])

To fill the matrix, we need to tell it which event_type each event is, though. This information might already be part of the simulated data, but in this case we have to add that variable by hand.

For this we can use the cut_function parameter. A cut function takes the data (a structured numpy array) as its only argument and returns the data that should be filled into the binning:

import numpy.lib.recfunctions as rfn
def set_signal(data):
    return rfn.append_fields(data, 'event_type', np.full_like(data['true_x'], 1.))
def set_bg(data):
    return rfn.append_fields(data, 'event_type', np.full_like(data['true_x'], 0.))
def set_noise(data):
    return rfn.append_fields(data, 'event_type', np.full_like(data['reco_x'], -1.))

n_toys = 100
for i in range(n_toys):
    resp.reset()
    resp.fill_from_csv_file(["../03/modelA_data.txt", "../03/modelB_data.txt"],
        weightfield='weight_%i'%(i,), rename={'reco_x_%i'%(i,): 'reco_x'},
        cut_function=set_signal, buffer_csv_files=True)
    resp.fill_up_truth_from_csv_file(
        ["../00/modelA_truth.txt", "../00/modelB_truth.txt"],
        cut_function=set_signal, buffer_csv_files=True)
    resp.fill_from_csv_file("bg_data.txt", weightfield='weight_%i'%(i,),
        rename={'reco_x_%i'%(i,): 'reco_x'}, cut_function=set_bg,
        buffer_csv_files=True)
    resp.fill_up_truth_from_csv_file("bg_truth.txt", cut_function=set_bg,
        buffer_csv_files=True)
    # Calling `fill_up_truth_from_csv` twice only works because
    # the files fill completely different bins
    resp.fill_from_csv_file("noise_data.txt", cut_function=set_noise,
        buffer_csv_files=True)
    builder.add_matrix(resp)

builder.export("response_matrix.npz")

We can take a look at the truth information that has been filled into the last of the matrices:

pltr = plotting.get_plotter(truth_binning)
pltr.plot_values(density=False)
pltr.savefig('truth.png')

The base binning is a LinearBinning with only three bins. The corresponding plotter does not know how to plot the subbinnings, so it just marginalizes them out. To plot the content of all truth bins, we can use the basic BinningPlotter, which simply plots the content of each bin:

pltr = plotting.BinningPlotter(truth_binning)
pltr.plot_values(density=False)
pltr.savefig('all_truth.png')

We can look at the content of the subbinings directly for some nicer plots:

pltr = plotting.get_plotter(signal_binning)
pltr.plot_values()
pltr.savefig('signal_truth.png')

pltr = plotting.get_plotter(bg_binning)
pltr.plot_values()
pltr.savefig('bg_truth.png')

And we can take a look at the efficiencies using the corresponding convenience function:

matrix_utils.plot_mean_efficiency(resp, "efficiency.png")

Next we can use the response matrix for some hypothesis tests. First we need to create the ResponseMatrixPredictor and the templates to be used with the TemplatePredictor:

from remu import likelihood
from multiprocess import Pool
pool = Pool(8)
likelihood.mapper = pool.map

with open("../01/reco-binning.yml", 'rt') as f:
    reco_binning = binning.yaml.full_load(f)
with open("truth-binning.yml", 'rt') as f:
    truth_binning = binning.yaml.full_load(f)

reco_binning.fill_from_csv_file("real_data.txt")
data = reco_binning.get_entries_as_ndarray()
data_model = likelihood.PoissonData(data)

response_matrix = "response_matrix.npz"
matrix_predictor = likelihood.ResponseMatrixPredictor(response_matrix)

calc = likelihood.LikelihoodCalculator(data_model, matrix_predictor)
maxi = likelihood.BasinHoppingMaximizer()

import numpy.lib.recfunctions as rfn
def set_signal(data):
    return rfn.append_fields(data, 'event_type', np.full_like(data['true_x'], 1.))
def set_bg(data):
    return rfn.append_fields(data, 'event_type', np.full_like(data['true_x'], 0.))

truth_binning.fill_from_csv_file("../00/modelA_truth.txt",
                                 cut_function=set_signal)
modelA = truth_binning.get_values_as_ndarray()
modelA /= np.sum(modelA)

truth_binning.reset()
truth_binning.fill_from_csv_file("../00/modelB_truth.txt",
                                 cut_function=set_signal)
modelB = truth_binning.get_values_as_ndarray()
modelB /= np.sum(modelB)

truth_binning.reset()
truth_binning.fill_from_csv_file("bg_truth.txt", cut_function=set_bg)
bg = truth_binning.get_values_as_ndarray()
bg /= np.sum(bg)

truth_binning.reset()
noise = truth_binning.get_values_as_ndarray()
noise[0] = 1.

Since we put all noise events into the first truth bin, the noise template is just a value of 1 in that bin.

Now we start by trying to fit the model A template without background or noise events:

modelA_only = likelihood.TemplatePredictor([modelA])
calcA_only = calc.compose(modelA_only)

retA_only = maxi(calcA_only)
print(retA_only)

                        fun: 38.52930140475345
             log_likelihood: -38.52930140475345
 lowest_optimization_result:       fun: 38.52930140475345
 hess_inv: <1x1 LbfgsInvHessProduct with dtype=float64>
      jac: array([4.2633015e-06])
  message: 'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
     nfev: 54
      nit: 24
     njev: 27
   status: 0
  success: True
        x: array([1682.32663089])
                    message: ['requested number of basinhopping iterations completed successfully']
      minimization_failures: 0
                       nfev: 92
                        nit: 10
                       njev: 46
                    success: True
                          x: array([1682.32663089])

To judge how well the result actually fit, we can consult the results p-value:

testA_only = likelihood.HypothesisTester(calcA_only)
print(testA_only.likelihood_p_value(retA_only.x))

0.0024

It clearly is a very bad fit, which is reflected in the maximum likelihood p-value as well:

print(testA_only.max_likelihood_p_value())

0.0

So the signal-only model A hypothesis can be excluded. Now let us try again with background and noise templates added in:

modelA_bg = likelihood.TemplatePredictor([noise, bg, modelA])
calcA_bg = calc.compose(modelA_bg)

retA_bg = maxi(calcA_bg)
print(retA_bg)

                        fun: 28.30046763136394
             log_likelihood: -28.30046763136394
 lowest_optimization_result:       fun: 28.30046763136394
 hess_inv: <3x3 LbfgsInvHessProduct with dtype=float64>
      jac: array([-3.90798750e-06, -5.68433738e-06, -3.19743977e-06])
  message: 'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
     nfev: 124
      nit: 30
     njev: 31
   status: 0
  success: True
        x: array([ 99.5327942 , 476.05489184, 791.38644159])
                    message: ['requested number of basinhopping iterations completed successfully']
      minimization_failures: 0
                       nfev: 264
                        nit: 10
                       njev: 66
                    success: True
                          x: array([ 99.5327942 , 476.05489184, 791.38644159])

testA_bg = likelihood.HypothesisTester(calcA_bg)
print(testA_bg.likelihood_p_value(retA_bg.x))

0.9012

print(testA_bg.max_likelihood_p_value(), file=f)

0.62

This fit is clearly much better. We can repeat the same with model B:

modelB_only = likelihood.TemplatePredictor([modelB])
calcB_only = calc.compose(modelB_only)

retB_only = maxi(calcB_only)
print(retB_only)

                        fun: 37.51360728486539
             log_likelihood: -37.51360728486539
 lowest_optimization_result:       fun: 37.51360728486539
 hess_inv: <1x1 LbfgsInvHessProduct with dtype=float64>
      jac: array([-4.2633015e-06])
  message: 'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
     nfev: 78
      nit: 22
     njev: 39
   status: 0
  success: True
        x: array([1866.10415724])
                    message: ['requested number of basinhopping iterations completed successfully']
      minimization_failures: 0
                       nfev: 124
                        nit: 10
                       njev: 62
                    success: True
                          x: array([1866.10415724])

testB_only = likelihood.HypothesisTester(calcB_only)
print(testB_only.likelihood_p_value(retB_only.x))

0.002

print(testB_only.max_likelihood_p_value())

0.004

modelB_bg = likelihood.TemplatePredictor([noise, bg, modelB])
calcB_bg = calc.compose(modelB_bg)

retB_bg = maxi(calcB_bg)
print(retB_bg)

                        fun: 27.239578250837532
             log_likelihood: -27.239578250837532
 lowest_optimization_result:       fun: 27.239578250837532
 hess_inv: <3x3 LbfgsInvHessProduct with dtype=float64>
      jac: array([7.10542172e-06, 1.42108434e-06, 2.13165075e-06])
  message: 'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
     nfev: 48
      nit: 7
     njev: 12
   status: 0
  success: True
        x: array([ 210.20797605,  188.07865899, 1044.45892048])
                    message: ['requested number of basinhopping iterations completed successfully']
      minimization_failures: 0
                       nfev: 336
                        nit: 10
                       njev: 84
                    success: True
                          x: array([ 210.20797605,  188.07865899, 1044.45892048])

testB_bg = likelihood.HypothesisTester(calcB_bg)
print(testB_bg.likelihood_p_value(retB_bg.x))

0.98

print(testB_bg.max_likelihood_p_value())

0.796

We can also take a qualitative look at the results by plotting the maximum likelihood predictions in reco and truth space:

pltr = plotting.get_plotter(reco_binning)
modelA_reco, modelA_weights = calcA_only.predictor(retA_only.x)
modelB_reco, modelB_weights = calcB_only.predictor(retB_only.x)
modelA_bg_reco, modelA_bg_weights = calcA_bg.predictor(retA_bg.x)
modelB_bg_reco, modelB_bg_weights = calcB_bg.predictor(retB_bg.x)
pltr.plot_array(modelA_reco, label='model A only', stack_function=0.68,
                hatch=r'//', edgecolor='C1')
pltr.plot_array(modelA_bg_reco, label='model A + bg', stack_function=0.68,
                hatch=r'*', edgecolor='C1')
pltr.plot_array(modelB_reco, label='model B only', stack_function=0.68,
                hatch=r'\\', edgecolor='C2')
pltr.plot_array(modelB_bg_reco, label='model B + bg', stack_function=0.68,
                hatch=r'O', edgecolor='C2')
pltr.plot_entries(edgecolor='C0', label='data', hatch=None, linewidth=2.)
pltr.legend()
pltr.savefig('reco-comparison.png')

pltr = plotting.get_plotter(truth_binning)
pltr.plot_array(modelA_only(retA_only.x)[0], label='model A only',
                hatch=r'//', edgecolor='C1', density=False)
pltr.plot_array(modelA_bg(retA_bg.x)[0], label='model A + bg',
                hatch=r'*', edgecolor='C1', density=False)
pltr.plot_array(modelB_only(retB_only.x)[0], label='model B only',
                hatch=r'\\', edgecolor='C2', density=False)
pltr.plot_array(modelB_bg(retB_bg.x)[0], label='model B + bg',
                hatch=r'O', edgecolor='C2', density=False)
pltr.legend(loc='upper left')
pltr.savefig('truth-comparison.png')