Tools for Statistical Tests and Inference



The recommended frequentist methodology is based on the profile likelihood ratio as a test statistic with the corresponding modifications made to be appropriate for discovery, 1-sided upper-limits, and measurements~cite{Cowan:2010js}. The procedure for the mode of search and 1-sided upper-limits based on CLs~cite{CLs} is further documented in Ref.~cite{ATLAS:2011tau} and Ref.~cite{Recommendations}. The extension of these recommendations for measurement problems (68% and 95% confidence intervals in a single parameter or multiple parameters with or without physical boundaries) is described in Ref.~cite{ExtendedRecommendations}.


In the case of upper limits we start with the statistical model \(f(data | \mu, \alpha)\), where \(\mu\) is the parameter of interests (like a cross-section, signal yield, or signal strength) and \(\alpha\) are the nuisance parameters. In most cases, we have the physical boundary \(\mu\ge 0\), which motivates the use of the CLs procedure for 1-sided upper-limits in order to avoid the “sensitivity problem” (the possibility of excluding arbitrarily small values of \(\mu\) where we have no sensitivity due to downward fluctuations). The profile likelihood ratio \(\tilde{q}_\mu\) (as defined in Ref.~cite{Cowan:2010js}) is used as a test statistic. The p-values for the \(\mu\) (s+b) and \(\mu=0\) (b-only) hypotheses can be evaluated either with ensembles of pseudo-experiments (toy Monte Carlo) or by using the asymptotic distributions. Both techniques are non-trivial to code. The toy Monte Carlo approach which requires dealing with randomizing the global observables associated to nuisance parameters in a frequentist way (as opposed to the mixed frequentist-Bayesian hybrid procedure that is implemented in several tools) and the ‘’profile-construction’’ for making the multi-dimensional Neyman-Construction computationally tractable~cite{Chuang:2000tba,Cranmer:2005hi,ATLAS:2011tau,Demortier}.

The RooStats tool HypoTestInverter performs the “hypothesis test inversion” of the Neyman-Construction, and can be configured to use either FrequentistCalculator (toy Monte Caro) or AsymptoticCalculator (asymptotics) to calculate p-values. The FrequentistCalculator can use several test statistics, but for upper-limits one uses the ProfileLikelihoodRatioTestStat to calculate the 1-sided profile-likelihood ratio test statistic \(q_\mu\). The RooStats tool FrequentistCalculator implements the fully-frequentist p-value calculation based on toys Monte Carlo with the recommended treatment of global observables and the profile-construction.

Improvements Needed/Planned

An improved procedure for the \(\pm 1,2\sigma\) bands for limits based on the asymptotic formulae have been investigated~cite{AsymptoticBands}. The key issue there is that \(\sigma^2\), the variance of \(\mu\), depends on \(\mu\), which is a second-order effect, while the original formulation assumed constant \(\sigma^2\). The improved procedure was described and used for several Higgs results, but the corresponding code has not yet been ported into the ROOT codebase. This is a work package for the Statistics Forum.

Various improvements and feature requests are also work packages for the statistics forum. This includes improved robustness, improved scanning algorithms, and requests for streamlined, user-friendly interfaces to these underlying tools.

Background-only p-values for Searches

In the case of searches, we are interested in calculating a background-only p-value. Typically we start with the statistical model \(f(data | \mu, \alpha)\), where \(\mu\) is proportional to the signal cross-section (e.g. a signal yield or signal strength) and \(\alpha\) are the nuisance parameters. The appropriate test statistic \(\tilde{q}_0\) (as defined in Ref.~cite{Cowan:2010js}), with increasing values indicating more events than expected in the background-only hypothesis, is implemented with ProfileLikelihoodRatioTestStat. The background-only p-value, denoted \(p_0\), can be calculated using toy Monte Carlo with RooStats FrequentistCalculator or by using the asymptotic formulae with the RooStats AsymptoticCalculator. The subtleties associated to the treatment of global observables and nuisance parameters described above in the upper-limit section also apply here.

Measurements and Confidence Intervals / Parameter Contours

The extended recommendations in Ref.~cite{ExtendedRecommendations}. are aimed at measurement problems (68% and 95% confidence intervals in a single parameter or multiple parameters with or without physical boundaries). The document is a natural extension on the search/upper-limit recommendations. The primary difference is to change the test statistic to \(t_\mu\) (as defined in Ref.~cite{Cowan:2010js}), whcihis appropriate for measurements instead of 1-sided tests. This test statistic is also implemented with the RooStats ProfileLikelihoodRatioTestStat. As above, the p-values can be calculated either with asymptotic or toy Monte Carlo; however, there are improvements needed and planned in this case.

Improvements Needed/Planned

The HypoTestInverter currently only supports 1-d problems. The NeymanConstruction and FeldmanCousins classes support N-D problems (with boundaries), but is not as configurable as the HypoTestInverter class. The planning in RooStats is to unify these tools. Note, the RooStats FrequentistCalculator can calculate p-values using toy Monte Carlo and the recommended treatment of global observables and nuisance parameters even in multi-dimensional cases with complex boundaries – the missing part is not the p-value calculation, but the scan over the parameter space and the actual hypothesis test inversion. Efficient scanning becomes increasingly important for multidimensional problems.

The HypoTestInverter can also be configured to use the AsymptoticCalculator to calculate p-values more quickly. The AsymptoticCalculator only has the 1-d case with a lower-boundary implemented. The 1-d case with upper- and lower-boundaries has been worked out~cite{Cowan:2012se} and should be implemented as well.

For multi-dimensional problems, p-value based on toy MC can become quite time consuming. In many cases the asymptotic approach is sufficiently accurate and much faster. The presence of boundaries modifies the asymptotic distributions; however, in general this depends on the shape of the boundary which means there will be no general formulae. It is possible that one can find a formulae for the asymptotic distribution for simple boundaries (e.g. or \(\mu_1>0\), \(\mu_2>0\), or \(\mu_1>0 \&\& \mu_2>0\) ). Neglecting these modifications to the boundary leads to over-coverage and some protection to the sensitivity problem near the boundary similar to CLs, thus the current recommendations are to use the uncorrected \(\chi^2_n\) distribution and make this clear in the text of the paper. Thus, the tools needed for the asymptotic procedure for this non-calibrated procedure are already in place with RooStats ProfileLikelihoodRatioTestStat and the standard \(\chi^2_n\) cutoffs for 68% and 95% confidence intervals (and have been used in recent Higgs property papers).

Diagnostics are important for all statistical methods, particularly for complicated problems. There are a number tools that have been developed that are in use by the physics groups, but these need to migrated into the common code bases (either ROOT or the appropriate ATLAS distribution).



The Bayesian approach has been used for a number of ATLAS results, primarily for upper-limits on a signal cross-section. The Bayesian approach needs two key pieces of information: the likelihood \(L(\mu,\alpha)\) and the prior \(\pi(\mu,\alpha)\), where \(\mu\) denotes the parameter(s) of interest and \(\alpha\) denotes the nuisance parameters.

As mentioned in Section~ref{S:modeling}, if one has a full statistical model \(f(data|\mu,\alpha)\), then one evaluates the likelihood via \(L(\mu,\alpha) \equiv f(\textrm{observed~data} | \mu,\alpha)\). It is possible that one defines a likelihood function \(L(\mu,\alpha)\) without providing the ability to generate pseudo data, but this is approach rules out the ability to compare to or provide complementary analysis with frequentist methods.


The prior \(\pi(\mu,\alpha)\) must also be specified for Bayesian methods. There are several approaches to specifying the prior ranging from a fully subjective ‘’prior belief’’ to priors derived from formal rules. Often the prior is factorized as \(\pi(\mu,\alpha) = \pi(\mu)\pi(\alpha)\). While priors may be informed by previous measurements, there is always some trace of an original prior, as described below. Recall that a change of variables influences the prior. In particular, the change of variables from \(\mu \to \nu(\mu)\) introduces a Jacobian factor for the prior \(\pi(\mu) \to \pi(\nu) = \pi(\mu)/ | d\nu/d\mu |\). This implies that a statement like ‘’uniform prior’’ is meaningless unless you say that is uniform with respect to some particular variable. This is relevant for situations in which the signal rate is a function of some theoretical parameters – like the energy scale \(\Lambda\) for a contact interaction. The choice of ‘’flat’’ prior in \(s\), \(\Lambda\), \(\Lambda^2\), etc. are not equivalent.

The prior for the nuisance parameters is often dealt with in way that is closely connected to frequentist approach. In particular, there is often some auxiliary measurement described by the statistical model \(f(a | \alpha)\), where \(a\) is the data associated to that auxiliary measurement used to measure/constrain \(\alpha\). Often the statistical model for these auxiliary measurements are adequately approximated by a Gaussian or Poisson. The corresponding prior used for the main measurement can be seen as the posterior from the auxiliary measurement. In particular, it is given by Bayes’ theorem with \(\pi(\alpha) \propto f(a|\alpha) \eta(\alpha)\), where \(\eta(\alpha)\) is some original prior (sometimes called the ‘’ur-prior’’ in reference to the German ‘’ur-‘’) from before both the main measurement and the auxiliary measurement. Most commonly, the \(\eta(\alpha)\) is taken to be uniform in \(\alpha\) – which is mainly irrelevant when the parameter has small relative uncertainty after the auxiliary measurement – in which case \(\pi(\alpha) \propto f(a|\alpha)\). Table~ref{tab:constraints} provides a few common and consistent relationships between the auxiliary measurement, the ur-prior , and the resulting posterior from the auxiliary measurement (which is used as prior for the main measurement).

In cases like theoretical uncertainties, there may be no auxiliary measurement for the parameter \(\alpha\). In the Bayesian approach, one can directly start with an assumed \(\pi(\alpha)\), while the constraint term used in the frequentist approach is typically introduced artificially following Table~ref{tab:constraints} backwards (i.e. \(\pi(\alpha) \to f(a|\alpha)\), ).

The prior for the parameter of interest is more delicate as it directly impacts the inference on \(\mu\). As mentioned above, for situations in which the signal rate is a function of some theoretical parameters like the energy scale \(\Lambda\) for a contact interaction, then the choice of ‘’flat’’ prior in \(s\), \(\Lambda\), or \(\Lambda^2\) will lead to different answers. The discussion of which prior to use is beyond the scope of this document, but the most common choice is to use a uniform prior for the signal yield. While not a justification, it is an important to know that 1-sided Bayesian upper-limits using this prior agree with the 1-sided CLs upper-limits in almost all problems that have been investigated, even complex Higgs combinations~cite{ATLAS:2011gia}.

The other priors that have been advocated are Jeffreys’s prior~cite{Jeffreys} and the Reference prior~cite{Demortier:2010sn,Casadei:2011hx}, which are both the result of a formal procedure based on the full statistical model \(f(data|\mu,\alpha)\) (another reason it is good to describe the full statistical model, not just the likelihood). These priors have nice invariance properties to reparametrization (e.g. \(s(\Lambda) \to \Lambda\)) and strive to be `non-informative’ in a precise sense. Unfortunately, both priors are difficult to calculate except in rather simple problems. For example, the reference prior has been calculated for the common number counting problem

\[f(n,m | \mu, \alpha) = \Pois_{SR}(n|\mu+\alpha) \, \Pois_{CR}(n|\tau\alpha) \;,\]

where \(\tau\) is a fixed extrapolation factor. Note, the reference prior for the number counting experiment is implemented in BAT.

Sampling and Marginalization Tools

The key object of Bayesian analysis is the posterior distribution, which is proportional to the product of the likelihood \(L(\mu,\alpha)\) and the prior \(\pi(\mu,\alpha)\). When interested in inference on the parameter of interest \(\mu\) the Bayesian approach is to marginalize (integrate) over the nuisance parameters \(\alpha\). These integrals can be difficult and are typically performed with numerical algorithms that produce a large sample of \(\{\mu,\alpha\}\) that is proportional to the posterior. The primary algorithms for this sampling are the Markov Chain Monte Carlo (Metropolis-Hastings or Gibbs sampling) and Nested Sampling. The two primary tools that implement these algorithms are the BAT toolkit~cite{BAT} and RooStats, and there is interoperability between these two tools~cite{roobat}. BAT provides MCMC sampling for likelihoods implemented in both BAT’s modeling language or RooFit modeling language. RooStats can use either it’s internal texttt{MCMCCalculator}, the interface to BAT’s MCMC algorithm, or the nested sampling algorithm implemented in MultiNest~cite{Feroz:2008xx}.

Improvements Needed/Planned

The interface between BAT/RooFit interfaces should be maintained and well tested for complicated examples. The RooStats tool for numerically evaluating the Jeffreys prior for an arbitrary RooFit model cite{RooJeffreys} should be further developed and documented (as this can be used by both RooStats and BAT). Diagnostics of various sorts are available, but can always be improved.

Good Practices

Modeling systematics

  • Uncerstanding constrained NPs - associated tool - ranking plots [ PLL, Exo, Slides ]

Recommendations for measurements

(as opposed to limits)

Bayesian misc

  • marginalisation tools etc, choice of prior, sampling techniques etc [ Kyles notes? ]

Tables of systematics

  • how to do this, conceptual issues with NP correlations between reported groups [ Exo, Slides, PLL ]

Diagnostic Tools

Understanding fit failures etc


expected vs observed