paper_cw_mcmc.tex 74.7 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
\documentclass[aps, prd, twocolumn, superscriptaddress, floatfix, showpacs, nofootinbib, longbibliography]{revtex4-1}

% Fixes Adbobe error (131)
\pdfminorversion=4

% Ams math
\usepackage{amsmath}
\usepackage{amssymb}

% Packages for setting figures
\usepackage[]{graphicx}
\usepackage{subfig}

% Colors
\usepackage{color}

% Bibliography
\usepackage{natbib}

\usepackage{placeins}

\usepackage{multirow}

\usepackage{hhline}

% get hyperlinks
%\usepackage{hyperref}

% Tables
\newcommand{\specialcell}[2][c]{%
  \begin{tabular}[#1]{@{}c@{}}#2\end{tabular}}

\input{definitions.tex}
34
\input{macros}
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53

% For editing purposes: remove before submition
\usepackage[normalem]{ulem}	%% only added for 'strikeout' \sout
\usepackage[usenames,dvipsnames]{xcolor}

\newcommand{\dcc}{LIGO-{\color{red}}}

%% ---------- editing/commenting macros: make sure all a cleared at the end! ----------
\newcommand{\mygreen}{\color{green!50!black}}
\newcommand{\addtext}[1]{\textcolor{green!50!black}{#1}}
\newcommand{\meta}[1]{\addtext{#1}}
\newcommand{\CHECK}[1]{\textcolor{red}{#1}}
\newcommand{\strike}[1]{\textcolor{red}{\sout{#1}}}
\newcommand{\comment}[1]{\textcolor{red}{[#1]}}
\newcommand{\replace}[2]{\strike{#1}\addtext{$\rightarrow$ #2}}
%% ---------- end: editing/commenting macros ----------------------------------------

\begin{document}

54
\title{MCMC follow-up of continuous gravitational wave candidates}
55

56
57
58
59
60
61
62
63
64
%    \author{G. Ashton}
%    \email[E-mail: ]{gregory.ashton@ligo.org}
%    \affiliation{Max Planck Institut f{\"u}r Gravitationsphysik
%                 (Albert Einstein Institut) and Leibniz Universit\"at Hannover,
%                 30161 Hannover, Germany}
%    \author{R. Prix}
%    \affiliation{Max Planck Institut f{\"u}r Gravitationsphysik
%                 (Albert Einstein Institut) and Leibniz Universit\"at Hannover,
%                 30161 Hannover, Germany}
65
66
67
68

\date{\today}

\begin{abstract}
69

70
71
72
73
74
75
76
77
78
We describe methods to follow-up potential continuous gravitational wave
candidates (as identified by wide-parameter space semi-coherent searches)
leveraging MCMC optimisation of the $\mathcal{F}$-statistic.  Such a framework
provides a unique advantage in both following-up standard signals (i.e. during
the `zoom-stage' in which the coherence time is increased aiming to localise
the fully-coherent candidate and in parameter estimation) as well as follow-up
of non-standard waveforms. In particular, we demonstrate provide a
glitch-robust statistic for follow-up of glitching continuous gravitational
wave signals.
79
80
81
82
83
84
85
86
87
88
89
90

\end{abstract}

\pacs{04.80.Nn, 97.60.Jd, 04.30.Db}
\input{git_tag.tex}
\date{\commitDATE; \commitIDshort-\commitSTATUS, \dcc}

\maketitle


\section{Introduction}

91
92
A target for the advanced gravitational wave detector network of LIGO
and Virgo are long-lived periodic sources called continuous gravitational waves (CWs).
93
94
95
96
97
98
Rapidly rotating nonaxisymmetric neutron stars are potentially capable of
producing detectable CWs which last much longer than typical observation spans.
There exists three well known sources of the nonaxisymmetry: `mountains',
precession, and r-mode oscillations; each of which make a prediction for the
scaling between $\nu$, the NS spin frequency and $f$, the gravitational wave
frequency. In any case, observing neutron stars through their gravitational
99
wave emission would provide a unique astrophysical insight and has hence
100
101
102
103
104
105
106
motivated numerous searches.

As shown by \citet{jks1998}, the gravitational wave signal from a
nonaxisymmetric source produces a strain in the detector $h(t, \A, \blambda)$;
where $\A{=}\{h_0, \cos\iota, \psi, \phi_0\}$ is a vector of the four
\emph{amplitude-parameters} (expressed in `physical coordinates') and
$\blambda$ is a vector of the \emph{Doppler-parameters} consisting of the
107
108
sky-location, frequency $f$, and any spindown terms $\{\dot{f}, \ddot{f},
\ldots\}$.
109

110
111
112
113
CW searches often use a fully-coherent matched-filtering methods whereby a
\emph{template} (the signal model at some specific set of parameters) is
convolved against the data resulting in a detection statistic (in this work, we
will consider the $\F$-statistic, defined in \citet{jks1998}). Since the signal
114
parameters are unknown, it is usual to perform this matched filtering over a
115
116
117
118
119
120
121
122
123
124
125
126
grid of points to maximise the detection statistic.  Three search categories
can be identified: \emph{targeted} searches for a signal from a known
electromagnetic pulsar where the Doppler parameters are considered `known';
\emph{directed} searches in which the location is known, but not the frequency
and spin-down (i.e.  searching for the neutron star in a supernova remnant
which does not have a known pulsar); and \emph{all-sky} searches where none of
the parameters are considered known.  Searching over more parameters amounts to
an increase in the dimension of the search space. Since the density of grid
points required to resolve a signal scales inversely with total coherence time
$\Tcoh$ (the span of data used in a fully-coherent matched filter),
wide-parameter searches (such as the all-sky) with many search dimensions over
long durations of data are computationally infeasible.
127
128
129
130

At a fixed computing cost, it has been shown (see for example \citep{brady1998,
prix2012}) that a semi-coherent search is more sensitive for unknown signals
than a fully-coherent search. While the exact details of how the search works
131
132
133
134
135
136
137
138
139
140
141
142
143
depends on the implementation, a semi-coherent search involves by dividing the
total observation span $\Tspan$ into $\Nseg$ segments (each lasting for
$\Tcoh$) and in each segment computes the fully-coherent detection statistic;
the semi-coherent detection statistic is then computed by some combination of
all segments summed at the same point in parameter space. Fundamentally, this
gain in sensitivity is because the width of a peak in the detection statistic
due to a signal is inversely proportional to the coherence time: shorter
coherence times make the peak wider and hence a lower density of templates is
required.  This idea was first proposed by \citet{brady2000} along with the
first implementation, the `Stack-slide' search. Since then, several
modifications such as the `Hough-transform' \citep{krishnan2004, astone2014},
and the `Powerflux' method (first described in \citet{allyskyS42008}) have been
proposed, implemented and applied to gravitational wave searches.
144
145
146

Wide parameter space searches produce a list of candidates with an associated
detection statistic which passes some threshold. In order to verify these
147
candidates, they are subjected to a \emph{follow-up}: a process of increasing
148
the coherence time, eventually aiming to calculate a fully-coherent detection
149
statistic over the maximal span of data. In essence, the semi-coherent search
150
is powerful as it spreads the significance of a candidate over a wider area of
151
parameter space, the follow-up attempts to reverse this process and recover
152
153
the maximum significance and tightly constrain the candidate parameters. The
original hierarchical follow-up of \citet{brady2000} proposed a two-stage method
154
(an initial semi-coherent  stage followed directly by a fully-coherent search.
155
156
157
158
159
160
161
162
However, it was shown in a numerical study by \citet{cutler2005} that allowing
an arbitrary number of semi-coherent stages before the final fully-coherent
stage can significantly improve the efficiency: ultimately they concluded that
three semi-coherent stages provide the best trade-off between sensitivity and
computational cost.

The first implementation of a two-stage follow-up was given by
\citet{shaltev2013} and used the Mesh Adaptive Direct Search algorithm for
163
optimisation. A multi-stage gridded approach was presented in \citet{Papa2016}.
164

165
In this paper, we propose an alternative hierarchical follow-up procedure using
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
Markov-Chain Monte-Carlo (MCMC) as the optimisation tool. An MCMC tool is
advantages due to its ability to trace the evolution of multiple modes
simultaneously through the follow-up procedure and allow the optimisation to
decide between them without arbitrary cuts.

The follow-up of canonical CW signals is a well studied problem and, provided
they behave as expected, the maximum loss of detection statistic (i.e. the
template bank mismatch) can be calculated. However, nature does not always
behave as expected and it is plausible that the CW signals in the data may
differ from the templates used in wide-parameter space searches. There are of
course an unlimited number of ways this may manifest given our ignorance of
neutron stars, but from the study of pulsars three obvious mechanisms present
themselves: \emph{transient-}, \emph{glitching-}, and \emph{noisy-} signals.
A search method for long-duration transient signals (i.e. signals which
last from hours to weeks) was first presented by
\citet{prix2011}, in Sec.~\ref{sec_transients} we will describe how MCMC
optimisation can be used to perform parameter estimation using this method.
In \citet{ashton2017}, we noted the need for a glitch-robust follow-up method
(i.e. a search capable of finding signals which undergo sudden changed in the
rotation frequency as seen in pulsars); to respond to this in Sec.~\ref{sec_glitches}
we provide such a statistic, demonstrate its use on simulated glitching
signals, and discuss significance estimation. For noisy-signals, specifically
motivated by the observation of so-called timing noise in radio pulsars (see
\citet{hobbs2010} for a review), in \citet{ashton2015} we determined that these
where unlikely to be of immediate concern.

The methods introduced in this paper have been implemented in the
\texttt{python} package \texttt{pyfstat}. All examples in this work, along with
tutorials can be found at
\url{https://gitlab.aei.uni-hannover.de/GregAshton/PyFstat}.
196
197
198
199

We begin in Section~\ref{sec_hypothesis_testing} with a review of search
methods from a Bayesian perspective. Then in
Section~\ref{sec_MCMC_and_the_F_statistic} we introduce the MCMC optimisation
200
procedure and give details of our particular implementation. In
201
202
203
204
Section~\ref{sec_follow_up} we describe the zoom follow-up procedure and test
its efficient. In Sections~\ref{sec_transients} and \ref{sec_glitches} we
demonstrate how searches can be performed for either transient or glitches CWs
before we finally conclude in Section~\ref{sec_conclusion}.
205

Gregory Ashton's avatar
Gregory Ashton committed
206
207
\section{MCMC and the $\F$-statistic}
\label{sec_MCMC_and_the_F_statistic}
208

Gregory Ashton's avatar
Gregory Ashton committed
209
210
211
212
In this section, we will give a brief introduction to the use of the
$\F$-statistic in a hypothesis testing framework, introduce conceptual ideas
about the use of Markov-Chain Monte-Carlo (MCMC) algorithms in this context,
and develop the idea of using MCMC routines in CW searches.
213

Gregory Ashton's avatar
Gregory Ashton committed
214
215
216
217
218
219
220
221
222
223
224
225
226
227
\subsection{Hypothesis testing framework}
\label{sec_hypothesis_testing}

In \citet{prix2009}, a framework was introduced demonstrating the use of the
$\F$-statistic (first discussed in \citet{jks1998}) in defining the \emph{Bayes
factor} $\Bsn(\boldsymbol{x})$ between the signal hypothesis and the noise
hypothesis computed on some data $\boldsymbol{x}$.  It was shown that for an
unphysical uniform prior on the amplitude parameters, the log-Bayes factor was
proportional to the $\F$-statistic. In this work, we will use the modification
proposed in \citet{prix2011}, which removed the antenna-pattern weighting
factor; let us now briefly review this framework.

The Bayes-factor between the signal and noise hypothesis, given the
`$\F$-statistic prior', is
228
\begin{equation}
Gregory Ashton's avatar
Gregory Ashton committed
229
230
\Bsn(\boldsymbol{x}) = \int \Bsn(\boldsymbol{x}| \blambda)P(\blambda) d\blambda
\label{eqn_Bsn_x}
231
\end{equation}
Gregory Ashton's avatar
Gregory Ashton committed
232
233
234
where $\Bsn(\boldsymbol{x}| \blambda)$ is the
targeted Bayes factor (i.e. computed at a set of
Doppler parameters $\blambda$):
235
\begin{equation}
Gregory Ashton's avatar
Gregory Ashton committed
236
237
238
\Bsn(\boldsymbol{x}|\blambda) = \frac{70}{\rhohatmax^{4}}
e^{\F(\boldsymbol{x}; \blambda)}.
\label{eqn_CW_likelihood}
239
\end{equation}
Gregory Ashton's avatar
Gregory Ashton committed
240
241
242
243
244
245
246
247
Here, $\rhohatmax$ is a maximum cutoff SNR required to normalise the prior.
The original calculation was done for a transient-CW, in Eqn~\eqref{eqn_CW_likelihood}
we constrain the system to the special case of standard CW signals for which the
marginalisation over the transient parameters yields unity.

Furthermore, \citet{prix2011} also demonstrated that the semi-coherent Bayes
factor naturally arrises by splitting the likelihood into $N$ independent
\emph{segment} resulting in
248
\begin{equation}
Gregory Ashton's avatar
Gregory Ashton committed
249
250
251
252
\Bsn^{\textrm{SC}}(\boldsymbol{x}|\blambda) =
\left(\frac{70}{\rhohatmax^{4}}\right)^{N}
e^{\sum_{\ell=1}^{N}\F(\boldsymbol{x}_{(\ell)}; \blambda)},
\label{eqn_SC_CW_likelihood}
253
\end{equation}
Gregory Ashton's avatar
Gregory Ashton committed
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
where $\boldsymbol{x}_{(\ell)}$ refers to the data in the $\ell^{\mathrm{th}}$
segment. As such, this can replace the relevant term in



%\section{Hypothesis testing}
%\label{sec_hypothesis_testing}
%
%\subsection{Bayes factors}
%Given some data $x$ and a set of background assumptions $I$, we formulate
%two hypotheses: $\Hn$, the data contains solely Gaussian noise and $\Hs$, the
%data contains an additive mixture of noise and a signal $h(t; \A, \blambda)$.
%In order to make a quantitative comparison, we use Bayes theorem in the usual
%way to write the odds as
%\begin{equation}
%O_{\rm S/N} \equiv \frac{P(\Hs| x, I)}{P(\Hn| x, I)} =
%\Bsn(x| I) \frac{P(\Hs| I)}{P(\Hn | I)},
%\end{equation}
%where the second factor is the \emph{prior odds} while the first is the
%\emph{Bayes factor}:
%\begin{equation}
%\Bsn(x| I) = \frac{P(x| \Hs, I)}{P(x| \Hn, I)}.
%\end{equation}
%Typically, we set the prior odds to unity such that it is the Bayes factor
%which determines our confidence in the signal hypothesis. In this work, we will
%therefore discuss the Bayes factor with the implied assumption this is
%equivalent to the odds, unless we have a good reason to change the prior odds.
%
%We can rewrite the Bayes factor in terms of the two sets of signal parameters
%as
%\begin{equation}
%\Bsn(x| I) = \frac{P(x, \A, \blambda|\Hs, I)}
%{P(\A| \Hs, I)P(\blambda| \Hs, I)P(x| \Hn, I)}.
%\end{equation}
%Marginalising over the two sets of parameters we find that
%\begin{equation}
%\Bsn(x| I)= \iint
%\mathcal{L}(x; \A, \blambda)
%P(\A| \Hs, I)P(\blambda| \Hs, I)
%d\blambda d\A
%\label{eqn_full_bayes}
%\end{equation}
%where
%\begin{equation}
%\mathcal{L}(x; \A, \blambda) \equiv \frac{P(x |\A, \blambda, \Hs, I)}{P(x| \Hn, I)},
%\label{eqn_likelihood}
%\end{equation}
%is the \emph{likelihood-ratio}.
%
%At this point, we can appreciate the problems of searching for unknown signals:
%one has four amplitude parameters and several Doppler parameters over which
%this integral must be performed. If a single signal exists in the data, this
%corresponds to a single peak in the likelihood-ratio, but at an unknown
%location. Therefore, one must must first search for peaks (candidates), and
%then subsequently analyse their significance. If one has prior knowledge this
%can be used; for example in targeted searches for gravitational waves from
%known pulsars emitting CWs from a mountain the Doppler parameters are
%considered known collapsing their integral to a single evaluation (see for
%example the method proposed by \citet{dupuis2005}).
%\comment{Should we cite other known-pulsar methodologies?}
%
%\subsection{The $\F$-statistic: a frequentist perspective}
%
%For directed and all-sky searches, a common method introduced by
%\citet{jks1998} to reduce the parameter space is the maximum-likelihood
%approach. In this approach (often referred to as `frequentist'), one starts by
%defining the likelihood-ratio, Equation~\eqref{eqn_likelihood}, which in this
%context is a \emph{matched-filtering} amplitude. Then, analytically maximising
%this likelihood-ratio with respect to the four amplitude parameters results
%(c.f.~\citet{prix2009}) in a maximised log-likelihood given by $\F(x|
%\blambda)$: the so-called $\F$-statistic. Picking a particular set of Doppler
%parameters $\blambda$ (the template) one can then compute a detection statistic
%(typically $\twoFtilde$ is used\footnote{Note that the `tilde' denote that it
%is a fully-coherent quantity while we use a `hat' to denote a semi-coherent
%quantity}) which can be used to quantify the significance of the template.
%Usually this is done by calculating a corresponding false alarm rate, the
%probability of seeing such a detection statistic in Gaussian noise.
%
%Calculations of the significance are possible due to the properties of the
%detection statistic: in Gaussian noise, it can be shown \citep{jks1998,
%cutlershutz2005} that $\widetilde{2\F}$ follows a chi-squared distribution with
%4 degrees of freedom. In the presence of a signal, $\widetilde{2\F}$ is
%still chi-squared with 4 degrees of freedom, but has a non-centrality parameter
%$\tilde{\rho}^{2}$ such that its expectation value is
%\begin{equation}
%\textrm{E}[\widetilde{2\F}(x; \blambda)] = 4 + \tilde{\rho}(x; \blambda)^2.
%\label{eqn_twoF_expectation}
%\end{equation}
%The non-centrality parameter in this context is the SNR of the matched-filter
%given by
%\begin{equation}
%\rho^{2} = (h | h) \propto \frac{h_0^2}{\Sn}\Tcoh \mathcal{N}
%\end{equation}
%where $(h|h)$ is the inner product of the signal with itself (see for example
%\citet{prix2009}), $\Sn$ is a (frequency-dependent) measure of the noise in
%the detector and $\mathcal{N}$ is the number of detectors.
%
%
%\subsection{The $\F$-statistic: a Bayesian perspective}
%
%At first, it
%appears that the $\F$-statistic is independent of the Bayesian framework since
%it was first derived directly from the likelihood. However, it was shown in
%\citet{prix2009, prix2011} that if we marginalise over the four amplitude parameters of
%Equation~\eqref{eqn_full_bayes} (using the ``$\F$-statistic prior'' \citep{prix2011},
%which we denote by $\AmplitudePrior$)
%then the integral, when $\homax \gg 1$, is a Gaussian integral and can be
%computed analytically as
%\begin{align}
%\begin{split}
%\Bsn(x| \blambda, \AmplitudePrior, I) & \equiv
%\int
%\mathcal{L}(x ;\A, \blambda)
%P(\A| \Hs, \AmplitudePrior, I) d\A
%\label{eqn_B_S_N}
%\\
%& \propto e^{\tilde{\F}(x| \blambda)}
%\end{split}
%\end{align}
%where $\mathcal{F}$ is exactly the $\F$-statistic, first derived
%as a maximised log-likelihood \citep{jks1998}. This result
%demonstrates that the $\mathcal{F}$-statistic is proportional to the log-Bayes
%factors when calculated with a uniform prior on the amplitude parameters and
%fixed Doppler parameters.
%
%As such, we can define the Bayes-factor of Equation~\eqref{eqn_full_bayes} as
%\begin{equation}
%\Bsn(x| \AmplitudePrior, I) = \int
%\Bsn(x| \AmplitudePrior, \blambda, I) P(\blambda| \Hs, I)
% d\blambda,
%\end{equation}
%or neglecting the constants
%\begin{equation}
%\Bsn(x| \AmplitudePrior, I) \propto \int
%e^{\tilde{\F}(x| \blambda)} P(\blambda| \Hs, I)
% d\blambda.
%\label{eqn_bayes_over_F}
%\end{equation}
%
%Formulating the significance of a CW candidate in this way is pragmatic in that
%there exists a wealth of well-tested tools (i.e. \texttt{LALSuite}
%\citep{lalsuite}) capable of computing the $\mathcal{F}$-statistic for CW
%signals, transient-CWs, and CW signals from binary systems; these can be
%leveraged to compute Equation~\eqref{eqn_bayes_over_F}.
398
399

The MCMC class of optimisation tools are formulated to solve the problem of
400
inferring the posterior distribution of some general model parameters $\theta$
Gregory Ashton's avatar
Gregory Ashton committed
401
402
403
404
405
given given some data $x$. We will not give a full description here, but
refer the reader to the literature, e.g. \citet{mackay2003information}.
In this case, we will be concerned with the posterior
distribution of $\blambda$, the Doppler parameters. From Bayes theorem, we
see that this is proportional to the integrand of Eqn.~\eqref{eqn_Bsn_x}:
406
\begin{equation}
Gregory Ashton's avatar
Gregory Ashton committed
407
\Bsn(\blambda | \boldsymbol{x}) \propto \Bsn(\boldsymbol{x}| \blambda)P(\blambda).
408
409
\end{equation}

Gregory Ashton's avatar
Gregory Ashton committed
410
411
412
413
414
415
416
The likelihood function (either for the fully-coherent of semi-coherent case)
and the prior distributions can be passed to any optimisation routine in order
to maximise over the unknown signal parameters. In this paper, we will discuss
the use of an ensemble MCMC optimization routine, this has the advantage of
being efficient in high-dimensional spaces, the ability to sampler multi-modal
posteriors, and to calculate the marginalised Bayes factor (i.e.
Eqn~\ref{eqn_Bsn_x}).
417
418
419
420
421
422
423
424

\subsection{The \texttt{emcee} sampler}

In this work we will use the \texttt{emcee} ensemble sampler
\citep{foreman-mackay2013}, an implementation of the affine-invariant ensemble
sampler proposed by \citet{goodman2010}. This choice addresses a key issue with
the use of MCMC sampler, namely the choice of \emph{proposal distribution}. At
each step of the MCMC algorithm, the sampler generates from some distribution
425
426
(known as the proposal-distribution) a jump in parameter space. Usually, this
proposal distribution must be `tuned' so that the MCMC sampler efficiently
427
428
429
walks the parameter space without either jumping too far off the peak, or
taking such small steps that it takes a long period of time to traverse the
peak. The \texttt{emcee} sampler addresses this by using an ensemble, a large
430
431
432
433
434
435
436
437
number ${\sim}100$ parallel \emph{walkers}, in which the proposal for each
walker is generated from the current distribution of the other walkers.
Moreover, by applying an an affine transformation, the efficiency of the
algorithm is not diminished when the parameter space is highly anisotropic. As
such, this sampler requires little in the way of tuning: a single proposal
scale and the number of steps to take.

\subsection{Parallel tempering: sampling multimodal posteriors}
438
\label{sec_parallel_tempering}
439
440
Beyond the standard ensemble sampler, we will also use one further
modification, the parallel-tempered ensemble sampler. A parallel
441
442
tempered MCMC simulation, first proposed by \citet{swendsen1986}, runs
$\Ntemps$ simulations in parallel with the likelihood in the $i^{\rm th}$
443
parallel simulation is raised to a power of $1/T_{i}$ where $T_i$ is referred
444
to as the \emph{temperature}. As such, Equation~\eqref{eqn_lambda_posterior} is
445
written as
446
\begin{equation}
447
448
P(\blambda | T_i, x, \AmplitudePrior, \Hs, I)
%=\Bsn(x| \AmplitudePrior, \blambda)^{1/T_i} P(\blambda| \Hs I).
449
450
\propto (e^{\F(x| \blambda)})^{T_i} P(\blambda| \Hs I).
\end{equation}
451
Setting $T_0=1$ with $T_i > T_0 \; \forall \; i > 1$, such that the lowest
452
453
temperature recovers Equation~\eqref{eqn_lambda_posterior} while for higher
temperatures the likelihood is broadened (for a Gaussian likelihood, the
454
455
standard deviation is larger by a factor of $\sqrt{T_i}$). Periodically, the
algorithm swaps the position of the walkers between the different
456
temperatures. This allows the $T_0$ chain (from which we draw samples of the
457
posterior) to efficiently sample from multi-modal posteriors. This method introduces
458
two additional tuning parameters, the number and range of the set of
459
temperatures $\{T_i\}$, we will discuss their significance when relevant.
460

461
\subsection{Parallel tempering: estimating the Bayes factor}
462
\comment{Greg: I don't think we actually need this since we aren't using the Bayes factor}
463
464
In addition, parallel-tempering also offers a robust method to estimate the
Bayes factor of Equation~\eqref{eqn_bayes_over_F}. If we define
465
$\beta\equiv1/T$, the inverse temperature and $Z(\beta)\equiv \Bsn(x| \AmplitudePrior, I)$, then as noted by \citet{goggans2004} for the general case, we may
466
467
468
469
write
\begin{align}
\frac{1}{Z} \frac{\partial Z}{\partial \beta}=
\frac{
470
471
\int \Bsn(x| \AmplitudePrior, \blambda)^{\beta}
\log(\Bsn(x| \AmplitudePrior, \blambda))P(\blambda| I)
472
d\blambda
473
474
}
{
475
\int \Bsn(x| \AmplitudePrior, \blambda)^{\beta})P(\blambda| I)
476
d\blambda
477
478
479
480
481
}
\end{align}
The right-hand-side expresses the average of the log-likelihood at $\beta$. As
such, we have
\begin{align}
Gregory Ashton's avatar
Gregory Ashton committed
482
\frac{\partial \log Z}{\partial \beta} =
483
\langle \log(\Bsn(x| \AmplitudePrior, \blambda) \rangle_{\beta}
484
\end{align}
485
The log-likelihood values are calculated during the MCMC sampling. As such, one
486
487
can numerically integrate to get the Bayes factor, i.e.
\begin{align}
488
489
\log \Bsn(x| \AmplitudePrior, I) = \log Z = \int_{0}^{1}
\langle \log(\Bsn(x| \AmplitudePrior, \blambda) \rangle_{\beta} d\beta.
490
\end{align}
491
In practice, we use a simple numerical quadrature over a finite ladder of
492
$\beta_i$ with the smallest chosen such that choosing a smaller value does not
493
494
change the result beyond other numerical uncertainties. Typically, getting
accurate results for the Bayes factor requires a substantially larger number of
495
temperatures than are required for efficiently sampling multi-modal
496
distributions.  Therefore, it is recommended that one uses a few
497
498
499
500
temperatures during the search stage, and subsequently a larger number of
temperatures (suitably initialised close to the target peak) when estimating
the Bayes factor.

Gregory Ashton's avatar
Gregory Ashton committed
501
\subsection{Evaluating the search space}
502
503
504
505

In general, MCMC samplers are highly effective in generating samples of the
posterior in multi-dimensional parameter spaces. However, they will perform
poorly if the posterior has multiple small maxima with only a small number of
Gregory Ashton's avatar
Gregory Ashton committed
506
507
508
large maxima which occupy a small fraction of the prior volume. Later on, in
Section~\ref{sec_topology} we will show that for a signal in Gaussian noise our
log-likelihood $\F$ will have a single peak due to the signal which is
509
510
511
surrounded by other smaller peaks particular to given noise realisation. The relative
size of the these peaks to the signal peak depends on the SNR.  However,
regardless of the strength of the signal peak, if the width of the signal peak
Gregory Ashton's avatar
Gregory Ashton committed
512
513
514
515
516
517
is small compared to the prior volume, the sampler will get `stuck' on the
local maxima and be inefficient at finding the global maxima.  This problem is
exacerbated in higher-dimensional search spaces where the volume fraction of
the signal scales with the exponent of the number of dimensions.

\subsubsection{The metric}
518
519
520

In a traditional CW search which uses a grid of templates (also known as a
template bank), the spacings of the grid are chosen such that the loss of
521
SNR is bounded to less than $\mu$, the
Gregory Ashton's avatar
Gregory Ashton committed
522
523
524
525
526
527
\emph{template-bank mismatch}. The optimal choice of grid then consists of
minimising the computing cost while respecting this minimum template-bank
mismatch or vice-verse (for examples see \citet{pletsch2010, prix2012,
wette2013, wette2015}). We will now discuss how the work on setting up these
grids can be applied to the problem of determining whether the setup is
appropriate for an MCMC method: i.e. given the prior volume do we expect a
528
signal to occupy a non-negligible fraction of the volume.
529
530
531

For a fully-coherent $\F$-statistic search on data containing Gaussian noise
and a signal with Doppler parameters $\blambdaSignal$, the template-bank
Gregory Ashton's avatar
Gregory Ashton committed
532
mismatch at the grid point $\blambda_{l}$ we define as
533
\begin{align}
Gregory Ashton's avatar
Gregory Ashton committed
534
\mutilde(\blambdaSignal, \blambda_{l}) \equiv 1 -
535
536
537
538
539
\frac{\tilde{\rho}(\blambda_l;\blambdaSignal)^{2}}
{\tilde{\rho}(\blambdaSignal; \blambdaSignal)^{2}},
\end{align}
where $\tilde{\rho}(\blambda_l; \blambdaSignal)$ is the non-centrality
parameter (c.f. Equation~\ref{eqn_twoF_expectation}) at $\blambda_l$, given
Gregory Ashton's avatar
Gregory Ashton committed
540
541
542
that the signal is at $\blambdaSignal$; as such $\tilde{\rho}(\blambdaSignal;
\blambdaSignal)$ is the perfectly-matched non-centrality parameter.  For a
fully-coherent search, this non-centrality parameter is equivalent to
543
fully-coherent matched-filter SNR. However, as noted by
Gregory Ashton's avatar
Gregory Ashton committed
544
545
546
\citet{leaci2015}, this is true for the fully-coherent case only.  Therefore,
we choose to use the non-centrality parameter which easily generalises to the
semi-coherent case.
547

548
To make analytic calculations of the template-bank mismatch possible, as first shown by
549
550
\citet{brady1998}, the mismatch can be approximated by
\begin{equation}
Gregory Ashton's avatar
Gregory Ashton committed
551
\mutilde(\blambda, \Delta\blambda) \approx
552
\tilde{g}_{ij}^\phi \Delta\lambda^{i}\Delta\lambda^{j}
553
+ \mathcal{O}\left(\Delta\blambda^{3}\right)
Gregory Ashton's avatar
Gregory Ashton committed
554
\label{eqn_mutilde_expansion}
555
\end{equation}
Gregory Ashton's avatar
Gregory Ashton committed
556
557
where we use index notation with an implicit summation over repeated indices.
Here, $\tilde{g}_{ij}^{\phi}$ is the \emph{phase-metric} given by
558
\begin{align}
559
\tilde{g}_{ij}^\phi (\blambda) =
Gregory Ashton's avatar
Gregory Ashton committed
560
561
562
\langle
\partial_{\Delta\lambda^{i}}\phi
\partial_{\Delta\lambda^{j}}\phi
563
564
\rangle
-
Gregory Ashton's avatar
Gregory Ashton committed
565
566
\langle
\partial_{\Delta\lambda^{i}}\phi
567
568
\rangle
\langle
Gregory Ashton's avatar
Gregory Ashton committed
569
\partial_{\Delta\lambda^{j}}\phi
570
571
572
573
574
575
\rangle,
\label{eqn_metric}
\end{align}
where $\langle \cdot \rangle$ denotes the time-average over $\Tcoh$ and
$\phi(t; \blambda)$ is the phase evolution of the source. The phase metric is
in fact an approximation of the full metric which includes modulations of the
Gregory Ashton's avatar
Gregory Ashton committed
576
577
578
amplitude parameters $\A$. It was shown by \citet{prix2007metric} that
when using data spans longer than a day and data from multiple detectors, the
difference between phase metric and full metric is negligible.
579

580
The phase metric, Equation~\eqref{eqn_metric} provides the necessary tool to
581
measure distances in the Doppler parameter space in units of mismatch. To
Gregory Ashton's avatar
Gregory Ashton committed
582
calculate components, we define the phase evolution
583
584
585
586
of the source as \citep{wette2015}
\begin{align}
\phi(t; \blambda) \approx 2\pi\left(
\sum_{s=0}^{\smax} f^{(s)}\frac{(t-\tref)^{s+1}}{(s+1)!}
Gregory Ashton's avatar
Gregory Ashton committed
587
+ \frac{r(t)\cdot\mathbf{n}(\alpha, \delta)}{c} \fmax\right),
588
589
590
\label{eqn_phi}
\end{align}
where $\mathbf{n}(\alpha, \delta)$ is the fixed position of the source with
Gregory Ashton's avatar
Gregory Ashton committed
591
592
593
594
respect to the solar system barycenter (with coordinates $\alpha$ and $\delta$,
the right ascension and declination), $f^{(s)}\equiv d^{s}\phi/dt^s$, and
$\fmax$ a constant approximating $f(t)$, chosen conservatively to be the
maximum frequency over the data span \citep{wette2013}.
595
596
597

The frequency and spin-down components of the metric can be easily calculated
due to their linearity in Equation~\eqref{eqn_phi} and for the special case in
Gregory Ashton's avatar
Gregory Ashton committed
598
599
which $\tref$ is in the middle of the data span, the phase-evolution metric
with $\smax{=}1$ (i.e. the frequency and spin-down components) are given by
Gregory Ashton's avatar
Gregory Ashton committed
600
\meta{This is not explained properly - just link to a previous paper (I used Eqn (33) of RP's frequency metrix notes}
Gregory Ashton's avatar
Gregory Ashton committed
601
602
603
604
605
606
607
608
609
\begin{equation}
\tilde{g}^{\textrm{PE}}_{ij} \equiv
\left[
\begin{array}{cc}
\frac{\pi^{2}\Tcoh^{2}}{3} & 0 \\
0 & \frac{4\pi^{2}\Tcoh^{4}}{45}
\end{array}
\right]
\textrm{ for } i, j \in [f, \dot{f}].
610
\end{equation}
Gregory Ashton's avatar
Gregory Ashton committed
611
612
613
614
615
616
617
618
619
620

Accurately approximating the sky components of the metric is non-trivial, but
was accomplished by \citet{wette2013} for the fully-coherent case. In
\citet{wette2015} it was shown how to calculate the equivalent semi-coherent
metric $\hat{g}_{ij}(\blambda, \Nseg)$; in the following, we
will work with this generalised semi-coherent method with the understanding
that $\hat{g}_{ij}(\blambda, \Nseg{=}1)=
\tilde{g}_{ij}(\blambda)$.

\subsubsection{The metric volume}
621
To understand the volume of parameter space which a true signal would occupy,
Gregory Ashton's avatar
Gregory Ashton committed
622
we can make use of the \emph{metric-volume}, defined as \citep{prix2007}
623
\begin{align}
Gregory Ashton's avatar
Gregory Ashton committed
624
625
626
\mathcal{V} = \int
\sqrt{\textrm{det}\hat{g}_{ij}(\blambda, \Nseg)} d\blambda \approx
\sqrt{\textrm{det}\hat{g}_{ij}(\blambda, \Nseg)} \Delta\blambda
627
628
\end{align}
where in the second step we assume a constant coefficient metric. Here, $\Delta
Gregory Ashton's avatar
Gregory Ashton committed
629
\blambda$ is the volume element which is given by
630
\begin{equation}
Gregory Ashton's avatar
Gregory Ashton committed
631
\Delta\lambda = \frac{\cos(\delta_0)\Delta\delta\Delta\alpha}{2}
632
633
634
%\frac{1}{2}\sin\delta\Delta\delta\Delta\alpha
\prod_{s=0}^{\smax} \Delta f^{(s)},
\end{equation}
Gregory Ashton's avatar
Gregory Ashton committed
635
636
637
638
639
where the numerator of the prefactor is the solid angle of the sky-patch at a
declination of $\delta_0$, the factor of $1/2$ comes from converting the
normalised determinant (which is computed over the whole sky) to the solid angle
of the directed search, and $\Delta f^{(s)}$ is the extent of the frequency and
spin-down range(s) searched.
640
641

The metric volume $\V$ is the approximate number of templates required to cover
642
given Doppler parameter volume at a fixed mismatch of $\approx 1$. As
643
644
such, its inverse gives the approximate (order of magnitude) volume fraction of
the search volume which would be occupied by a signal. This can therefore be
645
646
used as a proxy for determining if an MCMC search will operate in a regime where
it is efficient (i.e. where the a signal occupies a reasonable fraction of the
647
648
649
search volume).

The volume $\V$ combines the search volume from all search dimensions. However,
650
let us now discuss how we can delve deeper to understand how each dimension
651
652
653
654
655
contributes to the total product. This is done by noticing that the metric has
a particular block form:
\begin{align}
g_{ij} = \left[
\begin{array}{cc}
Gregory Ashton's avatar
Gregory Ashton committed
656
657
g^{\textrm{Sky}} & 0 \\
0 & g^{\textrm{PE}}.
658
659
660
661
662
663
664
665
666
667
668
\end{array}
\right]
\end{align}
where $g^{\rm Sky}$ is the $2\times2$ sky-metric, while $g^{\rm PE}$ is the
$(\smax{+}1){\times}(\smax{+}1)$ phase-evolution metric.
As such, the volume can be decomposed as
\begin{align}
\mathcal{V} & =
\sqrt{\textrm{det}g^{\rm Sky}}\frac{\Delta\Omega}{2} \times
\sqrt{\textrm{det}g^{\rm PE}}\prod_{s=0}^{\smax}\Delta f^{(s)} \\
& = \Vsky \times \Vpe.
669
\label{eqn_metric_volume}
670
671
672
673
674
675
676
677
678
\end{align}
Moreover, if $\tref$ is in the middle of the observation span, the diagonal
nature of $g^{\rm PE}$ means that one can further identify
\begin{align}
\Vpe = \prod_{s=0}^{\smax}\sqrt{g^{\rm PE}_{ss}} \Delta f^{(s)}
= \prod_{s=0}^{\smax}\Vpe^{(s)}
\end{align}
This decomposition may be useful in setting up MCMC searches.

Gregory Ashton's avatar
Gregory Ashton committed
679
680
681
682
683
\subsection{The topology of the likelihood}
\label{sec_topology}

We intend to use the $\F$-statistic as our log-likelihood in MCMC simulations,
but before continuing, it is worthwhile to acquaint ourselves with the typical
Gregory Ashton's avatar
Gregory Ashton committed
684
685
686
687
688
689
690
691
692
693
694
695
behavior of the log-likelihood by considering a specific example.  In
particular, let us consider the frequency-domain `topology' of the $\twoFtilde$
likelihood that our MCMC simulations will need to explore.

We simulate a signal with $h_0=$\GridedFrequencyhzero\; in data lasting
\GridedFrequencyT~days with $\sqrt{S_{x}}=$\GridedFrequencySqrtSx.  In
Figure~\ref{fig_grid_frequency}, we plot the numerically computed $\twoFtilde$
for this data as a function of the template frequency (all other parameters are
held fixed at the simulation values). Three clear peaks can be observed: a
central peak at $f_0$ (the simulated signal frequency), and two `Doppler lobes'
at $f_0\pm 4/(1\textrm{-day})$. \comment{Need to understand the cause of
these}.
Gregory Ashton's avatar
Gregory Ashton committed
696
697

As shown in Equation~\eqref{eqn_twoF_expectation}, the expectation of
Gregory Ashton's avatar
Gregory Ashton committed
698
699
700
701
702
$\widetilde{2\F}$ is $4$ in Gaussian noise alone, but proportional to the
square of the non-centrality parameter (or SNR) in the presence of a signal.
This behaviour is exactly seen in Figure~\ref{fig_grid_frequency}: between the
peaks one finds noise fluctations.

Gregory Ashton's avatar
Gregory Ashton committed
703
704
\begin{figure}[htb]
\centering \includegraphics[width=0.45\textwidth]{grided_frequency_search_1D}
Gregory Ashton's avatar
Gregory Ashton committed
705
706
\caption{Numerically computed $\twoFtilde$ for a
simulated signal with frequency $f_0$ in Gaussian noise.}
Gregory Ashton's avatar
Gregory Ashton committed
707
708
709
\label{fig_grid_frequency}
\end{figure}

Gregory Ashton's avatar
Gregory Ashton committed
710
711
712
713
714
715
716
717
718
719
720
721
722
723
The widths of the peaks can be estimated by using the metric, inverting
Eqn.~\ref{eqn_mutilde_expansion} for the frequency component alone one finds
that $\Delta f \approx \sqrt{3}/{\pi\Tcoh}$ (taking a mismatch of 1 as a proxy
for the peak width). In Figure~\ref{fig_grid_frequency} the coherence time is
short enough that the width of the peak is about an order of magnitude smaller
than the Doppler window $\sim 1/(1\mathrm{-day}$. In general, CW searches are
performs on much longer stretches of data and therefore the peak width will
be much narrower.

When running an MCMC simulation we must therefore be aware that in addition to
the signal peak, the likelihood can contain multiple modes which may be either
noise fluctuations, secondary peaks of the signal, or the signal peak itself.


724
\subsection{Example: MCMC optimisation for a signal in noise}
725
726
727

In order to familiarise the reader with the features of an MCMC search, we will
now describe a simple directed search (over $f$ and $\dot{f}$) for a simulated
728
729
signal in Gaussian noise. The signal will have a frequency of
$\BasicExampleFzero$~Hz and a spin-down of $\BasicExampleFone$~Hz/s, all other
730
731
732
733
Doppler parameters are considered known. Moreover, the signal has an amplitude
$\BasicExamplehzero$~Hz$^{-1/2}$ while the Gaussian noise has
$\sqrt{\Sn}=\BasicExampleSqrtSn$~Hz$^{-1/2}$ such that the signal has a
\emph{depth} of $\sqrt{\Sn}/h_0 = \BasicExampleDepth$.
734

735
First, we must define a prior for each search parameter Typically, we recommend
736
either a uniform prior bounding the area of interest, or a normal distribution
737
738
739
740
centered on the target and with some well defined width. In defining the prior
distributions, one must ensure
that the MCMC simulation has a reasonable chance at finding a peak. To do so,
we can use the corresponding metric-volume given in
741
Equation~\eqref{eqn_metric_volume}. For this example, we will use a uniform
742
743
744
745
746
prior with a frequency range of $\Delta f=\BasicExampleDeltaFzero$~Hz and a
spin-down range of $\Delta \fdot=\BasicExampleDeltaFone$~Hz/s both centered on
the simulated signal frequency and spin-down rate. We set the reference time to
coincide with the middle of the data span, therefore the metric volume can be
decomposed into the frequency contribution and spin-down contribution:
747
748
frequency,
\begin{align}
749
\Vpe^{(0)} = \frac{(\pi\Tcoh\Delta f)^2}{3} \approx \BasicExampleVFzero
750
751
752
\end{align}
and
\begin{align}
753
\Vpe^{(1)} = \frac{4(\pi\Delta \fdot)^2\Tcoh^{4}}{45} \approx \BasicExampleVFone
754
\end{align}
755
756
757
such that $\V\approx\BasicExampleV$ (note that $\Vsky$ does not contribute
since we do not search over the sky parameters). This metric volume indicates
that the signal will occupy about 1\% of the prior volume, therefore the MCMC
758
is expected to work efficiently. Alternative priors will need careful thought about how to
759
translate them into a metric volume: for example using a Gaussian one could use
760
761
762
the standard deviation as a proxy for the allowed search region.

In addition to defining the prior, one must also consider how to
763
764
765
766
767
768
769
770
771
772
\emph{initialise} the walkers of the ensemble sampler. If the prior genuinely
represents the stated prior knowledge, the usual solution is to initialise the
walkers from the prior: that is the starting position is drawn from the prior.
However, instances do occur when one would like to initialise the walkers from
a different distribution. For example, if one only needs to estimate the
evidence (given a particular prior), but is aware from previous searches that
the only significant peak lies in a small area of parameter space, one could
initialise the walkers in a small cluster close to that area. In this example,
we initialise the walkers from the prior such that they have the chance to
explore the entire prior volume.
773
774
775
776
777
778
779

Having defined the prior, the final setup step is to define the number of
\emph{burn-in} and \emph{production} steps the sampler should take and the
number of walkers; this is a tuning parameter of the MCMC algorithm. The number
of walkers should be typically a few hundred, the greater the number the more
samples will be taken resulting in improved posterior estimates. The burn-in
steps refers to an initial set of steps which are discarded as they are taken
780
whilst the walkers converge. After they have converged the steps are known as
781
782
783
784
785
786
production steps since they are used to produce posterior estimates and
calculate the marginal likelihood.

Using these choices, the simulation is run. To illustrate the full MCMC
process, in Figure~\ref{fig_MCMC_simple_example} we plot the progress of all
the individual walkers (each represented by an individual line) as a function
Gregory Ashton's avatar
Gregory Ashton committed
787
788
789
790
791
792
793
794
795
796
of the total number of steps. The portion of steps before the dashed vertical
line are ``burn-in'' samples and hence discarded, from this plot we see why:
the walkers are initialised from the uniform prior and initially spend some
time exploring the whole parameter space before converging. The fact that they
converge to a single unique point is due to the strength of the signal
(substantially elevating the likelihood above that of Gaussian fluctuations)
and the tight prior which was quantified through the metric volume $\V$. The
``production'' samples, those after the black dashed vertical line, can be used
to generate posterior plots. In this instance, the number of burn-in and
production samples was pre-specified by hand.
797
798
\begin{figure}[htb]
\centering
Gregory Ashton's avatar
Gregory Ashton committed
799
\includegraphics[width=0.45\textwidth]{fully_coherent_search_using_MCMC_walkers}
800
801
802
\caption{The progress of the MCMC simulation for a simulated signal in Gaussian
noise, searching over the frequency and spin-down. The upper two panels show
the position of all walkers as a function of the number of steps for the
Gregory Ashton's avatar
Gregory Ashton committed
803
804
frequency and spin-down; samples to the left of the dashed vertical line are discarded as
burn-in (the first 100 steps), while the rest are used
805
as production samples.}
806
\label{fig_MCMC_simple_example}
807
808
\end{figure}

809
\subsection{Testing convergence}
810
811
812
813
814
815
816
817
818
819

In Figure~\ref{fig_MCMC_simple_example}, it is easy to see by eye that that the
MCMC simulation has converged during the burn-in. However, choosing the number
of steps, balancing the extra computing cost against ensuring the chains have
converged, is often a matter of trial and error. So, while graphical methods
for assesing convergence are the most robust and informative, quantitative
methods (diagnostics) also exist to test convergence and can therefore provide
the utility to prematurely stop the burn-in period. As such one can then simply
choose a large number of steps (as compared to a typical burn-in for the given
problem) and expect that in most cases these will not all be used. Moreover, if
820
a large number of searches are to be run such that inspection of each MCMC
821
822
823
simulation is impossible, the convergence tests can easily be checked
numerically.

824
\subsubsection{The Gelman \& Rubin statistic}
825
826
827
828
829
830
831
We will use the ``Gelman \& Rubin statistic" \citep{gelman1992} to assess
convergence. This method, developed in the context of single-walker samplers
(i.e. not ensemble samplers) requires multiple independent simulations to be
run, each initialised from an overdispersed estimate of the target
distribution. The statistic, the \emph{potential scale reduction factor}
(PSRF), is calculated for each scalar quantity of interest and estimates the
amount by which the scale factor of the target distribition (approximated by a
832
Student-t distribution) might shrink if the simulation where run indefinitely.
833
834
835
836
837
838
839
840
841
This method makes a normality assumption on the posterior distribution, which
in testing appears to be reasonble for all parameters except the transient
start time and durations (discussed later in Sec.~\ref{sec_transients}).
However, paraphrasing \citet{brooks1998}, Gelman and Rubin's approach is
generally applicable to any situation where the posterior inference can be
summarized by the mean and variance (or standard deviation), even if the
posterior distributions are not themselves believed to be normally distributed.

\comment{Note I have not yet fully implemented the method, for simplicity I'm
842
         assuming df $\rightarrow$ infty}
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860

In this setting, we depart from the original description by using the walkers
of the ensemble sampler in place of independent MCMC simulations. Moreover,
this is done periodicically checking the last $n$ steps of the simulation. As
such, testing convergence requires only a small overhead in calculating the
variances: there is no requirement to run multiple independent simulations.
At each periodic check, the statistic is computed for each model parameter
and if all model parameters are less than the threshold (with a default value
of 1.2), the simulation is considered converged. However, the burn-in period is
only prematurely stopped if convergence is reached a pre-specific number of
times (the default value is 10). This is a conservative system to prevent
misdiagnosis of convergence.

While in testing this method was found to agree well with graphical
determination convergence, but we echo the conclusions of \citet{cowles1996markov}
that no convergence diagnostic can say with certainty that the posterior
samples are truly representative of the underlying stationary distribution.

861
\subsubsection{Example}
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891

In Figure~\ref{fig_MCMC_convergence_example}, we illustrate our implementation
of the Rubin-Gelman statistic for a fully-coherent search of data including
a strong continuous signal. In essense this repeats the simulation in
Figure~\ref{fig_MCMC_simple_example}, but we have expanded the prior ranges
on $f$ and $\fdot$. As a result, the sampler takes many more steps to reach
convergence: it can be seen by eye that the traces of a few walkers remain
isolated from the bulk of walkers. When this is the case, the PSRF is much
larger than unity, since the between-walker variance is much larger than the
within-walker variance. Once that walker joins the main bulk, the PSRF tends
to unity for several periods and hence the burn-in period is prematurely
halted. The sampler is then continued, with all samples saved as production
samples. In the figure, the gap indicates the `missing' samples which would
have been taken if convergence had not been obtained before the end of the
maximum burn-in period.
\begin{figure}[htb]
\centering
\includegraphics[width=0.5\textwidth]{fully_coherent_search_using_MCMC_convergence_walkers}
\caption{Demonstration of the Potential Scale Reduction Face (PSRF) convergence
statistic (in blue) calculated separately for each parameter: values below 1.2
are considered converged.}
\label{fig_MCMC_convergence_example}
\end{figure}

It is worth noting that this type of long convergence, when a small number of
chains remain isolated from the main bulk can be easily countered by the use
of parallel tempering (see Sec.~\ref{sec_parallel_tempering}). This was intenionally
not used here to highlight the utility of the convergence statistic.


892
\section{Zoom follow-up}
893
894
895
896
\label{sec_follow_up}

Incoherent detection statistics trade significance (the height of the peak) for
sensitivity (the width of the peak). We will now discuss the advantages of
Gregory Ashton's avatar
Gregory Ashton committed
897
898
899
using an MCMC sampler to follow-up a candidate found incoherently, decreasing
the number of segments (of increasing
the coherence time) until finally estimating its parameters and significance
900
fully-coherently. We begin by rewriting Equation~\eqref{eqn_lambda_posterior},
901
the posterior distribution of the Doppler parameters, with the explicit
Gregory Ashton's avatar
Gregory Ashton committed
902
dependence on the $\Nseg$:
903
\begin{equation}
904
905
P(\blambda | \Tcoh, x, \AmplitudePrior, \Hs, I)
%=\Bsn(x| \Tcoh, \AmplitudePrior, \blambda) P(\blambda| \Hs I).
Gregory Ashton's avatar
Gregory Ashton committed
906
\propto e^{\hat{\F}(x| \Nseg, \blambda)} P(\blambda| \Hs I).
907
908
\end{equation}

Gregory Ashton's avatar
Gregory Ashton committed
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
Introducing the coherence time $\Nseg$ as a variable provides a free parameter
which adjusts the width of signal peaks in the likelihood. Given a candidate
found from a semi-coherent search with $\Nseg^{0}$ segments, , a natural way to
perform a follow-up is to start the MCMC simulations with $\Nseg^{0}$ segments
(such that the signal peak occupies a substantial fraction of the prior volume)
and then subsequently incrementally decrease the number of segments in a
controlled manner. The aim being at each decrease in the number of segments to
allow the MCMC walkrs to converge to the narrower likelihood before again
decreasing the number of segments. Ultimately, this process continuous until
$\Nseg^{\Nstages}=1$, where $\Nstages$ is the total number of stages in  a
ladder of segment-numbers $\Nseg^{j}$. Choosing the ladder of segment-numbers
is a tuning balancing computation time against allowing sufficient
time for the MCMC walkers to converge at each stage so that the signal is not
lost.

Before discussing how to choose the ladder, we remain that in some ways, this
method bears a resemblance to `simulated annealing', a method in which the
likelihood is raised to a power (the inverse temperature) and subsequently
`cooled' \comment{Add citation?}. The important difference being that the
semi-coherent likelihood is wider at short coherence times, rather than flatter
as in the case of high-temperature simulated annealing stages. For a discussion
and examples of using simulated annealing in the context of CW searches see
\citet{veitch2007}.

For an efficient zoom follow-up, we want to choose the ladder of
segments-numbers to ensure that
935
the metric volume at the $i^{th}$ stage $\V_i \equiv \V(\Nseg^i)$ is a constant
936
937
938
939
940
fraction of the volume at adjacent stages. That is we define
\begin{equation}
\mathcal{R} \equiv \frac{\V_i}{\V_{i+1}},
\end{equation}
where $\mathcal{R} \ge 1$ as $\Tcoh^{i+1} > \Tcoh^{i}$.
941

942
943
Assuming then that the candidate comes with a suitably small prior bounding box,
such that $\V(\Nseg^{0}) \sim \mathcal{O}(100)$, we define $\Nseg^0$ as the
Gregory Ashton's avatar
Gregory Ashton committed
944
number of segments used in the search which found the candidate.
945
946

We can therefore define a minimisation problem for $\Nseg^{i+1}$ given $\Nseg^{i}$:
947
948
949
950
951
952
953
954
\begin{align}
\min_{\Nseg^{i+1} \in \mathbb{N}}
\left| \log\mathcal{R} + \log\V(\Nseg^{i+1}) - \log\V(\Nseg^i)\right|
\end{align}
which can be solved numerically to a specified tolerance. Due to the difficulty
in solving such a minimisation with the integer constrain on $\Nseg^{i+1}$ we
simply solve it as a real scalar, and then round to the nearest integer. We now
have a method to generate a ladder of $\Nseg^{i}$ which keep the ratio of
955
volume fractions fixed.
956

957
958
959
960
961
We run the minimisation problem, given a value of $\mathcal{R}$, iterating
until $\Nseg^{\Nstages} = 1$ at which point we stop: we have generated the
ladder of $\Nseg^i$ given a single tuning parameter, $\mathcal{R}$. Note that
$\Nstages$ is not known prior to runing the iterative minimization probelm, and
will in general depend on both the initial search prior bounding box and setup.
962
963
964
965
966
967
968

\subsection{Example}

We now provide an illustrative example of the follow-up method. We consider a
directed search over the sky position and frequency in 100 days of data from a
single detector, with $\sqrt{\Sn}=10^{-23}$~Hz$^{-1/2}$ (at the fiducial
frequency of the signal). The simulated signal has an amplitude
969
$h_0=2\times10^{-25}$ such that the signal has a depth of $\sqrt{\Sn}/h_0=50$
970
971
972
in the noise.

First, we must define the setup for the run. Using $\mathcal{R}=10$ and
973
$\V^{\rm min}=100$ our optimisation procedure is run and proposes the setup
974
layed out in Table~\ref{tab_follow_up_run_setup}. In addition, we show the
975
number of MCMC steps taken at each stage.
976
\begin{table}[htb]
977
\caption{The search setup used in Figure~\ref{fig_follow_up}, generated with
978
$\mathcal{R}=10$ and $\Nseg^0=100$.}
979
\label{tab_follow_up_run_setup}
980
\input{follow_up_run_setup}
981
982
\end{table}

983
The choice of $\mathcal{R}$ and $\V^{\rm min}$ is a compromise between the
984
985
986
987
988
total computing time and the ability to ensure a candidate will be identified.
From experimentation, we find that $\V^{\rm min}$ values of 100 or so are
sufficient to ensure that any peaks are sufficiently broad during the
initial stage. For $\mathcal{R}$ value much larger than $10^{3}$ or so where
found to result in the MCMC simulations `loosing' the peaks between stages, we
989
conservatively opt for 10 here, but values as large as 100 where also successful.
990
991

In Figure~\ref{fig_follow_up} we show the progress of the MCMC sampler during
992
the follow-up.  As expected from Table~\ref{tab_follow_up_run_setup}, during
993
994
995
the initial stage the signal peak is broad with respect to the size of the
prior volume, therefore the MCMC simulation quickly converges to it. Subsequently,
each time the number of segments is reduced, the peak narrows and the samplers
996
similarly converge to this new solution. At times it can appear to be inconsistent,
997
however this is due to the changing way that the Gaussian noise adds to the signal.
998
Eventually, the walkers all converge to the simulated signal.
999
\begin{figure}[htb]
Gregory Ashton's avatar
Gregory Ashton committed
1000
\centering
For faster browsing, not all history is shown. View entire blame