\documentclass[journal]{IEEEtran} \usepackage{amsmath} \usepackage{amsfonts} \usepackage{newalg} \usepackage{amssymb} \usepackage{graphicx} \begin{document} \newcommand{\code}[1]{\texttt{#1}} \title{A Simple Implementation of the Condensation Algorithm} \author{Aly~Merchant,~Keir~Mierle } \maketitle \begin{abstract} We implemented a simple contour tracker using the condensation algorithm. The tracker consists of two main parts. First, there is the observer which takes an estimate of the current shape position, and uses image features to find a new spline. The new spline is projected onto a subspace of Euclidean similarities of a spline template. The second part is the particle filter, the heart of the condensation algorithm. It estimates the posterior density of the shape being tracked by modeling a non-Gaussian distribution as a collection of weighted samples (also known as factored sampling). Our implementation differs from the implementation presented by Blake et al by 1) simplifying the dynamic model prediction to a simple uniform diffusion, 2) replacing B-Splines with Catmull-Rom splines, and 3) tracking only control points (rather than considering spline shape). The tracker uses the OpenCV library from Intel to take care of low-level video processing and acquisition. It successfully tracks the the authors head in a video stream from a webcam in realtime, despite the low resolution and low frame rate of the webcam. \end{abstract} \section{Introduction} \hfill May 14th, 2005 \\ \PARstart{T}{he} condensation algorithm\cite{isard98condensation} approximates the posterior density of the state of a tracked object. In other words, it tracks objects (represented as spline curves, parameterized as low-dimensional vectors) in significant clutter. In this paper, we describe a simple implementation of the condensation algorithm using Intel's OpenCV library. First a description of the condensation algorithm is presented, followed by a description of how to use our software. Then the details of how we implemented condensation are presented, and finally a discussion of the results. \section{The \textsc{Condensation} Algorithm} The condensation algorithm is a stochastic framework for tracking curves in clutter. Algorithms such as condensation and the bootstrap algorithm (presented in the seminal work by Gordon et al in \cite{gordon93}) are considered part of a class of algorithms, called \emph{Sequential Monte Carlo Methods}\cite{smchome}. In SMC, it is possibly to avoid making assumptions about the distribution of the model by approximating the distributions as a collection of weighted samples. In condensation, the idea is to track objects, represented by parameterized spline curves, in substantial clutter at video frame rate. Furthermore, the algorithm is computationally efficient enough to run in realtime on modest hardware. At the heart of the condensation algorithm is a particle filter which performs three steps for each iteration: selection, prediction and measurement. Each step is detailed in the following sections. \subsection{Selection} The first step of condensation is to form a set of potential states for evaluation. The algorithm tracks the current state as a multi-modal probability distribution. It samples from this distribution using a method called {\em factored sampling}. This sampling method focuses on the peaks of the current distribution as it propagates the distribution forward in time. Often, states which have a large weight will be sampled multiple times and other states will be discarded. Each state will be processed by the prediction and observation steps, which will predict the state's motion and reassess its value according to measurement respectively. \begin{figure} \centering \includegraphics[angle=-90,width=3.45in]{figs/propagation} \caption{The state density is propagated over time according to the steps shown above. In the deterministic drift step, the dynamic model predicts the location of samples. Stochastic diffusion moves the density around according to process noise, also from the dynamic model. Finally, before continuing to the next time step, image measurements tighten the output from the dynamics step. The graphic is from \cite{isard98condensation}. } \label{density} \end{figure} \subsection{Prediction} The prediction step attempts to evaluate how the state of an object will change by feeding it through a dynamic model. The dynamic model serves two purposes: imposing deterministic drift to the state density, and spreading the state density by stochastic diffusion. As an intuitive example, consider a baseball thrown through the air in front of the camera. If we know the current position and velocity, we can fairly accurately predict the balls next position according to standard Newtonian dynamics. This is the deterministic drift. However, it's likely that there is some wind disturbance, or maybe the ball is spinning which affects its air resistance, introducing an unknown component called the process noise. This is the stochastic diffusion. See figure \ref{density} for a visual representation of the process. If the dynamic model predicts the motion correctly, the algorithm will ``know where to look'' which simplifies the measurement step. This is represented by how tight the probability distribution remains after the diffusion step. \subsection{Measurement} The measurement step of condensation attempts to reconcile possible states generated by the prediction step with actual data. Essentially, there are two steps. \begin{enumerate} \item Gather data, usually by evaluating features in an image such as edges. \item Using the analyzed data (i.e. comparing features), determine the importance of each potential state. \end{enumerate} The importance of each state affects the probability distribution. If features match with the data then the measure step will likely correct for any diffusion from the prediction step, leaving tight distributions. As another intuitive example, consider observing a moving car. If the car is unobscured and not surrounded by similar cars, it is very easy to predict the position based soley on observation (ie, just by looking). If the car passes through a tunnel, predicting its position becomes more challenging. It is possible to estimate the cars next position based on its previous velocity, but not nearly as accurately because the car is no longer visible. In condensation, the first example is handled by the observation step, while the second is handled by the prediction step. The effects of each step (selection, prediction, and measurement) on the state density is shown in figure \ref{density}. \section{How to use the software} First, the compile the software. The only dependency is on the Intel OpenCV libraries, which are available from SourceForge\cite{opencv}. In order to get the video capture working with Linux, it is necessary to download the latest version of OpenCV from CVS and compile manually, rather than the an official release. Instructions on how to do this are on the OpenCV website\cite{opencv}. \begin{figure} \centering \includegraphics[width=1.7in]{figs/nospline} \includegraphics[width=1.7in]{figs/tracking} \caption{\emph{Left:} At startup, this is how the interface looks before a template spline is defined. The user must then click a series of points around the object to track to define a spline, followed by a right click to start tracking. \emph{Right:} After the user finishes drawing the spline, a right click displays the spline and starts tracking. The video feed is supplied either from a webcam or a video file.} \label{interface} \end{figure} To compile the software, run \code{scons}\footnote{SCons is a superior make replacement written in Python. See \cite{scons}.} in the source directory. After successful compilation, launch the software by running it from the command line: \code{./condensation}. A window will appear with a video feed from the webcam, looking like the left image in figure \ref{interface}. Notice there is no spline. It must be drawn around the object of interest (in this case, the author's head and shoulders) by clicking successive points along the curve. Unfortunately the spline is not visible while drawing it; just click points on the outline of the object anyway. Make sure the points are in succession. For best results, make the winding counterclockwise. When finished drawing the outline, right click. The outline spline with normals to the curve will appear; outline in green and normals in blue, as seen on the right in figure \ref{interface}. At this point, tracking is active. To switch to outline mode, shown in figure \ref{thresh}, left-click anywhere in the video. If a live video feed is not available, the software can be launched with a video file: \code{./condensation videosource.avi}. Then the keys are slightly different: left-click advances a frame, right-click places a control point, middle click starts tracking, and double-click starts/stops playing the video. \section{Implementation} \subsection{Particle Filter} The particle filter implemented in the class \code{Filter} outlines the basic steps of the condensation algorithm: selection, prediction, and measurement. The selection step is common to all condensation implementations; however, the prediction and observation steps rely on models which are specific to dynamics of the tracked object. The generic selection method is part of the \code{Filter::predict()} code. In the selection step, elements representing possible future states are sampled from a weighted list of possible present states. The weight of a sample determines the probability of selecting it for the next step. Initially the list is generated as multiple copies of a single state using \code{Filter::nsamples()}. Pseudo code for the method is shown in figure \ref{sampling}. \begin{figure} \[ \parbox{3.0in}{\hbadness3000 \begin{raggedright} $ S : \text{list of current states} $ \\ $ S' : \text{list of states for consideration} $ \\ $ C : \text{cumulative distribution of weights} $ \\ $ \textsc{Search}(A,b) : \text{find smallest $i$ such that}~A[i] \geq b $ \\ $ \textsc{Random}(a,b) : \text{uniform random number}~ \in [a,b) $ \\ $ $ \\ \end{raggedright} \begin{algorithm}{Sample}{} \begin{REPEAT} \\ r \= \CALL{Random}(0,1) \\ selected \= \CALL{Search}(C,r) \\ S' \= S' \cup S[selected] \end{REPEAT} \CALL{size}(S') = \CALL{size}(S) \end{algorithm} }\] \caption{The factored sampling algorithm} \label{sampling} \end{figure} The prediction step is also performed in \code{Filter::predict()}. To run the dynamic model it calls \code{Filter::moveSample()}, a virtual function which applies the dynamic model to predict a state's next location. Lastly, the observation model is called using a similar method, \code{Filter::makeCdf()}, which calls virtual function \code{Filter::fixWeight()} for each of the states selected for propagation. \code{fixWeight} must also adjust the weight of a possible future state based on the observation model. \code{makeCdf} also recreates the cumulative distribution as its name implies. \subsection{Representation of Splines} A generic spline can be represented as a multiplication of a spline basis vector (composed of spline basis functions) with a vector of control points. \begin{equation} \mathbf{r}(s) = \left(I_2 \otimes \mathbf{B}(s)^\top\right) \mathbf{Q} = \left( \begin{array}{cc} \mathbf{B}(s)^\top & \mathbf{0} \\ \mathbf{0} & \mathbf{B}(s)^\top \end{array} \right) \mathbf{Q} \end{equation} $\mathbf{Q}$ is the stacked $x$ and $y$ coordinates of the control points, $\mathbf{Q} = \left(x_0, x_1, x_2, ... x_n, y_0, y_1, y_2, ... y_n \right)^\top$ where the $x_i$ and $y_i$ are the coordinates of control point $\mathbf{p}_i$. $\mathbf{0}$ is a row vector of zeros. There are many choices for $\mathbf{B}(s)$. A typical example would be B-splines or Bezier curves. We chose Catmull-Rom splines\cite{catmull74}, which are simple cubic splines which interpolate according to two simple rules: the curve passes through all points, and the tangent to the curve at control point $\mathbf{p_i}$ is proportional to the vector connecting $\mathbf{p_{i-1}}$ and $\mathbf{p_{i+1}}$. In other words, if $\mathbf{r}(s) = \mathbf{p}_i$, then $\mathbf{r}'(s) \propto \mathbf{p_{i+1}} - \mathbf{p_{i-1}}$. In our code, the \code{Spline} class implements simple Catmull-Rom splines. For a detailed investigation of B-splines, see \cite{activecontours}. \subsection{Representation of Shape Space} \label{shaperep} It is very convenient to represent the outline of an object as a constrained spline, i.e. a deformed template, rather than any arbitrary spline configuration. This is necessary because otherwise, the spline would be distorted in ways that do not match the underlying object by image clutter. When the user completes entering the object contour as a series of control points, the centroid of the points is subtracted off each one to create a template centered around the origin. Then, we parameterize the splines according to: \begin{equation} \label{eqnexpand} \mathbf{Q} = W\mathbf{X} + \mathbf{Q}_0 \end{equation} where \begin{equation} W = \left( \begin{array}{cccr} \mathbf{1} & \mathbf{0} & \mathbf{Q}_0^x & -\mathbf{Q}_0^y \\ \mathbf{0} & \mathbf{1} & \mathbf{Q}_0^y & \mathbf{Q}_0^x \end{array} \right) \end{equation} and $\mathbf{1}$ and $\mathbf{0}$ are column vectors of 1's and 0's of length $n$, respectively. The shape space vector, $\mathbf{X} \in \mathbb{R}^2$, defines the low-dimensional parameterization of the template as the set of Euclidean similarities (translation, rotation and uniform scaling). The shape template, $\mathbf{Q}_0$, is defined according to \begin{equation} \mathbf{Q}_0 = \left(x_0, x_1, x_2, ... x_n, y_0, y_1, y_2, ... y_n \right)^\top \end{equation} where the $x_i$ and $y_i$ are the coordinates of the control points. To see how this parameterization works, imagine $\mathbf{X} = (x,y,0,0)^\top$. This represents a translation of the template by $(x,y)$. Scaling by factor $r$ is represented by $\mathbf{X} = (0,0,r-1,0)^\top$. Rotation about an angle $\theta$ is captured by $\mathbf{X} = (0,0,\cos\theta-1,\sin\theta)^\top$. The most important reason to parameterize spline templates as above is that the shape space is a proper vector space, which leads to a simple way of computing the closest shape space vector to an arbitrary spline. Note that there are many other parameterizations that are also proper vector subspaces; this is usually accomplished by extending $W$ to have more columns and expanding the dimension of $\mathbf{X}$. This is described in the next section. The method \code{SplineTemplate::setDerived()} takes the template stored internally in a \code{SplineTemplate} instance, generates a derived spline with equation (\ref{eqnexpand}), and stores it in the derived spline, also stored in the \code{SplineTemplate} instance. \subsection{Fitting spline templates} When making observations from the current frame, it is required to come up with the `closest' shape space vector $\mathbf{X}$ to the deformed spline produced by displacing control points to the nearest detected edge. The problem can be expressed as a minimization problem: \begin{equation} \min_{\mathbf{X}} = ||W\mathbf{X} + \mathbf{Q}_0 - \mathbf{Q}_f|| \end{equation} To solve this, we made an approximation. In Active Contours\cite{activecontours}, several algorithms are presented to solve this problem iteratively. Now, because the parameterization is a proper subspace of $W$, we can \emph{project} a deformed spline onto $W$ to get the `closest' shape space vector which represents the deformed spline. A bit of algebra reveals: \begin{equation} \label{eqnapprox} \mathbf{\hat{X}} = W^+\left(\mathbf{Q}_f - \mathbf{Q}_0\right) \end{equation} In Active Contours, $W^+$, the pseudo-inverse of the spline is defined as \begin{equation} W^+ = \left(W^\top \mathcal{U} W\right)^{-1}W^\top \mathcal{U} \end{equation} Now, the computation of $\mathcal{U}$, which defines the $\mathcal{L}_2$ norm over a spline curve (via $||\mathbf{Q}||^2 = \mathbf{Q}^\top \mathcal{U} \mathbf{Q}$), is not particularly pretty: \begin{equation} \mathcal{U} = I_2 \otimes \mathcal{B} = \left(\begin{array}{cc} \mathcal{B} & 0 \\ 0 & \mathcal{B} \end{array} \right) \end{equation} where the spline metric matrix, $\mathcal{B}$, is defined as \begin{equation} \mathcal{B} = \frac{1}{L}\int\limits^L_0 \mathbf{B}(s)\mathbf{B}(s)^\top ds \end{equation} and $\mathbf{B}(s)$ is the vector of spline basis functions. The authors made two observations, ultimately leading to setting $\mathcal{U} = I$, the identity matrix. First, in our implementation we used Catmull-Rom splines\cite{catmull74} rather than the traditional B-Splines. Catmull-Rom splines always pass through every control point, unlike B-Splines, where the spline curve does not usually pass through control points, and is contained within the convex hull of control points. Secondly, if the spline curve passes through all control points, then the difference between a polygon constructed from the control points and the spline curve itself is small for a reasonable number of control points. A polygon can be written with the same equation as a spline, where the spline basis functions are linear interpolators between control points. Thus, by defining the $\mathcal{L}_2$ norm as the identity, we get \begin{equation} W^+ = \left(W^\top I W\right)^{-1}W^\top I = \left(W^\top W\right)^{-1}W^\top \end{equation} which is just the normal equation for the pseudo-inverse of $W$. After substituting into (\ref{eqnapprox}), we get \begin{equation} \label{eqnimpl} \mathbf{\hat{X}} = \left(W^\top W\right)^{-1}W^\top \left(\mathbf{Q}_f - \mathbf{Q}_0\right) \end{equation} which is readily computed. In the observation step, after we find the deformed spline, we find the closest shape space vector $\mathbf{\hat{X}}$ with this equation. The above computation is carried out in \code{Observer::observe()}. \subsection{Observation Model} \label{obsmodel} To make an observation (implemented in \code{Observer::observe()}), an estimate from the particle filter for the current shape state, $\mathbf{X}$, is expanded into a spline $\mathbf{Q}$ according to equation (\ref{eqnexpand}). The entire image is then run through a Canny edge detection routine, accessed as \code{cvCanny()} in OpenCV. The output of the edge detector is shown in figure \ref{thresh}. For greater computational efficiency, we could have only run edge detection along the scanned normals, but we since our computers were plenty fast for doing this in realtime, we left it as is. Then we deform the spline according to the edges, in \code{Observer::fullyTweakSpline()}. To do this, for each control point two rays are cast from the control point in directions normal to the curve (in opposite directions to search both sides of the curve). If either or both of the rays hit an edge, the $(x,y)$ coordinate of the closest one replaces the current control point. The rays are only scanned for a limited number of pixels. One can change the distance of normal scanning by adjusting the \emph{Normal Length} slider seen in figures \ref{interface} and \ref{thresh}. If the rays does not hit an edge, no action is taken---yet. See the next section for details. After the spline $\mathbf{Q}$ is deformed into $\mathbf{Q}_f$, it is fed to the spline fitting routine, which computes the closest shape space vector $\mathbf{\hat{X}}$ to the spline $\mathbf{Q}_f$ via equation (\ref{eqnimpl}). \begin{figure} \centering \includegraphics[width=1.7in]{figs/highthresh} \includegraphics[width=1.7in]{figs/lowthresh} \caption{\emph{Left:} The observation model runs the video feed through a Canny edge detector, then scans along normals to the spline. Here the entire GUI is shown; Thresh1 and Thresh2 are parameters for the OpenCV \code{cvCanny} function. \emph{Right:} Here we can see what happens when Thresh2 is lowered: more edges are detected. It is interesting to note that while the tracking is not as robust as with a high threshold, it is still effective enough to track ones head if movement is slow.} \label{thresh} \end{figure} \subsection{Observation Model Feedback} After the samples from the last time step are diffused by uniform noise, they are weighted according to their closeness to measurement for this time frame. To impose the reactive effect of measurement on the estimated density, the observation model iterates over the current samples and evaluates their closeness to the observed sample. Our observation model is tied to the particle filter with the \code{SplineFilter::fixWeight()} function. It evaluates a sample via the observation model and adjusts the samples weight accordingly. There are several steps in determining the new weight for a selected sample. The first step is to expand the sample into a curve using the method provided by the observation model, described in section \ref{obsmodel}. The observation model returns the control points of the transformed curve. It also calculates the nearest edge to each control point along the curve's normal (\code{Observer::fitTransform()}). Then, how well a curve fits with the edges in the image determines the value of that curve's state. The error between the curve's control points and the edge points is calculated in \code{SplineTemplate::mse()} and returned to the \code{fixWeight} method. This error is calculated as \begin{equation} E = \sum_i ||c_i - e_i||^2 \label{mse} \end{equation} where $c_i$ and $e_i$ are the image coordinates rather than vectors defining the shape. If the observation step found no edge for this point, then the term is replaced with a fixed penalty. The error is converted into a weight according to \begin{equation} w = e^{-\sqrt{err}} \label{expdecay} \end{equation} % If the ray does not hit an edge, then that control point is attributed a % fixed penality in the calculation of the MSE (during the reactive effect % of measurement) to reflect that it does not carry any information. \subsection{Dynamics} In the prediction step of the algorithm, the dynamic model predicts deterministic drift of samples and spreads samples according to some form of stochastic diffusion. To do this, a sample from the current sample set is selected, then its state in the next sample set is estimated using the dynamic model. After the prediction, stochastic diffusion spreads the samples. In our implementation, we entirely ignore the deterministic drift and rely on the stochastic diffusion. The reason for not including a more complex dynamic model is that we did not have time to implement learning dynamics. Due to its simplicity the dynamic model is directly implemented in the \code{SplineFilter} class as part of the \code{SplineFilter::moveSample()} function. To apply stochastic diffusion, we add independent uniform noise to each of the existing samples: \begin{equation} \mathbf{X}_{t} = \mathbf{X}_{t-1} + \left( \Delta x, \Delta y, \Delta p, \Delta q \right)^\top \end{equation} where $\Delta x$ and $\Delta y$ offset the translational motion, and $\Delta p$ and $\Delta q$ offset the scaling and rotation (as explained in section \ref{shaperep}). The noise has different magnitude for each of the dimensions; we found the diffusion in the scaling / rotation dimensions ($\mathbf{X}^3$ and $\mathbf{X}^4$) should be smaller than the drift in the translation ($\mathbf{X}^1$ and $\mathbf{X}^2$) dimensions. For our implementation we found that diffusing translation by noise with magnitude $\pm 2.5$ pixels and scaling/rotation by noise with $\pm 0.05$ magnitude yielded acceptable results. This leads to a very simple implementation; however, it also requires a large sample set compared to other models. The purpose of the prediction step is to estimate how the state changes between frames. If a dynamic model is good at guessing then fewer guesses are required. To increase speed, or to handle cases where the motion will not be constrained to a small search space, other models are needed. Models based on difference equations generate better estimates than a purely random walk: they lie somewhere in the middle of the simplicity vs. speed trade-off. If the motion is quite complex it might be useful to implement a machine learning algorithm to use in the prediction. \section{Conclusion} The condensation algorithm is an effective way of tracking multi-modal distributions across time. Our simple implementation succeeds in tracking the authors head on a live video stream from a webcam. It is surprising that it works as well as it does, considering the extremely low resolution video, the low frame rate of 10 frames per second, and the many approximations made in the implementation. Things we are considering adding to our implementation are a 2nd order auto-regressive dynamics model, and more sophisticated image processing for the observation step. \begin{figure} \[ \parbox{3.0in}{\hbadness3000 \begin{raggedright} $ S : \text{list of current states} $ \\ $ S' : \text{list of states for consideration} $ \\ $ C : \text{cumulative distribution of weights} $ \\ $ $ \\ \end{raggedright} \begin{algorithm}{Predict}{S,C} S' \= \{\} \\ \begin{WHILE}{ size(S) \neq size(S') } \\ S' \= S' \cup \CALL{PickWeightedSample}(S,C) \\ \end{WHILE} \\ \begin{FOR}{\EACH s \IN S'} \\ \CALL{AdjustStateUsingDynamics}(s) \\ \end{FOR} \\ \RETURN S' \end{algorithm} \begin{algorithm}{EvaluateState}{s,edges} C \= \CALL{ExpandStateToCurve}(s) \\ E \= \CALL{FindMatchingEdgePoints}(C,edges) \\ error \= C - E \\ \RETURN error \end{algorithm} \begin{algorithm}{MakeCDF}{S,edges} C_{-1} \= 0 \\ \begin{FOR}{i \= 1 \TO size(S)} \\ e \= \CALL{EvaluateState}(S_i,edges) \\ w \= \CALL{ErrorToWeight}(e) \\ C_i \= C_{i-1} + w \end{FOR} \end{algorithm} \begin{algorithm}{Tracker}{} \text{Initialize data structures} \\ \begin{WHILE}{true} \\ image \= \CALL{GetFrame}(stream) \\ \CALL{Blur}(image) \\ \CALL{Grayscale}(image) \\ edges \= \CALL{Canny}(image) \\ \\ S' \= \CALL{Predict}(S,C) \\ % FIXME % \CALL{Observe}(S, edges) \\ % I'm actually not sure how this fits with our current code % The observation logic is now called from MakeCDF, so I'm moving this % there for now C \= \CALL{MakeCDF}(S',edges) \\ \\ \CALL{DrawMostLikely}(S, image) \end{WHILE} \end{algorithm} } \] \caption{The overall algorithm for our implementation} \label{overall} \end{figure} \bibliography{condproject} \bibliographystyle{IEEEtran} \end{document}