A Benchmark for the Comparison of 3D Motion Segmentation Algorithms
Roberto Tron Ren´e VidalCenter for Imaging Science, Johns Hopkins University308B Clark Hall, 3400 N. Charles St., Baltimore MD 21218, USA
http://www.vision.jhu.edu
Abstract
Over the past few years, several methods for segmenting a scene containing multiple rigidly moving objects havebeen proposed. However, most existing methods have beentested on a handful of sequences only, and each method hasbeen often tested on a different set of sequences. Therefore,the comparison of different methods has been fairly limited. In this paper, we compare four 3D motion segmentation algorithms for afﬁne cameras on a benchmark of 155 motionsequences of checkerboard, trafﬁc, and articulated scenes.
1. Introduction
Motion segmentation is a very important preprocessingstep for several applications in computer vision, such assurveillance, tracking, action recognition, etc. During thenineties, these applications motivated the development of several 2D motion segmentation techniques. Such techniques aimed to separate each frame of a video sequenceinto different regions of coherent 2D motion (optical ﬂow).For example, avideo ofa rigidscene seenby amoving camera could be segmented into multiple 2D motions, becauseof depth discontinuities, occlusions, perspective effects, etc.However, in several applications the scene may containseveral moving objects, and one may need to identify eachobject as a coherent entity. In such cases, the segmentationtask must be performed based on the assumption of severalmotions in 3D space, not simply in 2D. This has motivatedseveral works on 3D motion segmentation during the lastdecade, which can be roughly separated into two categories:1.
Afﬁne methods
assume an afﬁne projection model,which generalizes orthographic, weakperspective andparaperspective projection. Under the afﬁne model,point trajectories associated with each moving objectacross multiple frames lie in a linear subspace of dimension at most 4. Therefore, 3D motion segmentation can be achieved by clustering point trajectoriesinto different motion subspaces. At present, several algebraic and statistical methods for performing this task have been developed (see
§
2 for a brief review). However, all existing techniques have been typically evaluated on a handful of sequences, with limited comparison against other methods. This motivates a study onthe real performances of these methods.2.
Perspective methods
assume a perspective projectionmodel. In this case, point trajectories associated witheach moving object lie in a multilinear variety (bilinearfor two views, trilinear for three views, etc.) Therefore, motion segmentation is equivalent to clusteringthese multilinear varieties. Because this problem isnontrivial, most prior work has been limited to algebraic methods for factorizing bilinear and trilinear varieties (see
e.g
. [18, 7]) and statistical methods for two[15] and multiple [13] views. At present, the evaluation of perspective methods is still far behind that of afﬁne methods. It is arguable that perspective methodsstill need to be signiﬁcantly improved, before a meaningful evaluation and comparison can be made.In this paper, we present a benchmark and a comparison of 3D motion segmentation algorithms. We choose tocompare only afﬁne methods, not only because the afﬁnecase is better understood, but also because afﬁne methods are at present better developed than their perspectivecounterparts. We compare four stateoftheart algorithms,GPCA [16], Local Subspace Afﬁnity (LSA) [21], MultiStageLearning(MSL)[14]andRANSAC[4], onadatabaseof 155 motion sequences. The database includes 104 indoor checkerboard sequences, 38 outdoor trafﬁc sequences,and 13 articulated/nonrigid sequences, all with two or threemotions. Our experiments show that LSA is the most accurate method, with average classiﬁcation errors of 3.45% fortwomotionsand9.73%forthreemotions. However, fortwomotions, GPCA and RANSAC are faster and have a limited1%2% drop in accuracy. More importantly, the results varydepending on the type of sequences: LSA is more accuratefor checkerboard sequences, while GPCA is more accuratefor trafﬁc and articulated scenes. The MSL algorithm is often very accurate, but signiﬁcantly slower.
2. Multibody Motion Segmentation Problem
In this section, we review the geometry of the 3Dmotion segmentation problem from multiple afﬁne viewsand show that it is equivalent to clustering multiple lowdimensional linear subspaces of a highdimensional space.
2.1. Motion Subspace of a RigidBody Motion
Let
{
x
fp
∈
R
2
}
f
=1
,...,F p
=1
,...,P
be the projections of
P
3Dpoints
{
X
p
∈
P
3
}
P p
=1
lyingon arigidlymoving object onto
F
frames of a rigidly moving camera. Under the afﬁne pro jection model, which generalizes orthographic, weak perspective, and paraperspective projection, the images satisfythe equation
x
fp
=
A
f
X
p
,
(1)where
A
f
=
K
f
1 0 0 00 1 0 00 0 0 1
R
f
t
f
0
1
∈
R
2
×
4
is the
afﬁne camera matrix
at frame
f
, which depends on the camera calibration parameters
K
f
∈
R
2
×
3
and the object poserelative to the camera
(
R
f
,
t
f
)
∈
SE
(3)
.Let
W
1
∈
R
2
F
×
P
be the matrix whose
P
columns are theimage point trajectories
{
x
fp
}
P p
=1
. It follows from (1) that
W
1
can be decomposed into a
motion matrix
M
1
∈
R
2
F
×
4
and a
structure matrix
S
1
∈
R
P
×
4
as
W
1
=
M
1
S
1
x
11
···
x
1
P
......
x
F
1
···
x
FP
2
F
×
P
=
A
1
...
A
F
2
F
×
4
X
1
···
X
P
4
×
P
,
(2)hence rank
(
W
1
)
≤
4
. Note also that the rows of each
A
f
involve linear combinations of the ﬁrst two rows of the rotation matrix
R
f
, hence rank
(
W
1
)
≥
rank
(
A
f
) = 2
. Therefore, under the afﬁne projection model, the 2D trajectories of a set of 3D points seen by a rigidly moving camera(the columns of
W
1
) live in a subspace of
R
2
F
of dimension
d
1
=
rank
(
W
1
) = 2
,
3
or
4
.
2.2. Segmentation of Multiple RigidBody Motions
Assume now that the
P
trajectories
{
x
fp
}
P p
=1
correspondto
n
objectsundergoing
n
rigidbodymotionsrelativeto a moving camera. The
3D motion segmentation problem
is the task of clustering these
P
trajectories according tothe
n
moving objects. Since the trajectories associated witheach object span a
d
i
dimensional linear subspace of
R
2
F
,the 3D motion segmentation problem is equivalent to clustering a set of points into
n
subspaces of
R
2
F
of unknowndimensions
d
i
∈ {
2
,
3
,
4
}
for
i
= 1
,...,n
.Notice that the data matrix can be written as
W
=
W
1
,
W
2
,
···
,
W
n
Γ
∈
R
2
F
×
P
,
(3)where the columns of
W
i
∈
R
2
F
×
P
i
are the
P
i
trajectories associated with the
i
th moving object,
P
=
n
i
=1
P
i
, and
Γ
∈
R
P
×
P
is an unknown matrix permuting the
P
trajectoriesaccording tothe
n
motions. Since
W
i
canbefactorizedinto matrices
ˆ
M
i
∈
R
2
F
×
d
i
and
ˆ
S
i
∈
R
P
i
×
d
i
as
W
i
= ˆ
M
i
ˆ
S
i
i
= 1
,...,n,
(4)the matrix associated with all the objects can be factorizedinto matrices
M
∈
R
2
F
×
P
ni
=1
d
i
and
S
∈
R
P
×
P
ni
=1
d
i
as
W
=
W
1
,
W
2
,
···
,
W
n
Γ
∈
R
2
F
×
P
=
ˆ
M
1
,
ˆ
M
2
,
···
,
ˆ
M
n
ˆ
S
1
ˆ
S
2
...
ˆ
S
n
Γ
=
MS
Γ
.
(5)It follows that one possible way of solving the motionsegmentation problem is to ﬁnd a permutation matrix
Γ
,such that the matrix
WΓ
can be decomposed into a motion matrix
M
and a
block diagonal
structure matrix
S
. Thisidea has been the basis for most existing motion segmentation algorithms [1, 3, 5, 8, 10, 11, 19]. However, as shownin [10], in order for
W
to factor according to (5), the motionsubspaces
{W
i
⊂
R
2
F
}
ni
=1
must be
independent
, that is,for all
i
=
j
= 1
,...,n
, we must have
dim(
W
i
∩W
j
) = 0
,so that rank
(
W
) =
ii
=1
d
i
, where
d
i
= dim(
W
i
)
.Unfortunately, most practical motion sequences exhibit
partially dependent
motions,
i.e
. there are
i,j
∈ {
1
,...,n
}
such that
0
<
dim(
W
i
∩W
j
)
<
min
{
d
i
,d
j
}
. For example, when two objects have the same rotational but differenttranslational motion relative to the camera [14], or for articulated motions [20]. This has motivated the development of several algorithms for dealing with partially dependent motions, including statistical methods [6, 14], spectral methods [21, 22] and algebraic methods [16]. We review someof these methods in the next section.
3. Multibody Motion Segmentation Algorithms
3.1. Generalized PCA (GPCA) [17, 16]
Generalized Principal Component Analysis (GPCA) isan algebraic method for clustering data lying in multiplesubspaces proposed by Vidal
et al
. [17]. The main idea behind GPCA is that one can ﬁt a union of
n
subspaces with aset of polynomials of degree
n
, whose derivatives at a pointgive a vector normal to the subspace containing that point.The segmentation of the data is then obtained by groupingthese normal vectors, which can be done using several techniques. In the context of motion segmentation, GPCA operates as follows [16]:
1.
Projection
: Project the trajectories onto a subspace of
R
2
F
of dimension 5 to obtain the projected data matrix
ˆ
W
= [
w
1
,...,
w
P
]
∈
R
5
×
P
.
The reason for projecting is as follows. Since the maximumdimensionofeachmotionsubspaceis4, projecting onto a generic subspace of dimension 5 preservesthe number and dimensions of the motion subspaces.As a byproduct, there is an important reduction in thedimensionality of the problem, which is now reducedto clustering subspaces of dimension at most 4 in
R
5
.Another advantage of the projection, is that it allowsone to deal with missing data, as a rank5 factorization of
W
can be computed using matrix factorizationtechniques for missing data (see [2] for a review).2.
Multibody motion estimation via polynomial ﬁtting
:Fit a homogeneous polynomial representing all motionsubspaces to the projected data. For example, if wehave
n
motion subspaces of dimension 4, then eachone can be represented with a unique normal vector in
R
5
as
{
w
:
b
i
w
= 0
}
. The union of
n
subspaces isrepresented as
{
w
:
q
n
(
w
) = (
b
1
w
)
···
(
b
n
w
) = 0
}
.
q
n
is a polynomial of degree
n
in
w
that can be writtenas
c
ν
n
(
w
)
, where
c
is the vector of coefﬁcients, and
ν
n
(
w
)
is the vector of all monomials of degree
n
in
w
.The vector of coefﬁcients is of dimension O
(
n
4
)
andcan be computed from the linear system
c
ν
n
(
w
1
)
ν
n
(
x
2
)
···
ν
n
(
w
P
)
= 0
.
(6)3.
Feature clustering via polynomial differentiation
: For
n
= 2
,
∇
q
2
(
w
) = (
b
2
w
)
b
1
+ (
b
1
w
)
b
2
, thus if
w
p
belongs to the ﬁrst motion, then
∇
q
2
(
w
)
∼
b
1
. Moregenerally, one can obtain the normal to the hyperplanecontaining point
w
p
from the gradient of
q
n
(
w
)
at
w
p
b
(
w
p
)
∼ ∇
q
n
(
w
p
)
.
(7)One can then cluster the point trajectories by applyingspectral clustering [12] to the similarity matrix
S
ij
=cos
2
(
θ
ij
)
, where
θ
ij
is the angle between the vectors
∇
q
n
(
w
i
)
and
∇
q
n
(
w
j
)
for
i,j
= 1
,...,P
.The ﬁrst advantage of GPCA is that it is an algebraic algorithm, thus it is computationally very cheap. Second, aseach subspace is represented with a hyperplane containingthe subspace, intersections between subspaces are automatically allowed, and so the algorithm can deal with both independent and partially dependent motions. Third, GPCAcandealwithmissingdatabyperformingtheprojectionstepusing matrix factorization techniques for missing data [2].The main drawback of GPCA is that
c
is of dimensionO
(
n
4
)
, while there are only
4
n
unknowns in the
n
normal vectors. Since
c
is computed using leastsquares, thiscauses the performance of GPCA to deteriorate as
n
increases. Also, the computation of
c
is sensitive to outliers.
3.2. Local Subspace Afﬁnity (LSA) [21]
The LSA algorithm proposed by Yan and Pollefeysin [21] is also based on a linear projection and spectralclustering. The main difference is that LSA ﬁts a subspace
locally
around each projected point, while GPCA uses thegradients of a polynomial that is
globally
ﬁt to the projecteddata. The main steps of the local algorithm are as follows:1.
Projection
: Project the trajectories onto a subspace of dimension
D
=
rank
(
W
)
using the SVD of
W
. Thevalue of
D
is determined using model selection techniques. The resulting points in
R
D
are then projectedonto the hypersphere
S
D
−
1
by setting their norm to 1.2.
Local subspace estimation
: For each point
i
, computeits
k
nearest neighbors using the angles between thevectors or their Euclidean distance as a metric. Then ﬁta local subspace
W
i
to the point and its neighbors. Thedimension
d
i
of the subspace
W
i
depends on the kindof motion (e.g., general motion, purely translational,etc.) and the position of the 3D points (e.g. generalposition, all on the same plane, etc.). The dimension
d
i
is also determined using model selection techniques.3.
Spectral clustering
: Compute a similarity matrix between two points
i,j
= 1
,...,P
as
S
ij
= exp
{−
d
ij
m
=1
sin
2
(
θ
m
)
}
,
(8)where the
{
θ
m
}
d
ij
m
=1
are the principal angles betweenthe two subspaces
W
i
and
W
j
, and
d
ij
is the minimumbetween
dim(
W
i
)
and
dim(
W
j
)
. Finally, cluster thefeatures by applying spectral clustering [12] to
S
.The LSA algorithm has two main advantages when compared to GPCA. First, outliers are likely to be “rejected”,because they are far from all the points and so they arenot considered as neighbors of the inliers. Second, LSArequires only
Dn
≤
4
n
2
point trajectories, while GPCAneeds O
(
n
4
)
. On the other hand, LSA has two main drawbacks. First, the neighbors of a point could belong to a different subspace – this case is more likely to happen near theintersection of two subspaces. Second, the selected neighbors may not span the underlying subspace. Both cases area source of potential misclassiﬁcations.Duringourexperiments, wehadsomedifﬁcultiesinﬁnding a set of model selection parameters that would work across all sequences. Thus, we decided to avoid model selection in the ﬁrst two steps of the algorithm and ﬁx boththe dimension of the projected space
D
and the dimensionsof the individual subspaces
{
d
i
}
ni
=1
. We used two choicesfor
D
. One choice is
D
= 5
, which is the dimension usedby GPCA. The other is
D
= 4
n
, which implicitly assumesthat all motions are independent and fulldimensional. Inour experiments in
§
5 we will refer to these two variants as
LSA 5
and
LSA
4
n
, respectively. As for the dimension of the individual subspaces, we assumed
d
i
= 4
.
3.3. MultiStage Learning method (MSL) [14]
The MultiStage Learning (MSL) algorithm is a statistical approach proposed by Sugaya and Kanatani in [14].It builds on Costeira and Kanade’s factorization method(CK) [3] and Kanatani’s subspace separation method (SS)[10, 11]. While the CK and SS methods apply to independent and nondegenerate subspaces, MSL can handle someclasses of degenerate motions by reﬁning the solution of SSusing the Expectation Maximization algorithm (EM).The CK algorithm proceeds by computing a rank
D
approximation
V
∈
R
P
×
D
of
W
from its SVD
W
=
UΣV
. Asshown in [10], when the motions are
independent
, the shapeinteraction matrix
Q
=
VV
∈
R
P
×
P
is such that
Q
ij
= 0
if points
i
and
j
belong to different objects. (9)With noisy data, this equation holds only approximately.CK’salgorithmobtainsthesegmentationbymaximizingthesum of squared entries of the noisy
Q
in different groups.However, this process is very sensitive to noise [5, 10, 19].The SS algorithm [10, 11] deals with noise using twoprinciples:
dimension correction
and
model selection
. Dimension correction is used to induce exact zero entries in
Q
by replacing points in a group with their projections ontoan optimally ﬁtted subspace. Model selection, particularlythe Geometric Akaike Information Criterion [9] (GAIC), isused to decide whether to merge two groups. This can beachieved by applying CK’s method to a scaled version of
QS
ij
=
GAIC
W
i
,
W
j
GAIC
W
i
∪W
j
max
k
∈W
i
,l
∈W
j

Q
kl

.
(10)However, in most practical sequences the motion subspaces are degenerate,
e.g
. of dimension three for 2D translational motions. In this case the SS algorithm gives wrongresults, because the calculation of the GAIC uses the incorrect dimensions for the individual subspaces. The MSLalgorithm deals with degenerate motions by assuming thatthetypeofdegeneracyisknown(
e.g
.2Dtranslational), andcomputing the GAIC accordingly. Another issue is that inmost practical sequences the motion subspaces are partiallydependent. In this case, the SS algorithm also gives wrongresults, because equation (9) does not hold even with perfect data. To overcome these issues, the MSL algorithm iteratively reﬁnes the segmentation given by the SS algorithmusing EM for clustering subspaces as follows:1. Obtain an initial segmentation using SS adapted to independent 2D translational motions.2. Use the current solution to initialize an EM algorithmadapted to independent 2D translational motions.3. Use the current solution to initialize an EM algorithmadapted to independent afﬁne subspaces.4. Use the current solution to initialize an EM algorithmadapted to full and independent linear subspaces.The intuition behind the MSL algorithm is as follows. If the motions are degenerate, then the ﬁrst two stages willgive a good solution, which will simply be reﬁned by thelast two stages. On the other hand, if the motions are notdegenerate, then the third stage will anyhow provide a goodinitialization for the last stage to operate correctly.As with all algorithms based on EM, the MSL methodsuffers from convergence to a local minimum. Therefore,good initialization is needed to reach the global optimum.When the initialization is not good, it often happens that thealgorithm takes a long time to converge (several hours), asit performs a series of optimization problems. Another disadvantage is that the algorithm is not designed for partiallydependent motions, thus sometimes its performance is notideal. In spite of these difﬁculties in theory, in practice thealgorithm is quite accurate, as we will see in
§
5.
3.4. Random Sample Consensus (RANSAC) [4, 15]
RANdom SAmple Consensus (RANSAC) is a statisticalmethod for ﬁtting a model to a cloud of points corruptedwith outliers in a statistically robust way. More speciﬁcally,if
d
is the minimum number of points required to ﬁt a modelto the data, RANSAC randomly samples
d
points from thedata, ﬁts a model to these
d
points, computes the residual of each data point to this model, and chooses the points whoseresidual is below a threshold as the inliers. The procedure isthen repeated for another
d
sample points, until the numberof inliers is above a threshold, or enough samples have beendrawn. The outputs of the algorithm are the parameters of the model and the labeling of inliers and outliers.In the case of motion segmentation, the model to be ﬁtby RANSAC is a subspace of dimension
d
. Since there aremultiplesubspaces, RANSACproceedsiterativelybyﬁttingone subspace at a time as follows:1. Apply RANSAC to the srcinal data set and recover abasis for the ﬁrst subspace along with the set of inliers.Allpointsinothersubspacesareconsideredasoutliers.2. Remove the inliers from the current data set and repeatstep 1 until all the subspaces are recovered.3. For each setof inliers, usePCAtoﬁnd an optimalbasisfor each subspace. Segment the data into multiple subspaces by assigning each point to its closest subspace.The main advantage of RANSAC is its ability to handleoutliers explicitly. Also, notice that RANSAC can deal withpartially dependent motions, because it computes one subspace at a time. However, the performance of RANSAC deteriorates quickly as the number of motions
n
increases, because the probability of drawing
d
inliers reduces exponentially with the number of subspaces. Another drawback of RANSAC is that it uses
d
= 4
as the dimension of the subspaces, which is not the minimum number of points neededto deﬁne a degenerate subspace (of dimension 2 or 3).
3.5. Reference
Data from real sequences contain not only noise and outliers, but also some degree of perspective effects, which arenot accounted for by the afﬁne model. Therefore, obtaininga perfect segmentation is not always possible.In order to verify the validity of the afﬁne model on realdata, we will also compare the performance of afﬁne algorithms with an “oracle” algorithm (here called
Reference
).This algorithm cannot be used in practice, because it requires the ground truth segmentation as an input. The algorithmusesleastsquarestoﬁtasubspacetothedatapointsineach group using the SVD. Then, the data are resegmentedby assigning each point to its nearest subspace.This Reference algorithm shows, with a perfect estimation of the subspaces, if the data can be segmented usingthe approximation of afﬁne cameras and constitutes a goodterm of comparison for all the other (practical) algorithms.
4. Benchmark
We collected a database of
50 video sequences
of indoorand outdoors scenes containing two or three motions. Eachvideo sequence
X
with three motions was split into threemotion sequences
X g12
,
X g13
and
X g23
containing thepoints from groups one and two, one and three, and twoand three, respectively. This gave a total of
155 motion sequences
:120 with two motions and 35 with three motions.Figure 1 shows a few sample images from the videos inthe database with feature points superimposed. The entiredatabase is available at
http://www.vision.jhu.edu
.These sequences contain degenerate and nondegeneratemotions, independent and partially dependent motions, articulated motions, nonrigid motions, etc. To summarize theamount of motion present in all the sequences, we estimatedthe rotation and translation between all pairs of consecutiveframes for each motion in each sequence. This informationwas used to produce the histograms shown in Figure 2.Based on the content of the video and the type of motion,the sequences can be categorized into three main groups:
Checkerboard sequences:
this group consists of 104 sequences of indoor scenes taken with a handheld camera under controlled conditions. The checkerboard pattern on theobjects is used to assure a large number of tracked points.Sequences
1R2RC
–
2T3RTCR
contain three motions: twoobjects (identiﬁed by the numbers
1
and
2
, or
2
and
3
) andthecameraitself(identiﬁedbytheletter
C
).Thetypeofmotion of each object is indicated by a letter:
R
for rotation,
T
for translation and
RT
for both rotation and translation.If there is no letter after the
C
, this signiﬁes that the camera is ﬁxed. For example, if a sequence is called
1R2TC
it means that the ﬁrst object rotates, the second translatesand the camera is ﬁxed. Sequence
threecars
is taken from[18] and contains three motions of two toy cars and a boxmoving on a plane (the table) taken by a ﬁxed camera.
(a) 1R2RCT B (b) 2T3RCRT(c) cars3 (d) cars10(e) people2 (f) kanatani3
Figure 1: Sample images from some sequences in thedatabase with tracked points superimposed.
0 1 2 3 4 5 6 7 80102030Rotation [°]
O c c u r e n c e s [ % ]
(a) Amount of rotation
θ
=
acos
`
(
trace
(
R
f
+1
R
f
))
−
1)
/
2
´
in degrees.
0 0.02 0.04 0.06 0.08 0.1 0.120204060Translation (normalized)
O c c u r e n c e s [ % ]
(b) Amount of translation
τ
=
t
f
/
max
{
depth
}
.
Figure 2: Histograms with the amount of rotation and translation between two consecutive frames for each motion.
Trafﬁc sequences:
this group consists of 38 sequences of outdoor trafﬁc scenes taken by a moving handheld camera.Sequences
carsX
–
truckX
have vehicles moving on a street.Sequences
kanatani1
and
kanatani2
are taken from [14] anddisplay a car moving in a parking lot. Most scenes containdegenerate motions, particularly linear and planar motions.