A Benchmark for the Comparison of 3-D Motion Segmentation Algorithms

of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Business

Published:

Views: 29 | Pages: 8

Extension: PDF | Download: 0

Share
Description
A Benchmark for the Comparison of 3-D Motion Segmentation Algorithms
Tags
Transcript
  A Benchmark for the Comparison of 3-D Motion Segmentation Algorithms Roberto Tron Ren´e VidalCenter for Imaging Science, Johns Hopkins University308B Clark Hall, 3400 N. Charles St., Baltimore MD 21218, USA http://www.vision.jhu.edu Abstract Over the past few years, several methods for segment-ing a scene containing multiple rigidly moving objects havebeen proposed. However, most existing methods have beentested on a handful of sequences only, and each method hasbeen often tested on a different set of sequences. Therefore,the comparison of different methods has been fairly limited. In this paper, we compare four 3-D motion segmentation al-gorithms for affine cameras on a benchmark of 155 motionsequences of checkerboard, traffic, and articulated scenes. 1. Introduction Motion segmentation is a very important pre-processingstep for several applications in computer vision, such assurveillance, tracking, action recognition, etc. During thenineties, these applications motivated the development of several 2-D motion segmentation techniques. Such tech-niques aimed to separate each frame of a video sequenceinto different regions of coherent 2-D motion (optical flow).For example, avideo ofa rigidscene seenby amoving cam-era could be segmented into multiple 2-D motions, becauseof depth discontinuities, occlusions, perspective effects, etc.However, in several applications the scene may containseveral moving objects, and one may need to identify eachobject as a coherent entity. In such cases, the segmentationtask must be performed based on the assumption of severalmotions in 3-D space, not simply in 2-D. This has motivatedseveral works on 3-D motion segmentation during the lastdecade, which can be roughly separated into two categories:1.  Affine methods  assume an affine projection model,which generalizes orthographic, weak-perspective andparaperspective projection. Under the affine model,point trajectories associated with each moving objectacross multiple frames lie in a linear subspace of di-mension at most 4. Therefore, 3-D motion segmen-tation can be achieved by clustering point trajectoriesinto different motion subspaces. At present, several al-gebraic and statistical methods for performing this task have been developed (see § 2 for a brief review). How-ever, all existing techniques have been typically evalu-ated on a handful of sequences, with limited compari-son against other methods. This motivates a study onthe real performances of these methods.2.  Perspective methods  assume a perspective projectionmodel. In this case, point trajectories associated witheach moving object lie in a multilinear variety (bilinearfor two views, trilinear for three views, etc.) There-fore, motion segmentation is equivalent to clusteringthese multilinear varieties. Because this problem isnontrivial, most prior work has been limited to alge-braic methods for factorizing bilinear and trilinear va-rieties (see  e.g . [18, 7]) and statistical methods for two[15] and multiple [13] views. At present, the evalua-tion of perspective methods is still far behind that of affine methods. It is arguable that perspective methodsstill need to be significantly improved, before a mean-ingful evaluation and comparison can be made.In this paper, we present a benchmark and a compari-son of 3-D motion segmentation algorithms. We choose tocompare only affine methods, not only because the affinecase is better understood, but also because affine meth-ods are at present better developed than their perspectivecounterparts. We compare four state-of-the-art algorithms,GPCA [16], Local Subspace Affinity (LSA) [21], Multi-StageLearning(MSL)[14]andRANSAC[4], onadatabaseof 155 motion sequences. The database includes 104 in-door checkerboard sequences, 38 outdoor traffic sequences,and 13 articulated/non-rigid sequences, all with two or threemotions. Our experiments show that LSA is the most accu-rate method, with average classification errors of 3.45% fortwomotionsand9.73%forthreemotions. However, fortwomotions, GPCA and RANSAC are faster and have a limited1%-2% drop in accuracy. More importantly, the results varydepending on the type of sequences: LSA is more accuratefor checkerboard sequences, while GPCA is more accuratefor traffic and articulated scenes. The MSL algorithm is of-ten very accurate, but significantly slower.  2. Multibody Motion Segmentation Problem In this section, we review the geometry of the 3-Dmotion segmentation problem from multiple affine viewsand show that it is equivalent to clustering multiple low-dimensional linear subspaces of a high-dimensional space. 2.1. Motion Subspace of a Rigid-Body Motion Let  { x fp  ∈  R 2 } f  =1 ,...,F  p =1 ,...,P   be the projections of   P   3-Dpoints { X   p  ∈  P 3 } P  p =1  lyingon arigidlymoving object onto F   frames of a rigidly moving camera. Under the affine pro- jection model, which generalizes orthographic, weak per-spective, and paraperspective projection, the images satisfythe equation x fp  =  A f  X   p ,  (1)where  A f   =  K f   1 0 0 00 1 0 00 0 0 1  R f   t f  0  1   ∈  R 2 × 4 is the affine camera matrix  at frame f  , which depends on the cam-era calibration parameters  K f   ∈  R 2 × 3 and the object poserelative to the camera  ( R f  , t f  )  ∈  SE  (3) .Let  W 1 ∈ R 2 F  × P  be the matrix whose  P   columns are theimage point trajectories  { x fp } P  p =1 . It follows from (1) that W 1  can be decomposed into a  motion matrix  M 1  ∈  R 2 F  × 4 and a  structure matrix  S 1  ∈  R P  × 4 as W 1  =  M 1 S  1  x 11 ··· x 1 P  ...... x F  1 ··· x FP   2 F  × P  =  A 1 ... A F   2 F  × 4  X  1 ··· X  P   4 × P   , (2)hence rank  ( W 1 )  ≤  4 . Note also that the rows of each  A f  involve linear combinations of the first two rows of the ro-tation matrix  R f  , hence rank  ( W 1 )  ≥  rank  ( A f  ) = 2 . There-fore, under the affine projection model, the 2-D trajecto-ries of a set of 3-D points seen by a rigidly moving camera(the columns of   W 1 ) live in a subspace of   R 2 F  of dimension d 1  =  rank  ( W 1 ) = 2 ,  3  or  4 . 2.2. Segmentation of Multiple Rigid-Body Motions Assume now that the  P   trajectories  { x fp } P  p =1  corre-spondto n objectsundergoing n rigid-bodymotionsrelativeto a moving camera. The  3-D motion segmentation problem is the task of clustering these  P   trajectories according tothe  n  moving objects. Since the trajectories associated witheach object span a  d i -dimensional linear subspace of   R 2 F  ,the 3-D motion segmentation problem is equivalent to clus-tering a set of points into  n  subspaces of   R 2 F  of unknowndimensions  d i  ∈ { 2 , 3 , 4 } for  i  = 1 ,...,n .Notice that the data matrix can be written as W  =  W 1 , W 2 , ···  , W n  Γ  ∈  R 2 F  × P  ,  (3)where the columns of   W i  ∈  R 2 F  × P  i are the  P  i  trajecto-ries associated with the  i th moving object,  P   = n  i =1 P  i , and Γ  ∈  R P  × P  is an unknown matrix permuting the  P   trajec-toriesaccording tothe n motions. Since W i  canbefactorizedinto matrices  ˆ M i  ∈  R 2 F  × d i and  ˆ S i  ∈  R P  i × d i as W i  = ˆ M i ˆ S  i  i  = 1 ,...,n,  (4)the matrix associated with all the objects can be factorizedinto matrices  M  ∈  R 2 F  × P ni =1  d i and  S  ∈  R P  × P ni =1  d i as W  =  W 1 , W 2 , ···  , W n  Γ  ∈  R 2 F  × P  =  ˆ M 1 , ˆ M 2 , ···  , ˆ M n  ˆ S  1 ˆ S  2 ... ˆ S  n  Γ =  MS  Γ . (5)It follows that one possible way of solving the motionsegmentation problem is to find a permutation matrix  Γ ,such that the matrix  WΓ  can be decomposed into a mo-tion matrix  M  and a  block diagonal  structure matrix  S . Thisidea has been the basis for most existing motion segmenta-tion algorithms [1, 3, 5, 8, 10, 11, 19]. However, as shownin [10], in order for  W  to factor according to (5), the motionsubspaces  {W  i  ⊂  R 2 F  } ni =1  must be  independent  , that is,for all  i   =  j  = 1 ,...,n , we must have  dim( W  i ∩W  j ) = 0 ,so that rank  ( W ) =   ii =1  d i , where  d i  = dim( W  i ) .Unfortunately, most practical motion sequences exhibit  partially dependent   motions,  i.e . there are  i,j  ∈ { 1 ,...,n } such that  0  <  dim( W  i  ∩W  j )  <  min { d i ,d j } . For exam-ple, when two objects have the same rotational but differenttranslational motion relative to the camera [14], or for artic-ulated motions [20]. This has motivated the development of several algorithms for dealing with partially dependent mo-tions, including statistical methods [6, 14], spectral meth-ods [21, 22] and algebraic methods [16]. We review someof these methods in the next section. 3. Multibody Motion Segmentation Algorithms 3.1. Generalized PCA (GPCA) [17, 16] Generalized Principal Component Analysis (GPCA) isan algebraic method for clustering data lying in multiplesubspaces proposed by Vidal  et al . [17]. The main idea be-hind GPCA is that one can fit a union of   n  subspaces with aset of polynomials of degree  n , whose derivatives at a pointgive a vector normal to the subspace containing that point.The segmentation of the data is then obtained by groupingthese normal vectors, which can be done using several tech-niques. In the context of motion segmentation, GPCA op-erates as follows [16]:  1.  Projection : Project the trajectories onto a subspace of  R 2 F  of dimension 5 to obtain the projected data matrix ˆ W  = [ w 1 ,..., w P  ]  ∈  R 5 × P  . The reason for projecting is as follows. Since the max-imumdimensionofeachmotionsubspaceis4, project-ing onto a generic subspace of dimension 5 preservesthe number and dimensions of the motion subspaces.As a byproduct, there is an important reduction in thedimensionality of the problem, which is now reducedto clustering subspaces of dimension at most 4 in  R 5 .Another advantage of the projection, is that it allowsone to deal with missing data, as a rank-5 factoriza-tion of   W  can be computed using matrix factorizationtechniques for missing data (see [2] for a review).2.  Multibody motion estimation via polynomial fitting :Fit a homogeneous polynomial representing all motionsubspaces to the projected data. For example, if wehave  n  motion subspaces of dimension 4, then eachone can be represented with a unique normal vector in R 5 as  { w  :  b  i  w  = 0 } . The union of   n  subspaces isrepresented as { w  :  q  n ( w ) = ( b  1 w ) ··· ( b  n w ) = 0 } . q  n  is a polynomial of degree  n  in  w  that can be writtenas  c  ν  n ( w ) , where  c  is the vector of coefficients, and ν  n ( w )  is the vector of all monomials of degree  n  in  w .The vector of coefficients is of dimension O ( n 4 )  andcan be computed from the linear system c   ν  n ( w 1 )  ν  n ( x 2 )  ···  ν  n ( w P  )   = 0 .  (6)3.  Feature clustering via polynomial differentiation : For n  = 2 ,  ∇ q  2 ( w ) = ( b  2  w ) b 1  + ( b  1  w ) b 2 , thus if   w  p belongs to the first motion, then  ∇ q  2 ( w )  ∼  b 1 . Moregenerally, one can obtain the normal to the hyperplanecontaining point  w  p  from the gradient of   q  n ( w )  at  w  p b ( w  p )  ∼ ∇ q  n ( w  p ) .  (7)One can then cluster the point trajectories by applyingspectral clustering [12] to the similarity matrix  S ij  =cos 2 ( θ ij ) , where  θ ij  is the angle between the vectors ∇ q  n ( w i )  and ∇ q  n ( w j )  for  i,j  = 1 ,...,P  .The first advantage of GPCA is that it is an algebraic al-gorithm, thus it is computationally very cheap. Second, aseach subspace is represented with a hyperplane containingthe subspace, intersections between subspaces are automat-ically allowed, and so the algorithm can deal with both in-dependent and partially dependent motions. Third, GPCAcandealwithmissingdatabyperformingtheprojectionstepusing matrix factorization techniques for missing data [2].The main drawback of GPCA is that  c  is of dimensionO ( n 4 ) , while there are only  4 n  unknowns in the  n  nor-mal vectors. Since  c  is computed using least-squares, thiscauses the performance of GPCA to deteriorate as  n  in-creases. Also, the computation of   c  is sensitive to outliers. 3.2. Local Subspace Affinity (LSA) [21] The LSA algorithm proposed by Yan and Pollefeysin [21] is also based on a linear projection and spectralclustering. The main difference is that LSA fits a subspace locally  around each projected point, while GPCA uses thegradients of a polynomial that is  globally  fit to the projecteddata. The main steps of the local algorithm are as follows:1.  Projection : Project the trajectories onto a subspace of dimension  D  =  rank  ( W )  using the SVD of   W . Thevalue of   D  is determined using model selection tech-niques. The resulting points in  R D are then projectedonto the hypersphere  S D − 1 by setting their norm to 1.2.  Local subspace estimation : For each point  i , computeits  k  nearest neighbors using the angles between thevectors or their Euclidean distance as a metric. Then fita local subspace W  i  to the point and its neighbors. Thedimension  d i  of the subspace  W  i  depends on the kindof motion (e.g., general motion, purely translational,etc.) and the position of the 3-D points (e.g. generalposition, all on the same plane, etc.). The dimension d i is also determined using model selection techniques.3.  Spectral clustering : Compute a similarity matrix be-tween two points  i,j  = 1 ,...,P   as S ij  = exp {− d ij  m =1 sin 2 ( θ m ) } ,  (8)where the  { θ m } d ij m =1  are the principal angles betweenthe two subspaces W  i  and W  j , and  d ij  is the minimumbetween  dim( W  i )  and  dim( W  j ) . Finally, cluster thefeatures by applying spectral clustering [12] to  S  .The LSA algorithm has two main advantages when com-pared to GPCA. First, outliers are likely to be “rejected”,because they are far from all the points and so they arenot considered as neighbors of the inliers. Second, LSArequires only  Dn  ≤  4 n 2 point trajectories, while GPCAneeds O ( n 4 ) . On the other hand, LSA has two main draw-backs. First, the neighbors of a point could belong to a dif-ferent subspace – this case is more likely to happen near theintersection of two subspaces. Second, the selected neigh-bors may not span the underlying subspace. Both cases area source of potential misclassifications.Duringourexperiments, wehadsomedifficultiesinfind-ing a set of model selection parameters that would work across all sequences. Thus, we decided to avoid model se-lection in the first two steps of the algorithm and fix boththe dimension of the projected space  D  and the dimensionsof the individual subspaces  { d i } ni =1 . We used two choicesfor  D . One choice is  D  = 5 , which is the dimension usedby GPCA. The other is  D  = 4 n , which implicitly assumesthat all motions are independent and full-dimensional. Inour experiments in § 5 we will refer to these two variants as  LSA 5  and  LSA  4 n , respectively. As for the dimension of the individual subspaces, we assumed  d i  = 4 .  3.3. Multi-Stage Learning method (MSL) [14] The Multi-Stage Learning (MSL) algorithm is a statis-tical approach proposed by Sugaya and Kanatani in [14].It builds on Costeira and Kanade’s factorization method(CK) [3] and Kanatani’s subspace separation method (SS)[10, 11]. While the CK and SS methods apply to indepen-dent and non-degenerate subspaces, MSL can handle someclasses of degenerate motions by refining the solution of SSusing the Expectation Maximization algorithm (EM).The CK algorithm proceeds by computing a rank- D  ap-proximation  V  ∈  R P  × D of   W  from its SVD  W  =  UΣV  . Asshown in [10], when the motions are  independent  , the shapeinteraction matrix  Q  =  VV  ∈  R P  × P  is such that Q ij  = 0  if points  i  and  j  belong to different objects. (9)With noisy data, this equation holds only approximately.CK’salgorithmobtainsthesegmentationbymaximizingthesum of squared entries of the noisy  Q  in different groups.However, this process is very sensitive to noise [5, 10, 19].The SS algorithm [10, 11] deals with noise using twoprinciples:  dimension correction  and  model selection . Di-mension correction is used to induce exact zero entries in Q  by replacing points in a group with their projections ontoan optimally fitted subspace. Model selection, particularlythe Geometric Akaike Information Criterion [9] (G-AIC), isused to decide whether to merge two groups. This can beachieved by applying CK’s method to a scaled version of   QS ij  =  G-AIC W  i , W  j G-AIC W  i ∪W  j max k ∈W  i ,l ∈W  j | Q kl | .  (10)However, in most practical sequences the motion sub-spaces are degenerate,  e.g . of dimension three for 2-D trans-lational motions. In this case the SS algorithm gives wrongresults, because the calculation of the G-AIC uses the in-correct dimensions for the individual subspaces. The MSLalgorithm deals with degenerate motions by assuming thatthetypeofdegeneracyisknown( e.g .2-Dtranslational), andcomputing the G-AIC accordingly. Another issue is that inmost practical sequences the motion subspaces are partiallydependent. In this case, the SS algorithm also gives wrongresults, because equation (9) does not hold even with per-fect data. To overcome these issues, the MSL algorithm it-eratively refines the segmentation given by the SS algorithmusing EM for clustering subspaces as follows:1. Obtain an initial segmentation using SS adapted to in-dependent 2-D translational motions.2. Use the current solution to initialize an EM algorithmadapted to independent 2-D translational motions.3. Use the current solution to initialize an EM algorithmadapted to independent affine subspaces.4. Use the current solution to initialize an EM algorithmadapted to full and independent linear subspaces.The intuition behind the MSL algorithm is as follows. If the motions are degenerate, then the first two stages willgive a good solution, which will simply be refined by thelast two stages. On the other hand, if the motions are notdegenerate, then the third stage will anyhow provide a goodinitialization for the last stage to operate correctly.As with all algorithms based on EM, the MSL methodsuffers from convergence to a local minimum. Therefore,good initialization is needed to reach the global optimum.When the initialization is not good, it often happens that thealgorithm takes a long time to converge (several hours), asit performs a series of optimization problems. Another dis-advantage is that the algorithm is not designed for partiallydependent motions, thus sometimes its performance is notideal. In spite of these difficulties in theory, in practice thealgorithm is quite accurate, as we will see in § 5. 3.4. Random Sample Consensus (RANSAC) [4, 15] RANdom SAmple Consensus (RANSAC) is a statisticalmethod for fitting a model to a cloud of points corruptedwith outliers in a statistically robust way. More specifically,if   d  is the minimum number of points required to fit a modelto the data, RANSAC randomly samples  d  points from thedata, fits a model to these  d  points, computes the residual of each data point to this model, and chooses the points whoseresidual is below a threshold as the inliers. The procedure isthen repeated for another  d  sample points, until the numberof inliers is above a threshold, or enough samples have beendrawn. The outputs of the algorithm are the parameters of the model and the labeling of inliers and outliers.In the case of motion segmentation, the model to be fitby RANSAC is a subspace of dimension  d . Since there aremultiplesubspaces, RANSACproceedsiterativelybyfittingone subspace at a time as follows:1. Apply RANSAC to the srcinal data set and recover abasis for the first subspace along with the set of inliers.Allpointsinothersubspacesareconsideredasoutliers.2. Remove the inliers from the current data set and repeatstep 1 until all the subspaces are recovered.3. For each setof inliers, usePCAtofind an optimalbasisfor each subspace. Segment the data into multiple sub-spaces by assigning each point to its closest subspace.The main advantage of RANSAC is its ability to handleoutliers explicitly. Also, notice that RANSAC can deal withpartially dependent motions, because it computes one sub-space at a time. However, the performance of RANSAC de-teriorates quickly as the number of motions  n  increases, be-cause the probability of drawing  d  inliers reduces exponen-tially with the number of subspaces. Another drawback of RANSAC is that it uses  d  = 4  as the dimension of the sub-spaces, which is not the minimum number of points neededto define a degenerate subspace (of dimension 2 or 3).  3.5. Reference Data from real sequences contain not only noise and out-liers, but also some degree of perspective effects, which arenot accounted for by the affine model. Therefore, obtaininga perfect segmentation is not always possible.In order to verify the validity of the affine model on realdata, we will also compare the performance of affine algo-rithms with an “oracle” algorithm (here called  Reference ).This algorithm cannot be used in practice, because it re-quires the ground truth segmentation as an input. The algo-rithmusesleast-squarestofitasubspacetothedatapointsineach group using the SVD. Then, the data are re-segmentedby assigning each point to its nearest subspace.This Reference algorithm shows, with a perfect estima-tion of the subspaces, if the data can be segmented usingthe approximation of affine cameras and constitutes a goodterm of comparison for all the other (practical) algorithms. 4. Benchmark We collected a database of   50 video sequences  of indoorand outdoors scenes containing two or three motions. Eachvideo sequence  X   with three motions was split into threemotion sequences  X g12  ,  X g13  and  X g23  containing thepoints from groups one and two, one and three, and twoand three, respectively. This gave a total of   155 motion se-quences :120 with two motions and 35 with three motions.Figure 1 shows a few sample images from the videos inthe database with feature points superimposed. The entiredatabase is available at  http://www.vision.jhu.edu .These sequences contain degenerate and non-degeneratemotions, independent and partially dependent motions, ar-ticulated motions, nonrigid motions, etc. To summarize theamount of motion present in all the sequences, we estimatedthe rotation and translation between all pairs of consecutiveframes for each motion in each sequence. This informationwas used to produce the histograms shown in Figure 2.Based on the content of the video and the type of motion,the sequences can be categorized into three main groups: Checkerboard sequences:  this group consists of 104 se-quences of indoor scenes taken with a handheld camera un-der controlled conditions. The checkerboard pattern on theobjects is used to assure a large number of tracked points.Sequences  1R2RC  – 2T3RTCR  contain three motions: twoobjects (identified by the numbers  1  and  2  , or  2   and  3 ) andthecameraitself(identifiedbytheletter C  ).Thetypeofmo-tion of each object is indicated by a letter:  R  for rotation, T   for translation and  RT   for both rotation and translation.If there is no letter after the  C  , this signifies that the cam-era is fixed. For example, if a sequence is called  1R2TC  it means that the first object rotates, the second translatesand the camera is fixed. Sequence  three-cars  is taken from[18] and contains three motions of two toy cars and a boxmoving on a plane (the table) taken by a fixed camera. (a) 1R2RCT B (b) 2T3RCRT(c) cars3 (d) cars10(e) people2 (f) kanatani3 Figure 1: Sample images from some sequences in thedatabase with tracked points superimposed. 0 1 2 3 4 5 6 7 80102030Rotation [°]    O  c  c  u  r  e  n  c  e  s   [   %   ] (a) Amount of rotation  θ  =  acos ` ( trace ( R  f  +1 R f  ))  −  1) / 2 ´  in degrees. 0 0.02 0.04 0.06 0.08 0.1 0.120204060Translation (normalized)    O  c  c  u  r  e  n  c  e  s   [   %   ] (b) Amount of translation  τ   =   t f   / max { depth } . Figure 2: Histograms with the amount of rotation and trans-lation between two consecutive frames for each motion. Traffic sequences:  this group consists of 38 sequences of outdoor traffic scenes taken by a moving handheld camera.Sequences  carsX  – truckX   have vehicles moving on a street.Sequences  kanatani1  and  kanatani2   are taken from [14] anddisplay a car moving in a parking lot. Most scenes containdegenerate motions, particularly linear and planar motions.
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x