On the Effects of Dimensionality on Data Analysis with Neural Networks

of 8
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Others

Published:

Views: 51 | Pages: 8

Extension: PDF | Download: 0

Share
Description
Modern data analysis often faces high-dimensional data. Nevertheless, most neural network data analysis tools are not adapted to high-dimensional spaces, because of the use of conventional concepts (as the Euclidean distance) that scale poorly with
Transcript
  On the effects of dimensionality on data analysis with neural networks M. Verleysen 1 , D. François 2 , G. Simon 1 , V. Wertz 2 *   Université catholique de Louvain 1  DICE, 3 place du Levant, B-1348 Louvain-la-Neuve, Belgium 2  CESAME, 4 av. G. Lemaitre, B-1348 Louvain-la-Neuve, Belgium {verleysen, simon}@dice.ucl.ac.be, {francois, wertz}@auto.ucl.ac.be Abstract.  Modern data analysis often faces high-dimensional data. Nevertheless, most neural network data analysis tools are not adapted to high-dimensional spaces, because of the use of conventional concepts (as the Euclidean distance) that scale poorly with dimension. This paper shows some limitations of such concepts and suggests some research directions as the use of alternative distance definitions and of non-linear dimension reduction. 1. Introduction In the last few years, data analysis has become a specific discipline, sometimes far from its mathematical and statistical srcin, where understanding of the problems and limitations coming from the data themselves is often more valuable than developing complex algorithms and methods. The specificity of modern data mining is that huge  amounts of data are considered. There are new fields where data mining becomes crucial (medical research, financial analysis, etc.); furthermore, collecting huge amount of data often becomes easier and cheaper. A main concern in that direction is the dimensionality  of data. Think of each measurement of data as one observation, each observation being composed of a set of variables. It is very different to analyze 10000 observations of 3 variables each, than analyzing 100 observations of 50 variables each! One way to get some feeling of this difficulty is to imagine each observation as a point in a space whose dimension is the number of variables. 10000 observations in a 3-dimensional space most probably form a structured shape, one or several clouds, from which it is possible to extract some relevant information, like principal directions, variances of clouds, etc. On the contrary, at first sight 100 observations in a 50-dimensional space do not represent anything specific, because the number of observations is too low. Nevertheless, many modern data have  this unpleasant characteristic of being high-dimensional. And despite the above difficulties, there are  ways to analyze the data, *  MV is a Senior research associate at the Belgian FNRS. GS is funded by the Belgian FRIA. The work of DF and VW is supported by the Interuniversity Attraction Pole (IAP), initiated by the Belgian Federal State, Ministry of Sciences, Technologies and Culture. The scientific responsibility rests with the authors. J. Mira (Ed.): IWANN 2003, LNCS 2687, pp. 105-112, 2003. c Springer-Verlag Berlin Heidelberg 2003  and extract information from observations. If the current data analysis methodologies are not adapted to high-dimensional, sparse data, then it is our duty to develop adapted methods, even if some well-admitted concepts must be questioned. In particular, artificial neural network methods, now widely and successfully used in data analysis, should be faced to high-dimensional data and modified if necessary. This paper makes no pretence of presenting generic solutions to this problem; the current state-of-the-art is far from that. However, we will illustrate some surprising facts (Section 2) about high-dimensional data in general, and about the use of neural networks in this context (Section 3). In particular, we will show that the use of standard notions as the Euclidean distance, the nearest neighbor, and more generally similarity search, is not adapted to high-dimensional spaces. There is thus a need for alternative solutions; this paper only gives paths to future developments (Section 4), to a new research activity that could influence considerably the field of neural networks for data mining in the next few years. 2. Some weird facts about high-dimensional space High dimensional spaces do in fact escape from our mental representations. What we take for granted in dimension one, two or three, because we can figure it out quite easily, might not actually hold in higher dimensions. Let’s highlight some weird facts. 2.1. The empty space phenomenon Scott and Thompson [1] first noticed some counter-intuitive facts related to high dimensional Euclidean spaces, and described what they called the “empty space phenomenon”. Fact 1.  The volume of a hyper-sphere of unit radius goes to zero as dimension growths. The volume of a sphere of radius r   in d   dimensions is given by: ( )( ) d d  r d d V  12 /  2 /  +Γ =  π  . ( 1 ) Figure 1a shows the volume for r = 1; we see that the volume rapidly decreases with d  . So, in higher dimension, a unit sphere is nearly empty. Fact 2.  The ratio between of the volumes of a sphere and a cube of same radius tend towards zero with increasing dimension as illustrated in Figure 1b. In one dimension these volumes are equal, and in two dimensions the ratio is approximately 0.8, but in higher dimension we can say that the volume of a hyper-cube concentrate in its corners. Fact 3.  The ratio of volume of a sphere of radius 1 and 1- ε  tend towards zero given the obvious fact that its value is equal to (1- ε ) to the power d  . With d   as small as 20, and e  = 0.1, only 10% of the srcinal radius contains 90% of the volume of the outer sphere, and so the volume of it concentrates in an outer shell. The same holds for hyper-cubes and hyper-ellipsoids as well. 106M. Verleysen et al.    Fig. 1.  Left (a): volume of the unit sphere; Right (b): ratio between the volumes of the unit sphere and the unit cube, with respect to the dimension of the space. These observations imply that high-dimensional spaces are mostly empty. They indeed show that local neighborhoods of points are mostly empty, and that even in the case of uniform distributions, data is concentrated at the borders of the volume of interest. 2.2. The concentration of measure phenomenon We will now have a deeper look at the behavior of the widely used Euclidean distance (i.e. the  L 2 -norm of the difference) when applied to high dimensional vectors. Fact 1.  The standard deviation of the norm of random vectors converges to a constant as dimension increases though the expectation of their norm growths as the square root of the dimension. More precisely, it has been proven in [2] that under i.i.d. assumption on  x  i , ( )  ( ) nban x   x  1E  Ο+−==  µ   ( 2 ) ( )  ( ) nb x   x  1Var 2 Ο+== σ    ( 3 ) where a  and b  are constants depending only on the four first momentums of  x  i . Note that the same law applies to the Euclidean distance between any two points, since it happens to be a random vector too. Fact 2.  The difference between the distances of a randomly-chosen point to its furthest and nearest neighbor decreases as dimensionality increases. This can be illustrated by the asymptotic behavior of the relative contrast [3] : If ( ) 0Varlim  =      ∞→ k k d   x  E  x   then 0  pk k k   Dmin Dmin Dmax  →−  ( 4 ) where  D min  and  D max   are the distance to respectively the nearest and furthest neighbors of a particular point. Note that the hypothesis of the theorem is induced by equation (3). A more general proof of the theorem can be found in [4]. 107On the Effects of Dimensionality on Data Analysis with Neural Networks  The conclusion we can draw from these observations is that, in high dimensional spaces, all points tend to be equally distant from each others, with respect to the Euclidean distance. As dimension increase, the observed distance between any two points tends towards a constant. This can be illustrated when computing the histograms of distances between random points of increasing dimensionality. It appears that (1) the mean of the histogram growths and (2) its variance shrinks. 2.3. The curse of dimensionality Finally let us have a look at a not-so-weird-but-often-ignored fact, which Richard Bellman named “The Curse of Dimensionality” [5]. It refers to the huge amount of points that are necessary in high dimensions to cover an input space, for example a regular grid spanning a certain portion of the space. When filling an hypercube in 5 dimensions ( [0 1] 5 ) with a 0.1-spaced grid, one needs no less than 100.000 points. 3. Consequences for neural network learning The considerations developed in the previous section have important consequences on ANN (Artificial Neural Networks) learning. The following subsections give examples of such consequences in specific contexts. 3.1. Supervised learning When modeling some process producing an output on basis of observed values for particular inputs, one has to fit a chosen model to a dataset. The more extensive the dataset, the more accurate is the model. Ideally, the dataset should span the whole input space of interest, in order to ensure that any predicted value (i.e. output of the model) is the result of an interpolation process and that no hazardous extrapolation occurs. But one has to face the curse of dimensionality. Silverman [6] addressed the problem of finding the necessary number of training points (samples) to approximate a Gaussian distribution with fixed Gaussian kernels. His results show that the required number of samples grows exponentially with the dimension. Fukunaga [7] obtained similar results for the k-NN classifier showing that whereas 44 observations are sufficient in 4 dimensions, not less than 3 . 8 e 57   are necessary when dimension is 128. 3.2. Local models Local artificial neural networks are often argued to be more sensitive to dimensionality than global ones. By local models, we mean approximators (or classifiers, or density estimators) made of a combination of local functions (for example Gaussian kernels). Indeed Gaussian functions also have an unexpected 108M. Verleysen et al.  behavior when extended to high-dimensional spaces. Examples of such approximators are RBFN (Radial-Basis Function Networks) and kernel methods. When a normal distribution with standard deviation σ  is assumed, the probability density function to find a point at distance r from the center of the distribution is given by [8] : ( )( ) 2 / 2 22 2 / 12 / 1 d er r  f  r d d  Γ = −−−  σ   , ( 5 ) which is maximum for r   /  σ  = ( n -1) 0.5 . In one dimension, it is maximum at the center of the distribution, as expected, but when dimension growths, it diverges from the center (see Figure 2), which become nearly empty, whereas the Gaussian distribution is maximal ! This shows that Gaussian kernels are not local any more in higher dimensions, and that models that have been seen as sums of local kernels do not behave as such in high dimensions. Fig. 2.  Probability density of a point from a normal distribution to be at distance r   of the center, for several space dimensions. Fig. 3.  Example of distance histogram in a several-clusters distributions. This limitation to the use of local models, in particular with Gaussian kernels, seems severe. However, it should be emphasized on the fact that global models, as for example MLP (Multi-Layer Perceptrons), probably equally suffer. Indeed, in many cases, sums of sigmoids as in MLP result in functions taking significant values in a limited region of the spaces. While mathematically different, models as MLP and RBF thus often behave similarly in practice. This enforces the conviction that local and global models equally suffer from the curse of dimensionality (and related effects), while this is probably harder to prove for global models. 3.3. Similarity search and Euclidean distances Most neural network models, as well as clustering techniques, rely on the computation of distances between vectors. For RBFN, it is the distance between a data and each kernel center. For MLP, it is the scalar product between a data and 109On the Effects of Dimensionality on Data Analysis with Neural Networks
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks