Pattern recognition theodoridis pdf




















Pattern recognition is a scientific discipline that is becoming increasingly important in the age of automation and information handling and retrieval. Patter Recognition, 2e covers the entire spectrum of pattern recognition applications, from image analysis to speech recognition and communications. This book presents cutting-edge material on neural networks, - a. The use of pattern recognition and classification is fundamental to many of the automated electronic systems in use today.

However, despite the existence of a number of notable books in the field, the subject remains very challenging, especially for the beginner. Pattern Recognition and Classification presents a comprehensive introduction to. The book offers a thorough introduction to Pattern Recognition aimed at master and advanced bachelor students of engineering and the natural sciences.

Besides classification - the heart of Pattern Recognition - special emphasis is put on features, their typology, their properties and their systematic construction. Additionally, general principles that govern. Becker, published by Springer which was released on This is the first text to provide a unified and self-contained introduction to visual pattern recognition and machine learning. It is useful as a general introduction to artifical intelligence and knowledge engineering, and no previous knowledge of pattern recognition or machine learning is necessary.

Basic for various pattern recognition and. In addition, the book teaches students how to recognize patterns and distinguish the similarities and differences between them. Patterns, such as weather. Watching the environment and recognising patterns with the end goal of basic leadership is central to human instinct. This book manages the logical train that empowers comparable observation in machines through pattern recognition, which has application in differing innovation regions-character recognition, picture handling, modern computerization, web looks, discourse recognition, therapeutic.

Home Introduction To Pattern Recognition. Introduction to Pattern Recognition. Pattern Recognition and Classification by Geoff Dougherty. Thus, pdfs give their place to probabilities. Acyclic means that there are no cycles in the graph.

For example, the graph in Figure 2. The joint probability of the variables can now be obtained by multiplying all conditional probabilities with the prior probabilities of the root nodes.

All that is needed is to perform a topological sorting of the random variables;that is,to order the variables such that every variable comes before its descendants in the related graph.

Bayesian networks have been used in a variety of applications. The network in Figure 2. S stands for smokers, C for lung cancer, and H for heart disease. H1 and H2 are heart disease medical tests, and C1 and C2 are cancer medical tests. The tables along the nodes of the tree are the respective conditional probabilities. The probabilities used in Figure 2.

A detailed treatment of the topic is beyond the scope of this book; the interested reader may consult more specialized texts, such as [Neap 04]. Given the values of some of the variables, known as evidence, the goal is to compute the conditional probabilities for some or all of the other variables in the graph, given the evidence. For notational simplicity we avoid subscripts, and the involved variables are denoted by x, y, z, w. Each variable is assumed to be binary.

Those below the graph can be derived. Take, for example, the y node. Note that all of these parameters should be available prior to performing probability inference. Suppose now that: a x is measured and let its value be x1 the evidence. We seek to compute P z1 x1 and P w0 x1. We seek to compute P x0 w1 and P z1 w1. To answer a , the following calculations are in order. This idea can be carried out to any net of any size of the form given in Figure 2.

For Bayesian networks that have a tree structure,probability inference is achieved via a combination of downward and upward computations propagated through the tree. See, for example, [Pear 88, Laur 96]. For the case of singly connected graphs, these algorithms have complexity that is linear in the number of nodes.

A singly connected graph is one that has no more than one path between any two nodes. Although it is beyond our scope to focus on algorithmic details, it is quite instructive to highlight the basic idea around which this type of algorithm evolves. Let us take as an example the DAG shown in Figure 2. For more variables and a large number of values, L, this can be a prohibitively large number. Let us now exploit the structure of the Bayesian network in order to reduce this computational burden.

Taking into account the relations implied by the topology of the graph shown in Figure 2. Underbraces indicate what variable the result of each summation depends on. Thus, the total number of operations required to compute 2. Each summation can be viewed as a processing stage that removes a variable and provides as output a function. The essence of the algorithm given in [Li 94] is to search for the factorization that requires the minimal number of operations.

This algorithm also has linear complexity in the number of nodes for singly connected networks. In general, for multiply connected networks the probability inference problem is NP-hard [Coop 90]. In light of this result, one tries to seek approximate solutions, as in [Dagu 93]. Training: Training of a Bayesian network consists of two parts.

For example, the fraction frequency of the number of instances that an event occurs over the total number of trials performed is a way to approximate probabilities. A review of learning procedures can be found in [Heck 95]. For the reader who wishes to delve further into the exciting world of Bayesian networks, the books of [Pear 88, Neap 04, Jens 01] will prove indispensable tools. Hint: It is easier to work with the probability of correct decision.

This is known as the Neyman—Pearson test, and it is similar to the Bayesian minimum risk rule. Assuming that the classes are equiprobable, a classify the feature vector [1. Derive the corresponding linear discriminant functions and the equation describing the decision surface. Observe that this is a decreasing function of dm. What is the percentage error for each case? Hint: To generate the vectors, recall from [Papo 91, p. Show that the decision hyperplane at the point x 0 ,Eq.

Hint: a Compute the gradient of Mahalanobis distance with respect to x. Comment on the results. Using a multivariate Gaussian pdf, show that this corresponds to a maximum and not to a minimum. Comment on the result. Hint: For the latter, note that the probabilities add to one; thus a Lagrangian multiplier must be used.

Generate samples according to the following rule. This rule repeats until all samples have been generated. The distance between the centers of the circles is greater than 4r.

Let N be the number of the available training samples. In the sequel, generate vectors from each class and classify them according to the NN and 3NN rules.

Observe that for large values of h the variance is small. Compute P x1 z0 and P w0 z0. Based on this test, compute the probability that the patient has developed cancer.

Needless to say that there may be other implementations of these functions. Short comments are also given along with the code. Solution Just type mvnrnd m,S,N 2. P is the c dimensional vector that contains the a priori probabilities of the classes.

It plots: a the data vectors of X using a different color for each class, b the mean vectors of the class distributions. It is assumed that the data live in the two-dimensional space. Write a MATLAB function that will take as inputs: a the mean vectors, b the covariance matrices of the class distributions of a c-class problem, c the a priori probabilities of the c classes,and d a matrix X containing column vectors that stem from the above classes.

They are used to serve as references, as we will see later on. Write a MATLAB function that will take as inputs: a the mean vectors, and b a matrix X containing column vectors that stem from the above classes.

Write a MATLAB function that will take as inputs: a the mean vectors, b the covariance matrix of the class distributions of a c-class problem, and c a matrix X containing column vectors that stem from the above classes.

Its output will be the percentage of the places where the two vectors differ i. This is important for the reproducibility of the results. Solution Figure 2. This is directly affected by the corresponding covariance matrix. Using the same settings, generate a data set Z, where the class from which a data vector stems is known.

M Bayesian Theory, John Wiley, Pattern Recognition and Machine Learning, Springer, Royal Statistical Society B,Vol. Royal Statistical Society,Vol. Introduction to Statistical Pattern Recognition, 2nd ed. Bayesian Networks and Decision Graphs, Springer, Graphical Models, Oxford University Press, Learning Bayesian Networks, Prentice Hall, Probability Random Variables and Stochastic Processes, 3rd ed.

Figure 3. On one side of the plane g x takes positive values and on the other negative. We will approach the problem as a typical optimization task Appendix C. Thus we need to adopt a an appropriate cost function and b an algorithmic scheme to optimize it.

Obviously, the sum in 3. The perceptron cost function in 3. However, we must be careful here. Note that Eq. The weight vector is then corrected according to the preceding rule. A pseudocode for the perceptron algorithm is given below.

The perceptron algorithm corrects the weight vector in the direction of x. No doubt, this sequence is critical for the convergence. The update of the weight vector is in the direction of x in order to turn the decision hyperplane to include x in the correct class. The solution is not unique, because there are more than one hyperplanes separating two linearly separable classes.

The convergence proof is necessary because the algorithm is not a true gradient descent algorithm and the general tools for the convergence of gradient descent schemes cannot be applied.

Then from 3. Hence, 3. An example of a sequence satisfying conditions 3. In other words, the corrections become increasingly small. Example 3. We will now state another simpler and also popular form. The N training vectors enter the algorithm cyclically, one after the other. Let w t be the weight vector estimate and x t the corresponding feature vector, presented at the tth iteration step. The algorithm belongs to a more general algorithmic family known as reward and punishment schemes.

The perceptron algorithm was originally proposed by Rosenblatt in the late s. The algorithm was developed for training the perceptron, the basic unit used for modeling neurons of the brain.

This was considered central in developing powerful models for machine learning [Rose 58, Min 88]. According to 3. Step 3. Step 4. These are known as synaptic weights or simply synapses. The products are summed up together with the threshold value w0. The result then goes through a nonlinear device, which implements the so-called activation function. Another popular choice is 1 and 0, and it is achieved by choosing the two levels of the step function appropriately.

This basic network is known as a perceptron or neuron. Later on we will use the perceptron as the basic building element for more complex learning networks. Sometimes a neuron with a hard limiter device is referred to as a McCulloch—Pitts neuron. Other types of neurons will be considered in Chapter 4. The Pocket Algorithm A basic requirement for the convergence of the perceptron algorithm is the linear separability of the classes.

If this is not true, as is usually the case in practice, the perceptron algorithm does not converge. Set a history counter hs of the w s to zero. Continue the iterations. The generalization to an M-class task is straightforward. That is, we require all the vectors to lie on the same side of the decision hyperplane. The initial vector of the algorithm w 0 is computed using the uniform pseudorandom sequence generator in [0, 1]. The goal now is to compute the corresponding weight vector under a suitable optimality criterion.

The least squares methods are familiar to us, in one way or another, from our early college courses. Let us then build upon them. However, we will have to live with errors; that is, the true output will not always be equal to the desired one. Thus, the mean square optimal weight vector results as the solution of a linear set of equations, provided, of course, that the correlation matrix is invertible.

It is interesting to point out that there is a geometrical interpretation of this solution. Random variables can be considered as points in a vector space. This is illustrated by an example in Figure 3. Equation 3. The corresponding desired output responses i. This is in agreement with the two-class case.

That is, matrix W has as columns the weight vectors w i. The MSE criterion in 3. This presupposes knowledge of the underlying distributions, which in general are not known. Thus, our major goal now becomes to see if it is possible to solve 3. The answer has been provided by Robbins and Monro [Robb 51] in the more general context of stochastic approximation theory. The proof is beyond the scope of the present text. However, we will demonstrate its validity via an example.

Most natural! Let us now return to our original problem and apply the iteration to solve 3. Then 3. The algorithm is known as the least mean squares LMS or Widrow—Hoff algorithm, after those who suggested it in the early s [Widr 60, Widr 90]. The algorithm converges asymptotically to the MSE solution. A number of variants of the LMS algorithm have been suggested and used. The interested reader may consult, for example, [Hayk 96, Kalou 93]. However, in this case the algorithm does not converge to the MSE solution.

In case k is a time index, LMS is a time-adaptive scheme, which adapts to the solution as successive samples become available to the system. This type of training, which neglects the nonlinearity during training and applies the desired response just after the adder of the linear combiner part of the neuron Figure 3.

In other words, the adaline is a neuron that is trained according to the LMS instead of the perceptron algorithm. In this way, we overcome the need for explicit knowledge of the underlying pdfs. Minimizing 3. Matrix X T X is known as the sample correlation matrix. The solution obtained by the pseudoinverse is the vector that minimizes the sum of error squares.

It is easy to show that under mild assumptions the sum of error squares tends to the MSE solution for large values of N Problem 3. Of course, this is not necessary. Obviously, all we have said so far is still applicable. However, the interesting aspect of this generalization would be to compute these desired values in an optimal way,in order to obtain a better solution.

The Ho—Kashyap algorithm is such a scheme solving for both the optimal w and optimal desired values yi. The interested reader may consult [Ho 65, Tou 74].

Generalization to the multi-class case follows the same concept as that introduced for the MSE cost, and it is easily shown that it reduces to M equivalent problems of scalar desired responses, one for each discriminant function Problem 3.

The task is not linearly separable. However, the resulting sum of error squares is minimum. The task of interest is to estimate the value of y, given the value of x that is obtained from an experiment. In a more general setting, the values of y may not be discrete.

The task now is to estimate predict the value of y, given the value of x. This type of problem is known as a regression task. One of the most popular optimality criteria for regression is the mean square error MSE.

In this section, we will focus on the MSE regression and highlight some of its properties. It will be shown that it results in higher mean square error. In general,this is a nonlinear vector-valued function of x i.

It can be shown Problem 3. We will consider the multiclass case. Given x, we want to estimate its class label. Let gi x be the discriminant functions to be designed.

The cost function in Eq. Adding and subtracting E[ yi x] 2 , Eq. Let us concentrate and look at it more carefully. Training the discriminant functions gi with desired outputs 1 or 0 in the MSE sense, Eq. An important issue here is to assess how good the resulting estimates are. If, for example, we adopt linear models, as was the case in Eq.

Our focus in the next chapter will be on developing modeling techniques for nonlinear functions. The latter plays its part when the approximation accuracy issue comes into the scene.

MSE cost is just one of the costs that have this important property. Other cost functions share this property too, see, for example, [Rich 91, Bish 95, Pear 90, Cid 99]. To emphasize the explicit dependence on D, we write g x; D. The key factor here is the dependence of the approximation on D. The effectiveness of an estimator can be evaluated by computing its mean square deviation from the desired optimal value.

In other words, even if the estimator is unbiased, it can still result in a large mean square error due to a large variance term. Increasing the bias decreases the variance and vice versa. This is known as the bias—variance dilemma. This behavior is reasonable. Thus, it will result in low bias but will yield high variance, as we change from one data set to another. The major issue now is to seek ways to make both bias and variance low at the same time. This is natural, because one takes advantage of the available information and helps the optimization process.

Such an assumption is not an unreasonable one. On the other hand, since g x has been chosen arbitrarily, in general, one expects the bias term to be large. In contrast to g x , the function g1 x , shown in Figure 3. That is, at the training points, the bias is zero. Due to the continuity of f x and g1 x , one expects similar behavior and at the points that lie in the vicinity of the training points xi.

Thus, if N is large enough we can expect the bias to be small for all the points in the interval [0, 1]. However, now the variance increases. However, now the algebra gets a bit more involved, and some further assumptions need to be adopted e. A simple and excellent treatment of the bias—variance dilemma task can be found in [Gema 92]. As for ourselves, this was only the beginning. More on the optimization task and the properties of the obtained solution can be found in, for example, [Ande 82, McLa 92].

There is a close relationship between the method of logistic discrimination and the LDA method, discussed in Chapter 2. It does not take much thought to realize that under the Gaussian assumption and for equal covariance matrices across all classes the following holds true.

However, LDA and logistic discrimination are not identical methods. Their subtle difference lies in the way the unknown parameters are estimated. In LDA, the class probability densities are assumed to be Gaussian and the unknown parameters are, basically, estimated by maximizing 3. In this maximization, the marginal probability densities p x k play their own part, since they enter implicitly into the game.

However, in the case of logistic discrimination, marginal densities contribute to C and do not affect the solution. Thus, if the Gaussian assumption is a reasonable one for the problem at hand, LDA is the natural approach since it exploits all available information.

However, in practice it has been reported [Hast 01] that there is little difference between the results obtained by the two methods. Generalizations of the logistic discrimination method to include nonlinear models have also been suggested. See, for example, [Yee 96, Hast 01]. We will start with the two-class linearly separable task, and then we will extend the method to more general cases where data are not separable.

As we have already discussed in Section 3. The perceptron algorithm may converge to any one of the possible solutions. Having gained in experience, this time we will be more demanding. Both hyperplanes do the job for the training set. No doubt the answer is: the full-line one. Thus such a hyperplane can be trusted more, when it is faced with the challenge of operating with unknown data.

We will come to this issue over and over again. Let us now quantify the term margin that a hyperplane leaves from both classes. Every hyperplane is characterized by its direction determined by w and its exact position in space determined by w0. This is illustrated in Figure 3. Our goal is to search for the direction that gives the maximum possible margin. However, each hyperplane is determined within a scaling factor.

We will free ourselves from it, by appropriate scaling of all the candidate hyperplanes. Recall from Section 3. We have now reached the point where mathematics will take over. This is a nonlinear quadratic optimization task subject to a set of linear inequality constraints. As it is pointed out in Appendix C,a nonzero Lagrange multiplier corresponds to a so called active constraint. Hence, as the set of constraints in 3.

In practice, w0 is computed as an average value obtained using all conditions of this type. Furthermore, the inequality constraints consist of linear functions. As discussed in Appendix C, these two conditions guarantee that any local minimum is also global and unique. This is most welcome. Having stated all these very interesting properties of the optimal hyperplane of a support vector machine, we next need to compute the involved parameters.

From a computational point of view this is not always an easy task, and a number of algorithms exist, for example, [Baza 79]. We will move to a path, which is suggested to us by the special nature of our optimization task, given in 3.

As we discuss in Appendix C, such problems can be solved by considering the socalled Lagrangian duality. We have already gained something. The training feature vectors enter into the problem via equality constraints and not inequality ones, which can be easier to handle. Substituting 3. The training vectors enter into the game in pairs, in the form of inner products. This is most interesting. The cost function does not depend explicitly on the dimensionality of the input space!

We will return to this at the end of Chapter 4. In words, the expansion of w in terms of support vectors in 3. These vectors comply with the constraints in 3.

These are the points placed in squares in Figure 3. The optimizing task becomes more involved,yet it falls under the same rationale as before. In an M-class problem, a straightforward extension is to consider it as a set of M two-class problems one-against-all.

This becomes more serious when the number of classes is relatively large. An alternative technique is the one-against-one. The decision is made on the basis of a majority vote.

In [Plat 00] a methodology is suggested that may speed up the procedure. A different and very interesting rationale has been adopted in [Diet 95].

The multiclass task is treated in the context of error correcting coding, inspired by the coding schemes used in communications. Each class is now represented by a binary code word of length L. Each row must be distinct and corresponds to a class. Here in lies the power of the technique. For the matrix in 3. For example, one grouping is based on the existence in the numeric digits of a horizontal line e.

In [Zhou 08], the composition of the individual binary problems and their number code word length, L is the result of a data-adaptive procedure that designs the code words by taking into account the inherent structure of the training data.

A naive implementation of a quadratic programming QP solver takes O N 3 operations, and its memory requirements are of the order of O N 2. For problems with a relatively small number of training data, any general purpose optimization algorithm can be used.

However, for a large number of training points e. Training of SVM is usually performed in batch mode. For large problems this sets high demands on computer memory requirements. To attack such problems, a number of procedures have been devised. Their philosophy relies on the decomposition, in one way or another, of the optimization problem into a sequence of smaller ones, for example, [Bose 92, Osun 97, Chan 00].

Optimization is, then, performed on this subset via a general optimizer. Support vectors remain in the working set while others are replaced by new ones, outside the current working set, that violate severely the KKT conditions.

It can be shown that this iterative procedure guarantees that the cost function is decreasing at each iteration step. In [Plat 99, Matt 99], the so called Sequential Minimal Optimization SMO algorithm is proposed where the idea of decomposition is pushed to its extreme and each working set consists of only two points. Its great advantage is that the optimization can now be performed analytically.

In [Keer 01], a set of heuristics is used for the choice of the pair of points that constitute the working set. To this end, it is suggested that the use of two thresholded parameters can lead to considerable speedups. Theoretical issues related to the algorithm, such as convergence, are addressed in [Chen 06] and the references therein.

The parallel implementation of the algorithm is considered in [Cao 06]. In [Joac 98] the working set is the result of a search for the steepest feasible direction. It is reported that substantial computational savings can be obtained compared to the SMO algorithm. A sequential algorithm, which operates on the primal problem formulation, has been proposed in [Navi 01], where an iterative reweighted least squares procedure is employed and alternates weight optimization with constraint forcing.

An advantage of the latter technique is that it naturally leads to online implementations. Another trend is to employ an algorithm that aims at an approximate solution to the problem.

In [Fine 01] a low-rank approximation is used in place of the the so-called kernel matrix, which is involved in the computations. In [Tsan 06, Hush 06] the issues of complexity and accuracy of the approximation are considered together. For large problems, the test phase can also be quite demanding, if the number of support vectors is excessively high. Methods that speed up computations have also been suggested, for example, [Burg 97, Nguy 06]. The points lie on the corners of a square, as shown in Figure 3.

Indeed, a careful observation of Figure 3. For any other direction, e. It must be pointed out that the same solution is obtained if one solves the associated KKT conditions Problem 3. Let us now consider the mathematical formulation of our problem. However, all of them lead to the unique optimal separating line. The full line in Figure 3. Dotted lines meet the conditions given in 3. The setting in Figure 3. This is because the second term in 3.

In other words, the width of the margin does not depend entirely on the data distribution, as was the case with the separable class case, but is heavily affected by the choice of C.

However, since the margin is such an important entity in the design of SVM after all, the essence of the SVM methodology is to maximize it , a natural question that arises is why not involve it in a more direct way in the cost function, instead of leaving its control to a parameter i.

To this end, in [Scho 00] a variant of the soft margin SVM was introduced. Under this new setting, the primal problem given in 3. We will revisit this issue later on. Observe that in contrast to 3.

Also, the new formulation has an extra constraint. As we will see in the next section, it leads to a geometric interpretation of the SVM task for nonseparable classes. First, as we have already commented, it directly affects the computational load, since large Ns means that a large number of inner products are to be computed for classifying an unknown pattern. Second, as we will see at the end of Section 5.

It can be shown e. It turns out that solving the dual optimization problem in 3. In other words, searching for the maximum margin hyperplane is equivalent to searching for two nearest points between the corresponding convex hulls!

Let us investigate this a bit further. Having established the geometric interpretation of the SVM optimization task, any algorithm that has been developed to search for nearest points between convex hulls e. It is now the turn of the nonseparable class problem to enter into the game, which, at this point becomes more exciting. Obviously, this has no effect on the solution.

Hence, the solution obtained via 3. The Wolfe dual representation of the primal problem in 3. In Figure 3. Adopting a procedure similar to the one that led to 3. In the separable class case, the constraints 3. In contrast,in the nonseparable class case a lower upper bound i. From the geometry point of view, this means that the search for the nearest points is limited within the respective reduced convex hulls. It is natural to choose it as the one bisecting the line segment joining two nearest points between the reduced convex hulls.

Then, as can be deduced from Figure 3. Thus, both approaches result in a separating hyperplane pointing in the same direction recall from Section 3. However, it is early to say that the two solutions are exactly the same. However, note that the value in Eq. In other words, a solution is feasible if the centroids of the two classes do not coincide.

This is because,for the latter case, such algorithms rely on the extreme points of the involved convex hulls. This is not the case for the reduced convex hulls, where extreme points are combinations of points of the original data sets. After convergence, draw the corresponding decision line. Produce samples for each of the classes. Show that in this case the sum of error squares criterion and the ML estimation result in identical estimates. Hint: Take N training data samples of known class labels.

For simplicity, consider equiprobable classes. Restrict the search for the optimum among the lines crossing the origin. It returns the estimated parameter vector. Function g x is approximated in terms of up to order r polynomials of the x components, for large enough r.

The generalization of 4. That is, even for medium-size values of the network order and the input space dimensionality the number of free parameters gets very high. Let us consider, for example, our familiar nonlinearly separable XOR problem. One can easily observe the close relation that exists between this and the Parzen approximation method for the probability density functions of Chapter 2.

In contrast, in 4. Coming back to Figure 4. As has already been said in Section 4. At this point, it is important to stress one basic difference between RBF networks and multilayer perceptrons.

Hence, the output is the same for all points on a hyperplane. In other words, the activation responses of the nodes are of a local nature in the RBF and of a global nature in the multilayer perceptron networks. This intrinsic difference has important repercussions for both the convergence speed and the generalization performance. In general,multilayer perceptrons learn slower than their RBF counterparts.

Simulation results in [Hart 90] show that, in order to achieve performance similar to that of multilayer perceptrons, an RBF network should be of much higher order. Let us now come back to our XOR problem and adopt an RBF network to perform the mapping to a linearly separable class problem.

Figure 4. The decision curve is linear in the transformed space a and nonlinear in the original space b. In our example we selected the centers c 1 , c 2 as [0, 0]T and [1, 1]T. This is an important issue for RBF networks.

Some basic directions on how to tackle this problem are given in the following. Provided that the training set is distributed in a representative manner over all the feature vector space,this seems to be a reasonable way to choose the centers. The computational complexity of such a scheme is prohibitive for a number of practical situations. To overcome this drawback, alternative techniques have been suggested.

One way is to choose the centers in a manner that is representative of the way data are distributed in space. This can be achieved by unraveling the clustering properties of the data and choosing a representative for each cluster as the corresponding center [Mood 89]. This is a typical problem of unsupervised learning,and algorithms discussed in the relevant chapters later in the book can be employed.

The unknown weights, wi , are then learned via a supervised scheme i. Thus, such schemes use a combination of supervised and unsupervised learning procedures. An alternative strategy is described in [Chen 91]. A large number of candidate centers is initially chosen from the training vector set.

This technique also provides a way to estimate the order of the model k. A recursive form of the method, which can lead to computational savings, is given in [Gomm 00]. Another method has been proposed based on support vector machines.

The idea behind this methodology is to look at the RBF network as a mapping machine, through the kernels, into a high-dimensional space. These are the support vectors and correspond to the centers of the input space.

The training consists of a quadratic programming problem and guarantees a global optimum [Scho 97]. The nice feature of this algorithm is that it automatically computes all the unknown parameters including the number of centers.

We will return to it later in this chapter. In [Plat 91] an approach similar in spirit to the constructive techniques,discussed for the multilayer perceptrons, has been suggested.

The novelty of each training input—desired output pair is determined by two conditions: a the input vector to be very far according to a threshold from all already existing centers and b the corresponding output error using the RBF network trained up to this point greater than another predetermined threshold.

If not, the input—desired output pair is used to update the parameters of the network according to the adopted training algorithm, for example, the gradient descent scheme. A variant of this scheme that allows removal of previously assigned centers has also been suggested in [Ying 98]. This is basically a combination of the constructive and pruning philosophies.

The procedure suggested in [Kara 97] also moves along the same direction. However, the assignment of the new centers is based on a procedure of progressive splitting according to a splitting criterion of the feature space using clustering or learning vector quantization techniques Chapter As was the case with the aforementioned techniques, network growing and training is performed concurrently.

A number of other techniques have also been suggested. For a review see, for example, [Hush 93]. A comparison of RBF networks with different center selection strategies versus multilayer perceptrons in the context of speech recognition is given in [Wett 92]. A major problem associated with polynomial expansions is that good approximations are usually achieved for large values of r. That is, the convergence to g x is slow.

However, large values of r, besides the computational complexity and generalization issues due to the large number of free parameters required , also lead to poor numerical accuracy behavior in the computations, because of the large number of products involved. The slow decrease of the approximation error with respect to the system order and the input space dimension is common to all expansions of the form 4. The scenario becomes different if data-adaptive functions are chosen,as is the case with the multilayer perceptrons.

In the latter,the argument in the activation functions is f w T x , with w computed in an optimal fashion from the available data. In other words, the input space dimension does not enter explicitly into the scene and the error is inversely proportional to the system order, that is, the number of neurons. Obviously, the price we pay for it is that the optimization process is now nonlinear, with the associated disadvantage of the potential for convergence to local minima.

The universal approximation property is also true for the class of RBF functions. Obviously, in Eq. The critical computation involving the unknown feature vector, x, in Eq.

After normalization, and combining Eqs. Each node corresponds to a training point, and it is numbered accordingly. Only the synaptic weights for the kth node are drawn. The synaptic weights, leading to the kth hidden node, consist of the components of the respective normalized training feature vector x k , i.

In other words, the training of this type of network is very simple and is directly dictated by the values of the training points. Output nodes are linear combiners. Each output node is connected to all hidden layer nodes associated with the respective class. Probabilistic neural network architectures were introduced in [Spec 90], and they have been used in a number of applications, for example, [Rome 97, Stre 94, Rutk 04].

Then, in the framework discussed in Section 4. However, there is an elegant property in the SVM methodology that can be exploited for the development of a more general approach. Recall from Chapter 3 that, in the computations involved in the Wolfe dual representation the feature vectors participate in pairs, via the inner product operation.

Thus, once more, only inner products enter into the scene. If the design is to take place in the new k-dimensional space, the only difference is that the involved vectors will be the k-dimensional mappings of the original input feature vectors. A naive look at it would lead to the conclusion that now the complexity is much higher, since, usually, k is much higher than the input space dimensionality l, in order to make the classes linearly separable.

However, there is a nice surprise just waiting for us. Let us start with a simple example. Most interesting! The opposite is always true; that is, for any symmetric, continuous function K x, z satisfying 4. This is the case, for example, for the radial basis Gaussian kernel [Burg 99]. For more on these issues, the mathematically inclined reader is referred to [Cour 53]. The number of nodes is determined by the number of support vectors Ns. The nodes perform the inner products between the mapping of x and the corresponding mappings of the support vectors in the high-dimensional space, via the kernel operation.

Dotted lines mark the margin and circled points the support vectors. However, the approach followed here is different. In Section 4. The Gaussian RBF kernel was used. In the SVM approach,the number of nodes as well as the centers are the result of the optimization procedure.

If it is chosen as a kernel, the resulting architecture is a special case of a two-layer perceptron. Once more, the number of nodes is the result of the optimization procedure. This is important. Although the SVM architecture is the same as that of a two-layer perceptron, the training procedure is entirely different for the two methods.

The same is true for the RBF networks. Thus, the curse of dimensionality is bypassed. In other words, one designs in a high-dimensional space without having to adopt explicit models using a large number of parameters, as this would be dictated by the high dimensionality of the space. We will return to this issue at the end of Chapter 5.

This is still an unsolved, yet challenging, research issue. Once a kernel function has been adopted, the so-called kernel parameters e. A different approach to the task of data-adaptive kernel tuning, with the same goal of improving the error performance,is to use information geometry arguments [Amar 99].

The basic idea behind this approach is to introduce a conformal mapping into the Riemannian geometry induced by the chosen kernel function, aiming at enhancing the margin. It turns out that under some very general assumptions S is a Riemannian manifold with a metric that can be expressed solely in terms of the kernel. Sometimes this is also known as the kernel trick. This is possible if all the computations can be expressed in terms of inner product operations.

The perceptron rule was introduced in Section 3. This drawback has prevented the perceptron algorithm to be used in realistic practical applications. In this perspective, the kernelized version of the perceptron rule transcends its historical, theoretical, and educational role and asserts a more practical value as a candidate for solving linearly separable tasks in the RKHS.

We will choose to work on the perceptron algorithm in its reward and punishment form, given in Eqs. The heart of the method is the update given by the Eqs. We have already done so in Section 4.

After all,it is a small world! Using the Gaussian kernel in Eq. Equation 4. There are a few differences, however. In contrast to the Parzen expansion, g x in 4. Moreover, from the practical point of view, the most important difference lies in the different number of terms involved in the summation. In practice, a small fraction of the training points enter in the summation in Eq.

A sparse solution spends computational resources only on the most relevant of the training patterns. A closer look behind the SVM philosophy reveals that the source of sparsity in the obtained solution is the presence of the margin term in the cost function.

Another way to view the term w 2 in the cost function in 3. A bias constant term can also be added, but it has been omitted for simplicity. Functions of the form in 4. For a more mathematical treatment of this result, the interested reader may refer to, for example, [Scho 02]. In order to see how this theorem can simplify the search for the optimal solution in practice, let us consider the following example. Example 4.

Hence, recalling 4. Based on 4. The computation of the unknown weights is carried out in the Bayesian framework rationale Chapter 2. Memory requirements scale with the square, and the required computational resources scale with the cube of the number of basis functions, which makes the algorithm less practical for large data sets.

In contrast,the amount of memory requirements for the SVMs is linear,and the number of computations is somewhere between linear and approximately quadratic in the size of the training set [Plat 99]. A more recent trend is to obtain solutions of the form in Eq.

That is, the solution is updated each time a new training pair yi , x i is received. This is most important when the statistics of the involved data are slowly time varying. The summation accounts for the total number of errors committed on all the samples that have been received up to the time instant t. Instead of minimizing, for example, 4. It can be shown see, e.

However, formulating the optimization task as in 4. In [Slav 08] an adaptive solution of the cost in Eq. It is shown that such a constraint becomes equivalent to imposing a forgetting factor that forces the algorithm to forget data in the remote past and adaptation focuses on the most recent samples.

The algorithm scales linearly with the number of data corresponding to its effective memory due to the forgetting factor. A dictionary of basis functions is adaptively formed. If the dependence measure is below a threshold value, the new sample is included in the dictionary whose cardinality is now increasing by one; otherwise the dictionary remains unaltered. The expansion of the solution is carried out by using only the basis functions associated with the samples in the dictionary.

A pitfall of this technique is that the complexity scales with the square of the size of the dictionary, as opposed to the linear complexity of the two previous adaptive techniques. In our discussion so far we have assumed the use of a loss function. The choice of the loss function is user-dependent. For smaller values, the loss function becomes positive, and it is also linearly increasing as the value of yg x becomes smaller moving toward negative values.

That is, it provides a measure of how far from the margin the estimate lies.



0コメント

  • 1000 / 1000