How do I implement lda

Test Run - Linear Discriminant Analysis Using C #

  • 14 minutes to read

October 2015

Volume 30 number 10

This article has been machine translated.

By James McCaffrey

The goal of a binary classification problem is to predict the value of a variable that can take one of two possible values. For example, it is best to predict change in the share price of some companies (increase or decrease) based on prediction variables such as the average number of shares sold, number of transactions, insiders, price, earnings ratio, and so on. Or, a person's political inclination (serving or conservative) could be predicted based on age, income, education, and so on.

There are about a dozen major algorithms that can be used for binary classification. Linear discriminate analysis (LDA) is one of the oldest methods used by the 1930s. This article explains what LDA is, describes how it works, and shows how to implement LDA in code.

Realistically, it is unlikely that you will ever write LDA code. Still, there are three reasons that this article might be helpful to you. First, LDA programming gives you a thorough understanding of how LDA works in the event that you ever come across it. Second, some of the coding techniques used in implementing LDA can be useful in other, more common programming scenarios. After all, the ideas behind LDA are exceptionally nifty and you found LDA interesting and insightful.

The best way to get a feel for what LDA binary classification is and what this article is about is to have a look at the demo program in illustration 1.


Figure 1 linear descriminate analysis binary classification demo

The aim of the demo program is to predict the political inclination, use or conservative of a person based on the person's age and annual income. The demo uses a tiny training dataset of just eight items to build the LDA predictive model. Age and income have been normalized in some way so that the values ​​that are both roughly the same.

Political inclination was coded with serving 0 and conservative 1. Unlike many binary classification algorithms, LDA can use any type of coding for the variable you are predicting. Thus the political inclination could as-1 and + 1, or "A" and "B" were coded.

Basic LDA binary classification can use any number of predictive variables. And it is possible to extend basic LDA to predict a variable that can take one of three or more values. For example, you can predict political inclination in which the possible values ​​are Serve, Conservative, and Mean. This article describes binary classification only.

The key to LDA is what is called a linear discriminate, usually represented by an uppercase "w". Using the eight data elements, the demo calculates w = (-0.868, -0.496). W arrays must have the same number of values ​​as there are prediction variables. In calculating the w behind the scenes the demo calculates the means for each of the two classes, then uses the means to calculate XY matrices for each class, and finally uses the XY matrices to make a combined in-class XY matrix to calculate. The matrix within the class is required to compute w.

After calculating w, demo will be predicted by the class for a new person who has normalized AGE = 4.0 and normalized Income = 5.0. The LDA prediction is that the person who will operate a political inclination.

This article assumes you have at least advanced programming knowledge, but does not assume that you are familiar with the LDA classification algorithm. The demo is coded in C #, but you should have trouble writing the code in a different language such as. B. Visual Basic .NET or JavaScript to redesign.

Understanding LDA binary classification

The main LDA binary classification concepts are presented in the two diagrams See Figure 2. The top diagram shows the data from the demo program. The three blue dots shown are used by three people. The five red dots are the conservatives. Yellow dot in (4.0, 5.0) represents a person with unknown political inclination. Even for such a simple problem, it is not obvious if the unknown person should be served or cautiously classified as one.

Figure 2 LDA prediction with the initialization vector Discriminate

Note that only the demo data could therefore be shown see Figure 2 is that there are only two prediction variables. If there are more than two prediction variables, it would not be possible to see the data as well in 2D chart, but the same principles of LDA apply. Regardless of how many prediction variables there are in binary LDA, there will only be two classes. It is easy to confuse the number of prediction variables (two or more) with the number of values ​​that variable predictions (always exactly two for binary classification) can use.

Most binary classification algorithms attempt to find a line (technically a vector) that explicitly separates the two classes. For example, in the diagram above in Figure 2You can think of a hypothetical, separating the line that runs over (3, 9) down to (5, 0). Any unknown data point on the left hand side of the row would be classified as serving, and any point on the right hand side would be a cautious one. LDA works very differently.

The first step in LDA is to find a line called the discriminate. In Figure 2, the discriminate line is green, and the discriminate as (0.868, 0.496). In LDA, the discriminate line is always identified by a single end point and the starting point is always implicit (0, 0, .., 0).

But wait a minute. the output of the demo program in illustration 1 shows the discriminate (-0.868, -0.496) instead of (+0.868, +0.496). As it turns out, a constant multiplies the components of the discriminate, and it has no effect on the result. Therefore, the diagram below in the Figure 2 more useful looking and important for illustration purposes, used (+0.868, +0.496) for w rather than the actual calculated values ​​which were negative. In other words, I multiply both components by -1.

In other words, the discriminate could be identified by any point on the green line in the Figure 2such as B. -2 * (-0.868, -0.496) = (1.736, 0.992). The standard, but not a universal approach in LDA is to scale the discriminate so that it has length 1 from the origin. Notice that the length of (0, 0), (0.868, 0.496) = Sqrt ((0 - 0.868) ^ 2 + (0 - 0.496) ^ 2) = Sqrt (0.754 + 0.246) = sqrt (1.000) = 1.

But what is the meaning of the discriminate vector? In Figure 2, black dashed lines are projected from each data point on the discriminate line, where the dashed lines are perpendicular to the discriminate. As it turns out, the discriminate is the line-wise, starting at (0, 0), which simultaneously minimizes the distance of the projected points for each class and maximizes the distance between the two groups of projected points. This concept is not at all obvious.

OK, but also, if it is possible to compute the discriminate vector for a set of data, it is still not clear how the discriminate can be used to make a prediction. In Figure 2who have favourited the purple diamond labeled "means" is the average data point that means the two-grade. You can see that the mean is halfway between the two sets of data points. Now take a vertical dashed line from the mean purple to the green discriminate line. The projection line would press on (5.2, 3.0).

And now take a vertical line from the yellow, point to the line, class green discriminate. Since the projection of the point for classifying is to the left of the projection from the mean, the projections serving data points are approaching and are therefore classified as serving. If you decreased the projection from unknown point in time into the discriminate line on the other side of the projection from the mean, the point would have been classified as conservative. A very nifty idea.

Implementing LDA binary classification

The general structure of the demo program has been removed with a few WriteLine statements and minor changes to save space have been made in the Figure 3. To create the demo program when I started Visual Studio and created a new C # console application called LinearDiscriminate. The demo does not include any major dependencies on the .NET version every version of Visual Studio should work with. After loading the code into the editor, I deleted all using statements except for the single reference to the top-level system namespace. In the Solution Explorer window when I renamed the LinearDiscriminateProgram.cs file Program.cs and Visual Studio allowed to automatically rename the Program class for me.

Figure 3 General structure of a demo program

The demo program is too long and is available here in its entirety, but you can find the full source code in the download that accompanies this article. I used a static method approach instead of an object-oriented programming approach.

The Main method starts with setting up the eight element training data set hard-coded into an array of Style Matrix arrays:

In a non-demo scenario, you would likely be reading data from a text file into memory using a method named something like LoadData. The discriminate is calculated as follows:

Note that the return value of the Discriminate method is an array of Matrix arrays instead of an array. Most of the operations used in LDA are Matrix operations, such as B. Matrix multiplication and the inversion of the matrix. Instead of an array with two cells, here w is a matrix with two rows and one column. Such column matrices are called column vectors. If you work with matrices frequently, getting used to them with joy instead of arrays may take a while.

In illustration 1, you can see that several objects are being calculated and displayed. The display instructions are in the Discriminate Method and the Utility Methods and are designed to make LDA easier for you. You would likely remove the display instructions in the production code.

The instructions that set the element up to predict are:

For the sake of simplicity, calling here is the element that is predicted, although it must later be converted to a matrix in a column, housed in a normal numeric array. The prediction statements are:

The prediction method accepts the data matrix to calculate the mean of the means, see Figure 2. The method also requires the element to predict X and the discriminate vector w.

Calculate the discriminate

Discrimination method calculates LDA discriminate vector. The code is surprisingly short with WriteLine and Display statements removed, since most of the work is done through helper methods:

The discriminate LDA, ranking, is quite complex, but the results are pretty straightforward. Auxiliary method Average calculates the average of a given class (0 or 1). Assume that the three data elements that are used (0 class) are (1, 4), (2, 2) and (3, 3). Is of their mean value ((1 + 2 + 3) / 3, (4 + 2 + 3) / 3) = (2.0, 3.0).

Scatter plot in matrix is ​​a measure of how Variable DataSet is. With two class means, mean0 mean1 and XY in matrix Sw, the discriminate is the product of the inverse of the software and the matrix difference between mean0 and mean1.

You can find various resources on the internet that explain the right complex math that the formula for discriminate is derived from. Remember that there are many different versions of the derivative for w. In particular, references to covariance and XY- (Sb) between may appear. Covariance is a mathematical entity that corresponds to XY within. Dot-between is the distinction of a mathematical entity that is used in the derivation of the equation for the LDA, but dot-between is not used explicitly to calculate discriminate or to make a prediction.

Discrimination method has a boolean parameter called unitize which indicates whether the discriminate unit is length, i.e. H. scale a length of 1.0. In most cases, you'll want to pass true as the appropriate argument.

A prediction

The demo program has two overloaded methods of prediction. The first accepts the element, to predict, X, as a normal numeric array:

This version of the prediction is used for calling for simplicity and is simply a wrapper around which version of the prediction model does the work:

Prediction method computes the swapping of w so that it can be used in matrix multiplication. This means that the two classes are calculated and then the mean of these two techniques is calculated by multiplying the value of each individual component by 0.5 and then storing it in the matrix m.

Matrix Tc is the constant threshold and is the product of exchanging the discriminate vector, wT and the mean of the class is given, m. Matrix Tc will always be a 1 x 1 matrix with a single value. The value in the matrix Tc represents the projection of the mean on the discriminate vector.

The projection of the element to predict is X, click the discriminate vector as the product exchanging the discriminate vector and x is computed in a similar manner. Unfortunately, since the projection of the discriminate can be positive or negative, use the Boolean comparison operator, which varies less-than or greater-as the problem to the problem. The simplest approach is to get bigger-than see if it gets results that are meaningful. You can programmatically determine which operator to use, but this approach results in a lot more code than you'd expect.

One option in predicting is to project the average of the class means, taking into account the probability of adjusting each class. This fit is log (p0 / p1), where p0 is the class 0 probability and p1 is the class 1 probability. For the demo data p0 = 3/8 = 0.375 and p1 = 5/8 0.625, =, so that the level of adjustment is log (0.375 / 0.625) = log (0.6) = -0.22. Note that the two probabilities are assumed to be equal and then p0 = p1 and the adjustment factor would be log (1) = 0.

Comments and Limitations

There are actually several different types of LDA. The one presented in this article is usually called the Fisher LDA. LDA can also be used for function selection (identifying which variables are particularly useful in predicting so that less useful predictions can be ignored) and classification. And it's a completely different statistical LDA (latent Dirichlet distribution) used in natural language processing.

Binary classification LDA has mathematically elegant, but some important limitations. Behind the scenes, different versions of the LDA make different assumptions, e.g. B. the predictor variables are normally distributed and have similar during covariances. In many cases these assumptions are not true. In any case, LDA often works quite well in practice if the mathematical assumptions are not met. Another limitation of math is that the inverse of the matrix XY must be calculated within LDA. Inversion of the matrix is ​​a very complicated process and can easily fail.

Perhaps the biggest limitation of LDA is binary classification, it assumes that the two classes are linearly separated. Vague This means that in the diagram as in Figure 2It is possible to find a straight line separating the two classes. The data for many problems are not just separated linearly.

In my opinion, LDA binary classification is nice, there are alternative classification algorithms, in particular logistic regression classification and neural network classification, that are more workable. But LDA sure is interesting.


Dr. James McCaffreyworks for Microsoft Research in Redmond, Washington. He has worked on various Microsoft products, including Internet Explorer and Bing. Dr. You can reach McCaffrey at [email protected]

Thanks to the following Microsoft Research technical experts for reviewing this article: Madison Minsk and Amy Shan