# Pre-processing: feature selection, feature scaling, and dimensionality reduction

Pre-processing involves feature selection, feature scaling, and dimensionality reduction.

✓ Feature selection – We are only interested in retaining meaningful features that can help to build a good classifier. Feature selection is often based on domain knowledge or exploratory analyses, such as histograms or scatterplots. The feature selection approach will eventually lead to a smaller feature space.

✓ Feature scaling/ normalization – Normalization and other feature scaling techniques are often mandatory in order to make comparisons between different attributes. If the attributes were measured on different scales, proper scaling of features is a requirement for most machine learning algorithms. A number of techniques are used for feature scaling,
(i) The simplest scaling is min-max scaling.
(ii) standardizing the data – it is the process of converting the input so that it has a mean of 0 and standard deviation of 1.

✓ Dimensionality reduction techniques – A dataset with hundreds of columns, and how do we proceed with the analysis of the data characteristics of such a large dimensional dataset? With increasing dimensions, the algorithms become computationally infeasible. Dimensionality reduction techniques preserve the structure of the data as much as possible while reducing the number of dimensions. A number of techniques are used for dimensionality reduction.

✓ Matrix Decomposition
✶Matrix decomposition is a way of expressing a matrix. Say that A is a product of two other matrices and C. The matrix B is supposed to contain vectors that can explain the direction of variation in the data. The matrix C is supposed to contain the magnitude of this variation. Thus, our original matrix A is now expressed as a linear combination of B and C.
✶There are methods that insist that the basic vectors have to be orthogonal to each other, such as the principal component analysis, and there are some that don’t insist on this requirement, such as dictionary learning.

✓ Principal component analysis (PCA)
✶PCA is an unsupervised method. In multivariate problems, PCA is used to reduce the dimension of the data with minimal information loss while retaining the maximum variation in the data. Variation means the direction in which the data is dispersed to the maximum.
✶Selection criteria of number of components:
(i) The Eigenvalue criterion – Eigenvalue would explain about one variable’s worth of  variability. We can say that we will include only those Eigenvalues whose value is greater than or equal to one. Based on your data set you can set the threshold. In a very large dimensional dataset including components capable of explaining only one variable may not be very useful.
(ii) The proportion of the variance explained criterion – It is calculated by Eigen values.
The PCA algorithm would work if the input dataset should have correlated columns. Without a correlation of the input variables, PCA cannot help us.

✓ Kernel PCA
✶PCA is limited to only those variables where the variation in the data falls in a straight line. In other words, it works only with linearly separable data.
✶Kernel PCA is used to reduce the dimension of datasets where the variations in them are not straight lines. In kernel PCA, a kernel function is applied to all the data points. This transforms the input data into kernel space, and then a normal PCA is performed in the kernel space.
✶Kernel is a function that computes the dot product, that is, the similarity between two vectors, which are passed to it as input. Some of other kernals functions are linear, polynomial, sigmoid, cosine.

✓ Extracting features using singular value decomposition (SVD)
✶SVD is another matrix decomposition technique that can be used to tackle the curse of the dimensionality problem.
✶It can be used to find the best approximation of the original data using fewer dimensions. Unlike PCA, SVD works on the original data matrix.
✶SVD does not need a covariance or correlation matrix. It works on the original data matrix. SVD factors an m x n matrix A into a product of three matrices: A =U*S*V.T. Here, U is an m x k matrix, V is an n x k matrix, and S is a k x k matrix. The columns of U are called left singular vectors and columns of V are called right singular vectors. The values on the diagonal of the S matrix are called singular values.

✓ Decomposing the feature matrices using nonnegative matrix factorization
✶Non-negative Matrix Factorization (NMF) is used extensively in recommendation systems using a collaborative filtering algorithm.
✶Let’s say that our input matrix A is of a dimension m x n. NMF factorizes the input matrix into two matrices, A_dash and H: A = A_dash*H.
✶Let’s say that we want to reduce the dimension of the A matrix to d, that is, we want the original m x n matrix to be decomposed into m x d, where d << n.
✶The A_dash matrix is of a size m x d and the H matrix is of a size d x m. NMF solves this as an optimization problem, that is, minimizing the function: |A-A_dah*H|^2.