# Learning Data Science – part 1

Data matrix
Data can often be represented or abstracted as an n×d data matrix, with n rows and d columns, where rows correspond to entities in the dataset, and columns represent attributes or features or properties of interest.

The n×d data matrix is given as

Numeric Attributes – A numeric attribute is one that has a real-valued or integer-valued domain. For example, Age.

Categorical Attributes – A categorical attribute is one that has a set-valued domain composed of a set of symbols. For example, Sex could be categorical attributes.

Orthogonality – Two vectors a and b are said to be orthogonal if the angle between them is 90◦, which implies that cos θ =0. Dot product of a and b is 0.

Orthogonal Projection – In data mining, we may need to project a point or vector onto another vector to obtain a new point after a change of the basis vectors. Let a, b be two m-dimensional vectors. An orthogonal decomposition of the vector b in the direction of another vector a, illustrated in below Figure,

The vector p is called the orthogonal projection or simply projection of b on the vector a.

Centered Data Matrix
The centered data matrix is obtained by subtracting the mean from all the points

Linear Independence
We say that the vectors v1, . . . ,vk are linearly dependent if at least one vector can be written as a linear combination of the others as follows,

where c1,c2, . . . ,ck  are scalers

A set of vectors is linearly independent if none of them can be written as a linear combination of the other vectors in the set.

Dimension and Rank
The maximum number of linearly independent vectors in a matrix is equal to the number of non-zero rows in its row echelon matrix. Therefore, to find the rank of a matrix, we simply transform the matrix to its row echelon form and count the number of non-zero rows.

For the data matrix D ∈ Rn×d, we have rank(D) ≤ min(n,d), which follows from the fact that the column space can have dimension at most d, and the row space can have dimension at most n. If rank(D) < d, then the data points reside in a lower dimensional subspace of Rd, and in this case rank(D) gives an indication about the intrinsic dimensionality of the data.

In fact, with dimensionality reduction methods it is often possible to approximate D ∈ Rn×d with a derived data matrix D′ ∈ Rn×k, which has much lower dimensionality, that is,   k ≪ d. In this case k may reﬂect the “true” intrinsic dimensionality of the data.

Statistic
We can estimate a parameter of the population by defining an appropriate sample statistic, which is defined as a function of the sample.

The random sample of size m drawn from a (multivariate) random variable X is defined as

A statistic θ is a function θ: S1, S2, . . ., Sm

The statistic is an estimate of the corresponding population parameter θ. If we use the value of a statistic to estimate a population parameter, this value is called a point estimate of the parameter, and the statistic is called an estimator of the parameter.

Univariate analysis
Univariate analysis focuses on a single attribute at a time. The data matrix is given as

X is assumed to be a random variable.

Mean – The mean, also called the expected value, of a random variable X is the arithmetic average of the values of X. The mean of discrete variable is defined as

The expected value of a continuous random variable X is defined as

Sample Mean – The sample mean is a statistic, µ: {x1, x2, . . . ,xn}, which is defined as the average value of xi ’s

Statistic is robust if it is not affected by extreme values/ outliers in the data.

Median – The median of a random variable is defined as

The median is robust, as it is not affected very much by extreme values.

Measures of Dispersion
The measures of dispersion give an indication about the spread or variation in the values of a random variable.

Range
The range of a random variable X is the difference between the maximum and minimum values of X, which is defined as

Interquartile Range
Quartile divides the data into four equal parts. Quartiles correspond to the quantile values of 0.25, 0.5, 0.75, and 1.0. The first quartile is the value q1 = F-1(0.25). The second quartile is the same as the median value q2 = F-1(0.5). The third quartile q3 = F-1(0.75).

Interquartile range (IQR) is defined as

Variance and Standard Deviation
The variance of a random variable X provides a measure of how much the values of X deviate from the mean or expected value of X. Variance is defined as

The standard deviation, σ, is defined as square root of the variance, σ2.

Sample variance is defined as

The standard score/ z score – sample value xi is the number of standard deviations the value is away from the mean:

Multivariate analysis
The d numeric attributes full data matrix is defined as

Mean
The multivariate mean vector is obtained by taking the mean of each attribute which is defined as

Covariance Matrix
The multivariate covariance information is captured by the d ×d (square) symmetric covariance matrix that gives the covariance for each pair of attributes:

The diagonal element σi2 specifies the attribute variance for Xi, whereas the off-diagonal elements σijji represent the covariance between attribute pairs Xiand Xj.

Data Normalization
When analyzing two or more attributes it is often necessary to normalize the values of the attributes, especially in those cases where the values are vastly different in scale.

In range normalization, each value is scaled as follows,

After transformation the new attribute takes on values in the range [0;1].

Standard Score Normalization
In standard score normalization, also called z-normalization, each value is replaced by

Univariate Normal Distribution
If a random variable X has a normal distribution, with the parameters mean µ and variance σ2, the probability density function of X is given as

Probability Mass
Given an interval [a, b] the probability mass of the normal distribution within that interval is given as

The probability mass concentrated within k standard deviations from the mean is given as

Normal distribution with different variances

Multivariate Normal Distribution
Given the d-dimensional vector random variable X = (X1,X2, . . . ,Xd), we say that X has a multivariate normal distribution, with the parameters mean µ and covariance matrix S, the joint multivariate probability density function is given as

An example of bivariate normal density and contours is shown as follows,

# Deep learning part 2 – Recurrent neural networks (RNN)

The details of feedforward networks has been gone through in the previous post, and in this post we are going through the recurrent networks.

Recurrent networks are used to learn patterns in sequences of data, such as text, and handwriting, the spoken word, and time series data. It can also be used for the vision applications where images are decomposed into a series of patches and treated as a sequence.

Recurrent networks have two sources of information. It takes the current input and the input which perceived one step back in time as input. As it depends on current and previous inputs, it is often said that recurrent networks have memory. Recurrent networks are differentiated from feedforward networks by feedback loops. Recurrent net can be seen as a (very deep) feedforward net with shared weights.

A simple RNN architecture is shown in below diagram.

✓ x0, x1,….. XT are input, and h0,h1,… hT are hidden state of the recurrent network.

✓ The hidden states are recursively estimated as bellow,

Where W is the input to hidden weights, U is the hidden to hidden weights, and V is the hidden to label weight.

✓ Weights are estimated by minimizing following function,

✓ Backpropagation through time (BPTT) algorithm is used to estimte the weights.

Backpropagationthrough time (BPTT)
✓ Firstly, the RNN is unfolded in time.
✓ Deep neural network is obtained with sharedweights Wand U.
✓ The unfolded RNN is traied using normal backpropagation which is explained in the previous post.
✓ In practice, the number ofunfolding steps are limited to 5 –10.
✓ It is computationally more efficient topropagate gradients after few trainingexamples (batch training).

✶ As we propagate the gradients back in time,  their magnitude usually quickly decreases which is called vanishing gradient problem.

✶ Sometimes, the gradients start to increase exponentially during backpropagation through the recurrent weights. This happens rarely. The huge gradients will lead to big change of weights, and thus destroy what has been learned so far. This would be the reason why RNN are unstable. Clipping/ normalizing the values of gradients would avoid the huge changes of weights.

✶ In practice, learning long term dependencies in data is difficult for simple RNN architecture. Special RNN architectures such as long short-term memory (LSTM) address this problem.

# Deep learning basics – part 1

A typical neural network strucutre is shown in below Figure.

Neural neworks are typically organized in layers. Layers are made up of a number of interconnected nodes which contain an activation function. Each node in a layer is connected in the forward direction to every unit in the next layer. It usually has an input layer, one or more hidden layers and an output layer.

Patterns are presented to the network via the input layer. It communicates to one or more hidden layers where the actual processing is done via a system of weighted connections. The hidden layers then link to an output layer.

✓ Network representation
The equation of a single-layer neural network in matrix form is as follows,
y = Wx
It can be also written in terms of individual components,
yk = Σ wkixi

Where
Input vector x = (x0,x1, . . . , xd)T
Output vector y = (y1, . . . , yk)T
Weight matrix W:wki is the weight from input xi to output yk

✓ Weight
A node usually receives many simultaneous inputs, and  each input has its own relative weight. Some inputs are made more important than others. Weight are a measure of an input’s connection strength. These strengths can be modified in response to various training sets and according to a network’s specific topology or its learning rules.

✓ Summation function
The input and weighting coefficients can be combined in many different ways before passing on to the transfer function. The summation function can be sum, max, min, average, or, and. The specific function for combining neural inputs is determined by the chosen network architecture.

✓ Transfer function
The result of the summation function is transformed to a working output through the transfer function. The transfer function can be hyperbolic tangent, linear, sigmoid, sin.

✓ Scaling and limiting
After the transfer function, the result can pass through scale and limit. This scaling simply multiplies a scale factor times the transfer value and then adds an offset. Limiting insures that the scaled result does not exceed an upper, or lower bound.

✓ Output function
Normally, the output is directly equivalent to the transfer function’s result. Some network topologies modify the transfer result to incorporate competition among neighboring processing elements.

✓ Error function/ Backpropagation
The values from the output layer are compared with the expected values and an error is computed for each output unit. The weights connected to the output units are adjusted to reduce those errors. The error estimates of the output units are then used to derive error estimates for the units in the hidden layers. The weight are adjusted to reduce the errors. Finally, the errors are propagated back to the connections stemming from the input units.

✓ Learning function
The purpose of learning funcation is to modify the weights on the inputs of each processing element according to some neural based algorithm.

✓ Sigmoid transfer/ activation function
The transfer function for neural networks must be differential as derivative of the transfer function is required for computation of local gradient. Sigmoid is one of the most common forms of transfer function which is used in construction of artificial neural network. It’s represented by following equation,

✓ There are different types of artificial neural networks,
(1) Single layer feed forward network
(2) Multilayer feed forward network
(3) Recurrent network
(4) ….. etc

Neural networks can be trained using supervised and unsupervised manner.

✓ Supervised training – In supervised training, both the inputs and the outputs are provided and the network processes the inputs and compares resulting outputs against the expected outputs. Errors are then propagated back through the system, causing the system to adjust the weights.

✓ Unsupervised training – In this type, the network is provided with inputs but not with desired outputs.

✓ Learning rates
The learning rate depends upon several controllable factors. A slower rate means more time to spend in producing an adequately trained system. In faster learning rates, the network may not be able to make the fine discriminations that are possible with a system learning slowly. Learning rate is positive and it is between 0 and 1.

✓ Learning laws
✶ Hebb’s rule – If a neuron receives an input from another neuron and if both are highly active (same sign), the weight between the two neurons should be strengthened.
✶ Hopfield law – If the desired output and the input are both active or both inactive,
increment the connection weight by the learning rate, otherwise decrement the weight by the learning rate.
✶ The delta rule – This rule is based on the simple idea of continuously modifying the strengths of the input connections to reduce the difference between the desired output value and the actual output of a processing element.
✶ The Gradient Descent rule – This is similar to delta Rule. The derivative of the transfer function is still used to modify the delta error before it is applied to the connection weights. However, an additional proportional constant tied to the learning rate is appended to the final modifying factor acting upon the weight.
✶ Kohonen’s law – The processing elements compete for the opportunity to learn or update their weights. The element with largest output is declared the winner and has the capability of inhibiting its competitors as well as exciting its neighbors. Only the winner is permitted an output and only the winner plus its neighbors are allowed to adjust their connection weights.

✓ Backpropagation for feed forward networks
The backpropagation algorithm is the most commonly used training method for feed forward networks. The main objective in neural network is to find an optimal set of weight parameters, w. The parameters are trained through the training process.

✶ Consider a multi-layer perceptron with k hidden layers. Layer 0 is input layer and layer k+1 is output layer
✶ The weight of jth unit in layer m and the ith unit in layer m+1 is denoted by wijm
✶ The input data for a feedforward network training is u(n) =(x10(n)… xk0(n))
✶ The output is d(n) =(d1k+1(n)… dL k+1(n))
✶ The activation of non-input units is xim+1(n) = f(wijm xj(n))
✶ A network response obtained in the output layer is y(n) =  (x1k+1(n),…, xLk+1(n))

The difference between desired output and neural network output is known as error and it’s quantified as follows,

The weights, w, are estimated by minimizing above objective function. This is done by incrementally changing the weights along the direction of the error gradient with respect to weights.

The new weight is

where γ is learning rate. weights are initialized with random numbers. In batch learning mode, new weights are computed after presenting all training samples. One such pass through all samples is called an epoch.

The procedure for one epoch of batch processing is given below.
✶ For each sample n, compute activations of internal and output units (forward pass).

✶  The error propagation term is estimated backward through m = k+1, k, …, 1, for each unit. The error propagation term for output layer is,

The error propagatin term of hidden layer is estimated using output layer error propagation as bellow,

where, the internal state of unit xim is

✶ Update the weight parameters as bellow,

After every such epoch, the error is computed and stop updating the weight when the error falls below a predetermined threshold.

# Factor analysis modelling

Factor analysis is a statistical method. It is used to describe variability among correlated observed variables in terms of a potentially lower number of unobserved variables.

The generative model is given by
y = μ + Λx +ε

y is P ×1 dimension observed variable
μ is P ×1 dimension mean vector
Λ is P × R dimension factor loading matrix
x is R×1 dimension unobserved variable (or latent variable)
ε is P ×1 dimension error term

We assume that
Ε(x) = Ε(ε) = 0
Ε(ΛΛT ) = Ι
p(x) = N (x |0, I )

Ε(y) = μ
Σ = Ε(yyT ) = ΛΛT + Ψ
p(y|θ ) = N(y|μ, ΛΛT +Ψ)

Mean μ is estimated using the observed variable.

The model parameters Λ,Ψ are estimated using expectation maximization (EM) algorithm.

Initially, model parameters Λ,Ψ are selected with random values and iteratively updated with EM algorithm.

In E step, the posterior p(xn |yn, θt) is defined as below,
qt+1 = p(xn |yn, θt ) = N(xn|mn,Vn)
Vn = (I −ΛTΨ−1Λ)−1
mn = VnΛTΨ−1(yn − μ )

In M step, the model parameters Λt+1, Ψt+1 are updated with posterior estimation as below,
Λt+1 =( ΣynmnT)(ΣVn)-1
Ψt+1 = 1/N diag (ΣynynT + Λt+1 ΣmnynT)

In each iteration, likelihood estimate L(Λ,Ψ) is estimated in order to confirm whether model is converged. Likelihood estimate L(Λ,Ψ) is estimated as below,
L(Λ,Ψ) = N/2 log|Ψ| − N/2 tr(SΨ−1)

Where S = 1/N Σ (yn-Λxn)T(yn-Λxn)

The maximum likelihood estimate values are plotted again iterations, and it can be observed that when the number of iterations increases, the model is converged as shown in following Figure,

The main applications of factor analysis is to reduce the number of variables and to detect structure in the relationships between variables. Factor analysis is commonly used to model the varialbiy in speaker and face recognition applications.

# Linear Discriminant Analysis (LDA)

The Linear Discriminant Analysis (LDA) is to separate samples of distinct groups by transforming to a space which maximises the between-class separability while minimising the within-class variability.

The between-class scatter matrix Sb be defined as

The within-class scatter matrix Sw be defined as

where xi,j is an n-dimensional data point j from class pi, Ni is the number of training examples from class pi, and g is the total number of classes or groups.

The main objective of LDA is to find a projection matrix Plda that maximises the ratio of the determinant of Sb to the determinant of Sw,

The Fisher’s criterion is maximised when the projection matrix Plda is composed of the eigenvectors of

It’s illustrated in below example that how LDA projection increases the between-class variability while minimizing the within-class variability.

LDA seeks directions that are efficient for discriminating data whereas PCA seeks directions that are efficient for representing data. LDA can be applied in several applications such as speaker recognition, face recognition, bankruptcy prediction, marketing, biomedical studies.

# Support Vector Machine (SVM) Classification

The support vector machine (SVM) is a powerful discriminative classification technique. Using the labelled training vectors, SVM training finds a separating hyper plane (H) that maximizes the margin of separation between these two classes. Based on trining data, SVM can be divided into two  categories: (a) Linearly separable training, (b) Non-linearly separable training.

✓ An example of a SVM trained using linearly-separable data is given in following Figure.

✓ Let’s define the hyperplane H such that:
wTxi + b ≥ +1 when yi =+1
wTxi + b ≤ -1 when yi =-1

The distance between H and H1 is y(wTx+b)/||w||

✓ The distance between H1 and H2 is: 2/||w||.

The distance between H1 and H2 is called margin. We want a classifier with as big margin as possible.  In order to maximize the margin, we need to minimize ||w||. With the condition that there are no datapoints between H1 and H2:
wTxi +b ≥ +1 when yi =+1
wTxi +b ≤ -1 when yi =-1

✓ Find w and b such that
||w||2/2 minimized and for all (xi,yi), yi(wTxi +b) ≥ 1

We are now optimizing a quadratic function subject to linear constraints. It’s Lagrangian optimization problem.
w = ∑ αiyixi
b = yk – wTxk

✓ Non-linearly separable training
In the case of performing SVM training on non-linearly separable data, the missclassication of training examples will cause the Lagrangian multipliers to grow exceedingly large and prevent a feasible solution.

The idea is to gain linearly separation by mapping the data to a higher dimensional space. An SVM kernel is used to transform training and testing data into a higher dimensional space, which provides better linear separation.

✓ A kernel function is defined as follows,
K(xi,xj) = Φ(xi) Φ(xj)

where Φ(x) is a mapping function used to convert input vectors x to a desired space. Linear, polynomial, radial basis function (gaussians) and sigmoid (neural net activation function) are used as non-linear mapping functions.

An example of non-linearly separable data and how kernal function transforms into separable data is shown in below Figure,

SVM classification can be used in several applications, such as natural language processing, computer vision and bioinformatics.

# Logistic regression

Logistic regression is model which is used to predict the categorical variable. It measures the relationship between the categorical dependent variable and independent variables by estimating probabilities using a logistic function.

For example, logistic regression can be used if we want to predict whether an American voter will vote Democratic or Republican, based on age, income, sex, race, state of residence, votes in previous elections. It can also be used for predicting the probability of failure of a given process, system or product.

Apart from binary classification, it can also used for ordinal or multinomial. Multinomial logistic regression deals with situations where the outcome can have three or more disordered possible types.

As the outcome of logistic regression is binary, Y needs to be transformed so that the regression process can be used. The logistic regression model can be written as follows,

Solving for p(x),

We should predict Y = 1 when p(x) ≥ 0.5 and Y = 0 when p(x) < 0.5. The inverse of logit(x;b,w), p(x) is shown in following graph,

Our goal is to find β0 and β using the training data. Maximum likelihood estimation is used to esimate the parameters.

For each training data-point, we have a vector of features, xi, and an observed class, yi . The probability of that class was either p, if yi= 1, or 1p, if yi = 0. The likelihood is,

The sume of log-likelihood is,

To find the maximum likelihood estimates, we differentiate the log likelihood with respect to the parameters, β, say βj and set the derivatives equal to zero to estimate the parameters. We can’t able to set this to zero and solve it as there is no closed-form solution. However, we can approximately solve it numerically. Newton-Raphson and Stochastic gradient descent methods can be used to solve this problem.

Multinomial logistic regression (Logistic regression with more than two classes)
If Y can take on more than two values, the predicted conditional probabilities will be,

When there are only two classes (say, 0 and 1), the above equation reduces to previous equation, with β0 = β(1)0 β(0)0 and β = β(1)β(0).