Learning Data Science – part 1

Data matrix
Data can often be represented or abstracted as an n×d data matrix, with n rows and d columns, where rows correspond to entities in the dataset, and columns represent attributes or features or properties of interest.

The n×d data matrix is given as


Numeric Attributes – A numeric attribute is one that has a real-valued or integer-valued domain. For example, Age.

Categorical Attributes – A categorical attribute is one that has a set-valued domain composed of a set of symbols. For example, Sex could be categorical attributes.

Orthogonality – Two vectors a and b are said to be orthogonal if the angle between them is 90◦, which implies that cos θ =0. Dot product of a and b is 0.

Orthogonal Projection – In data mining, we may need to project a point or vector onto another vector to obtain a new point after a change of the basis vectors. Let a, b be two m-dimensional vectors. An orthogonal decomposition of the vector b in the direction of another vector a, illustrated in below Figure,


The vector p is called the orthogonal projection or simply projection of b on the vector a.

Centered Data Matrix
The centered data matrix is obtained by subtracting the mean from all the points


Linear Independence
We say that the vectors v1, . . . ,vk are linearly dependent if at least one vector can be written as a linear combination of the others as follows,


where c1,c2, . . . ,ck  are scalers

A set of vectors is linearly independent if none of them can be written as a linear combination of the other vectors in the set.

Dimension and Rank
The maximum number of linearly independent vectors in a matrix is equal to the number of non-zero rows in its row echelon matrix. Therefore, to find the rank of a matrix, we simply transform the matrix to its row echelon form and count the number of non-zero rows.

For the data matrix D ∈ Rn×d, we have rank(D) ≤ min(n,d), which follows from the fact that the column space can have dimension at most d, and the row space can have dimension at most n. If rank(D) < d, then the data points reside in a lower dimensional subspace of Rd, and in this case rank(D) gives an indication about the intrinsic dimensionality of the data.

In fact, with dimensionality reduction methods it is often possible to approximate D ∈ Rn×d with a derived data matrix D′ ∈ Rn×k, which has much lower dimensionality, that is,   k ≪ d. In this case k may reflect the “true” intrinsic dimensionality of the data.

We can estimate a parameter of the population by defining an appropriate sample statistic, which is defined as a function of the sample.

The random sample of size m drawn from a (multivariate) random variable X is defined as

A statistic θ is a function θ: S1, S2, . . ., Sm

The statistic is an estimate of the corresponding population parameter θ. If we use the value of a statistic to estimate a population parameter, this value is called a point estimate of the parameter, and the statistic is called an estimator of the parameter.

Univariate analysis
Univariate analysis focuses on a single attribute at a time. The data matrix is given as


X is assumed to be a random variable.

Mean – The mean, also called the expected value, of a random variable X is the arithmetic average of the values of X. The mean of discrete variable is defined as

The expected value of a continuous random variable X is defined as

Sample Mean – The sample mean is a statistic, µ: {x1, x2, . . . ,xn}, which is defined as the average value of xi ’s


Statistic is robust if it is not affected by extreme values/ outliers in the data.

Median – The median of a random variable is defined as


The median is robust, as it is not affected very much by extreme values.

Measures of Dispersion
The measures of dispersion give an indication about the spread or variation in the values of a random variable.

The range of a random variable X is the difference between the maximum and minimum values of X, which is defined as


Interquartile Range
Quartile divides the data into four equal parts. Quartiles correspond to the quantile values of 0.25, 0.5, 0.75, and 1.0. The first quartile is the value q1 = F-1(0.25). The second quartile is the same as the median value q2 = F-1(0.5). The third quartile q3 = F-1(0.75).

Interquartile range (IQR) is defined as


Variance and Standard Deviation
The variance of a random variable X provides a measure of how much the values of X deviate from the mean or expected value of X. Variance is defined as


The standard deviation, σ, is defined as square root of the variance, σ2.


Sample variance is defined as


The standard score/ z score – sample value xi is the number of standard deviations the value is away from the mean:


Multivariate analysis
The d numeric attributes full data matrix is defined as


The multivariate mean vector is obtained by taking the mean of each attribute which is defined as


Covariance Matrix
The multivariate covariance information is captured by the d ×d (square) symmetric covariance matrix that gives the covariance for each pair of attributes:


The diagonal element σi2 specifies the attribute variance for Xi, whereas the off-diagonal elements σijji represent the covariance between attribute pairs Xiand Xj.

Data Normalization
When analyzing two or more attributes it is often necessary to normalize the values of the attributes, especially in those cases where the values are vastly different in scale.

In range normalization, each value is scaled as follows,


After transformation the new attribute takes on values in the range [0;1].

Standard Score Normalization
In standard score normalization, also called z-normalization, each value is replaced by


Univariate Normal Distribution
If a random variable X has a normal distribution, with the parameters mean µ and variance σ2, the probability density function of X is given as


Probability Mass
Given an interval [a, b] the probability mass of the normal distribution within that interval is given as


The probability mass concentrated within k standard deviations from the mean is given as


Normal distribution with different variances


Multivariate Normal Distribution
Given the d-dimensional vector random variable X = (X1,X2, . . . ,Xd), we say that X has a multivariate normal distribution, with the parameters mean µ and covariance matrix S, the joint multivariate probability density function is given as


An example of bivariate normal density and contours is shown as follows,