Learning R (Probability distributions)

Binomial distribution
The binomial distribution is a discrete probability distribution. It describes the outcome of m independent trials in an experiment. Each trial is assumed to have only two outcomes, either success or failure. If the probability of a successful trial is p, then the probability of having x successful outcomes in an experiment of n independent trials is as follows,

The binomial distribution describes the behavior of a count variable X if the following conditions apply:
1. The number of observations n is fixed.
2. Each observation is independent.
3. Each observation represents one of two outcomes (“success” or “failure”).
4. The probability of “success” p is the same for each outcome.

#R syntax for estimating the probability
dbinom(x, size, prob)
#Example 1:
#Compute the probability of getting five heads in seven tosses of a fair coin.
dbinom(x = 5, size = 7, prob = 0.5)

#Example 2:
#Compute the probability of getting less than or equal four heads in seven tosses of a fair coin.
#Method 1 – Estimate individual probability and sum it together
dbinom(x =0, size = 7, prob = 0.5)
dbinom(x =1, size = 7, prob = 0.5)
dbinom(x =2, size = 7, prob = 0.5)
dbinom(x =3, size = 7, prob = 0.5)
dbinom(x =4, size = 7, prob = 0.5)

#Method 2 – Esimate cumulative probability using inbuilt R function
pbinom(4, 7, 0.5)

#Example 3:
#Suppose there are twelve multiple choice questions in an English class quiz. Each question has five possible answers, and only one of them is correct. Find the probability of having four or less correct answers if a student attempts to answer every question at random
pbinom(4, 12, 0.2)

◙ The expected value (or mean) of a binomial random variable is mp
◙ The variance of a biomial distribution is mp(1 − p)

#Visualizing the binomial distribution using histogram plot
hist(c(rnorm(500,0,2),rnorm(500,8,2)),col=”grey”,main=”Bimodal”, breaks = 15)

Poisson distribution
The Poisson distribution is the probability distribution of independent event occurrences in an interval. If λ is the mean occurrence per interval, then the probability of having x occurrences within a given interval is:

◙ The mean and variance of a Poisson random variable are both equal to λ.
#R syntax of estimating the probability of Poisson distribution
dpois(x, lambda)

#Example 1:
#According to the Poisson model, the probability of three arrivals at anautomatic bank teller in the next minute, where the average number of arrivals per minute is 0.6, is
dpois(x = 3, lambda = 0.5)

#Cumulative probability is estimated using

#We can generate Poisson randomnumbers using the
rpois(n, lambda)

#Example 2:
#Suppose traffic accidents occur at an intersection with a mean rate of 3 per year. Simulate the annual number of accidents for a 12-year period,assuming a Poisson model.
rpois(12, 3)

#Example 3:
#If there are eleven cars crossing a bridge per minute on average, find the probability of having sixteen or more cars crossing the bridge in a particular minute.

#Visualizing the poisson distribution using histrogram plot

Exponential distribution
The exponential distribution describes the arrival time of a randomly recurring independent event sequence. If μ is the mean waiting time for the next event recurrence, its probability density function is

Where mean is reciprocal of the rate parameter.
#R syntax for estimating the probability of exponential distribution
pexp(q, rate)

#Example 1:
#Suppose the service time at a bank teller can be modeled as an exponential random variable with a rate of 3 per minute. Then the probability of acustomer being served in less than 1 minute is
pexp(1, rate = 3)

#The R function can be used to generate n random exponentialvariates.
rexp(n, rate)

#Example 2:
#A bank has a single teller who is facing a queue of 8 customers. The time for each customer to be served is exponentially distributed with rate 2 per minute. We can simulate the service times (inminutes) for the 8 customers.
servicetimes <- rexp(8, rate = 2)

Normal random distribution
A normal random variable X has a probability density function given by

where µ is the expected value of X, and σ2 denotes the variance of X .

◙ The standard normal random variable has mean µ = 0 and standard deviationσ = 1.

◙ The normal density function can be evaluated using the dnorm()
◙ The distribution function can be evaluated using pnorm()
◙ Normal pseudorandom variables can be generated using the rnorm()
#Example 1:
#Assume that the test scores of a college entrance exam fits a normal distribution. Furthermore, the mean test score is 75, and the standard deviation is 15. What is the percentage of students scoring 80 or more in the exam?
pnorm(80, mean=75, sd=15, lower.tail=FALSE)

#Visualizing the normal distribution using histrogram plot
x<-rnorm(1000, 5, 0.2)
hist(x,probability = TRUE)

Chi-squared Distribution
If X1, X2, …,Xm are m independent random variables having the standard normal distribution, then the following quantity follows a Chi-Squared distribution with m degrees of freedom. Its mean is m, and its variance is 2m.
#Example 1:
#Find the 97th percentile of the Chi-Squared distribution with 6 degrees of freedom.
qchisq(.97, df=6)        # 6 degrees of freedom

Examining the distribution of a set of data
The distribution can be examined in a number of ways.
# 1 . Generating summary of data

# 2. Drawing the histogram
hist(eruptions, seq(1.6, 5.2, 0.2), prob=TRUE)
lines(density(eruptions, bw=0.1))


Learning R (Programming Basics)

For loop in R

# Example 1
# print 1, 2, 3...10
for(i in 1:10) {

# Example 2
# print a, b, c, d
x <- c("a", "b", "c", "d")
for(i in 1:4) {

Nested for loop
It is defined as a set of for loops within for loops.

M<- matrix(1:9, ncol=0)
Sum<- 0
for (i in seq(nrow(M))) {  
  for (j in seq(ncol(M))) {    
    sum<- sum + M[i,j]    
    print sum    

While loop
Loop runs with condition.

i<- 1
# When i > =8, loop terminates
while ( i< 8) {  
  print i  
  i<- i+1

If statement
This structure allows you to test a condition and act on it depending on whether
it’s true or false.

# if random number is greater than 3, it will print 10. Otherwise print 0.
x <- runif(1, 0, 10)
if(x > 3) {
else {

Ifelse statement
This structure is an equivalent form of if else condition but this statement is applied to each element of vector individually.

# Ifelse statement checks each elements of vector and if it's odd,
prints are odd number otherwise prints as even.
X<- 1:8
Ifelse (x%%2, paste0(x, “  : odd number”), paste0 (x, “  : even number”)

Next statement
Next statement is used to skip some iterations.

#Example 1
# next statement skip the first 20 iterations
for(i in 1:100) {
if(i <= 20) {

Break statement
Break is used to exit a loop immediately, regardless of what iteration the loop may be on.

#Example 2 
# stop loop after 20 iterations 
for(i in 1:100) { 
  if(i > 20) { 

It’s infinite loop and break statement is used to terminate from loop.

Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.

apply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

Writing functions in R
Functions are defined using the function() directive and are stored as R objects just like anythingelse. In particular, they are R objects of class “function”.

#Example 1
func1 <- function() {
print("Hello, world!")
} f() 


#Example 2
func2 <- function(num) {
for(i in seq_len(num)) {
print("Hello, world!")

Function with return
return() functions to return a value immediately from a function.

#counting odd numbers from a vector
oddcount <- function(x) {

R Programming Environment
Environment is a collection of objects.

Global variables
Global variables are those variables which exists throughout the execution of a program. It can be changed and accessed from any part of the program.

Local variables
Local variables are those variables which exist only within a certain part of a program like a function, and is released when the function call ends.

Taking input from user

my.age <- readline(prompt="Enter age: ")
# Convert to integer
my.age <- as.integer(my.age)

A function that calls itself.

#Example 1
#Finding factorial - n! = n*(n-1)! 
recursive.factorial <- function(x) {   
  if (x == 0)    return (1)   
  else    return (x * recursive.factorial(x-1))

#Example 2
#The Fibonacci sequenceThe Fibonacci sequence is a series of numbers 
#where a number is found by adding up the two numbers before it.
#Starting with 1, the sequence goes 1, 1, 2, 3, 5, 8, 13, 21, 34.  

recurse_fibonacci <- function(n) {    
  if(n <= 1)    return(n) 
  else    return(recurse_fibonacci(n-1) + recurse_fibonacci(n-2))

for(i in 0:(12-1)) {    

Algorithm analysis
An algorithm is evaluated based on following attributes:
◙ Shorter running time
◙ Lesser memory utilization

Memory management in R
R allocates memory differently to different objects in its environment. Memory allocation can be determined using the object_size function from the pryr package.

System runtime in R
System runtime helps to compare the different algorithms and pick the best algorithm. The microbenchmark package on CRAN is used to evaluate the runtime of any expression/function/code at an accuracy of a sub-millisecond.

Algorithm asymptotic analysis
Asymptotic notations are commonly used to determine the complexity in calculating the runtime of an algorithm. Big O (upper bound), Big Omega (lower bound), and Big Theta (average) are the simplest forms offunctional equations, which represent an algorithm’s growth rate or its system runtime.

Assignment operator
Assigning an element (numeric, character, complex, or logical) to an object requires a constant amount of time. The asymptote (Big Theta notation) of the assignment operation is θ(1).

Simple for loop
The total cost of this for loop is θ(n).

Nested loop
The total cost of nested loop is θ(n2).

Writing sorting algorithms in R
Bubble sort
Bubble sort is a simple sorting algorithm. This sorting algorithm is comparison-based algorithm in which each pair of adjacent elements is compared and the elements are swapped if they are not in order.

bubblesort <- function(x) {
  if (length(x) < 2) 
    return (x)
  # last is the last element to compare with
  for(last in length(x):2) {  
    for(first in 1:(last - 1)) {    
      if(x[first] > x[first + 1]) {      
      # swap the pair      
        save <- x[first]      
        x[first] <- x[first + 1]      
        x[first + 1] <- save    
return (x)

Quick sort
Quick sort involves following steps:
◙ Pick an element, called a pivot, from the array.
◙ Partitioning: reorder the array so that all elements with values less than the pivot come before the pivot, while all elements with values greater than the pivot come after it (equal values can go either way). After this partitioning, the pivot is in its final position. This is called the partition operation.
◙ Recursively apply the above steps to the sub-array of elements with smaller values and separately to the sub-array of elements with greater values.

quickSort <- function(vect) {    
  if (length(vect) <= 1) {      
    return(vect)  }  
  # Pick an element from the vector  
  element <- vect[1]  
  partition <- vect[-1]  
  # Reorder vector so that integers less than element  
  # come before, and all integers greater come after.  
  v1 <- partition[partition < element]  
  v2 <- partition[partition >= element]  
  # Recursively apply steps to smaller vectors.  
  v1 <- quickSort(v1)  
  v2 <- quickSort(v2)  
  return(c(v1, element, v2))

Learning R (Introduction)

♦ R Programming language is generally used for developing statistical analysis, graphics representation, and reporting.

<- symbol is the assignment operator.
# symbol is used to comment a line

# x equals 1
# msg equals hellp
msg<- "hello"

The basic arithmetic operations using R

# addition
18 + 12
# subtraction
18 - 12
# multiplication
18 * 12
# division    
18 / 12 
# just the integer part of the quotient
18 %/% 12
# just the remainder part (modulo)
18 %% 12
# exponentiation (raising to a power)
18 ^ 12
# natural log (base e)
# base 10 logs
# square root    
# absolute value
abs(18 / -12)

Defining vectors in R

# Method 1
# numeric vector
x <- c(0.5, 0.6)
# complex vector
x <- c(1+0i, 2+4i)    

# Method 2 - Use the vector() function to initialize vectors
x <- vector("numeric", length = 10)

# Method 3 - Creating vector of numerical numbers
# number sequence 1, 2, 3,.... 10
# number sequence from 1 to 10 and interval is 1
seq(from=1, to=10, by=1) 

# Some useful functions
# to check the type of data of my.seq
# to check whether my.seq is vector
# it will devide each elments of my.seq by 3
my.seq = my.seq / 3

Defining matrices in R

# Method 1
# To create empty 2 by 3 matrix
m <- matrix(nrow = 2, ncol = 3)
# To check the dimensionality 

# Method 2 
# Matrices are constructed column-wise
m <- matrix(1:6, nrow = 2, ncol = 3)

# Method 3
#Matrix created directly from vectors by adding a dimension attribute.
m <- 1:10
dim(m) <- c(2, 5)

# Method 3
x <- 1:3
y <- 10:12
# create matrix by column-binding
cbind(x, y)
# create matrix by row-binding 
rbind(x, y)

Defining lists  in R
Lists are a special type of vector that can contain elements of different classes.

# Method 1
# this list contains different class of elements
x <- list(1, "a", TRUE, 1 + 4i)

# Method 2
# create empty list with the length of 5
x <- vector("list", length = 5)

Defining factors in R
♦ Factors are used to represent categorical data and can be unordered or ordered.
♦ Factors are important in statistical modelling.

# Levels are put in alphabetical order 
x <- factor(c("yes", "yes", "no", "yes", "no"))    

# table() will how many yes and no are available

x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"))
# Levels are put without alphabetical order

Missing Values
Missing values are denoted by NA or NaN. NA is used to represent missing numbers, and NAN is used to represent invalid numbers (0/0).

# a vector is defined with missing number
x <- c(1, 2, NA, 10, 3)    

# it will check whether this vector has any na values

Data Frames
♦ Data frames are used to store tabular data in R.
♦ Data frames are represented as a special type of list where every element of the list has to have the same length.
♦ Unlike matrices, data frames can store different classes of objects in each column.
♦ In addition to column names, indicating the names of the variables or predictors, data frames have a special attribute called row.names which indicate information about each row of the data frame.

# Define a data frame in R
x <- data.frame(foo = 1:4, bar = c(T, T, F, F))

# To show number of rows

# To show number of columns

Managing Data Frames with the dplyr package
♦ The data frame is a key data structure in statistics and in R.
♦ The dplyr package is designed filtering, re-ordering, and collapsing.

#Installing dplyr package

# load dplyr package into your R session

# Load chicago.rds file
chicago <- readRDS('C:\\Users\\ahilan\\Dropbox\\Elect_dept_UOJ\\Statistics using 

#To show number of col and row of data

# The select() function can be used to select columns of a data frame.

# Suppose we wanted to take the first 3 columns only.
subset <- select(chicago, city:dptp)

# if you wanted to keep every variable that ends with a “2”
subset <- select(chicago, ends_with("2"))

# You can also omit variables using the select()
select(chicago, -(city:dptp))

# If we wanted to keep every variable that starts with a “d”
subset <- select(chicago, starts_with("d"))

# The filter() function is used to extract subsets of rows from a data frame
# Extract the rows where PM2.5 is greater than 30
chic.f <- filter(chicago, pm25tmean2 > 30)    
chic.f <- filter(chicago, pm25tmean2 > 30)

# Extract the rows where PM2.5 is greater than 30 and temperature is greater 
than 80 degrees Fahrenheit.
chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)

#The arrange() function is used to reorder rows of a data frame according to one 
of the variables/columns

# We can order the rows of the data frame by date, so that the first row is the 
earliest (oldest) observation and the last row is the latest (most recent) 
chicago <- arrange(chicago, date)
chicago <- arrange(chicago, desc(date))

Logical operation in R

# define a vector using boolean values
# define a numeric vector
b <- c(13, 7, 8, 2)    
# selects true value elements
b[a]   // 13 2     
# inverse of a
# true as 1 and false as 0 and counts true and false values
sum(a)  // 2  

Built-in search function


Data input and output

#Changing directories
Changing the default to the mydata folder in the C: drive
setwd("c:\\ mydata")

#Save the objects for a future session
dump("usefuldata", "useful.R")

#Retrieve the saved objects

#Save all of the objects that you have created during a session
dump(list=objects(), "all.R")

#Redirecting R output to text file
# Create a file solarmean.txt for output
# Write mean value to solarmean.txt
# Close solarmean.txt; print new output to screen

Learning Data Science – part 1

Data matrix
Data can often be represented or abstracted as an n×d data matrix, with n rows and d columns, where rows correspond to entities in the dataset, and columns represent attributes or features or properties of interest.

The n×d data matrix is given as


Numeric Attributes – A numeric attribute is one that has a real-valued or integer-valued domain. For example, Age.

Categorical Attributes – A categorical attribute is one that has a set-valued domain composed of a set of symbols. For example, Sex could be categorical attributes.

Orthogonality – Two vectors a and b are said to be orthogonal if the angle between them is 90◦, which implies that cos θ =0. Dot product of a and b is 0.

Orthogonal Projection – In data mining, we may need to project a point or vector onto another vector to obtain a new point after a change of the basis vectors. Let a, b be two m-dimensional vectors. An orthogonal decomposition of the vector b in the direction of another vector a, illustrated in below Figure,


The vector p is called the orthogonal projection or simply projection of b on the vector a.

Centered Data Matrix
The centered data matrix is obtained by subtracting the mean from all the points


Linear Independence
We say that the vectors v1, . . . ,vk are linearly dependent if at least one vector can be written as a linear combination of the others as follows,


where c1,c2, . . . ,ck  are scalers

A set of vectors is linearly independent if none of them can be written as a linear combination of the other vectors in the set.

Dimension and Rank
The maximum number of linearly independent vectors in a matrix is equal to the number of non-zero rows in its row echelon matrix. Therefore, to find the rank of a matrix, we simply transform the matrix to its row echelon form and count the number of non-zero rows.

For the data matrix D ∈ Rn×d, we have rank(D) ≤ min(n,d), which follows from the fact that the column space can have dimension at most d, and the row space can have dimension at most n. If rank(D) < d, then the data points reside in a lower dimensional subspace of Rd, and in this case rank(D) gives an indication about the intrinsic dimensionality of the data.

In fact, with dimensionality reduction methods it is often possible to approximate D ∈ Rn×d with a derived data matrix D′ ∈ Rn×k, which has much lower dimensionality, that is,   k ≪ d. In this case k may reflect the “true” intrinsic dimensionality of the data.

We can estimate a parameter of the population by defining an appropriate sample statistic, which is defined as a function of the sample.

The random sample of size m drawn from a (multivariate) random variable X is defined as

A statistic θ is a function θ: S1, S2, . . ., Sm

The statistic is an estimate of the corresponding population parameter θ. If we use the value of a statistic to estimate a population parameter, this value is called a point estimate of the parameter, and the statistic is called an estimator of the parameter.

Univariate analysis
Univariate analysis focuses on a single attribute at a time. The data matrix is given as


X is assumed to be a random variable.

Mean – The mean, also called the expected value, of a random variable X is the arithmetic average of the values of X. The mean of discrete variable is defined as

The expected value of a continuous random variable X is defined as

Sample Mean – The sample mean is a statistic, µ: {x1, x2, . . . ,xn}, which is defined as the average value of xi ’s


Statistic is robust if it is not affected by extreme values/ outliers in the data.

Median – The median of a random variable is defined as


The median is robust, as it is not affected very much by extreme values.

Measures of Dispersion
The measures of dispersion give an indication about the spread or variation in the values of a random variable.

The range of a random variable X is the difference between the maximum and minimum values of X, which is defined as


Interquartile Range
Quartile divides the data into four equal parts. Quartiles correspond to the quantile values of 0.25, 0.5, 0.75, and 1.0. The first quartile is the value q1 = F-1(0.25). The second quartile is the same as the median value q2 = F-1(0.5). The third quartile q3 = F-1(0.75).

Interquartile range (IQR) is defined as


Variance and Standard Deviation
The variance of a random variable X provides a measure of how much the values of X deviate from the mean or expected value of X. Variance is defined as


The standard deviation, σ, is defined as square root of the variance, σ2.


Sample variance is defined as


The standard score/ z score – sample value xi is the number of standard deviations the value is away from the mean:


Multivariate analysis
The d numeric attributes full data matrix is defined as


The multivariate mean vector is obtained by taking the mean of each attribute which is defined as


Covariance Matrix
The multivariate covariance information is captured by the d ×d (square) symmetric covariance matrix that gives the covariance for each pair of attributes:


The diagonal element σi2 specifies the attribute variance for Xi, whereas the off-diagonal elements σijji represent the covariance between attribute pairs Xiand Xj.

Data Normalization
When analyzing two or more attributes it is often necessary to normalize the values of the attributes, especially in those cases where the values are vastly different in scale.

In range normalization, each value is scaled as follows,


After transformation the new attribute takes on values in the range [0;1].

Standard Score Normalization
In standard score normalization, also called z-normalization, each value is replaced by


Univariate Normal Distribution
If a random variable X has a normal distribution, with the parameters mean µ and variance σ2, the probability density function of X is given as


Probability Mass
Given an interval [a, b] the probability mass of the normal distribution within that interval is given as


The probability mass concentrated within k standard deviations from the mean is given as


Normal distribution with different variances


Multivariate Normal Distribution
Given the d-dimensional vector random variable X = (X1,X2, . . . ,Xd), we say that X has a multivariate normal distribution, with the parameters mean µ and covariance matrix S, the joint multivariate probability density function is given as


An example of bivariate normal density and contours is shown as follows,



Prior to train the model, dataset is randomly split into a training and a test dataset. The training dataset will be used to train the model, and the purpose of the test dataset is to evaluate the performance of the final model. A number of techniques are used for sampling,
(a) Simple random sampling, there is an equal chance of selecting any tuple.
(b) Stratified sampling
(c) Progressive sampling

Pre-processing: feature selection, feature scaling, and dimensionality reduction

Pre-processing involves feature selection, feature scaling, and dimensionality reduction.

✓ Feature selection – We are only interested in retaining meaningful features that can help to build a good classifier. Feature selection is often based on domain knowledge or exploratory analyses, such as histograms or scatterplots. The feature selection approach will eventually lead to a smaller feature space.

✓ Feature scaling/ normalization – Normalization and other feature scaling techniques are often mandatory in order to make comparisons between different attributes. If the attributes were measured on different scales, proper scaling of features is a requirement for most machine learning algorithms. A number of techniques are used for feature scaling,
(i) The simplest scaling is min-max scaling.
(ii) standardizing the data – it is the process of converting the input so that it has a mean of 0 and standard deviation of 1.

✓ Dimensionality reduction techniques – A dataset with hundreds of columns, and how do we proceed with the analysis of the data characteristics of such a large dimensional dataset? With increasing dimensions, the algorithms become computationally infeasible. Dimensionality reduction techniques preserve the structure of the data as much as possible while reducing the number of dimensions. A number of techniques are used for dimensionality reduction.

✓ Matrix Decomposition
✶Matrix decomposition is a way of expressing a matrix. Say that A is a product of two other matrices and C. The matrix B is supposed to contain vectors that can explain the direction of variation in the data. The matrix C is supposed to contain the magnitude of this variation. Thus, our original matrix A is now expressed as a linear combination of B and C.
✶There are methods that insist that the basic vectors have to be orthogonal to each other, such as the principal component analysis, and there are some that don’t insist on this requirement, such as dictionary learning.

✓ Principal component analysis (PCA)
✶PCA is an unsupervised method. In multivariate problems, PCA is used to reduce the dimension of the data with minimal information loss while retaining the maximum variation in the data. Variation means the direction in which the data is dispersed to the maximum.
✶Selection criteria of number of components:
(i) The Eigenvalue criterion – Eigenvalue would explain about one variable’s worth of  variability. We can say that we will include only those Eigenvalues whose value is greater than or equal to one. Based on your data set you can set the threshold. In a very large dimensional dataset including components capable of explaining only one variable may not be very useful.
(ii) The proportion of the variance explained criterion – It is calculated by Eigen values.
The PCA algorithm would work if the input dataset should have correlated columns. Without a correlation of the input variables, PCA cannot help us.

✓ Kernel PCA
✶PCA is limited to only those variables where the variation in the data falls in a straight line. In other words, it works only with linearly separable data.
✶Kernel PCA is used to reduce the dimension of datasets where the variations in them are not straight lines. In kernel PCA, a kernel function is applied to all the data points. This transforms the input data into kernel space, and then a normal PCA is performed in the kernel space.
✶Kernel is a function that computes the dot product, that is, the similarity between two vectors, which are passed to it as input. Some of other kernals functions are linear, polynomial, sigmoid, cosine.

✓ Extracting features using singular value decomposition (SVD)
✶SVD is another matrix decomposition technique that can be used to tackle the curse of the dimensionality problem.
✶It can be used to find the best approximation of the original data using fewer dimensions. Unlike PCA, SVD works on the original data matrix.
✶SVD does not need a covariance or correlation matrix. It works on the original data matrix. SVD factors an m x n matrix A into a product of three matrices: A =U*S*V.T. Here, U is an m x k matrix, V is an n x k matrix, and S is a k x k matrix. The columns of U are called left singular vectors and columns of V are called right singular vectors. The values on the diagonal of the S matrix are called singular values.

✓ Decomposing the feature matrices using nonnegative matrix factorization
✶Non-negative Matrix Factorization (NMF) is used extensively in recommendation systems using a collaborative filtering algorithm.
✶Let’s say that our input matrix A is of a dimension m x n. NMF factorizes the input matrix into two matrices, A_dash and H: A = A_dash*H.
✶Let’s say that we want to reduce the dimension of the A matrix to d, that is, we want the original m x n matrix to be decomposed into m x d, where d << n.
✶The A_dash matrix is of a size m x d and the H matrix is of a size d x m. NMF solves this as an optimization problem, that is, minimizing the function: |A-A_dah*H|^2.

Data cleansing

Data cleansing is a process of removing incorrect, inaccurate, incomplete, improperly formatted, and duplicated data. The quality of data affects the data analysis results. In many real-world scenarios, we have the problem of incomplete or missing data, and missing or sparse data can also lead to highly misleading results.

Correcting the dirty data:

Statistical methods – Statistical validations can be used to handle missing values. A strategy for dealing with missing data would be imputation: Replacement of missing values using certain statistics rather than complete removal.
(a) For categorical data, the missing value can be interpolated from the most frequent category
(b) For numerical data, the sample average or median can be used to interpolate missing values.
In general, substitution via k-nearest neighbour imputation is considered to be superior over substitution of missing data by the overall sample mean. Scikit-learn provides you with an Imputer() function in module pre-processing to handle the missing data.

Text parsing – Text parsing can be used to validate the data and avoid the syntax errors.

Detecting outliers – Including outliers in some of algorithms, unknowingly may lead to wrong results or conclusions. It is very important to account for them properly and have the right algorithms in order to handle them:
(a) Mean plus or minus three standard deviation – Mean plus or minus three standard deviation are used to detect the outliers in univariate data. For Gaussian data, we know that 68.27 percent of the data lies within one standard deviation, 95.45 percent in two, and 99.73 percent lies in three. Thus, according to our rule that any point that is more than three standard deviations from the mean is classified as an outlier. The finite sample breakdown point is defined as the proportion of the observations in a sample that can be replaced before the estimator fails to describe the data accurately.
(b) Median absolute deviation – Median is a more robust estimate. The median is the middle observation in a finite set of observations that is sorted in an ascending order. For the median to change drastically, we have to replace half of the observations in the data that are far away from the median. This gives you a 50 percent finite sample breakdown point for the median.
(c) Discovering outliers using the local outlier factor (LOF) method – LOF detects the outliers based on comparing the local density of the data instance with its neighbours. It’s inspired by the KNN (K-Nearest Neighbours) algorithm and is widely used.

OpenRefine – OpenRefine is a formatting tool which is used for data cleansing, data exploration, and data transformation.
(a) Text facet – Text facet is a very useful tool, similar to filter in a spreadsheet. Text facet groups unique text values into groups.
(b) Clustering – We can cluster all the similar values by clicking on our text facet which allow us to find the duty data.
(c) Numeric facets – Numeric facet group numbers into numeric range bins.
(d) Transforming data – If the data is in different format, by transformation, we can bring all the data to same format.