Learning R (Probability distributions)

Binomial distribution
The binomial distribution is a discrete probability distribution. It describes the outcome of m independent trials in an experiment. Each trial is assumed to have only two outcomes, either success or failure. If the probability of a successful trial is p, then the probability of having x successful outcomes in an experiment of n independent trials is as follows,

The binomial distribution describes the behavior of a count variable X if the following conditions apply:
1. The number of observations n is fixed.
2. Each observation is independent.
3. Each observation represents one of two outcomes (“success” or “failure”).
4. The probability of “success” p is the same for each outcome.

#R syntax for estimating the probability
dbinom(x, size, prob)
#Example 1:
#Compute the probability of getting five heads in seven tosses of a fair coin.
dbinom(x = 5, size = 7, prob = 0.5)

#Example 2:
#Compute the probability of getting less than or equal four heads in seven tosses of a fair coin.
#Method 1 – Estimate individual probability and sum it together
dbinom(x =0, size = 7, prob = 0.5)
dbinom(x =1, size = 7, prob = 0.5)
dbinom(x =2, size = 7, prob = 0.5)
dbinom(x =3, size = 7, prob = 0.5)
dbinom(x =4, size = 7, prob = 0.5)

#Method 2 – Esimate cumulative probability using inbuilt R function
pbinom(4, 7, 0.5)

#Example 3:
#Suppose there are twelve multiple choice questions in an English class quiz. Each question has five possible answers, and only one of them is correct. Find the probability of having four or less correct answers if a student attempts to answer every question at random
pbinom(4, 12, 0.2)

◙ The expected value (or mean) of a binomial random variable is mp
◙ The variance of a biomial distribution is mp(1 − p)

#Visualizing the binomial distribution using histogram plot
hist(c(rnorm(500,0,2),rnorm(500,8,2)),col=”grey”,main=”Bimodal”, breaks = 15)

Poisson distribution
The Poisson distribution is the probability distribution of independent event occurrences in an interval. If λ is the mean occurrence per interval, then the probability of having x occurrences within a given interval is:

◙ The mean and variance of a Poisson random variable are both equal to λ.
#R syntax of estimating the probability of Poisson distribution
dpois(x, lambda)

#Example 1:
#According to the Poisson model, the probability of three arrivals at anautomatic bank teller in the next minute, where the average number of arrivals per minute is 0.6, is
dpois(x = 3, lambda = 0.5)

#Cumulative probability is estimated using

#We can generate Poisson randomnumbers using the
rpois(n, lambda)

#Example 2:
#Suppose traffic accidents occur at an intersection with a mean rate of 3 per year. Simulate the annual number of accidents for a 12-year period,assuming a Poisson model.
rpois(12, 3)

#Example 3:
#If there are eleven cars crossing a bridge per minute on average, find the probability of having sixteen or more cars crossing the bridge in a particular minute.

#Visualizing the poisson distribution using histrogram plot

Exponential distribution
The exponential distribution describes the arrival time of a randomly recurring independent event sequence. If μ is the mean waiting time for the next event recurrence, its probability density function is

Where mean is reciprocal of the rate parameter.
#R syntax for estimating the probability of exponential distribution
pexp(q, rate)

#Example 1:
#Suppose the service time at a bank teller can be modeled as an exponential random variable with a rate of 3 per minute. Then the probability of acustomer being served in less than 1 minute is
pexp(1, rate = 3)

#The R function can be used to generate n random exponentialvariates.
rexp(n, rate)

#Example 2:
#A bank has a single teller who is facing a queue of 8 customers. The time for each customer to be served is exponentially distributed with rate 2 per minute. We can simulate the service times (inminutes) for the 8 customers.
servicetimes <- rexp(8, rate = 2)

Normal random distribution
A normal random variable X has a probability density function given by

where µ is the expected value of X, and σ2 denotes the variance of X .

◙ The standard normal random variable has mean µ = 0 and standard deviationσ = 1.

◙ The normal density function can be evaluated using the dnorm()
◙ The distribution function can be evaluated using pnorm()
◙ Normal pseudorandom variables can be generated using the rnorm()
#Example 1:
#Assume that the test scores of a college entrance exam fits a normal distribution. Furthermore, the mean test score is 75, and the standard deviation is 15. What is the percentage of students scoring 80 or more in the exam?
pnorm(80, mean=75, sd=15, lower.tail=FALSE)

#Visualizing the normal distribution using histrogram plot
x<-rnorm(1000, 5, 0.2)
hist(x,probability = TRUE)

Chi-squared Distribution
If X1, X2, …,Xm are m independent random variables having the standard normal distribution, then the following quantity follows a Chi-Squared distribution with m degrees of freedom. Its mean is m, and its variance is 2m.
#Example 1:
#Find the 97th percentile of the Chi-Squared distribution with 6 degrees of freedom.
qchisq(.97, df=6)        # 6 degrees of freedom

Examining the distribution of a set of data
The distribution can be examined in a number of ways.
# 1 . Generating summary of data

# 2. Drawing the histogram
hist(eruptions, seq(1.6, 5.2, 0.2), prob=TRUE)
lines(density(eruptions, bw=0.1))


Learning R (Programming Basics)

For loop in R

# Example 1
# print 1, 2, 3...10
for(i in 1:10) {

# Example 2
# print a, b, c, d
x <- c("a", "b", "c", "d")
for(i in 1:4) {

Nested for loop
It is defined as a set of for loops within for loops.

M<- matrix(1:9, ncol=0)
Sum<- 0
for (i in seq(nrow(M))) {  
  for (j in seq(ncol(M))) {    
    sum<- sum + M[i,j]    
    print sum    

While loop
Loop runs with condition.

i<- 1
# When i > =8, loop terminates
while ( i< 8) {  
  print i  
  i<- i+1

If statement
This structure allows you to test a condition and act on it depending on whether
it’s true or false.

# if random number is greater than 3, it will print 10. Otherwise print 0.
x <- runif(1, 0, 10)
if(x > 3) {
else {

Ifelse statement
This structure is an equivalent form of if else condition but this statement is applied to each element of vector individually.

# Ifelse statement checks each elements of vector and if it's odd,
prints are odd number otherwise prints as even.
X<- 1:8
Ifelse (x%%2, paste0(x, “  : odd number”), paste0 (x, “  : even number”)

Next statement
Next statement is used to skip some iterations.

#Example 1
# next statement skip the first 20 iterations
for(i in 1:100) {
if(i <= 20) {

Break statement
Break is used to exit a loop immediately, regardless of what iteration the loop may be on.

#Example 2 
# stop loop after 20 iterations 
for(i in 1:100) { 
  if(i > 20) { 

It’s infinite loop and break statement is used to terminate from loop.

Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.

apply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

Writing functions in R
Functions are defined using the function() directive and are stored as R objects just like anythingelse. In particular, they are R objects of class “function”.

#Example 1
func1 <- function() {
print("Hello, world!")
} f() 


#Example 2
func2 <- function(num) {
for(i in seq_len(num)) {
print("Hello, world!")

Function with return
return() functions to return a value immediately from a function.

#counting odd numbers from a vector
oddcount <- function(x) {

R Programming Environment
Environment is a collection of objects.

Global variables
Global variables are those variables which exists throughout the execution of a program. It can be changed and accessed from any part of the program.

Local variables
Local variables are those variables which exist only within a certain part of a program like a function, and is released when the function call ends.

Taking input from user

my.age <- readline(prompt="Enter age: ")
# Convert to integer
my.age <- as.integer(my.age)

A function that calls itself.

#Example 1
#Finding factorial - n! = n*(n-1)! 
recursive.factorial <- function(x) {   
  if (x == 0)    return (1)   
  else    return (x * recursive.factorial(x-1))

#Example 2
#The Fibonacci sequenceThe Fibonacci sequence is a series of numbers 
#where a number is found by adding up the two numbers before it.
#Starting with 1, the sequence goes 1, 1, 2, 3, 5, 8, 13, 21, 34.  

recurse_fibonacci <- function(n) {    
  if(n <= 1)    return(n) 
  else    return(recurse_fibonacci(n-1) + recurse_fibonacci(n-2))

for(i in 0:(12-1)) {    

Algorithm analysis
An algorithm is evaluated based on following attributes:
◙ Shorter running time
◙ Lesser memory utilization

Memory management in R
R allocates memory differently to different objects in its environment. Memory allocation can be determined using the object_size function from the pryr package.

System runtime in R
System runtime helps to compare the different algorithms and pick the best algorithm. The microbenchmark package on CRAN is used to evaluate the runtime of any expression/function/code at an accuracy of a sub-millisecond.

Algorithm asymptotic analysis
Asymptotic notations are commonly used to determine the complexity in calculating the runtime of an algorithm. Big O (upper bound), Big Omega (lower bound), and Big Theta (average) are the simplest forms offunctional equations, which represent an algorithm’s growth rate or its system runtime.

Assignment operator
Assigning an element (numeric, character, complex, or logical) to an object requires a constant amount of time. The asymptote (Big Theta notation) of the assignment operation is θ(1).

Simple for loop
The total cost of this for loop is θ(n).

Nested loop
The total cost of nested loop is θ(n2).

Writing sorting algorithms in R
Bubble sort
Bubble sort is a simple sorting algorithm. This sorting algorithm is comparison-based algorithm in which each pair of adjacent elements is compared and the elements are swapped if they are not in order.

bubblesort <- function(x) {
  if (length(x) < 2) 
    return (x)
  # last is the last element to compare with
  for(last in length(x):2) {  
    for(first in 1:(last - 1)) {    
      if(x[first] > x[first + 1]) {      
      # swap the pair      
        save <- x[first]      
        x[first] <- x[first + 1]      
        x[first + 1] <- save    
return (x)

Quick sort
Quick sort involves following steps:
◙ Pick an element, called a pivot, from the array.
◙ Partitioning: reorder the array so that all elements with values less than the pivot come before the pivot, while all elements with values greater than the pivot come after it (equal values can go either way). After this partitioning, the pivot is in its final position. This is called the partition operation.
◙ Recursively apply the above steps to the sub-array of elements with smaller values and separately to the sub-array of elements with greater values.

quickSort <- function(vect) {    
  if (length(vect) <= 1) {      
    return(vect)  }  
  # Pick an element from the vector  
  element <- vect[1]  
  partition <- vect[-1]  
  # Reorder vector so that integers less than element  
  # come before, and all integers greater come after.  
  v1 <- partition[partition < element]  
  v2 <- partition[partition >= element]  
  # Recursively apply steps to smaller vectors.  
  v1 <- quickSort(v1)  
  v2 <- quickSort(v2)  
  return(c(v1, element, v2))

Learning R (Introduction)

♦ R Programming language is generally used for developing statistical analysis, graphics representation, and reporting.

<- symbol is the assignment operator.
# symbol is used to comment a line

# x equals 1
# msg equals hellp
msg<- "hello"

The basic arithmetic operations using R

# addition
18 + 12
# subtraction
18 - 12
# multiplication
18 * 12
# division    
18 / 12 
# just the integer part of the quotient
18 %/% 12
# just the remainder part (modulo)
18 %% 12
# exponentiation (raising to a power)
18 ^ 12
# natural log (base e)
# base 10 logs
# square root    
# absolute value
abs(18 / -12)

Defining vectors in R

# Method 1
# numeric vector
x <- c(0.5, 0.6)
# complex vector
x <- c(1+0i, 2+4i)    

# Method 2 - Use the vector() function to initialize vectors
x <- vector("numeric", length = 10)

# Method 3 - Creating vector of numerical numbers
# number sequence 1, 2, 3,.... 10
# number sequence from 1 to 10 and interval is 1
seq(from=1, to=10, by=1) 

# Some useful functions
# to check the type of data of my.seq
# to check whether my.seq is vector
# it will devide each elments of my.seq by 3
my.seq = my.seq / 3

Defining matrices in R

# Method 1
# To create empty 2 by 3 matrix
m <- matrix(nrow = 2, ncol = 3)
# To check the dimensionality 

# Method 2 
# Matrices are constructed column-wise
m <- matrix(1:6, nrow = 2, ncol = 3)

# Method 3
#Matrix created directly from vectors by adding a dimension attribute.
m <- 1:10
dim(m) <- c(2, 5)

# Method 3
x <- 1:3
y <- 10:12
# create matrix by column-binding
cbind(x, y)
# create matrix by row-binding 
rbind(x, y)

Defining lists  in R
Lists are a special type of vector that can contain elements of different classes.

# Method 1
# this list contains different class of elements
x <- list(1, "a", TRUE, 1 + 4i)

# Method 2
# create empty list with the length of 5
x <- vector("list", length = 5)

Defining factors in R
♦ Factors are used to represent categorical data and can be unordered or ordered.
♦ Factors are important in statistical modelling.

# Levels are put in alphabetical order 
x <- factor(c("yes", "yes", "no", "yes", "no"))    

# table() will how many yes and no are available

x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"))
# Levels are put without alphabetical order

Missing Values
Missing values are denoted by NA or NaN. NA is used to represent missing numbers, and NAN is used to represent invalid numbers (0/0).

# a vector is defined with missing number
x <- c(1, 2, NA, 10, 3)    

# it will check whether this vector has any na values

Data Frames
♦ Data frames are used to store tabular data in R.
♦ Data frames are represented as a special type of list where every element of the list has to have the same length.
♦ Unlike matrices, data frames can store different classes of objects in each column.
♦ In addition to column names, indicating the names of the variables or predictors, data frames have a special attribute called row.names which indicate information about each row of the data frame.

# Define a data frame in R
x <- data.frame(foo = 1:4, bar = c(T, T, F, F))

# To show number of rows

# To show number of columns

Managing Data Frames with the dplyr package
♦ The data frame is a key data structure in statistics and in R.
♦ The dplyr package is designed filtering, re-ordering, and collapsing.

#Installing dplyr package

# load dplyr package into your R session

# Load chicago.rds file
chicago <- readRDS('C:\\Users\\ahilan\\Dropbox\\Elect_dept_UOJ\\Statistics using 

#To show number of col and row of data

# The select() function can be used to select columns of a data frame.

# Suppose we wanted to take the first 3 columns only.
subset <- select(chicago, city:dptp)

# if you wanted to keep every variable that ends with a “2”
subset <- select(chicago, ends_with("2"))

# You can also omit variables using the select()
select(chicago, -(city:dptp))

# If we wanted to keep every variable that starts with a “d”
subset <- select(chicago, starts_with("d"))

# The filter() function is used to extract subsets of rows from a data frame
# Extract the rows where PM2.5 is greater than 30
chic.f <- filter(chicago, pm25tmean2 > 30)    
chic.f <- filter(chicago, pm25tmean2 > 30)

# Extract the rows where PM2.5 is greater than 30 and temperature is greater 
than 80 degrees Fahrenheit.
chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)

#The arrange() function is used to reorder rows of a data frame according to one 
of the variables/columns

# We can order the rows of the data frame by date, so that the first row is the 
earliest (oldest) observation and the last row is the latest (most recent) 
chicago <- arrange(chicago, date)
chicago <- arrange(chicago, desc(date))

Logical operation in R

# define a vector using boolean values
# define a numeric vector
b <- c(13, 7, 8, 2)    
# selects true value elements
b[a]   // 13 2     
# inverse of a
# true as 1 and false as 0 and counts true and false values
sum(a)  // 2  

Built-in search function


Data input and output

#Changing directories
Changing the default to the mydata folder in the C: drive
setwd("c:\\ mydata")

#Save the objects for a future session
dump("usefuldata", "useful.R")

#Retrieve the saved objects

#Save all of the objects that you have created during a session
dump(list=objects(), "all.R")

#Redirecting R output to text file
# Create a file solarmean.txt for output
# Write mean value to solarmean.txt
# Close solarmean.txt; print new output to screen

Learning Data Science – part 1

Data matrix
Data can often be represented or abstracted as an n×d data matrix, with n rows and d columns, where rows correspond to entities in the dataset, and columns represent attributes or features or properties of interest.

The n×d data matrix is given as


Numeric Attributes – A numeric attribute is one that has a real-valued or integer-valued domain. For example, Age.

Categorical Attributes – A categorical attribute is one that has a set-valued domain composed of a set of symbols. For example, Sex could be categorical attributes.

Orthogonality – Two vectors a and b are said to be orthogonal if the angle between them is 90◦, which implies that cos θ =0. Dot product of a and b is 0.

Orthogonal Projection – In data mining, we may need to project a point or vector onto another vector to obtain a new point after a change of the basis vectors. Let a, b be two m-dimensional vectors. An orthogonal decomposition of the vector b in the direction of another vector a, illustrated in below Figure,


The vector p is called the orthogonal projection or simply projection of b on the vector a.

Centered Data Matrix
The centered data matrix is obtained by subtracting the mean from all the points


Linear Independence
We say that the vectors v1, . . . ,vk are linearly dependent if at least one vector can be written as a linear combination of the others as follows,


where c1,c2, . . . ,ck  are scalers

A set of vectors is linearly independent if none of them can be written as a linear combination of the other vectors in the set.

Dimension and Rank
The maximum number of linearly independent vectors in a matrix is equal to the number of non-zero rows in its row echelon matrix. Therefore, to find the rank of a matrix, we simply transform the matrix to its row echelon form and count the number of non-zero rows.

For the data matrix D ∈ Rn×d, we have rank(D) ≤ min(n,d), which follows from the fact that the column space can have dimension at most d, and the row space can have dimension at most n. If rank(D) < d, then the data points reside in a lower dimensional subspace of Rd, and in this case rank(D) gives an indication about the intrinsic dimensionality of the data.

In fact, with dimensionality reduction methods it is often possible to approximate D ∈ Rn×d with a derived data matrix D′ ∈ Rn×k, which has much lower dimensionality, that is,   k ≪ d. In this case k may reflect the “true” intrinsic dimensionality of the data.

We can estimate a parameter of the population by defining an appropriate sample statistic, which is defined as a function of the sample.

The random sample of size m drawn from a (multivariate) random variable X is defined as

A statistic θ is a function θ: S1, S2, . . ., Sm

The statistic is an estimate of the corresponding population parameter θ. If we use the value of a statistic to estimate a population parameter, this value is called a point estimate of the parameter, and the statistic is called an estimator of the parameter.

Univariate analysis
Univariate analysis focuses on a single attribute at a time. The data matrix is given as


X is assumed to be a random variable.

Mean – The mean, also called the expected value, of a random variable X is the arithmetic average of the values of X. The mean of discrete variable is defined as

The expected value of a continuous random variable X is defined as

Sample Mean – The sample mean is a statistic, µ: {x1, x2, . . . ,xn}, which is defined as the average value of xi ’s


Statistic is robust if it is not affected by extreme values/ outliers in the data.

Median – The median of a random variable is defined as


The median is robust, as it is not affected very much by extreme values.

Measures of Dispersion
The measures of dispersion give an indication about the spread or variation in the values of a random variable.

The range of a random variable X is the difference between the maximum and minimum values of X, which is defined as


Interquartile Range
Quartile divides the data into four equal parts. Quartiles correspond to the quantile values of 0.25, 0.5, 0.75, and 1.0. The first quartile is the value q1 = F-1(0.25). The second quartile is the same as the median value q2 = F-1(0.5). The third quartile q3 = F-1(0.75).

Interquartile range (IQR) is defined as


Variance and Standard Deviation
The variance of a random variable X provides a measure of how much the values of X deviate from the mean or expected value of X. Variance is defined as


The standard deviation, σ, is defined as square root of the variance, σ2.


Sample variance is defined as


The standard score/ z score – sample value xi is the number of standard deviations the value is away from the mean:


Multivariate analysis
The d numeric attributes full data matrix is defined as


The multivariate mean vector is obtained by taking the mean of each attribute which is defined as


Covariance Matrix
The multivariate covariance information is captured by the d ×d (square) symmetric covariance matrix that gives the covariance for each pair of attributes:


The diagonal element σi2 specifies the attribute variance for Xi, whereas the off-diagonal elements σijji represent the covariance between attribute pairs Xiand Xj.

Data Normalization
When analyzing two or more attributes it is often necessary to normalize the values of the attributes, especially in those cases where the values are vastly different in scale.

In range normalization, each value is scaled as follows,


After transformation the new attribute takes on values in the range [0;1].

Standard Score Normalization
In standard score normalization, also called z-normalization, each value is replaced by


Univariate Normal Distribution
If a random variable X has a normal distribution, with the parameters mean µ and variance σ2, the probability density function of X is given as


Probability Mass
Given an interval [a, b] the probability mass of the normal distribution within that interval is given as


The probability mass concentrated within k standard deviations from the mean is given as


Normal distribution with different variances


Multivariate Normal Distribution
Given the d-dimensional vector random variable X = (X1,X2, . . . ,Xd), we say that X has a multivariate normal distribution, with the parameters mean µ and covariance matrix S, the joint multivariate probability density function is given as


An example of bivariate normal density and contours is shown as follows,


Deep learning part 2 – Recurrent neural networks (RNN)

The details of feedforward networks has been gone through in the previous post, and in this post we are going through the recurrent networks.

Recurrent networks are used to learn patterns in sequences of data, such as text, and handwriting, the spoken word, and time series data. It can also be used for the vision applications where images are decomposed into a series of patches and treated as a sequence.

Recurrent networks have two sources of information. It takes the current input and the input which perceived one step back in time as input. As it depends on current and previous inputs, it is often said that recurrent networks have memory. Recurrent networks are differentiated from feedforward networks by feedback loops. Recurrent net can be seen as a (very deep) feedforward net with shared weights.

A simple RNN architecture is shown in below diagram.RNN1

✓ x0, x1,….. XT are input, and h0,h1,… hT are hidden state of the recurrent network.

✓ The hidden states are recursively estimated as bellow,
Where W is the input to hidden weights, U is the hidden to hidden weights, and V is the hidden to label weight.

✓ Weights are estimated by minimizing following function,
✓ Backpropagation through time (BPTT) algorithm is used to estimte the weights.

Backpropagationthrough time (BPTT)
✓ Firstly, the RNN is unfolded in time.
✓ Deep neural network is obtained with sharedweights Wand U.
✓ The unfolded RNN is traied using normal backpropagation which is explained in the previous post.
✓ In practice, the number ofunfolding steps are limited to 5 –10.
✓ It is computationally more efficient topropagate gradients after few trainingexamples (batch training).

✶ As we propagate the gradients back in time,  their magnitude usually quickly decreases which is called vanishing gradient problem.

✶ Sometimes, the gradients start to increase exponentially during backpropagation through the recurrent weights. This happens rarely. The huge gradients will lead to big change of weights, and thus destroy what has been learned so far. This would be the reason why RNN are unstable. Clipping/ normalizing the values of gradients would avoid the huge changes of weights.

✶ In practice, learning long term dependencies in data is difficult for simple RNN architecture. Special RNN architectures such as long short-term memory (LSTM) address this problem.

Deep learning basics – part 1

A typical neural network strucutre is shown in below Figure.

Neural neworks are typically organized in layers. Layers are made up of a number of interconnected nodes which contain an activation function. Each node in a layer is connected in the forward direction to every unit in the next layer. It usually has an input layer, one or more hidden layers and an output layer.

Patterns are presented to the network via the input layer. It communicates to one or more hidden layers where the actual processing is done via a system of weighted connections. The hidden layers then link to an output layer.

✓ Network representation
The equation of a single-layer neural network in matrix form is as follows,
y = Wx
It can be also written in terms of individual components,
yk = Σ wkixi

Input vector x = (x0,x1, . . . , xd)T
Output vector y = (y1, . . . , yk)T
Weight matrix W:wki is the weight from input xi to output yk

✓ Weight
A node usually receives many simultaneous inputs, and  each input has its own relative weight. Some inputs are made more important than others. Weight are a measure of an input’s connection strength. These strengths can be modified in response to various training sets and according to a network’s specific topology or its learning rules.

✓ Summation function
The input and weighting coefficients can be combined in many different ways before passing on to the transfer function. The summation function can be sum, max, min, average, or, and. The specific function for combining neural inputs is determined by the chosen network architecture.

✓ Transfer function
The result of the summation function is transformed to a working output through the transfer function. The transfer function can be hyperbolic tangent, linear, sigmoid, sin.

✓ Scaling and limiting
After the transfer function, the result can pass through scale and limit. This scaling simply multiplies a scale factor times the transfer value and then adds an offset. Limiting insures that the scaled result does not exceed an upper, or lower bound.

✓ Output function
Normally, the output is directly equivalent to the transfer function’s result. Some network topologies modify the transfer result to incorporate competition among neighboring processing elements.

✓ Error function/ Backpropagation
The values from the output layer are compared with the expected values and an error is computed for each output unit. The weights connected to the output units are adjusted to reduce those errors. The error estimates of the output units are then used to derive error estimates for the units in the hidden layers. The weight are adjusted to reduce the errors. Finally, the errors are propagated back to the connections stemming from the input units.

✓ Learning function
The purpose of learning funcation is to modify the weights on the inputs of each processing element according to some neural based algorithm.

✓ Sigmoid transfer/ activation function
The transfer function for neural networks must be differential as derivative of the transfer function is required for computation of local gradient. Sigmoid is one of the most common forms of transfer function which is used in construction of artificial neural network. It’s represented by following equation,
Screenshot (32)

✓ There are different types of artificial neural networks,
(1) Single layer feed forward network
(2) Multilayer feed forward network
(3) Recurrent network
(4) ….. etc

Neural networks can be trained using supervised and unsupervised manner.

✓ Supervised training – In supervised training, both the inputs and the outputs are provided and the network processes the inputs and compares resulting outputs against the expected outputs. Errors are then propagated back through the system, causing the system to adjust the weights.

✓ Unsupervised training – In this type, the network is provided with inputs but not with desired outputs.

✓ Learning rates
The learning rate depends upon several controllable factors. A slower rate means more time to spend in producing an adequately trained system. In faster learning rates, the network may not be able to make the fine discriminations that are possible with a system learning slowly. Learning rate is positive and it is between 0 and 1.

✓ Learning laws
✶ Hebb’s rule – If a neuron receives an input from another neuron and if both are highly active (same sign), the weight between the two neurons should be strengthened.
✶ Hopfield law – If the desired output and the input are both active or both inactive,
increment the connection weight by the learning rate, otherwise decrement the weight by the learning rate.
✶ The delta rule – This rule is based on the simple idea of continuously modifying the strengths of the input connections to reduce the difference between the desired output value and the actual output of a processing element.
✶ The Gradient Descent rule – This is similar to delta Rule. The derivative of the transfer function is still used to modify the delta error before it is applied to the connection weights. However, an additional proportional constant tied to the learning rate is appended to the final modifying factor acting upon the weight.
✶ Kohonen’s law – The processing elements compete for the opportunity to learn or update their weights. The element with largest output is declared the winner and has the capability of inhibiting its competitors as well as exciting its neighbors. Only the winner is permitted an output and only the winner plus its neighbors are allowed to adjust their connection weights.

✓ Backpropagation for feed forward networks
The backpropagation algorithm is the most commonly used training method for feed forward networks. The main objective in neural network is to find an optimal set of weight parameters, w. The parameters are trained through the training process.

✶ Consider a multi-layer perceptron with k hidden layers. Layer 0 is input layer and layer k+1 is output layer
✶ The weight of jth unit in layer m and the ith unit in layer m+1 is denoted by wijm
✶ The input data for a feedforward network training is u(n) =(x10(n)… xk0(n))
✶ The output is d(n) =(d1k+1(n)… dL k+1(n))
✶ The activation of non-input units is xim+1(n) = f(wijm xj(n))
✶ A network response obtained in the output layer is y(n) =  (x1k+1(n),…, xLk+1(n))

The difference between desired output and neural network output is known as error and it’s quantified as follows,
The weights, w, are estimated by minimizing above objective function. This is done by incrementally changing the weights along the direction of the error gradient with respect to weights.
The new weight is
where γ is learning rate. weights are initialized with random numbers. In batch learning mode, new weights are computed after presenting all training samples. One such pass through all samples is called an epoch.

The procedure for one epoch of batch processing is given below.
✶ For each sample n, compute activations of internal and output units (forward pass).

✶  The error propagation term is estimated backward through m = k+1, k, …, 1, for each unit. The error propagation term for output layer is,
The error propagatin term of hidden layer is estimated using output layer error propagation as bellow,
where, the internal state of unit xim is

✶ Update the weight parameters as bellow,
After every such epoch, the error is computed and stop updating the weight when the error falls below a predetermined threshold.


Factor analysis modelling

Factor analysis is a statistical method. It is used to describe variability among correlated observed variables in terms of a potentially lower number of unobserved variables.

The generative model is given by
y = μ + Λx +ε

y is P ×1 dimension observed variable
μ is P ×1 dimension mean vector
Λ is P × R dimension factor loading matrix
x is R×1 dimension unobserved variable (or latent variable)
ε is P ×1 dimension error term

We assume that
Ε(x) = Ε(ε) = 0
Ε(ΛΛT ) = Ι
p(x) = N (x |0, I )

Ε(y) = μ
Σ = Ε(yyT ) = ΛΛT + Ψ
p(y|θ ) = N(y|μ, ΛΛT +Ψ)

Mean μ is estimated using the observed variable.

The model parameters Λ,Ψ are estimated using expectation maximization (EM) algorithm.

Initially, model parameters Λ,Ψ are selected with random values and iteratively updated with EM algorithm.

In E step, the posterior p(xn |yn, θt) is defined as below,
qt+1 = p(xn |yn, θt ) = N(xn|mn,Vn)
Vn = (I −ΛTΨ−1Λ)−1
mn = VnΛTΨ−1(yn − μ )

In M step, the model parameters Λt+1, Ψt+1 are updated with posterior estimation as below,
Λt+1 =( ΣynmnT)(ΣVn)-1
Ψt+1 = 1/N diag (ΣynynT + Λt+1 ΣmnynT)

In each iteration, likelihood estimate L(Λ,Ψ) is estimated in order to confirm whether model is converged. Likelihood estimate L(Λ,Ψ) is estimated as below,
L(Λ,Ψ) = N/2 log|Ψ| − N/2 tr(SΨ−1)

Where S = 1/N Σ (yn-Λxn)T(yn-Λxn)

The maximum likelihood estimate values are plotted again iterations, and it can be observed that when the number of iterations increases, the model is converged as shown in following Figure,
Screenshot (31)

The main applications of factor analysis is to reduce the number of variables and to detect structure in the relationships between variables. Factor analysis is commonly used to model the varialbiy in speaker and face recognition applications.