Learning R (Probability distributions)

Binomial distribution
The binomial distribution is a discrete probability distribution. It describes the outcome of m independent trials in an experiment. Each trial is assumed to have only two outcomes, either success or failure. If the probability of a successful trial is p, then the probability of having x successful outcomes in an experiment of n independent trials is as follows,

The binomial distribution describes the behavior of a count variable X if the following conditions apply:
1. The number of observations n is fixed.
2. Each observation is independent.
3. Each observation represents one of two outcomes (“success” or “failure”).
4. The probability of “success” p is the same for each outcome.

#R syntax for estimating the probability
dbinom(x, size, prob)
#Example 1:
#Compute the probability of getting five heads in seven tosses of a fair coin.
dbinom(x = 5, size = 7, prob = 0.5)

#Example 2:
#Compute the probability of getting less than or equal four heads in seven tosses of a fair coin.
#Method 1 – Estimate individual probability and sum it together
dbinom(x =0, size = 7, prob = 0.5)
dbinom(x =1, size = 7, prob = 0.5)
dbinom(x =2, size = 7, prob = 0.5)
dbinom(x =3, size = 7, prob = 0.5)
dbinom(x =4, size = 7, prob = 0.5)

#Method 2 – Esimate cumulative probability using inbuilt R function
pbinom(4, 7, 0.5)

#Example 3:
#Suppose there are twelve multiple choice questions in an English class quiz. Each question has five possible answers, and only one of them is correct. Find the probability of having four or less correct answers if a student attempts to answer every question at random
pbinom(4, 12, 0.2)

◙ The expected value (or mean) of a binomial random variable is mp
◙ The variance of a biomial distribution is mp(1 − p)

#Visualizing the binomial distribution using histogram plot
hist(c(rnorm(500,0,2),rnorm(500,8,2)),col=”grey”,main=”Bimodal”, breaks = 15)

Poisson distribution
The Poisson distribution is the probability distribution of independent event occurrences in an interval. If λ is the mean occurrence per interval, then the probability of having x occurrences within a given interval is:

◙ The mean and variance of a Poisson random variable are both equal to λ.
#R syntax of estimating the probability of Poisson distribution
dpois(x, lambda)

#Example 1:
#According to the Poisson model, the probability of three arrivals at anautomatic bank teller in the next minute, where the average number of arrivals per minute is 0.6, is
dpois(x = 3, lambda = 0.5)

#Cumulative probability is estimated using
ppois()

#We can generate Poisson randomnumbers using the
rpois(n, lambda)

#Example 2:
#Suppose trafﬁc accidents occur at an intersection with a mean rate of 3 per year. Simulate the annual number of accidents for a 12-year period,assuming a Poisson model.
rpois(12, 3)

#Example 3:
#If there are eleven cars crossing a bridge per minute on average, find the probability of having sixteen or more cars crossing the bridge in a particular minute.
ppois(16,11)

#Visualizing the poisson distribution using histrogram plot
hist(rpois(500,2),col=”grey”,main=”Poisson”)

Exponential distribution
The exponential distribution describes the arrival time of a randomly recurring independent event sequence. If μ is the mean waiting time for the next event recurrence, its probability density function is

Where mean is reciprocal of the rate parameter.
#R syntax for estimating the probability of exponential distribution
pexp(q, rate)

#Example 1:
#Suppose the service time at a bank teller can be modeled as an exponential random variable with a rate of 3 per minute. Then the probability of acustomer being served in less than 1 minute is
pexp(1, rate = 3)

#The R function can be used to generate n random exponentialvariates.
rexp(n, rate)

#Example 2:
#A bank has a single teller who is facing a queue of 8 customers. The time for each customer to be served is exponentially distributed with rate 2 per minute. We can simulate the service times (inminutes) for the 8 customers.
servicetimes <- rexp(8, rate = 2)

Normal random distribution
A normal random variable X has a probability density function given by

where µ is the expected value of X, and σ2 denotes the variance of X .

◙ The standard normal random variable has mean µ = 0 and standard deviationσ = 1.

◙ The normal density function can be evaluated using the dnorm()
◙ The distribution function can be evaluated using pnorm()
◙ Normal pseudorandom variables can be generated using the rnorm()
#Example 1:
#Assume that the test scores of a college entrance exam fits a normal distribution. Furthermore, the mean test score is 75, and the standard deviation is 15. What is the percentage of students scoring 80 or more in the exam?
pnorm(80, mean=75, sd=15, lower.tail=FALSE)

#Visualizing the normal distribution using histrogram plot
x<-rnorm(1000, 5, 0.2)
hist(x,probability = TRUE)
lines(density(x))

Chi-squared Distribution
If X1, X2, …,Xm are m independent random variables having the standard normal distribution, then the following quantity follows a Chi-Squared distribution with m degrees of freedom. Its mean is m, and its variance is 2m.

#Example 1:
#Find the 97th percentile of the Chi-Squared distribution with 6 degrees of freedom.
qchisq(.97, df=6)        # 6 degrees of freedom

Examining the distribution of a set of data
The distribution can be examined in a number of ways.
# 1 . Generating summary of data
summary(faithful\$eruptions)

# 2. Drawing the histogram
hist(eruptions, seq(1.6, 5.2, 0.2), prob=TRUE)
lines(density(eruptions, bw=0.1))

Learning R (Programming Basics)

For loop in R

```# Example 1
# print 1, 2, 3...10
for(i in 1:10) {
print(i)
}

# Example 2
# print a, b, c, d
x <- c("a", "b", "c", "d")
for(i in 1:4) {
print(x[i])
}```

Nested for loop
It is defined as a set of for loops within for loops.

```#Example
M<- matrix(1:9, ncol=0)
Sum<- 0
for (i in seq(nrow(M))) {
for (j in seq(ncol(M))) {
sum<- sum + M[i,j]
print sum
}
}
```

While loop
Loop runs with condition.

```#Example:
i<- 1
# When i > =8, loop terminates
while ( i< 8) {
print i
i<- i+1
}
```

If statement
This structure allows you to test a condition and act on it depending on whether
it’s true or false.

```#Example
# if random number is greater than 3, it will print 10. Otherwise print 0.
x <- runif(1, 0, 10)
print(x)
if(x > 3) {
print(10)
}
else {
print(0)
}
```

Ifelse statement
This structure is an equivalent form of if else condition but this statement is applied to each element of vector individually.

```#Example
# Ifelse statement checks each elements of vector and if it's odd,
prints are odd number otherwise prints as even.
X<- 1:8
Ifelse (x%%2, paste0(x, “  : odd number”), paste0 (x, “  : even number”)
```

Next statement
Next statement is used to skip some iterations.

```#Example 1
# next statement skip the first 20 iterations
for(i in 1:100) {
if(i <= 20) {
next
}
x[i]=i
}
```

Break statement
Break is used to exit a loop immediately, regardless of what iteration the loop may be on.

```#Example 2
# stop loop after 20 iterations
for(i in 1:100) {
print(i)
if(i > 20) {
break
}
}
```

Repeat
It’s infinite loop and break statement is used to terminate from loop.

apply
Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.

Lapply
apply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

Writing functions in R
Functions are defined using the function() directive and are stored as R objects just like anythingelse. In particular, they are R objects of class “function”.

```#Example 1
func1 <- function() {
print("Hello, world!")
} f()

func1()

#Example 2
func2 <- function(num) {
for(i in seq_len(num)) {
print("Hello, world!")
}
}
f(4)
```

Function with return
return() functions to return a value immediately from a function.

```#Example
#counting odd numbers from a vector
oddcount <- function(x) {
return(length(which(x%%2==1)))
}
```

R Programming Environment
Environment is a collection of objects.

Global variables
Global variables are those variables which exists throughout the execution of a program. It can be changed and accessed from any part of the program.

Local variables
Local variables are those variables which exist only within a certain part of a program like a function, and is released when the function call ends.

Taking input from user

```#Example
# Convert to integer
my.age <- as.integer(my.age)
```

Recursion
A function that calls itself.

```#Example 1
#Finding factorial - n! = n*(n-1)!
recursive.factorial <- function(x) {
if (x == 0)    return (1)
else    return (x * recursive.factorial(x-1))
}

#Example 2
#The Fibonacci sequenceThe Fibonacci sequence is a series of numbers
#where a number is found by adding up the two numbers before it.
#Starting with 1, the sequence goes 1, 1, 2, 3, 5, 8, 13, 21, 34.

recurse_fibonacci <- function(n) {
if(n <= 1)    return(n)
else    return(recurse_fibonacci(n-1) + recurse_fibonacci(n-2))
}

for(i in 0:(12-1)) {
print(recurse_fibonacci(i))
}```

Algorithm analysis
An algorithm is evaluated based on following attributes:
◙ Shorter running time
◙ Lesser memory utilization

Memory management in R
R allocates memory differently to different objects in its environment. Memory allocation can be determined using the object_size function from the pryr package.

System runtime in R
System runtime helps to compare the different algorithms and pick the best algorithm. The microbenchmark package on CRAN is used to evaluate the runtime of any expression/function/code at an accuracy of a sub-millisecond.

Algorithm asymptotic analysis
Asymptotic notations are commonly used to determine the complexity in calculating the runtime of an algorithm. Big O (upper bound), Big Omega (lower bound), and Big Theta (average) are the simplest forms offunctional equations, which represent an algorithm’s growth rate or its system runtime.

Assignment operator
Assigning an element (numeric, character, complex, or logical) to an object requires a constant amount of time. The asymptote (Big Theta notation) of the assignment operation is θ(1).

Simple for loop
The total cost of this for loop is θ(n).

Nested loop
The total cost of nested loop is θ(n2).

Writing sorting algorithms in R
Bubble sort
Bubble sort is a simple sorting algorithm. This sorting algorithm is comparison-based algorithm in which each pair of adjacent elements is compared and the elements are swapped if they are not in order.

```bubblesort <- function(x) {
if (length(x) < 2)
return (x)
# last is the last element to compare with
for(last in length(x):2) {
for(first in 1:(last - 1)) {
if(x[first] > x[first + 1]) {
# swap the pair
save <- x[first]
x[first] <- x[first + 1]
x[first + 1] <- save
}
}
}
return (x)
}
```

Quick sort
Quick sort involves following steps:
◙ Pick an element, called a pivot, from the array.
◙ Partitioning: reorder the array so that all elements with values less than the pivot come before the pivot, while all elements with values greater than the pivot come after it (equal values can go either way). After this partitioning, the pivot is in its final position. This is called the partition operation.
◙ Recursively apply the above steps to the sub-array of elements with smaller values and separately to the sub-array of elements with greater values.

```quickSort <- function(vect) {
if (length(vect) <= 1) {
return(vect)  }
# Pick an element from the vector
element <- vect[1]
partition <- vect[-1]
# Reorder vector so that integers less than element
# come before, and all integers greater come after.
v1 <- partition[partition < element]
v2 <- partition[partition >= element]
# Recursively apply steps to smaller vectors.
v1 <- quickSort(v1)
v2 <- quickSort(v2)
return(c(v1, element, v2))
}
```

Learning R (Introduction)

♦ R Programming language is generally used for developing statistical analysis, graphics representation, and reporting.

<- symbol is the assignment operator.
# symbol is used to comment a line

```# x equals 1
x<-1
# msg equals hellp
msg<- "hello"```

The basic arithmetic operations using R

```# addition
18 + 12
# subtraction
18 - 12
# multiplication
18 * 12
# division
18 / 12
# just the integer part of the quotient
18 %/% 12
# just the remainder part (modulo)
18 %% 12
# exponentiation (raising to a power)
18 ^ 12
# natural log (base e)
log(10)
# base 10 logs
log10(100)
# square root
sqrt(88)
# absolute value
abs(18 / -12)```

Defining vectors in R

```# Method 1
# numeric vector
x <- c(0.5, 0.6)
# complex vector
x <- c(1+0i, 2+4i)

# Method 2 - Use the vector() function to initialize vectors
x <- vector("numeric", length = 10)

# Method 3 - Creating vector of numerical numbers
# number sequence 1, 2, 3,.... 10
1:10
# number sequence from 1 to 10 and interval is 1
seq(from=1, to=10, by=1)

# Some useful functions
# to check the type of data of my.seq
class(my.seq)
# to check whether my.seq is vector
is.vector(my.seq)
# it will devide each elments of my.seq by 3
my.seq = my.seq / 3```

Defining matrices in R

```# Method 1
# To create empty 2 by 3 matrix
m <- matrix(nrow = 2, ncol = 3)
# To check the dimensionality
dim(m)

# Method 2
# Matrices are constructed column-wise
m <- matrix(1:6, nrow = 2, ncol = 3)

# Method 3
#Matrix created directly from vectors by adding a dimension attribute.
m <- 1:10
dim(m) <- c(2, 5)

# Method 3
x <- 1:3
y <- 10:12
# create matrix by column-binding
cbind(x, y)
# create matrix by row-binding
rbind(x, y)```

Defining lists  in R
Lists are a special type of vector that can contain elements of different classes.

```# Method 1
# this list contains different class of elements
x <- list(1, "a", TRUE, 1 + 4i)

# Method 2
# create empty list with the length of 5
x <- vector("list", length = 5)```

Defining factors in R
♦ Factors are used to represent categorical data and can be unordered or ordered.
♦ Factors are important in statistical modelling.

```# Levels are put in alphabetical order
x <- factor(c("yes", "yes", "no", "yes", "no"))

# table() will how many yes and no are available
table(x)

x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"))
# Levels are put without alphabetical order
```

Missing Values
Missing values are denoted by NA or NaN. NA is used to represent missing numbers, and NAN is used to represent invalid numbers (0/0).

```# a vector is defined with missing number
x <- c(1, 2, NA, 10, 3)

# it will check whether this vector has any na values
is.na(x)
```

Data Frames
♦ Data frames are used to store tabular data in R.
♦ Data frames are represented as a special type of list where every element of the list has to have the same length.
♦ Unlike matrices, data frames can store different classes of objects in each column.
♦ In addition to column names, indicating the names of the variables or predictors, data frames have a special attribute called row.names which indicate information about each row of the data frame.

```# Define a data frame in R
x <- data.frame(foo = 1:4, bar = c(T, T, F, F))

# To show number of rows
nrow(x)

# To show number of columns
ncol(x)
```

Managing Data Frames with the dplyr package
♦ The data frame is a key data structure in statistics and in R.
♦ The dplyr package is designed filtering, re-ordering, and collapsing.

```#Installing dplyr package
install.packages("dplyr")

library(dplyr)

R\\chicago.rds')

#To show number of col and row of data
dim(chicago)
str(chicago)

# The select() function can be used to select columns of a data frame.

# Suppose we wanted to take the first 3 columns only.
names(chicago)[1:3]
subset <- select(chicago, city:dptp)

# if you wanted to keep every variable that ends with a “2”
subset <- select(chicago, ends_with("2"))

# You can also omit variables using the select()
select(chicago, -(city:dptp))

# If we wanted to keep every variable that starts with a “d”
subset <- select(chicago, starts_with("d"))

# The filter() function is used to extract subsets of rows from a data frame
# Extract the rows where PM2.5 is greater than 30
chic.f <- filter(chicago, pm25tmean2 > 30)
chic.f <- filter(chicago, pm25tmean2 > 30)
str(chic.f)
summary(chic.f\$pm25tmean2)

# Extract the rows where PM2.5 is greater than 30 and temperature is greater
than 80 degrees Fahrenheit.
chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
str(chic.f)
summary(chic.f\$pm25tmean2)

#The arrange() function is used to reorder rows of a data frame according to one
of the variables/columns

# We can order the rows of the data frame by date, so that the first row is the
earliest (oldest) observation and the last row is the latest (most recent)
observation.
chicago <- arrange(chicago, date)
chicago <- arrange(chicago, desc(date))
```

Logical operation in R

```# define a vector using boolean values
a <- c(TRUE, FALSE, FALSE, TRUE)
# define a numeric vector
b <- c(13, 7, 8, 2)
# selects true value elements
b[a]   // 13 2
# inverse of a
!a   // FALSE TRUE TRUE FALSE
# true as 1 and false as 0 and counts true and false values
sum(a)  // 2
```

Built-in search function

```example(mean)
help.search("optimization")
Help(mean)
```

Data input and output

```#Changing directories
Changing the default to the mydata folder in the C: drive
setwd("c:\\ mydata")

#Save the objects for a future session
dump("usefuldata", "useful.R")

#Retrieve the saved objects
source("useful.R")

#Save all of the objects that you have created during a session
dump(list=objects(), "all.R")

#Redirecting R output to text file
# Create a file solarmean.txt for output
sink("solarmean.txt")
# Write mean value to solarmean.txt