Introduction to R

R is a useful programming language that allows us to perform a variety of statis- tical tests and data manipulation. It can also be used to generate fantastic data visualisations. Here we will go through some of the basics of R so that you can better understand the practicals throughout the workshop.

Basics

To being type R in the terminal:

R

Libraries

Most functionality of R is organised in packages or libraries. To access these functions, we will have to install and load these packages. Most commonly used packages are installed together with the standard installation process. You can install a new library using the install.packages function.

For example, to install ggplot2, run the command:

install.packages("ggplot2")

After installation, you can load the library by typing

    library(ggplot2)

Variables in R

You can assign a value or values to any variable you want using \<-. e.g

Assign a number to a

  a <- 1

Assign a vector containing a,b,c to v1

  v1 <- c("a", "b","c")

Functions

You can perform lots of operations in R using different built-in R functions. Some examples are below:

Assign number of samples

    nsample <- 10000

Generate nsample random normal variable with mean = 0 and sd = 1

    normal <- rnorm(nsample, mean=0,sd=1)

    normal.2 <- rnorm(nsample, mean=0,sd=1)

We can examine the first few entries of the result using head

  head(normal)

And we can obtain the mean and sd using

  mean(normal)

  sd(normal)

We can also calculate the correlation between two variables using cor

  cor(normal, normal.2)

Plotting

While R contains many powerful plotting functions in its base packages,customisationn can be difficult (e.g. changing the colour scales, arranging the axes). ggplot2 is a powerful visualization package that provides extensive flexibility and customi- sation of plots. As an example, we can do the following

Load the package

  library(ggplot2)

Specify sample size

  nsample <-1000

Generate random grouping using sample with replacement

 groups <- sample(c("a","b"), nsample, replace=T)

Now generate the data

 dat <- data.frame(x=rnorm(nsample), y=rnorm(nsample), groups)

Generate a scatter plot with different coloring based on group

 ggplot(dat, aes(x=x,y=y,color=groups))+geom_point()

Regression Models

In statistical modelling, regression analyses are a set of statistical techniques for estimating the relationships among variables or features. We can perform regression analysis in R.

Use the following code to perform linear regression on simulated variables "x" and "y":

Simulate data

 nsample <- 10000

 x <- rnorm(nsample)

 y <- rnorm(nsample)

Run linear regression

  lm(y~x)

We can store the result into a variable

   reg <- lm(y~x)

And get a detailed output using summary

   summary(lm(y~x))

We can also extract the coefficient of regression using

   reg$coefficient

And we can obtain the residuals by

  residual <- resid(reg)

Examine the first few entries of residuals

   head(residual)

We can also include covariates into the model

  covar <- rnorm(nsample)

  lm(y~x+covar)

And can even perform interaction analysis

  lm(y~x+covar+x*covar)

Alternatively, we can use the glm function to perform the regression:

  glm(y~x)

For binary traits (case controls studies), logistic regression can be performed using

Simulate samples

  nsample <- 10000

  x <- rnorm(nsample)

Simulate binary traits (must be coded with 0 and 1)

  y <- sample(c(0,1), size=nsample, replace=T)

Perform logistic regression

  glm(y~x, family=binomial)

Obtain the detailed output

  summary(glm(y~x, family=binomial))