Importing Data

When importing data we use a few common functions:

read.csv() - to read in .csv files or files separated by commas/,
read.table() - to read files separated by delimiters other than commas - like spaces/" ", tabs/"/t", semicolons/";", etc.
openxlsx::read.xlsx() - to read excel files

You’ll note that read.xlsx() has the prefix openxlsx::. This is because the read. xlsx() function is not avaiable with base R. To get this function you will need to install a new package. To do so you’ll enter this general formula into the console:

install.packages("packageToInstall")

So to install the openxlsx package all you’ll need to do is write out install.packages("openxlsx"). Installing packages aside, let’s dive into how to use each function described above.

read.csv()

When importing .csv files you’ll need to specify the path to where you’re file is located. So if your .csv file is in /Documents/test.csv, you can download it like so:

test <- read.csv("/Documents/test.csv")

We can also extend this to URL’s as well:

url_test <- read.csv(url("http://plugins.biogps.org/download/human_sample_annot.csv"))

Once you have data loaded you will see it in your environment window:

You can click this to inspect your data and it will appear in your script editor window:

read.table()

Like read.csv(), read.table() can also import data. The latter function is very useful in that it can download files not delimted (a.k.a separated) by commas. So to open a “.tsv” file (a.k.a a file delimeted by a tab/"/t"):

tsv_test <- read.table("/Documents/test.tsv",sep="\t",stringsAsFactors=FALSE)

You’ll notice in the code above that we include the option, stringsAsFactors=FALSE. If this was set to TRUE it would coerce your character columns into factor columns and this isn’t always desired. So here we explicitly say stringsAsFactors=FALSE to be safe.

read.xlsx()

While files like the ones mentioned above are popular, so are excel spreadsheets. So it is worth mentioning how to read in excel data as well. However, to do wo we will need pull in the package openxlsx, which we did above by entering install.packages("openxlsx"). To use this package we can do one of the following:

call it for every function we use: openxlsx::read.xlsx("/Documents/test.xlsx")
load the entire library of functions:

library(openxlsx)

xlsx_test <- read.xlsx(“/Documents/test.xlsx”)

Often times we just load in the whole library, if we were writing our own package we would try and call it for every function we use for traceability reasons. Now in excel spreadsheets you may only want to pull out one page or start from a row that isn’t the first. To do so you can use:

 library(openxlsx)

 xlsx_test <- read.xlsx("/Documents/test.xlsx",sheet=2,startRow = 5,colNames = TRUE,rowNames = FALSE)

So here we are pulling: the document “/Documents/test.xlsx”, the second sheet, starting from the fifth row, specifying we do have column names, specifying we do not have row names.

Inspecting Data

Now before we go about manipulating it let’s inspect it. For training purposes we will inspect a base training dataset in R called iris:

To get a summary of each column:

summary(iris)

Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

To get the data’s class:

class(iris)

data.frame

To get a display of the data’s contents:

str(iris)

data.frame:	150 obs. of  5 variables:
$ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

To get the first 6 rows:

head(iris)
 
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
        5.1         3.5          1.4         0.2  setosa
        4.9         3.0          1.4         0.2  setosa
        4.7         3.2          1.3         0.2  setosa
        4.6         3.1          1.5         0.2  setosa
        5.0         3.6          1.4         0.2  setosa
        5.4         3.9          1.7         0.4  setosa

To get the last 6 rows:

tail(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
        6.7         3.3          5.7         2.5 virginica
        6.7         3.0          5.2         2.3 virginica
        6.3         2.5          5.0         1.9 virginica
        6.5         3.0          5.2         2.0 virginica
        6.2         3.4          5.4         2.3 virginica
        5.9         3.0          5.1         1.8 virginica

To get the length of a vector:

length(iris$Sepal.Length)

150

To get the dimensions of a matrix/data frame:

dim(iris)

150 5 #(so this would be 150 rows and 5 columns)

To get the number of columns/rows:

ncol(iris)

5

nrow(iris)

150

To get your column names:

colnames(iris)

Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

To get your row names:

rownames(iris)

"1"   "2"   "3"   "4"   "5" ...

Now that we know how to import our data and inspect it, we can go ahead and manipulate it!

Next Workshop: Manipulating Data

Main Page

Importing Data

Tools for Life Science

The Basics

NGS Analysis

Metagenomics Analysis

Protein Structure Analysis

Galaxy Tutorials

Importing Data

read.csv()

read.table()

read.xlsx()

Inspecting Data