An introduction to the R package neat

Introduction

What’s neat?

neat is the R package that implements NEAT, the Network Enrichment Analysis Test which is presented in Signorelli, M., Vinciotti, V., Wit, E. C. (2016). NEAT: an efficient network enrichment analysis test. BMC Bioinformatics, 17:352.

The article is freely available from the website of BMC Bioinformatics.

What’s “network” enrichment analysis?

Network enrichment analysis is an extension of traditional gene enrichment analysis (GEA) tests, which are typically used to provide a characterization of a target gene set by relating it to gene sets (such as Gene Ontologies or KEGG pathways) whose function is already known.

A known limitation of GEA tests is that they ignore associations and dependences between genes. The purpose of network enrichment analysis is thus to integrate GEA tests with information on known relations between genes, represented by means of a gene network.

Loosely speaking, we can say that network enrichment analysis incorporates genetic networks, with their information on gene dependences, into gene enrichment tests. Hence, the name “network” enrichment analysis.

Get started

In order to be able to use the package, you need to install it with

install.packages('neat')

and, then, to load it with the command

library('neat')

A first example

Let’s first have a quick look at an example of how a network enrichment analysis can be carried out with NEAT.

The analysis will typically consist of three steps: preparation of the data, computation of the test and inspection of the results.

Preparation of the data

Let’s start by loading yeast, a list which contains the data that we will need for the analysis:

data(yeast) # load the data
ls(yeast) # display the content of the list
## [1] "esr1"       "esr2"       "goslimproc" "kegg"       "yeastnet"  
## [6] "ynetgenes"

Let’s say that we are interested to know whether a set of differentially expressed genes, yeast$esr2, can be related to some functional gene sets contained in yeast$goslimproc. Let’s focus the attention on two of these processes, namely ‘response to heat’ and ‘response to starvation’.

Before we can proceed with the analysis, we have to create two lists of gene sets, one (which we will call induced_genes) containing the set of differentially expressed genes and the other (called functional_sets) with the functional sets of interest:

induced_genes = list('ESR 2' = yeast$esr2) # set of differentially expressed genes 
#(ESR 2 is the set of induced ESR genes)
functional_sets = yeast$goslimproc[c(72,75)] # two functional gene sets of interest: 
#response to heat and to starvation

Besides these two lists, we will need two further objects:

  • yeast$yeastnet, a two-column matrix that contains YeastNet (a network incorporating known functional couplings between yeast genes, see the help page ?yeast for more details);
  • yeast$ynetgenes, a vector containing the names of all the genes that are present in the network.

Computation of the test

The idea behind NEAT is that if two gene sets are related, then in the network we expect them to be connected by a larger (or smaller) number of links than we would expect to observe by chance. Our null hypothesis, thus, is that if A and B are unrelated, then links are randomly placed between the two groups, so that the total number of links between A and B can be assumed to follow an hypergeometric distribution.

If, however, the number of links that we actually observe between A and B turns out to be significantly different from what we would expect to get if links were placed randomly, then we take this fact as potential evidence of a relation between the two groups and we say that there is ``enrichment” between them.

The computation of the test can be done with the function neat as follows:

test = neat(alist = induced_genes, blist = functional_sets, network = yeast$yeastnet,
            nettype = 'undirected', nodes = yeast$ynetgenes, alpha = 0.01, mtc.type = 'fdr')

Analysis of the results

The results are now saved in the object test, which we can display with the command print:

print(test)
##       A                      B nab expected_nab pvalue adjusted_p
## 1 ESR 2       response_to_heat  86         96.9 0.2518     0.2518
## 2 ESR 2 response_to_starvation 459        331.4 0.0000     0.0000
##       conclusion
## 1  No enrichment
## 2 Overenrichment

From the table we can see that the set of differentially expressed genes (ESR 2) is not enriched with respect to the set of genes involved in response to heating, whereas it is overenriched with respect to the set of genes that are responsible for response to starvation (that is to say, the observed number of links, 459, is significantly higher than what we would expect to get by chance, i.e. 331). Thus, we can conclude that genes in ESR 2 are regulated when the yeast cell is exposed to starvation, but not when exposed to heating.

A closer look to the package

The core of the package is the function neat:

neat(alist, blist, network, nettype, nodes, 
     alpha = NULL, mtc.type = 'fdr',
     anames = NULL, bnames = NULL)

The fundamental arguments of the function are:

  • alist and blist, two lists of gene sets;

  • network, which can be specified in three different formats;

  • nettype, either 'undirected' or 'directed';

  • nodes, a vector containing the names of all nodes in the network.

Moreover, three optional arguments are alpha, which allows to specify the significance level of the test, and anames and bnames (they can be used to name the elements of alist and blist, if not already named). Starting from package version 1.2.0, a further argument mtc.type has been added to allow computation of multiple testing corrections within neat.

As a (toy) example, let’s consider a partially directed network with 7 nodes defined by the following adjacency matrix

A = matrix(0, nrow=7, ncol=7)
labels = letters[1:7]
rownames(A) = labels; colnames(A) = labels
A[1,c(2,3)]=1; A[2,c(5,7)]=1;A[3,c(1,4)]=1;A[4,c(2,5,7)]=1;A[6,c(2,5)]=1;A[7,4]=1
print(A)
##   a b c d e f g
## a 0 1 1 0 0 0 0
## b 0 0 0 0 1 0 1
## c 1 0 0 1 0 0 0
## d 0 1 0 0 1 0 1
## e 0 0 0 0 0 0 0
## f 0 1 0 0 1 0 0
## g 0 0 0 1 0 0 0

How to specify the lists of gene sets

Let’s consider three sets of genes {a,e}, {c,g} and {d,f} and suppose we want to test whether there is enrichment from the first two sets to the third one.

First of all, let’s create a vector for each of the three sets:

set1 = c('a','e')
set2 = c('c','g')
set3 = c('d','f')

As we want to know whether there is enrichment from set1 and set2 to set3, we can create two gene lists, one (alist) containing set1 and set2 and the other (blist) containing set3:

alist = list('set 1' = set1, 'set 2' = set2)
blist = list('set 3' = set3)

Alternative network formats

Above we have defined the network with its adjacency matrix A. However, the network can be passed to neat in three alternative formats:

  • a sparse adjacency matrix, e.g.
library(Matrix)
as(A, 'sparseMatrix')
## 7 x 7 sparse Matrix of class "dgCMatrix"
##   a b c d e f g
## a . 1 1 . . . .
## b . . . . 1 . 1
## c 1 . . 1 . . .
## d . 1 . . 1 . 1
## e . . . . . . .
## f . 1 . . 1 . .
## g . . . 1 . . .
  • an igraph graph;

  • a two-column matrix where every row represents and edge (for directed and mixed networks, parent nodes must be in the first column, and child nodes in the second), e.g.:

##       [,1] [,2]
##  [1,] "a"  "b" 
##  [2,] "a"  "c" 
##  [3,] "b"  "e" 
##  [4,] "b"  "g" 
##  [5,] "c"  "a" 
##  [6,] "c"  "d" 
##  [7,] "d"  "b" 
##  [8,] "d"  "e" 
##  [9,] "d"  "g" 
## [10,] "f"  "b" 
## [11,] "f"  "e" 
## [12,] "g"  "d"

Network type

Set the argument nettype equal to 'undirected' if an undirected network is at hand, and equal to 'directed' if you are considering a directed or partially directed network.

Compute the test

Once you have prepared the lists of gene sets and the network, what you need is to run neat, without forgetting to specify the correct nettype (here nettype = 'directed') and the labels of nodes (here nodes = labels):

test1 = neat(alist = alist, blist = blist, network = A, 
             nettype = 'directed', nodes = labels)
print(test1)
##       A     B nab expected_nab     pvalue adjusted_p
## 1 set 1 set 3   0       0.3333 0.68181818 0.68181818
## 2 set 2 set 3   2       0.5000 0.04545455 0.09090909

If you want to add to the results a column specifying the conclusion of the test (overenrichment, no enrichment or underenrichment) for a given significance level, you use the option alpha. Note that the conclusion will obviously depend on the type of multiple correction you choose (compare test2 and test3 below):

# results using FDR correction (default since version 1.2.0)
test2 = neat(alist = alist, blist = blist, network = A, 
             nettype = 'directed', nodes = labels, alpha = 0.05, mtc.type = 'fdr')
print(test2)
##       A     B nab expected_nab     pvalue adjusted_p    conclusion
## 1 set 1 set 3   0       0.3333 0.68181818 0.68181818 No enrichment
## 2 set 2 set 3   2       0.5000 0.04545455 0.09090909 No enrichment
# results without multiple testing correction (mtc.type = 'none')
test3 = neat(alist = alist, blist = blist, network = A, 
             nettype = 'directed', nodes = labels, alpha = 0.05, mtc.type = 'none')
print(test3)
##       A     B nab expected_nab     pvalue adjusted_p     conclusion
## 1 set 1 set 3   0       0.3333 0.68181818 0.68181818  No enrichment
## 2 set 2 set 3   2       0.5000 0.04545455 0.04545455 Overenrichment

Further details and material

The aim of this vignette is to provide a quick introduction to the computation of NEAT using R. Here I focused my attention on the fundamental aspects that one needs to use the package.

Further details, functions and examples can be found in the manual of the package.

The description of the method is available in an article which you can read here.