The objective of calculating nucleotide frequency of siRNA sequences is to transform any length of nucleotide sequence to fixed length feature vectors. It is important to have fixed length pattern while using machine-learning technique. Since our siRNA sequence contains both sense and antisense strand with their modifications. Information of each siRNA sequence for mononucleotide, dinucleotide can be encapsulated to a vector of 70 and 2450 respectively.
We employed binary pattern to extract pattern to extract siRNA features based on the occupancy of nucleotides at each position of siRNA sequences. There are 35 binary patterns used for each nucleotide. These are:
A= 000001, C= 000010, G= 000011, T= 000100, U= 000101, B= 000110, D= 000111, E= 001000, F= 001001, H= 001010, I= 001011, J= 001100, K= 001101, L= 001110, M= 001111, N= 010000, O= 010001, P= 010010, Q= 010011, R= 010100, S= 010101, V= 010111, W= 011000, X= 011001, Y= 011010, Z= 011011, b= 011100, d= 011101, e= 011110, g= 011111, h= 100000, m= 100001, n= 100010, q= 100011, r= 100100
In hybrid approach, besides being used individually, more parameters from nucleotide frequency and binary pattern were used in order to increase the performance of prediction method. We have used two hybrid methods, mono-binary and mono-binary-N&C terminal which makes a vector of 112 and 118 respectively.
Model No. |
Features |
Vector Size |
1. |
Mononucleotide Frequency (MNC) |
70 |
2. |
Hybrid 1-MNC+ Binary Antisense 24 (MNC+BIN-AS) |
94 |
3. |
Hybrid 2-MNC+Binary Antisense-5' 13 (MNC+BIN-AS(seed 13)) |
83 |
4. |
Hybrid 3-MNC+Binary Antisense-seed-5' 8 (MNC+BIN-AS(seed 8)) |
78 |
5. |
Hybrid 3-MNC+Binary Antisense-seed-3' 8 (MNC+BIN-AS(last 8)) |
78 |
Support vector machines (SVMs) were trained with the selected sequence features to predict modified siRNA potency in regression mode. SVM allows choosing a number of parameters and kernels. The SVM light software package (available at http://svmlight.joachims.org/) was used to construct SVM classifiers. In this study, we used the radial basis function (RBF) kernel:
k(x,y)=exp(-γ ||x-y||2 )
where x and y are two data vectors, and γ is a training parameter.
In addition, we also used another machine learning method, Random Forest (R package). Random forests are ensemble learning method for classification (and regression) that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes output by individual trees.
In order to evaluate performance of our models, we used Pearson’s correlation coefficient (R). All models were evaluated using 10-fold cross validation technique.
where n is the size of test set, EiPred and Eiact is the predicted and actual value respectively.