!

Frequently Asked Questions

What are CRISPRs?

Clustered regularly interspaced short palindromic repeats (CRISPR) are repeat sequences, which are present in the genome of archaea and bacteria. In between these repeats, spacer sequences are present which are derived from the phages.
What is CRISPR/Cas system?

CRISPR/Cas system is an antiviral defense mechanism that helps bacteria and archaea. The RNAs transcribed from CRISPR array along with the CRISPR associated proteins (Cas) helps in sequence-specific cleavage of the foreign genetic material.
What is the role of CRISPRs in genome editing?

CRISPR/Cas system provides sequence-specific cleavage, which is used to edit the genome of organisms contributing its role in targeted genome editing. Cas9 proteins are endonucleases that make double stranded breaks at the site directed by single guide RNA (sgRNA) or chimeric RNA. By simply designing CRISPR and cas9 constructs any intended site of the genome can be targeted.
What are PAM sequences?

Protospacer adjacent motifs (PAMs) are the nucleotide sequences that are required by Cas proteins to recognize target sequence. These can be present either upstream or downstream of the target site.
Dataset used to develop geCRISPR algorithms?

Data utilized to develop geCRISPR algorithm is:
High throughput experimentally proven 4569 sgRNAs from literature i.e. 1841 sgRNAs (Doench et al, Nature Biotech 2014); 1278 sgRNAs (Shalem et al, Science 2014); 1020 sgRNAs (Miguel et al, Nature Methods 2015) and 430 sgRNAs (Chari et al, Nature Methods 2015).

sgRNA features (structure+sequence) and hybrids used to develop geCRISPR algorithms?

Compositional/positional (binary)/structural/thermodynamic Profile:

S no.PropertiesVector size
1mono-compo4
2di-compo16
3tri-compo64
4tetra-compo256
5penta-compo1024
6mono+di+compo hybrid20
7mono+di+tri+compo hybrid84
8mono+di+tri+tetra compo hybrid340
9mono+di+tri+tetra+penta compo hybrid1364
10mono-binary80
11di-binary304
12di-2-degree binary288
13di-3-degree binary272
14mono+di-binary hybrid384
15di-1+2+3-degree binary hybrid864
16mono+di (1+2+3-degree) binary hybrid944
17secondary structure20
18Thermodynamic property21
197+14 hybrid468
207+14+17 hybrid488
217+14+18 hybrid489
227+14+17+18 hybrid509

Machine learning techniques (MLTs) employed to develop geCRISPR algorithms?

  • Support Vector Machine (SVM)
  • Random Forest (RF) in R

*However, SVM is further used to develop webserver as it perfromed better than RF.

In-silico methods available for predicting sgRNA genome editing efficiency for CRISPR/Cas system?

  • WU-CRISPR (Wong et al Genome Biology 2015 used top 20% (~368p) and bottom 20% (~368n) of 1841 sgRNAs from Doench et al, Nature Biotech 2014)
  • sgRNAScorer: 450 sgRNAs (Chari et al, Nature Methods 2015)
  • CRISPRscan : 1280 sgRNAs (Miguel et al, Nature Methods 2015)

Comparison of geCRISPR algorithms with existing methods?

Classification Based:  To predict sgRNA genome editing efficiency (qualitative) either high or low

Algorithm

Dataset (Train/Test)

ACC (%)

MCC

ROC

Dataset
(Independent)

ACC (%)

MCC

ROC

WU-CRISPR

~736=368p+368n

NA

NA

0.92

NA

NA

NA

NA

sgRNAScorer_sp

279=133p+146n

73.20

NA

NA

NA

NA

NA

NA

sgRNAScorer_st

171=82p+89n

81.50

NA

NA

NA

NA

NA

NA

geCRISPRc

1840=895p+945n

81.17

0.75

0.92

250= 126p+124n

88.80

0.78

0.94

ACC: Accuracy
MCC: Matthews Correlation Coefficient
ROC: Receiver operating characteristic

Regression based:  To predict sgRNA genome editing efficiency (quantitative) from 0 to 100%

Algorithm

Dataset (Train/Test)

PCC (R)

Dataset
Independent

PCC (R)

CRISPRscan

1280

0.45

NA

0.58

geCRISPRr

3619

0.68

520

0.69

PCC: Pearson correlation coefficient

Why to use geCRISPR pipeline/algorithm?

Salient features of geCRISPR pipeline is:

  • It is developed on the very large dataset i.e. approximately 3 fold as compare to the existing methods.

  • Dataset used to develop this algorithm contains multiplatform data (Heterogeneous), which make it more general or universal method.

  • Performance of the algorithm is better than the existing methods as shown in above tables.

  • At last, it is easy to use pipeline, user just need to provide genomic sequence/region in the fast format only.