Frequently Asked Questions
What are CRISPRs?
Clustered regularly interspaced short palindromic repeats (CRISPR) are repeat sequences, which are present in the genome of archaea and bacteria. In between these repeats, spacer sequences are present which are derived from the phages.
What is CRISPR/Cas system?
CRISPR/Cas system is an antiviral defense mechanism that helps bacteria and archaea. The RNAs transcribed from CRISPR array along with the CRISPR associated proteins (Cas) helps in sequence-specific cleavage of the foreign genetic material.
What is the role of CRISPRs in genome editing?
CRISPR/Cas system provides sequence-specific cleavage, which is used to edit the genome of organisms contributing its role in targeted genome editing. Cas9 proteins are endonucleases that make double stranded breaks at the site directed by single guide RNA (sgRNA) or chimeric RNA. By simply designing CRISPR and cas9 constructs any intended site of the genome can be targeted.
What are PAM sequences?
Protospacer adjacent motifs (PAMs) are the nucleotide sequences that are required by Cas proteins to recognize target sequence. These can be present either upstream or downstream of the target site.
Dataset used to develop geCRISPR algorithms?
Data utilized to develop geCRISPR algorithm is:
High throughput experimentally proven 4569 sgRNAs from literature i.e. 1841 sgRNAs (Doench et al, Nature Biotech 2014); 1278 sgRNAs (Shalem et al, Science 2014); 1020 sgRNAs (Miguel et al, Nature Methods 2015) and 430 sgRNAs (Chari et al, Nature Methods 2015).
sgRNA features (structure+sequence) and hybrids used to develop geCRISPR algorithms?
Compositional/positional (binary)/structural/thermodynamic Profile:
S no. | Properties | Vector size |
---|---|---|
1 | mono-compo | 4 |
2 | di-compo | 16 |
3 | tri-compo | 64 |
4 | tetra-compo | 256 |
5 | penta-compo | 1024 |
6 | mono+di+compo hybrid | 20 |
7 | mono+di+tri+compo hybrid | 84 |
8 | mono+di+tri+tetra compo hybrid | 340 |
9 | mono+di+tri+tetra+penta compo hybrid | 1364 |
10 | mono-binary | 80 |
11 | di-binary | 304 |
12 | di-2-degree binary | 288 |
13 | di-3-degree binary | 272 |
14 | mono+di-binary hybrid | 384 |
15 | di-1+2+3-degree binary hybrid | 864 |
16 | mono+di (1+2+3-degree) binary hybrid | 944 |
17 | secondary structure | 20 |
18 | Thermodynamic property | 21 |
19 | 7+14 hybrid | 468 |
20 | 7+14+17 hybrid | 488 |
21 | 7+14+18 hybrid | 489 |
22 | 7+14+17+18 hybrid | 509 |
Machine learning techniques (MLTs) employed to develop geCRISPR algorithms?
- Support Vector Machine (SVM)
- Random Forest (RF) in R
*However, SVM is further used to develop webserver as it perfromed better than RF.
In-silico methods available for predicting sgRNA genome editing efficiency for CRISPR/Cas system?
- WU-CRISPR (Wong et al Genome Biology 2015 used top 20% (~368p) and bottom 20% (~368n) of 1841 sgRNAs from Doench et al, Nature Biotech 2014)
- sgRNAScorer: 450 sgRNAs (Chari et al, Nature Methods 2015)
- CRISPRscan : 1280 sgRNAs (Miguel et al, Nature Methods 2015)
Comparison of geCRISPR algorithms with existing methods?
Classification Based: To predict sgRNA genome editing efficiency (qualitative) either high or low
Algorithm |
Dataset (Train/Test) |
ACC (%) |
MCC |
ROC |
Dataset |
ACC (%) |
MCC |
ROC |
WU-CRISPR |
~736=368p+368n |
NA |
NA |
0.92 |
NA |
NA |
NA |
NA |
sgRNAScorer_sp |
279=133p+146n |
73.20 |
NA |
NA |
NA |
NA |
NA |
NA |
sgRNAScorer_st |
171=82p+89n |
81.50 |
NA |
NA |
NA |
NA |
NA |
NA |
geCRISPRc |
1840=895p+945n |
81.17 |
0.75 |
0.92 |
250= 126p+124n |
88.80 |
0.78 |
0.94 |
MCC: Matthews Correlation Coefficient
ROC: Receiver operating characteristic
Regression based: To predict sgRNA genome editing efficiency (quantitative) from 0 to 100%
Algorithm |
Dataset (Train/Test) |
PCC (R) |
Dataset |
PCC (R) |
CRISPRscan |
1280 |
0.45 |
NA |
0.58 |
geCRISPRr |
3619 |
0.68 |
520 |
0.69 |
Why to use geCRISPR pipeline/algorithm?
Salient features of geCRISPR pipeline is:
- It is developed on the very large dataset i.e. approximately 3 fold as compare to the existing methods.
- Dataset used to develop this algorithm contains multiplatform data (Heterogeneous), which make it more general or universal method.
- Performance of the algorithm is better than the existing methods as shown in above tables.
- At last, it is easy to use pipeline, user just need to provide genomic sequence/region in the fast format only.