ge-CRISPR

Frequently Asked Questions

What are CRISPRs?

Clustered regularly interspaced short palindromic repeats (CRISPR) are repeat sequences, which are present in the genome of archaea and bacteria. In between these repeats, spacer sequences are present which are derived from the phages.

What is CRISPR/Cas system?

CRISPR/Cas system is an antiviral defense mechanism that helps bacteria and archaea. The RNAs transcribed from CRISPR array along with the CRISPR associated proteins (Cas) helps in sequence-specific cleavage of the foreign genetic material.

What is the role of CRISPRs in genome editing?

CRISPR/Cas system provides sequence-specific cleavage, which is used to edit the genome of organisms contributing its role in targeted genome editing. Cas9 proteins are endonucleases that make double stranded breaks at the site directed by single guide RNA (sgRNA) or chimeric RNA. By simply designing CRISPR and cas9 constructs any intended site of the genome can be targeted.

What are PAM sequences?

Protospacer adjacent motifs (PAMs) are the nucleotide sequences that are required by Cas proteins to recognize target sequence. These can be present either upstream or downstream of the target site.

Dataset used to develop geCRISPR algorithms?

Data utilized to develop geCRISPR algorithm is:
High throughput experimentally proven 4569 sgRNAs from literature i.e. 1841 sgRNAs (Doench et al, Nature Biotech 2014); 1278 sgRNAs (Shalem et al, Science 2014); 1020 sgRNAs (Miguel et al, Nature Methods 2015) and 430 sgRNAs (Chari et al, Nature Methods 2015).

sgRNA features (structure+sequence) and hybrids used to develop geCRISPR algorithms?

Compositional/positional (binary)/structural/thermodynamic Profile:

S no.	Properties	Vector size
1	mono-compo	4
2	di-compo	16
3	tri-compo	64
4	tetra-compo	256
5	penta-compo	1024
6	mono+di+compo hybrid	20
7	mono+di+tri+compo hybrid	84
8	mono+di+tri+tetra compo hybrid	340
9	mono+di+tri+tetra+penta compo hybrid	1364
10	mono-binary	80
11	di-binary	304
12	di-2-degree binary	288
13	di-3-degree binary	272
14	mono+di-binary hybrid	384
15	di-1+2+3-degree binary hybrid	864
16	mono+di (1+2+3-degree) binary hybrid	944
17	secondary structure	20
18	Thermodynamic property	21
19	7+14 hybrid	468
20	7+14+17 hybrid	488
21	7+14+18 hybrid	489
22	7+14+17+18 hybrid	509

Machine learning techniques (MLTs) employed to develop geCRISPR algorithms?

Support Vector Machine (SVM)
Random Forest (RF) in R

*However, SVM is further used to develop webserver as it perfromed better than RF.

In-silico methods available for predicting sgRNA genome editing efficiency for CRISPR/Cas system?

WU-CRISPR (Wong et al Genome Biology 2015 used top 20% (~368p) and bottom 20% (~368n) of 1841 sgRNAs from Doench et al, Nature Biotech 2014)
sgRNAScorer: 450 sgRNAs (Chari et al, Nature Methods 2015)
CRISPRscan : 1280 sgRNAs (Miguel et al, Nature Methods 2015)

Comparison of geCRISPR algorithms with existing methods?

Classification Based: To predict sgRNA genome editing efficiency (qualitative) either high or low

Algorithm	Dataset (Train/Test)	ACC (%)	MCC	ROC	Dataset (Independent)	ACC (%)	MCC	ROC
WU-CRISPR	~736=368p+368n	NA	NA	0.92	NA	NA	NA	NA
sgRNAScorer_sp	279=133p+146n	73.20	NA	NA	NA	NA	NA	NA
sgRNAScorer_st	171=82p+89n	81.50	NA	NA	NA	NA	NA	NA
geCRISPRc	1840=895p+945n	81.17	0.75	0.92	250= 126p+124n	88.80	0.78	0.94

ACC: Accuracy
MCC: Matthews Correlation Coefficient
ROC: Receiver operating characteristic

Regression based: To predict sgRNA genome editing efficiency (quantitative) from 0 to 100%

Algorithm	Dataset (Train/Test)	PCC (R)	Dataset Independent	PCC (R)
CRISPRscan	1280	0.45	NA	0.58
geCRISPRr	3619	0.68	520	0.69

PCC: Pearson correlation coefficient

Why to use geCRISPR pipeline/algorithm?

Salient features of geCRISPR pipeline is:

It is developed on the very large dataset i.e. approximately 3 fold as compare to the existing methods.

Dataset used to develop this algorithm contains multiplatform data (Heterogeneous), which make it more general or universal method.

Performance of the algorithm is better than the existing methods as shown in above tables.

At last, it is easy to use pipeline, user just need to provide genomic sequence/region in the fast format only.