Enzymes play an important role in our daily lives and are used in variety of industries and sectors like food, detergent and medicine. The demand of certain enzymes has increased exponentially, like lipases, proteases, hydrolases and polymerases. Research laboratories and industries are extensively working to find newer and better candidates.
Major enzyme industries are regularly introducing new enzymes in the market. In the past two decades, several patents on enzymes have been filed and issued. Apart from this, there are ongoing efforts to substitute chemical reaction processes in industries with enzymatic processes, as they are greener and environment friendly alternatives.
It has been widely accepted that a cleaner chemical synthesis process should be practiced to prevent pollution and avoid generation of toxic wastes . Enzymatic synthesis of chemical compounds has emerged as a simple, better and competitive route in comparison to chemical methods. Also, a high substrate specificity and better conversion rate with formation of low or no by-products makes enzyme a robust and efficient choice.
Recently, Merck and Codexis developed a greener process for the synthesis of Sitagliptin, a drug used in diabetes treatment. In the recent years, advancement in recombinant DNA technology has resulted in successful approaches to overexpress an enzyme in variety of host cells, which can help in producing the biocatalyst in high amount.
To obtain an efficient enzyme candidate, stringent selection criteria are required to achieve high activity, specificity, and stability. In an industrial processes, the substrate, solvent, reaction conditions are important and an enzyme chosen should be able to withstand these components and conditions. It is actually difficult to find a natural enzyme with all the properties desired in an industrial process.
To fulfil the massive enzyme demand, various approaches are practiced to constantly explore different resources to obtain new and better enzymes. Among these, in-silico bioprospecting has come up as an efficient, cost and time effective approach to discover new enzyme candidates. Although this approach has been practiced at various laboratories, it has not been reviewed or discussed.
New enzyme discovery can be accomplished using various conventional and contemporary methods as mentioned in Fig.Common methods of screening to identify novel enzymes are performed by exploring natural sources like industrial waste or soil, but they require an established protocol for screening assay or selection method based on the desired properties of the enzyme. This process involves biochemical screening and isolating the organism on selective media, which is usually time and resource consuming and may or may not result in a novel candidate.
From these screening assays, the selected organism further needs to be identified, followed by the identification of gene sequence which is coding for the desired enzyme and function. One approach is to perform random mutagenesis to create enzyme mutant, and then sequence the DNA region.
Another way is to perform targeted or whole genome sequencing to identify the desired enzyme gene sequence. As an alternative, amplification of target gene can be performed using degenerate primers. There are challenges involved in primer designing, which affects the success rate. The process is followed by PCR library cloning and screening for prospective candidates with desired properties, which again demands a well-established protocol for screening positive candidates.
After selecting the desired clone, the responsible gene can be sequenced, cloned and expressed. The direct screening and identification methods are preferred where molecular biology resources are inadequate.
These experimental approaches are used commonly, but they are time and resource consuming, with low success rate. However, in-silico bioprospecting is a simple, straightforward and promising approach to identify novel enzyme candidates with better enzymatic properties. A compilation of recent reports, where in-silico bioprospecting approach has been used to find novel enzymes, is given in Table 1.
The current fast paced, high-throughput whole genome/metagenome sequencing has tremendously increased the biological database and thus the enzyme diversity. This diversity in turn has increased the complexity and difficulty of finding a novel candidate. The in-silico bioprospecting process can be broadly divided into two steps: (i) Searching databases (ii) Using Bioinformatics tools to screen, analyse and shortlist prospective candidates.
This can be performed by exploring databases using various search tools based on homology, conserved motif, consensus guided approach, or simply keyword search. The search result can be further screened using filters, such as percentage identity, query coverage, e-value. For example, a keyword search in NCBI protein database can be performed, followed by filtering the results to show candidates between 30 and 80% identity with query coverage > 95%. Gupta et al.
used keywords such as ‘Hypothetical Protein of T. aestivum’, ‘Hypothetical Proteins of wheat’ in NCBI database followed by manual screening to get unique protein candidates. After removing redundant entries, unique candidates were further subjected to physicochemical, localization, function and domain analysis. In another database search, keywords such hydroxybutyrate, hydroxyalkanoate, hydroxyalkanoic, PHA and PHB were used as input.
Another common approach practiced by researcher is to search biological databases using a known candidate enzyme sequence. While choosing a potential enzyme gene sequence, it is of utmost importance to select a full length protein sequence having conserved domains, as many incomplete sequences annotated in database do not code for a functional protein, when checked experimentally. Also, in the search result, the selected candidate’s sequence similarity should not be very high with known sequence.
This is to ensure that a novel candidate is shortlisted and not a close homologue of a known sequence. In the similarity search result, the hits with >90 identity are very closely related, sources like different species of same family, and it is more likely that they are very similar. But, the hits with ~ 80% identity or lower are those candidates who are different from the query candidates, not closely related, but do have conserved sequences similar to known candidates.
This ensures that novel candidates are chosen, which is predicted to retain the enzyme activity but is different from the search query. There have been reports where researchers had selected candidates with sequence similarity as low as 40 percent. Sharma et al. searched novel sources of nitrilases from microbial genomes by adopting homology-based approach and selected sequences which exhibited>30% and <80% identity.
The shortlisted search results need to be confirmed for a complete coding sequence or sequences. For example, shortlisted candidates of nitrilase were checked by GenMark S tool to verify complete coding sequences or sequences. Since the protein length information is available for the input sequence, the search results should be restricted to length closer to the input sequence length. In case of nitrilases, sequences with less than 100 amino acids were considered as false positive and were discarded.
In another instance, sequences less than 250 amino acids were excluded to find novel BVMO (Bayer-Villiger Monooxygenases) enzyme. For PHA synthase, sequences with ~120 to 260 bp were considered as prospective candidate in a database search. These search filters along with others like e-value, can aid in gathering positive sequences which could code for functional enzyme of appropriate length and reduces the chance of false discovery or random or irrelevant search result.
In certain cases, designing motif from selected protein sequences [e.g. by using MAST (Motif Alignment and Search Tool) at MEME suite] can be used to search bacterial genome. For example, Homology-based approach and motif search resulted in the identification of 138 putative/hypothetical protein sequences which had potential to code for nitrilase. Vaquero et al.
also adopted homology-based strategy to screen for novel CalB-type lipase in fungal genomes using blastp algorithm, against JGI and NCBI databases, with e-value cut-off as 10−2. In the same study, conserved motif approach failed to identify putative lipase gene due to absence of conserved sequence motif generated by MEME software. Therefore, different individual strategies or combinations should be implemented in the process of finding novel putative enzymes. Consensus guided approach, using Pfam domain, can also be used to search databases for the presence of particular enzyme family.
Consensus-guided approach was adopted by Shakeel et al.to obtain heat stable alkane-producing enzymes, using ado gene from Synechococcus elongatus PCC7942 as a query to search IMG MER hot spring database. A consensus sequence was generated from the list of homologous sequences using Bioinformatics tools, which was further validated computationally and experimentally. Specific datasets like metagenomes from various ecosystems can also be searched for obtaining novel enzymes.
Around 264 putative monooxygenases were obtained when Pfam domain and blastp search were used to search BVMO from ~ 14 million protein-coding sequences present in metagenomic dataset of cold marine sediments. Metagenome data of mangrove soil were explored to find polyhydroxyalkanoate (PHA) synthase genes. Adam et al. reported a novel activity-based approach to screen H2-uptake enzyme from hydrothermal Metagenome.
Toyama et al. reported a novel β-glucosidase from microbial Metagenome of a lake in Amazon. Tan et al. reported a novel thermostable phytase using bioinformatics approach which was screened from Metagenome database. Various steps and approaches used in gene mining from Metagenome data have been discussed and reviewed recently and reader is referred to these articles and reviews for details.
The steps of in-silico bioprospecting can be modified as per the desired property of enzyme. For example, if a thermostable enzyme is desirable, but the known enzyme reported is not thermostable, the similarity searches in thermophiles will be useful to find putative thermostable enzymes. It has been commonly observed that the thermostable enzyme sequences are different from their mesophilic counterpart. The putative thermophilic candidates searched this way should be further analysed (discussed in Step 2) to make sure that residues important for structure and functions are conserved.
Using Bioinformatics Tools to Screen, Analyse and Shortlist Prospective Candidates
Once the primary list has been generated using various database search approaches, the next step will be to analyse their physiochemical, phylogenetic and functional properties using different bioinformatics tools. ProtParam software using ExPASy server is widely used to access physiochemical properties (such as the molecular weight, theoretical pI, amino acid composition, atomic composition, extinction coefficient, estimated half-life, instability index, aliphatic index, and grand average of hydropathicity (GRAVY) of putative candidates. Predicted values of all parameters of putative enzyme(s) are compared to the well characterized enzyme which affects the confidence level to study the putative enzyme(s) experimentally.
For example, ProtParam predicted physiochemical properties of 138 putative nitrilases with in the range of well-characterized nitrilases. All the parameters are based on protein sequence i.e. sequence-dependent analysis; therefore, it is necessary to get complete or nearly complete sequence for accurate analysis and prediction of various physiochemical properties. Phylogenetic analysis can be performed using tools like Molecular Evolutionary Genetics Analysis (MEGA).
For example, phylogenetic analysis of selected putative candidates belonging to CalB-family grouped putative lipases in to different clusters of known lipases depending upon its evolutionary closeness, thus helping in deciding on novel and unique candidates. Structural modelling of putative candidates can be performed using SWISS-MODEL server or MODELLER v9.15 software.
There are other tools which can predict structural information such as signal peptide (e.g. Signal P) or disulphide linkages (e.g. DiANNA). DiANNA 1.1 web server predicted two disulpfide bonds in PlicB whereas CalB and Uml2 lacks disulfide bonds. Protein functional domains and families are studied by comparing list of putative enzyme(s) against databases like Pfam, CATH, SVM-Prot, CDART, SMART.
In one study, hypothetical proteins (HPs) were explored using tools based on domain architecture and profiles. Out of 124 HPs, 77 sequences were annotated with high confidence by using Pfam, CATH, SVM-Prot, CDART, SMART and ProtoNet, and among them, 16 were predicted as enzymes.
Author: Asmita Kamble,Sumana Srinivasan,Harinder Singh