find-a-gene.py
to find novel genes
For details on find-a-gene.py
, please see GitHub
Human Synapsin 1a was chosen as the starting gene. This was then used in a TBLASTN search of the NCBI EST database. Results from human, mouse, and rat were discarded due to the high likelihood of being "known."
The results are then filtered based on identity, similarity, and gaps. The thought is that sequences too similar to the orignal (potentially well-studied) sequence are more likely to be annotated. While this theory was not empiracally validated, the results were good!
The 50 top hits from the filtered results were checked for novelty by performing a BLAST search for them in the non-redundant database. By Pevsner's criteria, a gene is novel if there is not a 100% identitry match from the query organism. If the original gene of interest is not found in the results, the result is likely a false positive.
Multiple sequence alignment was performed using Clustal Omega for both the full set of genes checked for novelty (regardless of the result) and the discovered novels only. In both cases, the original sequence and 3 comparitor sequences (Human synapsins IIa, IIb, and III). Phylogenetic trees were calculated from these alignments using PhyML with nonparametric bootstrapping for refinement.
For synapsin Ia and the novel genes, homology modeling was performed using Prime (Schrödinger). Briefly, a BLASTP search on the protein sequence was performed, limiting to proteins with structures deposited in the PDB. The basis of the model was chosen by a combination of E-value and maximum overlap with the query sequence. After the model was generated, the structure was refined using constrained molecular dynamics and loops not directly fit to the PDB structure were further refined.