Datasets and Workflow

 
Data collection
The transcriptomic data of 414 marine micro-planktonic species were used. Among them 406 were derived from the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP). The most represented phylum Bacillariophyta includes transcriptomes of six most abundant diatom genera not covered by the MMETSP project. Furthermore, two transcriptomes of species Phaeodactylum tricornutum, and Thalassiosira pseudonana were assembled using an in-house script. The distribution of the species transcriptomes per group is as follows:
 
 
Workflow
Inspired from the ensemble machine learning methods, the following pipeline implements a majority voting procedure in order to predict the coding potential of the transcripts. The aim is to increase the reliability and promote the diversity of the ensemble model by combining the predictions of multiple models. To reach our goal, ten coding potential prediction tools including the most used ones like CPC2 and CPAT were selected and used in the pipeline. Basically, a transcript sequence was inputted to each tool which predicted both the forward and the reverse strand, and labeled them as “Coding” or “NonCoding”. A transcript was predicted as “NonCoding” within a single tool only if both strands were labeled as “NonCoding”, otherwise it was assigned the “Coding” label. The majority label output from all the tools was considered as the final coding potential class. Then the set of transcripts satisfying the follwing conditions were marked as Non-coding transcripts: the length of the encoded peptide(s) < 100 aa, no significant protein domains found in the Pfam database, and no significant hit in SwissProt. The non-coding transcripts were then separated into lncRNAs and sncRNAs on the basis of the transcript length.
Performance evaluation
The majority voting pipeline was tested on different transcriptomic datasets related to 18 different species and compared to the state-of-the-art coding potential prediction tools. The data tested contained different sets of coding and non-coding RNA sequences from Ensembl and RefSeq database, covering different organisms of distinct representative clades. Each tool was applied on each tested dataset to predict the classes of the transcript. The default parameters and pre-trained models were used for each tool. A cross-tabulation of observed and predicted classes was generated, and the performance metrics including the accuracy, sensitivity, specificity, and AUROC were calculated.
 
Server construction
The architecture of the LncPlankton server consists of two major components: a client web interface and a server backend (including the MySQL database). The client web interface is responsible for interacting with users through the input and output displays, and to process the service logic including sequence validation and format. The server backend is responsible for executing the whole prediction process described in the pipeline above including features calculation of the coding potential tools, making predictions, and generating data visualization. The processing of the prediction results was performed using R language. For the communication between the server and the web interface, the JSON-based (JavaScript Object Notation) data structure is mainly the most used format. In addition, the server is integrated with the followings components:
           - A standalone BLAST implemented via the interface rBLAST for online similarity search,
           - A standalone RNAfold implemented via the package LncFinder and RNAPlot implemented via the package RRNA for the calculation and the visualisation of secondary structure,
           - ORFfinder implemented via LncFinder for the exploration of lncRNA containing sORFs, and seqinr package for the translation to peptide sequences.
Those functionalities are provided via an express REST API (see API documentation) web service implemented using the R package Plumber. A shiny server function is also developed and integrated to the server component. This function processes the request of the shiny prediction app from the presentation layer and uses the NoSQL/static part of the data layer. The presentation layer contains several modules based on AJAX (Asynchronous JavaScript and XML), jQuery, and the PHP server-side scripting language, as well as the CSS code to describe how HTML elements are to be displayed on user side web interface. jQuery and AJAX provide methods to perform asynchronous call requests to the logic tier using GET and POST methods, parsing the JSON response, and dynamically rendering of the browser display.
 
Using the shiny prediction app

To allow users to customize their coding potential predictions, a dedicated shiny app was developed. This app provides a user-friendly interface to help the user in the selection of the desired tool and filters to be applied. For the visualizatio of the output, the app proposes one of the following choices: table, stacked plot, circular plot, and map. As an example, the prediction of the diatoms species was performed using CPC2 including the filters: reverse strand, transcript length > 200 bp, and peptide length < 100 aa.

LncPlankton Current Status
Phyla/Groups > 9
Species
Total transcripts
lncRNA-like
sncRNA-like
Coding-like
High-confidence lncRNAs