This project is an ANR project, part of the "Action de Recherche Amont (ARA) Masse de Données"
The project we propose aims to meet the challenges raised by the present and future deluge of genomic data. Basically, we plan to design an integrated resource for microbial genomics. The objective is to gather together the maximum of relevant data and to make them available to various approaches of data mining in spite of their heterogeneity. A graphic interface will be designed to allow efficient, simple still expressive queries in order to extract relevant pieces of knowledge through a visual and interactive system. This will make possible to cross-fertilize and analyze in finest details all available genomic data.
Since 1995, when the first complete bacterial genome was sequenced, we have witnessed a profound change in the way biological problems are approached. Instead of studying a few genes involved in a particular biological system, biologists can now adopt an encyclopaedic approach, considering the whole genome of the organism. This new approach, known as integrative biology, is characterised by large amounts of data that needs to be analysed. At the time of writing (June 2005), 230 microbial genomes have been sequenced (corresponding to some 600,000 proteins) and 370 microbial genomes more are expected in a near future. Moreover, genomic data are rather heterogeneous and spread over a number of biological collections : nucleic sequences, protein sequences, protein 3D structures, data from the scientific literature, interaction networks, results from DNA chips (microarrays) or proteomic experiments, etc...
It is important to notice that these raw (primary) data are of little interest by themselves and, only the biological knowledge that can be extracted from them makes sense. In particular, the principal goal of the biologists is to understand how the genome characteristics can explain the biological properties of an organism, exploring all intermediate organisational levels that are pertinent for the living organism studied. The purpose of bioinformatics is to help experimental biologists in this task. This is a complex task that is characterized by the integration of various data coming from many different sources.
In addition to the raw genomic sequence, biologists must take into account results from bioinformatics tools, data from general or specialized databases, knowledge coming from the literature. Amongst all these secondary data, of increasing importance are data and biological knowledge coming from other sequenced genomes. Comparing different genomes is a rich source of information about the biological properties of the organisms. The information so gathered is extremely useful for annotating new genomes or reannotating already published genomes. It also permits to study gene evolution (protein history) by detecting events such as duplication and/or fusion, genome evolution events such as fluidity/rigidity (synteny conservation, rearrangement events, horizontal transfers), and can help to reconstruct the phylogenomic trees of organisms. Therefore, Genomics calls for integration of data on a huge scale in both volume and complexity.
To address these challenges in the case of microbial genomics we intend to create an integrated resource, "Microbiogenomics", that will be organized around a data warehouse containing the primary and secondary data described above and will provide a number of tools to exploit the stored data. In order to set up an efficient system we have identified several deadlocks that must be broken. (i) To cope with the heterogeneity of the data, especially at the level of functional classification, we have to structure them with a unifying ontology. This point is crucial for microbial genomics where different functional classifications are used and for which the well-known Gene Ontology is poorly helpful. (ii) Once structured, these genomic data will be integrated into a unique warehouse using an integrative schema allowing to store and query data coming from various sources (including our present relational databases) in a uniform way. The challenge is to find out an integrated solution which would be both tightly (semantic schema matching and semantic data matching are performed) and flexible (evolution of the schema and of the number of sources considered is taken in account) as much as possible. (iii) Such an integrated resource will be an invaluable mine of information making it ready for efficient extraction. To ease data mining, we wish to develop two kinds of tools : data mining tools that make use of the data to predict, as accurately as possible, gene functional classes and innovative visualization tools (principally devoted to the analysis of phylogenetic trees) that take advantage of the capacity of the human vision to facilitate the exploration of the data.
There is a number of existing warehouses whose aim is to store curated primary genomic data. RefSeq at the NCBI provides a non-redundant collection of microbial genome sequences for a number of species derived from GenBank?. These sequences are, to some extent, curated by human experts. The TIGR has developed the Comprehensive Microbial Resource that contains robust annotation for all the complete microbial genomes and allows for a wide variety of data retrieval. The EBI is currently developing a project called Genome Reviews whose goal is to provide an up-to-date, standardised and comprehensively annotated view of genomic sequence of organisms with completely deciphered genomes. The PEDANT genome database at the MIPS provides an exhaustive computational analysis of genomic sequences for 21 archaea, 218 eubacteria and 25 eukaryotes. The HAMAP (High-quality Automated and Manual Annotation of microbial Proteomes) project aims to automatically reannotate, in the framework of the Swiss-Prot protein knowledge base, proteomes originating presently from 188 bacterial and 21 archaeal sequenced genomes. As can be seen from this list which is by no means exhaustive, the community feels the need to gather microbial genomic primary data. In these projects a strong emphasis is put on the standardisation of the data that represents a necessary condition for further exploitation.
Fewer projects are devoted to gathering and organizing secondary data. At the EMBL the database STRING aims to simplify access to ‘protein associations’ derived not only from predictions based on gene context analysis but also from the mining of databases and literature or from the results of high-throughput experiments. In France, one of the teams of the Genoscope, the Atelier de Génomique Comparative, has developed PkGDB, a relational database whose goal is to gather “clean” annotation data. This database contains primary data coming from RefSeq? that are corrected if required (in terms of gene prediction). It also contains synteny conservations data which are used during the annotation of new genomes or re-annotation of genomes. The PBIL (Pôle Bioinformatique Lyonnais) has replaced HOBACGEN by HOGENOM, a database of homologous genes from fully sequenced organisms, structured under ACNUC sequence database management system.
Few projects, thus, integrate primary and secondary data in the same warehouse. They also strongly differ regarding the tools they provide for mining the data (in fact, a number of them are only concerned with a standardisation of the annotation data and retrieval facilities). Another aspect we consider of primeval importance is the interface between the users and the system. According to our experience, graphic interfaces allowing an easy visualization of the data are really useful to experimental biologists when they need to inspect and cross a large number of heterogeneous data in the course of their work. The international context regarding this last point is described below (in section 2.4). Finally, the ease to extract the requested data in a convenient output format is a crucial point.
MIG (INRA, Jouy) and EMBG (IGM, Orsay) groups are engaged in vast programmes of exhaustive comparison of microbial genomes, allowing to collect numerous data belonging to different fields and displayed in various formats. Both groups share a lot of common needs at several steps of their respective analyses such as reannotating genomes, understanding the relationships between structure of a protein and its function, modular structure of a protein. Moreover, MIG and EMBG groups are already engaged in an active collaboration with the Laboratoire de Recherche en Informatique (Bioinformatique and IN SITU groups, Orsay) that has an acknowledged expertise in storage and analysis of massive biological data, data sources integration, knowledge representation, data mining, visualization of data, etc.. Bringing together the competences afforded by our laboratories would allow to take up challenges such as storing and analysing huge amounts of genomic data in order to optimise the extraction of relevant information. Our longer-term goal is to build an integrated resource making it possible to systematise the possibility of making new discoveries, as the group EMBG recently made in the case of the so-called orphan enzymes. Moreover, the properties (durability, facilitated use, possibilities of sophisticated analyses) of this integrated resource would make it an attractive tool for our communities of bioinformaticians and of experimental biologists. Indeed, opening new fields