Jean Armengaud, Celine Bland, Joseph Christie-Oleza and Guylaine Miotello
A consensus for defining the fundamental unit of biological diversity, the species, has not yet been reached for prokaryotes. Although high-throughput molecular tools are now available to assess the microbial diversity, estimating the total number of species of bacteria and archaea on Earth is still a challenge due to the huge amount of low- abundant species present in environmental samples. Ever since the first whole cellular genome sequenced, the one from Haemophilus influenzae in 1995, more than seven thousand complete genomes have been reported. The avalanche of genome sequences is resulting in an exceptional documentation of representatives of numerous taxa. While annotation of these genomes has gained in accuracy with new gene prediction tools, proteogenomics has proved to help in discovering new genes, identifying the true translational initiation codon of coding domain sequences and characterizing maturation events at the protein level such as signal peptide processing. Beside this structural annotation, proteogenomics can also give rise to significant insights into the function of proteins. Basically, proteogenomics consists in obtaining massive protein sequence data by means of large shotgun proteomic strategies and the use of high-throughput tandem mass spectrometry. Such experimental data is then used for improving genome annotation. Unexpected results such as the reversal of gene sequences in different bacteria or the use of non-canonical start codons for translation in Deinococcus species are only some of the numerous corrections documented so far. Today, the proteogenomic analysis of a given set of representatives that fully covered the tree of life would result in a better ground for accurate annotation of novel strains. This would improve comparative genomics studies and could be of help for assessing in what way closely-related species are differing.