Genetic database construction and sequence sampling
Sequences for nirS and eNOR genes from SURF MAG 42 (Table S1) were used as queries to BLAST (Camacho et al. , 2009) three genomic repositories:
  1. Genome databases constructed for 21 Chloroflexi genomes assembled from deep-subsurface MAG data (Jungbluth, Amend and Rappé, 2017; Momperet al. , 2017) (Table S1).
  2. Genome databases constructed for 86 genomes from recent MAG assembled sludge bioreactor genomes (Parks et al. , 2017) (Table S3)
  3. The full NCBI non-redundant protein database (as of 25 September, 2019)(Agarwala et al. , 2018)
Additionally, putative environmental homologs were evaluated using protein sequence data from SURF MAG 42 to query NCBI’s non-redundant environmental metagenomic sequence database (env-nr, as of June 2020)(Agarwala et al. , 2018) (Supplementary Datafile S2) .
Hits from all databases (Table S4) were combined and assessed for quality; hits with E ≤ 1x10-10 were included for initial analyses. To capture diversity while limiting imprecision and biased sampling of overrepresented groups (e.g., Proteobacteria), hits were subsampled to the genus level, with the exception of members of the Chloroflexi (to fully capture the taxonomic distribution of the novel gene variant). One additional, divergent multispecies hit was allowed per genus. The genus-level filter was also removed for C1, where non-Chloroflexi hits were severely limited (see below). Duplicate sequences (from strains with multiple genome entries or in multiple databases surveyed) were removed.