Genetic diversity analysis of the study cohorts
In this analysis, PCA used to compare and assess the genetic diversity of our study cohorts with respect to the 1000 genomes and IGV populations (Figure-S2 ). As expected, 1000 genomes European (EUR,) American(AMR) and African (AFR) super populations are distant while SAS is proximal to majority of the IGV large populations. Though TB group is closer to EAS super population than any other super population of 1000 genomes as well as IGV populations(Figure-2). OG-W-IP (an outgroup population of African descent), which was earlier demonstrated to be an admixed Indo-African population from western part of Indian is present in a cline between Indian and African populations (Narang et al., 2011; Shah et al., 2011). Further, we excluded the 1000 genomes AFR, AMR and EUR super populations as well as the Indian outgroup population (OG-W-IP) to fine map genetic structure. We clearly observed that majority of the IE and DR large populations are proximal to the 1000 genomes SAS group (1kg_SAS). However, AA and DR isolated populations as well as TB genetic cluster are under-represented in the 1000 genomes. EAS group (1kg_EAS) in 1000 genomes is genetically distinct from populations in TB cluster (FST=0.01-0.02) (Figure S3) . Underrepresentation of Indian genomic diversity in 1000 genomes was earlier reported and also substantiated our findings (Sengupta, Choudhury, Basu, & Ramsay, 2016). Also, recently published GAsP project lacks representation from TB group and moreover, has comparatively less number of samples in SAS group (n=724) which might bias frequency estimations in SAS group.
Lastly, we compared the genetic diversity of our study cohorts with IGV populations as well as 1000 genomes SAS and EAS group. Figure 3shows we have representation of IE and DR large populations as well as from TB group (high altitude populations) in our cohorts. Representation of AA and DR isolated groups in our study samples is also lacking. However, FST analysis suggests that our study cohorts are more proximal to IGV populations than 1Kg_SAS. More specifically, AA and DR isolated groups as well as TB low altitude populations are genetically more closer to our study cohorts than 1kg_SAS