Materials and Methods
Construction of human Transcription factor and miRNA regulatory networks
The human Transcription factor (TF) and miRNA regulatory networks were built by integrating miRTarBase, TarBase, TRANSFAC and TransmiR14-16. These four databases include curated interactions among human TFs, miRNAs, and target genes as well as standardization of gene and miRNA names within the regulatory networks using data from NCBI and miRbase databases. Additionally, all regulatory relationships within the regulatory network were literature-supported. In total, there were 460 TFs, 2,434 miRNAs, 13,898 target genes and 98,894 edges in the regulatory network.
Known HCC and HCV-associated genes and miRNAs
DisGeNET, a discovery platform containing one of the largest publicly available collections of genes and variants associated with human diseases, was utilized to identify two disease-associated genes 17. Two disease-associated miRNAs were collected from the miR2Disease18 and HMDD 19 , which are curated databases containing experimental evidence for human microRNA (miRNA) and disease associations. We also utilized genes in the KEGG pathways associated with HCC (168) or HCV (155). We included 30 known HCC-associated genes in DisGeNET and 463 known HCC-associated miRNAs from either miR2Disease or HMDD. Finally, 18 known HCV-associated genes in DisGeNET and 100 known HCV-associated miRNAs in either miR2Disease or HMDD were used for network analysis.
Disease-related network construction
For the disease-related network construction, the closer the nodes in the network to the known disease genes, the more likely they are disease-associated 20. In order to construct a more closely related subnet, we selected nodes directly connected to the known disease-associated genes in the background network to build an HCC and HCV-related network. In total, there were 409 TFs, 2,300 miRNAs, 10,697 target gene and 48423 edges in this regulatory network.
Differentially expressed genes in the three datasets
The normalized mRNA expression profiles of HCC (TCGA), HCV (GSE15387) and HCV-related HCC (GSE44074) were downloaded from the Gene Expression Omnibus (GEO) database 21 and The Cancer Genome Atlas (TCGA) database 22. There were 374 HCC samples and 50 normal samples in the TCGA data set, 35 HCV-related HCC samples and 37 HCC samples in the GSE44074, as well as 60 HCV samples and 60 normal samples in the GSE15387. For mRNA expression data, probe sets were mapped to Entrez Gene IDs. When multiple probes corresponded to the same gene, the mean expression value of these probes was used to represent the gene expression level. We obtained 2, 3, and 4 differentially expressed genes at the p -values of less than 0.05 by using edgeR (TCGA data) and SAM (GEO data) in each of the three data sets.
Identification of the subnetworks for each dataset
To construct subnetworks for each dataset, we extracted differentially expressed genes and their neighbor genes from the disease-related network. The regulatory relationships between these genes and miRNAs constituted a core regulatory subnetwork at multiple stages of disease development. We identified 3 subnetworks, which we termed the HCC subnetwork, HCC-HCV subnetwork and HCV subnetwork.
Extraction of candidate risk regulatory pathways
Using the BFS algorithm to extract risk regulatory pathways from the three subnetworks, we identified all the pathways in the network from the nodes indegree 0 to outdegree 0, and pathways with a length greater than 2 were regarded as the candidate risk pathways.
Prediction of key regulators
Gene expression varies in different tissues and during different diseases. Some genes are expressed at a specific stage of a given disease, while some genes continue to play a role throughout the process. We analyzed all the pathways in the three subnetworks to identify the most critical pathways in each network by examining highly shared genes. We propose a KP score to evaluate key pathways, which is calculated as follows:
Where denotes the number of nodes on a pathway within a subnetwork, denotes the number of intersection nodes between pathway and subnetwork , denotes the length of the longest pathway within the subnetwork that satisfies the conditions, denotes location weight score of the intersection gene within the pathway, denotes whether the gene at this position is an intersection gene, if yes, then the value of is 1, if not, the value of is 0, upstream genes get higher scores.
Survival analysis
In this study, we constructed three subnetworks for HCC, HCV samples and normal samples, and identified key pathways from the subnetworks. We next investigated whether the key regulators could distinguish HCC patients with good or poor outcomes. From these data, we obtained TCGA HCC dataset mRNA expression, miRNA expression and clinical information. Next, we used the K-means method (K=2) to cluster all patients into two groups based on the mRNA and miRNA expression. Finally, Kaplan–Meier curve and log-rank tests were used to evaluate the difference in overall survival time between the two groups of patients.