This is an openaccess article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Bioinformatics and Biotechnology, is properly cited. The complete bibliographic information, a link to the original publication on https://bioinform.jmir.org/, as well as this copyright and license information must be included.
Large amounts of biological data have been generated over the last few decades, encouraging scientists to look for connections between genes that cause various diseases. Clustering illustrates such a relationship between numerous species and genes. Finding an appropriate distancelinkage metric to construct clusters from diverse biological data sets has thus become critical. Pleiotropy is also important for a gene’s expression to vary and create varied consequences in living things. Finding the pleiotropy of genes responsible for various diseases has become a major research challenge.
Our goal was to establish the optimal distancelinkage strategy for creating reliable clusters from diverse data sets and identifying the common genes that cause various tumors to observe genes with pleiotropic effect.
We considered 4 linking methods—single, complete, average, and ward—and 3 distance metrics—Euclidean, maximum, and Manhattan distance. For assessing the quality of different sets of clusters, we used a fitness function that combines silhouette width and withincluster distance.
According to our findings, the maximum distance measure produces the highestquality clusters. Moreover, for medium data set, the average linkage method, and for large data set, the ward linkage method works best. The outcome is not improved by using ensemble clustering. We also discovered genes that cause 3 different cancers and used gene enrichment to confirm our findings.
Accuracy is crucial in clustering, and we investigated the accuracy of numerous clustering techniques in our research. Other studies may aid related works if the data set is similar to ours.
A substantial amount of genetic data began to accumulate in the hands of bioinformatics experts at the turn of the 21st century. The process was sped by advances in technology hardware and improved computer algorithms. Scientists began storing all of this genomic information in sequential data [
Microarray [
Sample microarray data.
Genes  Time 1  Time 2  Time 3 
Gene 1  0.25  0.22  0.65 
Gene 2  –0.75  1.25  –0.63 
Gene 3  0.05  0.66  0.75 
Gene 4  1.25  –0.52  0.15 
Researchers have been extracting valuable biological information from microarray data. The construction of a phylogenetic tree is one of the most extensively used methodologies [
Different clustering approaches have been presented to extract information from the microarray data set [
Various distance methods are employed depending on the data set’s characteristics. The way the 2 distance methods determine the difference between 2 distant data points is the fundamental distinction between them. Euclidean [
Pleiotropy is another key phenomenon identified in the investigation of gene functions behind many diseases. Pleiotropy occurs when a single gene influences many phenotypic features [
In this work, we used a variety of data sets to investigate different distancelinkage combinations for hierarchical clustering. These clusters have revealed which gene groupings are closely connected to one another. We also assessed the fitness of those groupings and attempted to determine which distancelinkage combination produced the greatest results. We validated our findings using 8 different data sets. Furthermore, we used the best measure to identify common genes responsible for various tumors. Gene enrichment scores about their influence on various diseases were used to corroborate our findings.
This section goes over our proposed methodology. First, we provided the proposed workflow for determining the optimum clustering distancelinkage approach. Then we went over several distance metrics, linkage methods, and our selection procedure for comparing the performance of various combinations. Finally, the pleiotropic gene observation methodology is discussed.
Our investigation begins with the import of a microarray data set into our procedure. This microarray data set is typically a 2D array, with rows representing different genes and columns representing their intensity at various time stamps. To minimize the dimensionality of the data set, we will use Principal Component Analysis. It is a sophisticated approach used by academics to remove irrelevant data from a data set while keeping its integrity.
Then, in our data set, we run a distance metric. A distance measure, in general, calculates the similarity of 2 genes and determines how far apart they are. We employed the following 3 different distance metrics: Euclidean, Manhattan, and maximum. We chose a linkage method to connect related genes and generate a hierarchical tree after picking the distance metric. We used the following 4 linkage methods: single, complete, average, and ward linkage methods. We constructed a hierarchical tree using the distancelinkage method, where each leaf represents a gene, and the branches reflect the dissimilarity among them. The tree was then cut to various heights, resulting in several sets of genes for each cut point. Subsequently, we identified the appropriate cut point for that hierarchical tree by calculating how well those genes are clustered on different cut points. We used “Average Silhouette Width” and “Distance within Cluster” to calculate the fitness of the groups formed by different cut locations. The optimal fitness value is calculated using these fitness values. We determined the best combination of distance and linkage methods for a single data set by repeating this process with different combinations of distance and linkage methods.
Proposed algorithm for finding the best distancelinkage combination. Input: Microarray data set. Output: Distancelinkage combination.
For a particular data set, D, optimal fitness value can be expressed by the following equations:
Where d
Euclidean distance uses Pythagorean formula to calculate the distance between 2 genes. For n dimensional space, we can write that formula as follows:
Unlike Euclidean distance, Manhattan distance takes the modulus value of the subtraction. For ndimensional space, the equation of Manhattan Distance will be as follows:
Maximum distance, on the other hand, calculates the subtraction value for each column before selecting the highest number. The formula for ndimensional space is as follows:
The single linkage approach connects 2 clusters by taking the shortest distance between them. The equation for the single linkage method to calculate the distance between any element and another element in another group is as follows:
Where p is an element in cluster P and q is an element of cluster Q.
To compute the distance, the complete technique uses the farthest points in 2 clusters and connects the clusters with the shortest distance. The equation for the entire linking approach is as follows:
The average method determines the average value for each gene inside the cluster, then connects them one by one on each layer to form a hierarchical tree. Equation 7 is the average linkage method update formula.
Where m is all the instances of cluster a, and n is all the instances of cluster b.
A centroid point is determined using Ward linkage (much like the centroid method). The squared distance value of each point in each cluster is then calculated using that centroid. It then sums all the squared distance values obtained by the 2 clusters together. It takes the smallest total value produced by a cluster pair and merges them on that level after repeating the same technique for every cluster on the same level. Equation 8 is the Ward linkage method update formula.
The fitness of the clusters we acquired after cutting the hierarchical tree at a specific height was calculated using the following 2 metrics: average silhouette width (ASW) and distance inside cluster. The following formula is used to compute silhouette width:
Where a(i) is the average distance from object i and all the other points of the cluster in which i belongs; b(i) is the distance of the closest point in other cluster; and s(i) is the silhouette value between 2 clusters.
ASW is the average of all the silhouette values. Generally, it varies from –1 to 1, and the value closer to 1 is considered better.
The distance within a cluster is used to determine how close the elements are. Each cluster’s centroid is chosen during this process. The distance between each object in the cluster and the centroid is then determined as an average. This calculation’s formula is as follows:
Where dist(c,i) is the distance between centroid c and element i in a cluster; E is the set of elements in the cluster; and E is the number of elements in the cluster.
From the characteristics, we can understand that ASW measures the quality of clusters. A greater ASW indicates good quality of clusters, that is, for a data set D, distance metric d and linkage method l,
Where S_{i} is the ASW for cut point i.
However, distance within clusters measures how compact the data points are in the clusters. Therefore, betterquality clusters will have lower distance within clusters, that is,
Where W_{i} is the distance within clusters for cut point i.
Thus, to compare the quality of clusters we acquired at different cut points i in the hierarchical tree, our fitness function combines these 2 criteria. When these 2 relationships are combined, our fitness function becomes as follows:
From this function, we can find out the optimal fitness for a specific combination of metrics in a certain data set.
We will try ensemble clustering [
It starts by creating an n×n binary matrix in which the input is 1 if two objects belong to the same cluster and 0 otherwise. Every clustering approach is put through it. The final ensemble cluster is then generated using an entrywise average of all clustering approaches.
The data set is represented as a hypergraph by this algorithm. The hypergraph is then partitioned to determine the smallest number of edges. It produces the ensemble cluster based on the smallest number of edges.
The metaclustering algorithm starts by creating numerous clusters from a data set. The dissimilarity between those clusters is then calculated, and a metacluster is generated as a result of that measurement. In this approach, the ensemble is represented by the final metacluster.
One of the most important characteristics of these algorithms is that the number of clusters that the algorithm will build must be declared at the start. For the specified data set, we used the cluster number created by the best distancelinkage combination.
We identified the genes responsible for various cancer tumors from the data sets and then evaluated their expression in different patients with cancer to report their various phenotypes in order to discover the pleiotropic behavior of distinct genes. We built a secondary data set by extracting the expression data for each gene from each data set after identifying the common genes across these disorders. Every primary data set must contain an equal number of time stamp values in order to build a 2D microarray data set. The data sets, however, have different numbers of columns. Central nervous system, for example, includes 60 time stamps for a single gene, but the ALLAML (acute lymphoblastic leukemiaacute myeloid leukemia) data set has 72 time stamps. We cannot modify or remove any columns from the data set because doing so could compromise the data’s integrity or result in the loss of valuable information. To address this issue, we estimated the mean, median, standard deviation, and variance, which may be used to summarize numerical data [
Since no human or animal trial was conducted during this research, the authors did not apply for an ethical approval for the study.
We will discuss the experimental outcomes we discovered in our research in this part. We started by explaining the data sets we used. The findings for various distancelinkage method combinations were then shown. We later presented our findings in terms of pleiotropy for the shared genes.
We obtained gene expression data for various cancers from a publicly accessible database [
The number of genes and patients, or the number of conditions for each gene, differs among these data sets. We used a diverse data set to discover the ideal metrics, which can be used to any gene expression data set. Furthermore, these databases contain certain genes that are widely used. We have created a secondary data set to explore and analyze those genes further.
Description of data sets.
Data set  Data domain  Number of patients  Number of genes 
CNS^{a}  Central nervous system  60  7129 
ALLAML^{b}  Acute lymphocytic leukemia  72  7129 
Lung cancer  Lung cancer  181  12,533 
Ovarian cancer  Ovarian cancer  253  15,154 
Lymphoma  Lymphoma  62  4022 
SRBCT^{c}  Small round blue cell tumor  83  2308 
^{a}CNS: central nervous system.
^{b}ALLAML: acute lymphoblastic leukemiaacute myeloid leukemia.
^{c}SRBCT: small round blue cell tumor.
In our experiment, we employed several combinations of distance measurements and connection algorithms to generate a hierarchical tree. To validate our founding, we used 3 distance metrics and 4 linking methods. We combined these 3 distance metrics and 4 linkage methods to build 12 hierarchical trees for each data set. We cut each tree on numerous cut points after building hierarchical trees. As a result, the tree has been separated into several distinct groups. We assessed the fitness value for each cut point and selected the highest as the ideal value for that hierarchical tree given that particular distance metriclinkage method combination.
A portion of a hierarchical tree of genes from the lung cancer data set is shown in
Hierarchical tree created using the maximumWard method on lung cancer data set.
Fitness value for different combinations of distance and linkage metrics.
Data set and linkage  Manhattan distance  Euclidean distance  Maximum distance  






Average  3.80×10^{13}  9.47×10^{12}  3.50×10^{11} 

Complete  1.42×10^{13}  6.78×10^{12}  3.44×10^{12} 

Single  2.59×10^{13}  5.72×10^{12}  2.16×10^{11} 

Ward  4.49×10^{14}  3.22×10^{13}  3.09×10^{12} 






Average  1.20×10^{6}  1.45×10^{5}  3.39×10^{5} 

Complete  8.89×10^{7}  2.11×10^{5}  1.51×10^{5} 

Single  1.11×10^{6}  1.37×10^{5}  1.24×10^{5} 

Ward  4.41×10^{7}  2.64×10^{6}  3.07×10^{5} 






Average  5.56×10^{8}  1.48×10^{6}  3.36×10^{6} 

Complete  5.35×10^{8}  1.23×10^{6}  1.52×10^{9} 

Single  5.33×10^{8}  6.47×10^{7}  5.86×10^{7} 

Ward  3.03×10^{8}  1.19×10^{6}  6.71×10^{6} 






Average  1.25×10^{5}  1.59×10^{4}  2.87×10^{4} 

Complete  1.71×10^{5}  7.49×10^{5}  6.28×10^{4} 

Single  2.88×10^{6}  3.12×10^{4}  1.28×10^{4} 

Ward  2.49×10^{4}  3.44×10^{5}  9.31×10^{4} 






Average  1.29×10^{7}  2.81×10^{6}  9.66×10^{6} 

Complete  2.21×10^{8}  2.34×10^{6}  6.00×10^{6} 

Single  1.01×10^{7}  2.81×10^{6}  8.10×10^{6} 

Ward  1.23×10^{8}  6.05×10^{7}  2.82×10^{6} 






Average  1.52×10^{7}  6.73×10^{6}  4.41×10^{5} 

Complete  1.03×10^{7}  4.72×10^{6}  3.73×10^{5} 

Single  8.24×10^{8}  4.34×10^{6}  3.00×10^{5} 

Ward  3.88×10^{9}  8.55×10^{8}  2.67×10^{6} 
^{a}CNS: Central Nervous System.
^{b}ALLAML: acute lymphoblastic leukemiaacute myeloid leukemia.
^{c}SRBCT: small round blue cell tumor.
We chose the data set (ALLAML) for testing and ran these 4 ensemble clustering techniques. For this data set, the maximumaverage combination produced the best result, with a cluster number of 135.
Fitness value for different ensemble techniques.
Ensemble techniques  Fitness value 
CSPA^{a}  4.32×10^{6} 
HGPA^{b}  3.29×10^{6} 
MCLA^{c}  1.53×10^{5} 
Maximumaverage  3.39×10^{5} 
^{a}CSPA: clusterbased similarity partitioning.
^{b}HGPA: hyper graph partitioning algorithm.
^{c}MCLA: metaclustering algorithm.
Multiple tumors can be caused by a small number of genes. We discovered 9 genes linked to the following 3 types of cancer: central nervous system, lymphoma, and lung cancer. AFFXTrpnX5 at, AFFXThrX5 at, AFFXThrX3 at, AFFXPheXM at, AFFXPheX5 at, AFFXPheX3 at, AFFXLysXM at, AFFXLysX3 at, and AFFXLysX5 at were discovered to be common genes. We found the gene enrichment score publicly available at [
Gene enrichment score vs cancer type.
The maximum distance method combined with the average linkage method produces better hierarchical trees in 4 data sets (central nervous system, leukemia, lymphoma, and SRBCT), according to the fitness values provided in
The maximum distance metric outperforms the other 2 distance methods among the 3 most commonly used distance metrics. Maximum distance considers only 1 column where those 2 genes have the most variance when calculating distance between them. The Euclidean and Manhattan distance methods, on the other hand, would have taken distances across all columns. As a result, the dissimilarity values for the Euclidean and Manhattan distances are approaching the maximum distance. As a result, in clustering, the Euclidean and Manhattan distances place points slightly farther apart than the Maximum distance. Furthermore, because all the columns indicate the same features of a gene evaluated at different time stamps, we can analyze the worst scenario (ie, the greatest differential in the expression of 2 genes at a certain moment). This is the most significant difference between these 2 genes. To put it another way, maximum distance calculates only the difference that matters. The Euclidean and Manhattan distances, on the other hand, are becoming buried in the massive amount of data. The maximum distance, on the other hand, may create undesirable clusters in a different data set with uniform variation across all columns.
When the data set is small, the average linkage approach performs well, and when the data set is huge, the ward method performs well. The single linkage approach may be faster than the average method for joining clusters, but it is not necessarily better. When determining the proximity of 2 clusters, it always considers only 2 points and ignores all others. The average linkage approach, on the other hand, considers all the points in the cluster when determining relatedness. When using the ward technique, the sum square error is used to determine similarity. When working with small or mediumsized data sets, the average linkage approach outperforms the ward linkage method, but as the data sets grow larger, the sum square error values take over and produce superior results compared with the average linkage method.
We tried to identify the optimal combination in our research and found that the maximum distance method performs better on hierarchical clustering when column variance is not uniform across the data set. However, if the data set is medium in size, with around 20007000 rows and 6080 columns, the average linkage technique will outperform other linkage methods, and if the data set is very large, with 12,00015,000 rows and 100200 columns, the ward linkage approach will outperform other linkage methods. Furthermore, it has been discovered that ensemble clustering can improve performance by a very little amount at the cost of extra work.
We discovered 9 common genes that cause the following 3 diseases: lymphoma, central nervous system cancer, and lung cancer. We tried to figure out how these genes play a role in these 3 diseases using the data provided in the data sets. The maximumaverage hierarchical clustering technique was chosen since it performed the best in the first experiment. We used gene enrichment score to confirm our findings on whether the 9 genes discovered have an impact on these 3 conditions.
Bioinformatics is becoming more and more involved in health sectors, such as disease detection and individualized medicine recommendation, as computational technology advances. Clustering techniques are becoming increasingly important in these industries. We investigated several distancelinkage combinations and attempted to find a solution. We hope that other researchers who use hierarchical clustering will profit from our findings and apply what they have learned to their own study. We also discovered common genes with multiple symptoms, which we confirmed using gene enrichment profiling. Knowing the pleiotropic nature of these genes will help scientists work on them to combat cancer.
In this study, we discovered a set of measures that will yield higherquality clusters for gene expression data. Pleiotropic behavior of common genes for many disorders was also discovered. To validate our findings, we used a variety of data sets that varied in size and richness. We used a fitness function to compare cluster quality between sets of clusters while assessing cluster quality. For mediumsized data sets, we discovered that the maximum distance metric combined with average linkage works best. Ward linkage also works better with huge data sets. Furthermore, due to data dimension differences, we had to preprocess data while identifying common genes for various disorders. It is critical to identify genes with similar symptoms more precisely and to separate those genes more effectively. Furthermore, detecting a gene by applying the clustering technique to find comparable genes is a critical work for researchers, and if done correctly, might save countless lives. For all these reasons, correct clustering is becoming increasingly important in bioinformatics. Therefore, if their data set resembles our microarray data, researchers from other fields can employ this technology.
acute lymphoblastic leukemiaacute myeloid leukemia
average silhouette width
None declared.