This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Bioinformatics and Biotechnology, is properly cited. The complete bibliographic information, a link to the original publication on https://bioinform.jmir.org/, as well as this copyright and license information must be included.
There is a great need to develop a computational approach to analyze and exploit the information contained in gene expression data. The recent utilization of nonnegative matrix factorization (NMF) in computational biology has demonstrated the capability to derive essential details from a high amount of data in particular gene expression microarrays. A common problem in NMF is finding the proper number rank (r) of factors of the degraded demonstration, but no agreement exists on which technique is most appropriate to utilize for this purpose. Thus, various techniques have been suggested to select the optimal value of rank factorization (r).
In this work, a new metric for rank selection is proposed based on the elbow method, which was methodically compared against the cophenetic metric.
To decide the optimum number rank (r), this study focused on the unit invariant knee (UIK) method of the NMF on gene expression data sets. Since the UIK method requires an extremum distance estimator that is eventually employed for inflection and identification of a knee point, the proposed method finds the first inflection point of the curvature of the residual sum of squares of the proposed algorithms using the UIK method on gene expression data sets as a target matrix.
Computation was conducted for the UIK task using gene expression data of acute lymphoblastic leukemia and acute myeloid leukemia samples. Consequently, the distinct results of NMF were subjected to comparison on different algorithms. The proposed UIK method is easy to perform, fast, free of a priori rank value input, and does not require initial parameters that significantly influence the model’s functionality.
This study demonstrates that the elbow method provides a credible prediction for both gene expression data and for precisely estimating simulated mutational processes data with known dimensions. The proposed UIK method is faster than conventional methods, including metrics utilizing the consensus matrix as a criterion for rank selection, while achieving significantly better computational efficiency without visual inspection on the curvatives. Finally, the suggested rank tuning method based on the elbow method for gene expression data is arguably theoretically superior to the cophenetic measure.
Nonnegative matrix factorization (NMF) algorithms have been advanced for the application fields of bioinformatics, artificial intelligence [
Several approaches have been developed for clustering samples, mutational processes, and gene expression levels that draw similar expression motifs [
A common problem in conventional multivariate data analysis methods such as factor analysis (FA), principal component analysis (PCA), cluster analysis, and NMF is to detect the proper number (r) of factors, principal components, clusters, and ranks, respectively. Item redundancy is common in long questionnaires such as those used in a pilot questionnaire study, arguing for the utilization of FA and the variance inflation factor on a lifestyle questionnaire. Staffini et al [
The aim of this study was to utilize the unit invariant knee (UIK) method for obtaining related biological and molecular correlations in gene expression data. The UIK method is used to catch compositions essential for the data and to offer biological understanding by systematizing both the features and samples. The approach is based on a “knee point” and its unit invariant estimation using the extremum distance estimator method introduced by Christopoulous [
Therefore, given an NMF method and a data set (a target matrix), the tens of thousands of genes regarding a small number of signatures can be analyzed. Gene expression patterns of samples can then be studied to determine the expression motifs of the signatures. The signatures define an interesting decomposition of genes, analogous to the motifs of Hutchins et al [
In this study, the elbow technique was considered for model selection utilizing alternative parsing and its robustness was evaluated [
Finally, this study applies the combination of NMF and the UIK method (designated the uikNMF method) to simplify cancer classification tasks by clustering tumor samples and mutational signature data sets. This enables illustrating numerous sturdy decompositions of genetic and mutational signatures from experimental and simulated data sets.
Given a target matrix Vm×n, NMF identifies nonnegative matrices such that Nm×r and Mr×n (ie, with all entries≥0) to present the matrix decomposition as:
In practice, N is typically viewed as a basis or metagenes matrix, and the mixture coefficient matrix and metagene expression profiles refer to the matrix N. The rank factorization is chosen such that r≤min(m,n). The goal behind this selection is to explain and split the details classified among V into r factors (ie, the columns of N). Given a matrix Vm×n, NMF finds two nonnegative matrices, Nm×r and Mr×n (ie, with all elements≥0), to represent the decomposed matrix as
for instance by natural demanding of nonnegative N and M to minimize the reconstruction error:
In this case, we consider a gene expression data set characterized by the expression levels of
The aim is to identify a small number of rank factorizations, each defined as a positive linear combination of the
In the framework of classification analyses, Brunet et al [
Hutchins et al [
For instance, a fundamental step for any unsupervised algorithm is to determine the optimal number of clusters (k) into which the data may be clustered [
In many cases, utilization is referred to as uik(x,y), where x is the vector of ranks, components, clusters, or factors and y is the related vector of the RSS curve [
(A) Rank survey plots for residual sum of squares (RSS) and (B) cophenetic coefficient curves factorization rank. The factorization rank ranges from 3 to 37. The aim is to decide whether the optimal rank factorization is very rigid by simple visual inspection. (C) The function of factorization rank is selected as the emergence rank of the RSS survey. The rank range between knee points is detected by the uik() function of the R package "inflection" at the curve of the cumulative rank units. The best fit is determined using a linear regression model.
This study used cross-validation to select an optimal number of implicit elements in NMF. The goal of NMF is to obtain low-dimensional N and M with all nonnegative elements by minimizing the reconstruction error |V – NM|2. Leaving out a single entry of V (eg,
Consequently, the left-out element V
One can repeat this process by crossing out all entries of V
Since the NMF must be reiterated for each crossed-out value and might also be difficult to code (depending on the target matrix entries and how smooth it is to implement NMF with missing values), this can be a computationally expensive procedure. For instance, in PCA, one can avoid this by crossing out entire rows of V, which eventually speeds up the computing [
Note that various techniques have been developed to select the optimal rank factorization. For example, Brunet et al [
This study illustrates the utilization of NMF based on the UIK method to select the optimal rank on the RSS curve with a leukemia gene expression data set (esGolub) in simplifying cancer subtypes [
Furthermore, this data set has served as a touchstone in cancer classification at the molecule, histology, and stage levels [
For example, by looking at the NMF rank survey plot of RSS in
The simulated mutational process data obtained from Alexandrov et al [
Analyses were performed utilizing the R programming language. Before the procedure, the low-quality genes with an inadequate number of reads were eliminated and gene expression values were converted to a logarithmic scale. The data set (
Gene expression and simulated mutational data sets.
Data set | Size | Samples |
esGolub gene expression | 5000×38 | 38 |
Mutational processes | 100×96 | 96 |
The present results are based on the NMF package of Gaujoux and Seoighe [
By simply looking at the cophenetic correlation or RSS plots of rank factorization in
Application of the unit invariant knee (UIK) method on different algorithms: (A) “Brunet” and (B) “nsNMF.” The optimal rank, which UIK represents, is 15 for the Brunet algorithm, whereas the UIK of the nsNMF algorithm reveals 14 as an optimum rank, similar to the “Lee” algorithm.
(A) Estimation of the optimal rank. Nonnegative matrix factorization (NMF) survey plot of quality measures obtained from factorization rank from 2 to 6 by running the target matrix esGolub [1:200] 10 times. (B) The function of factorization rank is selected as the emergence rank of the residual sum of squares (RSS) survey. For example, the rank range of 2 to 6 is between knee points detected by the R inflection package's uik()function at 3. Overall, the method of the UIK estimation was confirmed with former results.
It is challenging to observe the rank factorization of the simulated data on the cophenetic coefficient curve (
(A) It is complicated to locate the optimal rank with the cophenetic correlation coefficient approach. (B) However, the unit invariant knee (UIK) method can facilitate this decision more quickly and more accurately, which agrees with the number of signatures detected by Alexandrov et al [
The novel finding of this study is the ability to apply the UIK method in selecting optimal ranks based on the RSS curve of factorization ranks of the NMF technique. First, this study employed the Golub et al [
In the second module, the UIK precisely estimates simulated data with known dimensions. The UIK technique is free of a priori rank parameter input and does not require setting initial parameters that considerably affect the performance. Finally, this method was tested on gene expression data deconvolution, achieving optimal rank estimation.
The proposed uikNMF technique was tested on both experimental gene expression and simulated mutational processes data sets. Moreover, our recent study of utilization of the UIK technique on NMF revealed the genetic links of type 2 diabetes (T2D) that could lead to the development of Alzheimer disease (AD) [
This study further shows that the UIK method provides a credible prediction for gene expression data and precisely estimates simulated data with known dimensions. The proposed UIK method based on the RSS curvature’s first inflection point to estimate the optimal rank is theoretically superior or equivalent to existing implementation and software. All the undertaking is done with R programming and is freely available.
As future work, some software functionality ideas include adapting the UIK method on NMF rank estimation in a single function package to accommodate analyses of gene expression, mutational processes, and other biological data sets at the molecular level.
The analysis has some limitations such that other NMF packages or software on gene expression research were not tested. This study demonstrates that the UIK method provides a credible prediction for gene expression data. However, it was simply assumed that the same algorithms of NMF are used, as far as the RSS and residual curves would be approximated the same way so that the UIK method would result in the same optimal ranks.
One of the arguments related to the choice of rank is to remove noise and recover the signatures [
Several methods have been developed to select the optimal rank factorization [
This study demonstrates that the elbow method provides a credible prediction for both gene expression data and for precisely estimating simulated mutational processes data with known dimensions. The suggested UIK method is faster than conventional methods with regard to usage of the consensus matrix as a benchmark for rank choice, while achieving considerably better computational adeptness without visual inspection on the curvatives. It is further argued that the suggested rank tuning method based on the elbow method with gene expression data is theoretically superior to the cophenetic measure. Lastly, the proposed method could be applied to other types of gene expression data sets to reveal the most significant genes (so-called “metagenes”) in various diseases, including T2D and other metabolic diseases, and may further be helpful for understanding the underlying mechanism of AD and related neurological disorders.
Alexandrov et al [
Alzheimer disease
acute lymphoblastic leukemia
differentially expressed gene
factor analysis
mean squared error
nonnegative matrix factorization
principal component analysis
predicted residual sum of squares
residual sum of squares
type 2 diabetes
unit invariant knee
The Golub gene expression [
None declared.