The remarkable growth of medical repositories has made it necessary to develop various tools for mining text (Ranjit). One such important mining tool is the selection of relevant features. Selection of features in accessing data is of utmost importance in all fields of scientific studies since datasets have developed into complex structures. Feature selections, therefore, help the users to analyze the data that is stored in online resources (Shoushan).
Among the online medical data bases that uses feature selection are Pubmed and Medline (Imambi & Sudha). These two repositories are growing at a very high rate and handling a lot of data at any one given time. For instance, the classification of documents by Medline has become increasingly complex because of the high dimensionality of feature space (Imambi & Sudha). As such, there is need for these online medical science repositories to select the most appropriate feature selection methods that would help ease their burdens and ensure documents were accessible to people who need them.
Indeed, Imambi and Sudha of the Department of Computer Science in their respective colleges suggested, in experimental studies, recommended that Pubmed adopts the GRW in selecting its features. Further investigations using medical dataset indeed showed that this method was more effective and accurate than the other existing feature weighing methods.
Dissanayake, and Corne seem to share in the same sentiments and believes that, because of the several thousands of features contained in the medical datasets, it is wise to develop tools that can be relied upon to distinguish only relevant features in these datasets. This, they argue, will help improve the speed of machine learning algorithm and the quality of predictive models in these datasets. Recent research has demonstrated the possibility of combining algorithm and classifier (EA/k-NN) in distinguishing the relevant features (Dissanayake & Corne). This combination acts both as a feature selection mechanism and a machine learning method. As such, it yields the most accurate classification of the features making it easier to distinguish the most relevant ones.
On the other hand, a newer feature selection method for the medical repositories called Kernel F-Score Feature Selection (KFFS) was proposed by Kemal andSalih in 2009. This method, according to the authors, is effective at the pre-processing phase of text mining, to classify medical datasets. This method, however, acts in two phases. The first phase involves the transformation of the features into Kernel space, by use of a Linear or Basic Radial Function (BRF) of the Kernel. This causes the dimensions of the medical datasets to heighten to higher feature space (Kemal & Salih). The second phase of this method is characterized by the calculation of the F-score values of the high dimensional feature space, by use of an F-score value formula and establishing the mean value. Therefore, if any feature in the medical dataset has an F-score value bigger than the mean value, the feature is rendered relevant and is, therefore, selected (Kemal & Salih). If otherwise, the feature is removed from the high dimensional input feature space. This ensures that no redundant features are left in place for further consideration, thereby making the selection process and effective one (Kemal & Salih). Perhaps the most unique aspect of the KFFS, which is lacking with the other methods, is the transformation of non-linearly separable medical dataset to a linearly separable feature space, making it an easier variable to work with (Kemal & Salih).
Of the feature selection methods mentioned above, all of them are probably vital and effective in their own capacities. For instance, the GRW used by Pubmed has been described as the most effective ad accurate medical dataset feature selection method that exists. However, it is not as quantitative and justified like the KFFS method is. The quantitative nature of KFFS method makes it accurate and justifiable in the selection of relevant features in medical datasets (Ronen & James). I, therefore, think that it is the most appropriate feature selection method to be used in medical datasets.