In research problems associated with text mining and classification, many factors have to be considered as on what basis the classification needs to be done. These factor variables are termed as features. The hardness of the visualization of training data is directly based on the number of features. Most of the times, the features are found to have high correlation and redundant. Dimensionality reduction helps to reduce the number of these features under the task by accumulating a group of principle variables. In the previous work an automated feature extraction technique using the weighted TF-IDF was proposed. Although the proposed method performed well, there was a drawback that some of the features generated are correlated to each other which resulted in high dimensionality resulting in more time complexity and memory usage. This paper proposes an Automatic text summarization method using the weighted TF-IDF model and K-means clustering for reducing the dimensionality of the extracted features. The various similarity measures are utilized in order to identify the similarity between the sentences of the document and then they are grouped in cluster on the basis of their term frequency and inverse document frequency (tf-idf) values of the words. The experiments were carried out on the student text data from the US educational data hub and the results were compared with other dimensionality reduction methods in terms of co-selection, content based, weight based and term significance parameters. The proposed method found to be efficient in terms of memory usage and time complexity.
Text Mining, Classification, Dimension Reduction, Text Summarization, Weighted TF-IDF and K-Means Clustering .