Feature selection: a key technique for data mining

Klipi teostus: Mirjam Paales, 30.10.2013 1760 vaatamist Arvutiteadus


Feature selection: a key technique for data mining

Jean-Charles Lamirel



Since the 1990s, advances in computing and storage capacity allow the

manipulation of very large data. Whether in bioinformatics or in text

mining, it is not uncommon that main mining algorithms like classifiers

have to work with data description space of several thousands or even

tens of thousands of features. One might think that such algorithms

should be more efficient if there are a large number of features.

However, the first problem that arises is the increase in computation

time. Moreover, the fact that a significant number of features are

redundant or irrelevant to the classification task significantly

perturbs their operation. As well as it is crucial in human learning,

the integration of a feature selection process is also a main concern in

the framework of the classification of high dimensional data.



For presenting the existing methods and their limitations in an highly

multidimensional context, the following course will use a central

example related to a complex and “real life” text mining task.



The first part of the presentation will introduce the main principle of

classification and provides some examples of classifiers and of  their

application in the data mining domain. It will illustrate as well the

effect of the manipulation of high dimensional data on the classifier

results. Additional and usual problems related to the management of rare

or imbalanced data and to highly similar classes will be also discussed

in this part.



The second part of the presentation will focus on the description of the

feature selection principle and on the one of the main categories of

feature selection methods. Pros and cons of the usual methods will be

discussed here and the effect of their application in combination with

classifiers in the context of highly multidimensional data will be

highlighted. The additional use of resampling techniques for the

management of imbalanced data will also be investigated in this part.



The last part of the presentation will focus on one of our recent

research in the domain of feature selection. The principle of this new

promising approach based on the original theory of feature maximization

and its associated metric will be explained. The behavior of the method

will be compared to the ones of usual methods in the above mentioned

context. Additional advantages related to specific class labeling and

graph visualization capabilities and intrinsic properties of the method,

like incrementality and non parametric behavior, will be finally discussed.