If overfitting, i.e. the model will learn exceptions

If we keep adding the more dimensions, in order to avoid overfitting, we would need to increase the size of the training data by a lot. In high dimensional space, each feature has number of possible values and we would need to make sure that in our data we can find at least 5 samples with each combination of values (which is typically a frequency chosen in the literature) (Koutroumbas & Theodoridis, 2008, Spruyt, 2014). Failing to do so will result in overfitting, i.e. the model will learn exceptions that are specific for the training data and will not be able to make generalization over new, unseen examples.When it comes to data sparsity, with the increase of dimensions, the volume of the feature space increases a lot and data becomes sparse. This is a problem because, given the large space, data require a lot of storage and could be a problem for a machine learning method because it increases the cost of observing a feature (Koutroumbas & Theodoridis, 2008, Trunk, 1979).In order to avoid the above mentioned issues with high dimensionality, there are methods for dimensionality reduction that could be applied. They are normally divided into feature selection and feature extraction (Spruyt, 2014, Pudil & Novovi?ová, 1998). Feature selection methods are used to detect and remove the data which does not aid to increase the accuracy of the model or may even lead to accuracy decrease. The main aim of the feature selection is the following: “improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data” cite{fs}. In simpler words, feature selection methods allow us to decide what the best features to use are, to reduce dimensionality, remove noise and therefore also speed up the learning process and increase the accuracy. Which features to use will mainly depend on the task we are doing. For instance, to identify a word sense, the surrounding words can be a good feature to choose, while in author identification task, appearance and frequency of certain words can play a distinctive role as features. In order to get the final result of the model prediction performance the results of analysis on test data in each round of training is combined.The main goal of this technique is to define a test dataset for the time the model is training so that the performance of the model can be observed on unseen data.  By using  this technique  it is more guaranteed that  we do not lose  important modeling or testing capability  cite{CV}.\ Since it is not possible to go through all the possible combinations of features the used classifier could be trained and tested on, there are several methods that could be applied, such as greedy methods, best-first methods, etc.  in order to find the best number and combination of features (Spruyt, 2014). One of the feature selection methods which is well known for text classification task is Chi-square.