Text inductive process of text classification starts by

Text
mining is derivative from data mining 1, that is being
used to handle understanding documents on large amount of text data. Text
mining techniques extract relevant and specific information from textual data
to retrieve informative knowledge. These techniques are done by interpreting
relations, patterns, rules, and facts related to NLP as well as applying data
mining and machine learning algorithms 2. The main applications
of the text mining are included: text summarization, text classification, and
text clustering. Among these applications, automatic text classification (also
categorization) 3 have been extensively
attracted attention due to successfully dealing with real world applications.
Generally, text classification can be divided into two groups: rule based and
learning based methods. Rule based approaches utilize some predefined rules,
and in the learning based classification, documents or their patterns must be labeled
in order to classify the documents. On the other hand, in the perspective of
machine learning, a classifier is supervised,
unsupervised or semi-supervised fashion. In supervised text classification schemes,
some labeled documents, such as human feedback
taken into account for classifying correctly the documents. Unsupervised method
is usually known as text clustering 4, in which there is no external information or predefined
classes in the process of text classification.
Finally, in semi-supervised methods, some documents take the advantage of
supervised method to learn efficiently. In the text classification task, single
label document belongs to only one class and multi label document belongs to
more than one class. The inductive process of text classification starts by
learning a set of pre-classified documents to classify
test documents 3. There are some crucial challenges for classifying text
data: the high-dimensionality of the feature space, in which the machine is not
able to easily handle this numerous data, ontology consideration (embedded and
abstract representation of terms and concepts) 5, such as using
Wikipedia 6 7and WordNet 8 9.
Moreover, making the most of syntactic and semantic aspects of term is an
essential issue for an efficient classification. Semantic analysis 101112 is the process
of describing the relationship between terms which direct to find similarity
between documents even they do not contain the same terms. The most
sophisticated semantic methods in text analysis include: Latent Semantic
Analysis (LSA)1314, Latent
Semantic Indexing (LSI) 1516, Probabilistic LSA (PLSA) 1718, Probabilistic
LSI (PLSI) 19, and Latent
Dirichlet Allocation (LDA) 20. Furthermore, in
order to avoid over-fitting in text classification, some considerations should
be taken into account, including: appropriate text representation 21,
pre-processing 22, and dimension
reduction 232425. Vector Space Model
(VSM) 21 is a typical
text representation model in which for each

document and

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

 (the weight of term k in
document j), the vector defined as;

. Although this representation is effective, produces a huge number
of features. In order to reduce this high dimensional, two feature selection
and feature extraction methods are often used, such as: Term frequency (TF), Document
Frequency (DF), TF-IDF 262728, Mutual
Information (MI) 23 Gini Index
(GI) 29, T-test 30 Information
Gain (IG) 31 Gain Ratio 32 Odds Ratio 33.