MINING CAPSTONE TASK 1
In this task I opted to use python and the toolkits give to attain
results for the task given. It proved highly useful in the task of obtaining an
overview of the topics to be discussed and the reviews that were there.
The specific packages I opted to use are the genism and sklearn to
incorporate the topic extraction process.
TASK 1.1 TOPIC MINING OF ALL
In order to come into terms with what the reviewers were talking about
with reference to the topic data, I chose to use the LDA topic model in the
extraction process in order to attain 10 topics from all the reviews in the
data that were for the restaurants.
In order to vectorize the review data I chose to apply TfidfVectorizer.
The transformation produced results that were linear and I used IDF
reweighting where I specified to gram range to be either 1 or 2. This basically
collected data or terms with either one or two words.
In order to visualize the data, I opted to use D3 to acquire the
To effectively represent the topic models, I chose to use word cloud
visualization using different font sizes to represent the significance of each
term in any given topic model.
Some observations acquired form the data are as follows;
From the data visualized it can be seen that the
topics 6,7,9 have a great affinity on the emphasis for a specific cuisine/ food
like; pizza, Chinese food. At the same time topics such as 1,4,9 talk about
foods or drinks such as fish chips, chicken and “food drinks”
The representation shows that there is mostly good
comments towards the restaurants as indicated within topics 0 and 2.
With reference to topic 3 it can be seen that a very
important topic that comes up when the customers are reviewing a restaurant is
the time. It is very important to the customers.
Graphical representation of the topics mind from the raw data restaurant
TASK 1.2 TOPIC MINING OF
POSITIVE AND NEGATIVE REVIEWS
In the quest to explore the topic distribution for the subsets of all
the reviews I was able to attain certain results. Specifically, this task
required that the observations made to the subsets of positive reviews and
For the positive results I used reviews with star number =>4, while
for the negative reviews I used reviews with star number =<2. Still incorporating LDA as the topic model which resulted with the same configurations as those used for the previous task. From the results acquired, some observations from the data are as follows; · It can clearly be seen that from the results, regardless of whether they are positive or negative reviews, the main focus is on food or cuisines that the reviewers frequently revisit. · When it comes to positive reviews Indian food, pizza, sushi and chicken, and as it follows for negative reviews some of the top topics that are mentioned include but are not limited to pizza, hot dogs, sushi and tacos. · Given all the reviews it can be seen that the created subsets now offer different information through out the different subsets. This can be seen because in the case of the positive reviews there is no direct link expressing for the really rate the services offered. There is an abundance of general phrases such as great place and amazing which can't give a proper rating hence lacking any impression of negativity. When we come to the negative reviews it is clearly seen that the reviews give a better description regardless of the general phrases used. This is because when they review they give a clear and specific review on a given topic. A few examples are like portion size which clearly describes that the food portion was not satisfactory, limited menu is also another example that specifies that the reviewer lacked enough options on the menu in order to decided what he/she may want to eat or drink. The diagrams below show the positive and negative reviews. I choose to still use the word cloud representation. topic 0 topic 1 topic 2 topic 3 topic 4 topic 5 topic 6 topic 7 topic 8 topic 9 Figure showing the ten topics mined representing positive restaurant reviews. topic 0 topic 1 topic 2 topic 3 topic 4 topic 5 topic 6 topic 7 topic 8 topic 9 Figure representing the 10 topics extracted for the negative restaurant reviews. CONCLUSION In conclusion it can be seen that ten topics can be extracted from the raw restaurant data and after extraction there can bee further extraction in order to give more subsets. It can also be seen that from the subsets one may be able to collect further information from the data, in this case positive and negative reviews. I hope this report has given a detailed explanation of my findings from the Yelp data.