Monday, 13 January 2014

Creating basic "word clouds" with R

There's more to it than meets the eye! 

Reading up a text document might be the best way to judge the contents of the text, but seldom feasible when in bulk or otherwise. What if we get the essence much faster? There's a way and here we deal with a very basic one. Text Mining, of course. 


An integral part of text mining is being able to judge a piece (not necessarily a single piece) of text on the basis of its contents. Words based on frequency of occurrence, a few manipulations here and there and we have our wordcloud ready. All we need is the statistical programming, open source tool, R and of course, any piece of text we need to analyze.


The story goes like this - we choose a piece of text and save it (through any free editing software like Notepad) in Documents or Desktop or any folder of choice in our hard drive. In this case we name a folder called "cuisine", search the internet for any cuisine related articles and blindly copy it in a text and save it in our recently created folder. Easy.


The next part is a bit of writing a few lines of code in R, but one can obviously copy the lines given later. To continue our story, we then create a "Corpus" - an imaginary space where we load the text, such that the original text in our disk does not get tampered with. We can also handle more than one text in a single corpus, but let us stick to simplicity.


We then allow R to do a few things - remove leading white-space from our text, convert the text into lowercase for easier comprehension, remove certain English stopwords, make use of stem words and finally remove all numbers and punctuations from the text. We might want to save our formed wordcloud in a png format for future use, and then we execute it!





We find now that the text was mostly about beef tenderloin recipes and also we get a fair idea about the type of recipe (roasted!) and also about a few of the ingredients. That's really intuitive given that we had no time to read the text in the first place (remember, we had blindly copied the text!)

All we need to do is copy this part of the codes in R (which is quite self-explanatory because it has been commented) and run it.

#installs the packages "text-mining" and "word cloud"
library(tm)
library(wordcloud)
# now save the plain text file in any folder
# assume the file is cuisine.txt and the folder "cuisine"
# also assume the folder resides at # c:/Users/Ayan/Documents/cuisine/
# note: the file is not specified but rather the folder 
# in which it lives
# loads the text into a corpus
cuisine <- Corpus(DirSource("C:/Users/Ayan/Documents/cuisine/"))
# corpus do not behave as simple objects but virtual, so #inspect(cuisine) instead of cuisine
inspect(cuisine)
# tm_map function comes with the "tm" package
# remove unnecessary whitespace
cuisine <- tm_map(cuisine,stripWhitespace)
# converts everything to lower case 
cuisine <- tm_map(cuisine,tolower)
# removes English stopwords like "the","they",etc
cuisine <- tm_map(cuisine,removeWords,stopwords ("english"))
# allows text stemming like play, playing, played etc
cuisine <- tm_map(cuisine,stemDocument)
# removes all punctuations
cuisine <- tm_map(cuisine,removePunctuation)
# removes all numbers
cuisine <- tm_map(cuisine,removeNumbers)
# saves the wordcloud in an image format
png("wordcloud_packages.png", width=1280,height=800)
# applies wordcloud algorithm
# scale - controls difference between largest and smallest fonts
# maxwords - limits number of words in the cloud
# rot.per - % of vertical text
wordcloud(cuisine, scale=c(5,0.5), max.words=100, +random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, +colors=brewer.pal(8, "Dark2"))
# note: + indicates a continuation of the same line in R

It's almost done but not complete yet. We might want to lose a few words, depending on the context from our created word cloud. We can fit that as well. So after the cloud is formed, we choose the words which we want to omit from the cloud and type in:

cuisine <- tm_map(cuisine,removeWords,"words_to_be_removed")

and then type in the code for the wordcloud again and we can see the difference. The wordcloud however, gets saved in the path given,ie, in this case, the Documents folder. If we want to see the cloud in R, just remove the line which saves it in the image format.

By playing around with a few codes, intuitively, we can display our creativity a bit further, like we can change the color, the number of words to be displayed and so on, and I leave it now to the readers!

Happy text mining!

Note: Initially, we can install the mentioned packages by typing in the following commands:

install.packages("tm")
install.packages("wordcloud")

Otherwise, we can manually install the packages from the CRAN mirrors in R or R Studio.