Topic extraction from Message Text using external scripts/library

Introduction

Topic modeling means detecting “abstract” topics from a collection of texts. There are various techniques like: statistical technique, LDA, Hierarchical Dirichlet process etc. Amongst all of the techniques, Latent Dirichlet Allocation(LDA) has got more success based on accuracy and usability.

This statistics based text processing algorithm takes text as input and produces a list of topics based on inner implementation of LDA variant. Number of topics to be detected should be pre-defined before applying LDA method.

In this tip document, it is shown how to integrate JavaScript version of LDA into a Tizen Web app to detect different topics from the message texts in message box. Each topic is represented as collection of words. Word strength depends on word length and its associativity with other words. For simplicity and easy understanding, number of topics to be detected is limited to one in this sample application.

Some of the steps for message extraction from device message storage discussed in a different Tip & Tech document. Those steps are:

Step-1: Getting all messages in Web(UI) App

Step-2: Receiving launch request from UI app

Step-3: Retrieving messages in service App

Step-4: Closing message service and destroying message handle

All these steps are described in detail at this link:

https://developer.tizen.org/community/tip-tech/communication-within-tizen-hybrid-app

After "closing message service and destroying message handle" step to extract topic from those message text, some topic modeling algorithm has to be applied.

Step-5: Adding LDA library into Tizen Web APPS

Add all .js files from below link to use LDA library in any Tizen Web project.

https://github.com/awaisathar/lda.js/tree/gh-pages/js

Copy and paste those files into js folder of Tizen web project structure as below image.

After that, link this library with index.html. Add these two lines in index.html.

<script type="text/javascript" src="js/stopwords.js"></script>
<script type="text/javascript" src="js/lda.js"></script>

index.html will look like as below image:

Step-6: Detecting different topics

Now, different topics from message text will be detected after pressing Analyze button of Web UI app. After pressing that button btnTopiciseClicked() function is called and the function is written as below:

function btnTopiciseClicked() {
	$('#btnTopicise').attr('disabled','disabled');
	sentences = $('#text').val().split("\n");
	topicise();
	$('#btnTopicise').removeAttr('disabled');	
}

topicise() function will generate topics based on an LDA variants. LDA represents documents as combination of topics (collection of words) with certain probabilities. It’s a way of automatically discovering topics that document contain.

To get more technical details on this particular NLP algorithm please refer to https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation and http://www.cs.columbia.edu/~blei/topicmodeling.html

Running sample application

Now, sample application is capable of retrieving list of messages from message box and detecting different topics from message texts.

Figure: Retrieving list of messages & detecting different topics from message texts

References:

[1] https://github.com/awaisathar/lda.js

[2] https://developer.tizen.org/ko/development/api-references/native-application?redirect=/dev-guide/2.3.1/org.tizen.native.wearable.apireference/group__CAPI__MESSAGING__MESSAGES__MODULE.html&langredirect=1

[3] https://developer.tizen.org/ko/development/guides/native-application/application-management/application-controls?langredirect=1

[4] https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

[5] http://www.cs.columbia.edu/~blei/topicmodeling.html

File attachments: 
List
SDK Version Since: 
2.4 mobile/2.3.1 wearable