Pages

Showing posts with label Java. Show all posts
Showing posts with label Java. Show all posts

Saturday, January 2, 2016

Issues while setting up Jade (Java Agent Development Framework)

Use the following command to avoind the errors given below:

java -cp "jade.jar:(path to your classes)" jade.Boot -agents nickName:(fully qualified name for the agent class E.g., packageName:className)

E.g.,
java -cp "jade.jar:Control-0.0.1-SNAPSHOT.jar" jade.Boot -agents buy:Examples.BookBuyerAgent

Possible errors due to issues in class path or class name:

Error creating the Profile [Can't load properties: Cannot find file buyer:BookBuyerAgent]
jade.core.ProfileException: Can't load properties: Cannot find file buyer:BookBuyerAgent
    at jade.core.ProfileImpl.(ProfileImpl.java:129)
    at jade.Boot.main(Boot.java:76)


jade.Boot: No such file or directory

SEVERE: Cannot create agent buyer: Class BookBuyerAgent for agent ( agent-identifier :name buyer@172.20.10.2:1099/JADE ) not found - Caused by:  BookBuyerAgent

Tuesday, September 15, 2015

Issues when setting up DeepLearning4J in Mac OSX

I got the following errors when I was trying out DeepLearning4J example with Deep Belief Nets (DBNs) in Mac OSX 10.10 (Yosemite)
It is being said that Jblas is already available in Mac OSX, but still I got some errors with that. 

Jblas is a pre-requisite for setting up DeepLearning4J. 


DeepLearning4J uses ND4J to enable scientific computing with N-Dimentional arrays for Java. ND4J works on several backend linear algebra libraries (execution support with CPU or GPU). Jblas is one Java backend used in  DeepLearning4J for the required matrix operations. 

NoAvailableBackendException ND4J 

Solution: Add the following dependancy

org.nd4j
nd4j-jblas
0.4-rc0

java.lang.ClassNotFoundException: org.jblas.NativeBlas 

Solution: Add the following dependancy

  org.jblas
  jblas
  1.2.4

Saturday, August 15, 2015

Latent Dirichlet Allocation (LDA) with Apache Spark MLlib

Latent Dirichlet allocation is an scalable machine learning algorithm for topic annotation or topic modelling. It is available in Apache Spark MLlib. I will not explain the internals of the algorithm in detail here.

Please visit the following link for more information about LDA algorithm.
http://jayaniwithanawasam.blogspot.com/2013/12/infer-topics-for-documents-using-latent.html

Here’s the code for LDA algorithm in Spark MLlib.
import scala.Tuple2;

import org.apache.spark.api.java.*;

import org.apache.spark.api.java.function.Function;

import org.apache.spark.mllib.clustering.DistributedLDAModel;

import org.apache.spark.mllib.clustering.LDA;

import org.apache.spark.mllib.linalg.Vector;

import org.apache.spark.mllib.linalg.Vectors;

import org.apache.spark.SparkConf;

public class lda {

  public static void main(String[] args) {



// Spark configuration details

    SparkConf conf = new SparkConf().setAppName("LDA");

    JavaSparkContext sc = new JavaSparkContext(conf);



    // Load and parse the data (sample_lda_data.txt is available with Spark installation)

    // word count vectors (columns: terms [vocabulary], rows [documents])

    String path = "data/mllib/sample_lda_data.txt";

   

    // Read data

    // creates a RDD with each line as an element

    // E.g., 1 2 6 0 2 3 1 1 0 0 3

    JavaRDD data = sc.textFile(path);

   

    // Map is a transformation that passes each data set element through a function

    // It returns a new RDD representing the results

    // Prepares input as numerical representation

    JavaRDD parsedData = data.map(

        new Function() {


public Vector call(String s) {

            String[] sarray = s.trim().split(" ");

            double[] values = new double[sarray.length];

            for (int i = 0; i < sarray.length; i++)

              values[i] = Double.parseDouble(sarray[i]);

            return Vectors.dense(values);

          }

        }

    );

   

    // Index documents with unique IDs

    // The transformation 'zipWithIndex' provides a stable indexing, numbering each element in its original order.

    JavaPairRDD corpus = JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(

        new Function, Tuple2>() {



public Tuple2 call(Tuple2 doc_id) {

            return doc_id.swap();

          }

        }

    ));

    corpus.cache();



    // Cluster the documents into three topics using LDA

    // number of topics = 3

    DistributedLDAModel ldaModel = new LDA().setK(3).run(corpus);



    // Topic and its term distribution

    // columns = 3 topics/ rows = terms (vocabulary)

    System.out.println("Topic-Term distribution: \n" + ldaModel.topicsMatrix());

    // document and its topic distribution

    // [(doc ID: [topic 1, topic 2, topic3]), (doc ID: ...]

    JavaRDD> topicDist = ldaModel.topicDistributions().toJavaRDD();

    System.out.println("Document-Topic distribution: \n" + topicDist.collect());

    sc.close();

  }

}

Output:

Topic-Term distribution




Document-Topic distribution

Market Basket Analysis with Apache Spark MLlib FP-Growth

Market Basket Analysis 

source: http://www.noahdatatech.com/solutions/predictive-analytics/

Market basket analysis is identifying items in the supermarket which customers are more likely to buy together.
e.g., Customers who bought pampers also bought beer

      
This is important for super markets to arrange their items in a consumer convenient manner as well as to come up with promotions taking item affinity in to consideration.

Frequent Item set Mining and Association Rule Learning  


Frequent item set mining is a sub area in data mining that focuses on identifying frequently co-occuring items. Once, the frequent item set is ready, we can come up with rules         to derive association between items.
        e.g., Frequent item set = {pampers, beer, milk}, association rule = {pampers, milk ---> beer}

        There are two possible popular approaches for frequent item set mining and association rule learning as given below:

Apriori algorithm 
FP-Growth algorithm

To explain above algorithms, let us consider example with 4 customers making 4 transactions in supermarket that contain 7 items in total as given below:

    Transaction 1: Jana’s purchase: egg, beer, pampers, milk
    Transaction 2: Abi’s purchase: carrot, milk, pampers, beer
    Transaction 3: Mahesha’s purchase: perfume, tissues, carrot
    Transaction 4: Jayani’s purchase: perfume, pampers, beer

    Item index
    1: egg, 2: beer, 3: pampers, 4: carrot, 5: milk, 6: perfume, 7: tissues

Using Apriori algorithm


Apriori algorithm identifies frequent item sets by starting individual items and  extending item set by one at a time. This is known as candidate generation step.
This algorithm makes the assumption that any sub set of item within a frequent item set is also frequent.

Transaction: Items
1: 1, 2, 3, 5
2: 4, 5, 3, 2
3: 6, 7, 4
4: 6, 3, 2

Minimum Support 


Minimum support is used to prune the associations that are less frequent.

Minimum support = number of times item occur in transactions/ number of transactions

For example, lets say we define minimum support as 0.5.
Calculating support for egg is 1/4 = 0.25 (0.25 < 0.5), so that is eliminated. Support for beer is 3/4 = 0.75 (0.75 > 0.5) is it is considered for further processing.

Calculation of support for all items

size of the candidate itemset = 1

itemset: support
1: 0.25: eliminated
2: 0.75
3: 0.75
4: 0.5
5: 0.5
6: 0.5
7: 0.25: eliminated

remaining items: 2, 3, 4, 5, 6

extend candidate itemset by 1
size of the items = 2

itemset: support
2, 3: 0.75
2, 4: 0.25: eliminated
2, 5: 0.5
2, 6: 0.25: eliminated
3, 4: 0.25: eliminated
3, 5: 0.5
3, 6: 0.25: eliminated
4, 5: 0.25: eliminated
4, 6: 0.25: eliminated
5, 6: 0.25: eliminated

remaining items: {2,3},{ 2, 5}, {3, 5}

extend candidate itemset by 1
size of the items = 3

2, 3, 5: 0.5

Using FP-Growth algorithm


In FP-Growth algorithm, frequent patterns are mined using a tree approach (construction of Frequent Patter Tree)
FP-Growth algorithm has been proven to execute much faster than the Apriori algorithm.

Calculate support for frequent items and sort in degreasing order of the frequency as given below:

item: frequency
1: 1 - eliminated
2: 3
3: 3
4: 2
5: 2
6: 2
7: 1 - eliminated

Decreasing order of the frequency
2 (3), 3 (3), 4 (2), 5 (2), 6 (2)

Construction of FP-Tree

A) Transaction 1
 1, 2, 3, 5 > 2 (1), 3 (1), 5 (1)

B) Transaction 2
4, 5, 3, 2 > 2 (2), 3 (2), 4 (1), 5 (1)

C) Transaction 3
6, 7, 4 > 4 (1), 6 (1)

D) Transaction 4
6, 3, 2 > 2 (3), 3 (3), 6 (1)

Once the FP-tree is constructed, frequent item sets are calculated using depth first strategy along with divide and conquer mechanism.
This enables algorithm is be computationally more effective and parallelizable (using map-reduce).

Code Example with Apache Spark MLlib

    public static void main(String[] args) {
   
        SparkConf conf = new SparkConf().setAppName("Market Basket Analysis");
        JavaSparkContext sc = new JavaSparkContext(conf);
       
        // Items
        String item1 = "egg";
        String item2 = "beer";
        String item3 = "pampers";
        String item4 = "carrot";
        String item5 = "milk";
        String item6 = "perfume";
        String item7 = "tissues";
       
        // Transactions
        List transaction1 = new ArrayList();
        transaction1.add(item1);
        transaction1.add(item2);
        transaction1.add(item3);
        transaction1.add(item5);

        List transaction2 = new ArrayList();
        transaction2.add(item4);
        transaction2.add(item5);
        transaction2.add(item3);
        transaction2.add(item2);
       
        List transaction3 = new ArrayList();
        transaction3.add(item6);
        transaction3.add(item7);
        transaction3.add(item4);
       
        List transaction4 = new ArrayList();
        transaction4.add(item6);
        transaction4.add(item3);
        transaction4.add(item2);
       
        List> transactions = new ArrayList>();
        transactions.add(transaction1);
        transactions.add(transaction2);
        transactions.add(transaction3);
        transactions.add(transaction4);
       
        // Make transaction collection parallel with Spark
        JavaRDD> transactionsRDD = sc.parallelize(transactions);

        // Set configurations for FP-Growth
        FPGrowth fpg = new FPGrowth()
          .setMinSupport(0.5)
          .setNumPartitions(10);
       
        // Generate model
        FPGrowthModel model = fpg.run(transactionsRDD);

        // Display frequently co-occuring items
        for (FPGrowth.FreqItemset itemset: model.freqItemsets().toJavaRDD().collect()) {
           System.out.println("[" + Joiner.on(",").join(itemset.javaItems()) + "], " + itemset.freq());
        }
        sc.close();
    }

Saturday, July 25, 2015

How to set up Apache Spark (Java) - MLlib in Eclipse?

Apache Spark version: 1.3.0

Download Apache Spark required pre-built version from the following link:
http://spark.apache.org/downloads.html

Create Maven project in Eclipse
File > New > Maven Project

Add following dependencies in pom.xml.



  org.apache.spark
  spark-mllib_2.10
  1.3.0
  provided
 

org.apache.spark
spark-core_2.10
1.3.0
provided


We have mentioned scope as “provided” as those dependancies are already available in Spark server.

Create new class and add you Java source code for required MLlib algorithm

Run as > Maven Build… > package

Verify .jar file is created in ‘target' folder of Maven project

Change the location to Spark installation you downloaded and unpacked and try following command:
./bin/spark-submit —class --master local[2]
E.g.,

./bin/spark-submit --class fpgrowth --master local[2] /Users/XXX/target/uber-TestMLlib-0.0.1-SNAPSHOT.jar

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.mllib

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.mllib.clustering.LDA.run(Lorg/apache/spark/api/java/JavaPairRDD;)Lorg/apache/spark/mllib/clustering/DistributedLDAModel

Spark version during the compilation time (in Maven repository) was different from runtime Spark version in Spark server class path (Spark installation directory/ lib)

Friday, November 7, 2014

Java Game Development - Part 2: Developing multi user game



Extending the game as multi-user application


When extending this game as multi-user game, another player baby fish is added to the game. In addition to escaping from enemy fish and meeting friendly fish, the baby fish players should compete with each other to go home sooner than the other while getting maximum number of points.

Design

Client - server architecture

Client server architecture is used instead of peer-to-peer approach to due to its simplicity and ease of development. 

In a client-server architecture all the game players (baby fishes), or "clients", are connected to a central machine, the Fish game server. 

The server is responsible for important decisions such as creating game friend/ enemy fish collection, managing state and broadcasting this information (x, y coordinates of players and non-players) to the individual clients. 

As a result, the server becomes a key bottleneck for both bandwidth and computations.
However, this approach will consume more network bandwidth. 



Concurrent game playing using multi-threading


Multi- threading approach is used to enable multiple users to play the game concurrently. A separate thread will represent each client.

Network Access using socket communication


TCP/ IP Socket communication (low level network communication) is used for two-way communication between server and clients. Remote Method Invocation (RMI) approach is not used here, as it will incur additional processing overhead.
Using encapsulation to support multiple network protocols

FishServer class and FishClient interface do not include any TCP/ IP networking specific programming. This generic design will support different network protocols with out changing the core game logic. 

Class Diagram


 Sequence Diagram


Implementation

Threading in Java

Synchronized keyword

Since game server is accepting different threads that request or send messages, which access same resources (E.g., objects, variables), for preventing thread interference and memory consistency errors.

Runnable interface

Runnable interface is used to implement what each player client thread is supposed to do once executed.

Networking in Java

Serializable Interface

We need to send the game information such as game scores and player/ non-player x, y coordinates across network.

To achieve the above, state of the objects is transmitted across network by converting objects to byte arrays.

Classes are serialized before sending through network and de-serialized once received using Serializable interface.

Socket and ServerSocket


ServerSocket object is created to listen for incoming game player client connections.
The accept method is used to wait for incoming connections. 

The accept method returns an instance of a Socket class, which represents the connection to the new client.

ObjectOutputStream/ ObjectInputStream methods (writeObject, readObject) are used to get object streams to reading from and writing to the new client. 

UI









 



Java Game Development - Part 1: Developing a single user game using strategy design pattern

Introduction


The Game

The lost fish is a simple educational game for children to learn about behaviors and characteristics of different sea creatures such as fishes and plants in a fun and challenging way.

E.g., Dolphin is an innocent, friendly sea animal, whereas shark is harmful

Also, it is intended to make the children aware of common characteristics of the sea creatures, which belongs to a particular category.

E.g., harmful animals will scream furiously and look angry

In this game, player is a baby fish that has lost in the sea. Baby fish has to find its way home passing different barriers, while escaping from harmful animals.

Game Rules


Assume, before starting the game, baby fish will be given some knowledge on the sea creatures by its mother fish. But, mother fish will not be able to tell about “all” the sea creatures.

If the baby fish collides with a harmful creature, the baby fish will get weak. (Loose energy)

If the baby fish identifies and meet friendly fish it will gain energy. If energy level is below zero, baby fish will die and the game is over.

Baby fish will get bonus points if it reach/find home soon.

Win the Game


Baby fish has to reach home with maximum energy level with maximum bonus points, to win the game.

Design

This game is designed with the intention of improving further with a variety of sea animals with different appearances and behaviors. Then, game will keep the players interested and also this will lead to a good learning resource as well.

Design Decisions

OOP concepts such as encapsulation, inheritance and polymorphism are used to improve the reusability, maintainability and the extendibility of the game.

 Strategy Design Pattern

Strategy design pattern is used to effectively extend the game with new sea creatures with diverse behaviors, to keep the player entertained. Also, different behaviors can be dynamically invoked using this approach.

An example scenario is given below:

Assume we have to add two new sea animals (E.g., whale and jelly fish) with different/ existing sound behaviors to the game, with out doing major changes to the core game design and avoiding duplicate code. If we use the traditional inheritance approach, where all the sea animal behaviors (E.g., sound/ scream) inherit from parent class SeaAnimal, then the code will be duplicated for similar behaviors. But, the given approach will solve that problem using interface implementations for similar behaviors.

The current design supports the above scenario in two different ways.
•    Inheritance: New sea creatures can be created by extending “SeaAnimal” abstract class
•    Polymorphism: Novel sound behaviors can be added or existing sound behaviors can be reused using “Sound” interface.

Using Constants

Constant values are used wherever applicable to improve reusability and maintainability.

Class Diagram


Sequence diagrams






 UI Design


Here's the video: https://www.youtube.com/watch?v=ipv_6yYAUw4

References

[1] http://obviam.net/index.php/design-in-game-entities-object-composition-strategies-part-1/
-->



Tuesday, April 8, 2014

Basic SOLR concepts explained with a simple use case

This is an example scenario to understand the basic concepts behind SOLR/ Lucene indexing and search using advertising web site[4].

Use case:

Searcher: I want to search cars by different aspects such as car model, location, transmission, special features etc. Also, I want to see the similar cars that belong to same model as recommendations.

SOLR uses index which is an optimized data structure for fast retrieval.

To create an index, we need to come up with a set of documents with fields in it. How do we create a document for the following advertisement?



Title: Toyota Rav4 for sale
Category: Jeeps
Location: Seeduwa
Model: Toyota Rav4
Transmission: Automatic
Description: find later


SOLR document:


document 1

Toyota Rav4 for sale


Jeeps
Seeduwa
Toyota Rav4
Automatic
Brought Brand New By Toyota Lanka-Toyota Rav4 ACA21, YOM-2003, HG-XXXX Auto, done approx 79,500 Km excellent condition, Full Option, Alloy Wheels, Hood Railings,
call No brokers please.


Some more documents based on advertisements...

document 2

Nissan March for sale

Cars
Dankotuwa
K11
Automatic
A/C, P/S, P/W, Center locking, registered year 1998, full option, Auto, New battery, Alloys, 4 doors, Home used car, Mint condition, Negotiable,

document 3

Nissan March K12 for rent
Cars
Galle
K12
Automatic
A/C, P/S, P/W, Center locking, registered year 2004, full option, Auto, New battery, Alloys, 4 doors, cup holder, Doctor used car, Mint condition, Negotiable,

Inverted Index


Then SOLR creates an inverted index as given below: (Lets take example field as Title)

toyota doc1(1x)
rav4 doc1(1x)
sale doc1(1x) doc2(1x)
nissan doc2(1x)
march doc2(1x)

1x means the term frequency of the document for that particular field.


Lucene Analyzers


Note that “for” term here is eliminated during Lucene stop word removal process using Lucene text analysers. You can come up with your own analyser based on your preference as well.

Field configuration and search

You can configure, which fields your documents can contain, and how those fields should be dealt with when adding documents to the index, or when querying those fields using schema.xml.

For example, if you need to index description field as well and the description value of the field should be retrievable during search, what you need to do is add the following line in schema.xml [1].

Now, assume a user search for a vehicle.

Search query: “nissan cars for rent”

SOLR query would be /solr/select/?q=title:”nissan cars for rent"

Ok what about the other fields (Category, location, transmission etc. ?)?

By default, SOLR standard query parser can only search one field. To use multiple fields such as title and description and give them a weight to consider during retrieval based on their significance (boosts) we should use Dismax parser [2, 3]. Simply said, using Dismax parser you can make title field more important than description field. 



Anatomy of a SOLR query


q - main search statement
fl - fields to be returned
wt - response writer (response format)

http://localhost:8983/solr/select?q=*:*&wt=json
- select all the advertisements

http://localhost:8983/solr/select?q=*:*&fl=title,category,location,transmission&sort=title desc
- select title,category,location,transmission and sort by title in descending order

wt parameter = response writer
http://localhost:8983/solr/select?q=*:*&wt=json - Display results in json format
http://localhost:8983/solr/select?q=*:*&wt=xml - Display results in XML format

http://localhost:8983/solr/select?q=category:cars&fl=title,category,location,transmission -
Give results related to cars only

more option can be found at [5].

Coming up next...

  • Extending SOLR functionality using RequestHandlers and Components
  • SOLR more like this
References:
[1] http://wiki.apache.org/solr/SchemaXml
[2] https://wiki.apache.org/solr/DisMax
[3] http://searchhub.org//2010/05/23/whats-a-dismax/
[4] Ikman.lk
[5] http://wiki.apache.org/solr/CommonQueryParameters

Wednesday, April 2, 2014

How would you decide if a class should be abstract class or interface?

It depends :)

In my opinion, to implement methods in abstract class you need to inherit the abstract class.One of the key benefits of inheritance is to minimise the amount of duplicate code by implement common functionalities in parent classes. so if the abstract class have some common generic behaviour that can be shared with its concrete classes, then using abstract class would be optimal.

However, if all methods are abstract and those methods do not represent any unique/significant behaviour related to  the class instances, it may be better to use interface instead.

Use abstract classes to define planned inheritance hierarchies. Classes with already defined inheritance hierarchy can extend their behavior in terms of the “roles” they can play, which are not common to its parents all the other children, using interfaces. Abstract classes will not help in this situation because of multiple inheritance restriction in  java language.

How interfaces avoid “Deadly Diamond of Death” problem?

A key difference between interface and abstract class is, “Interfaces simulate multiple inheritance” for languages where multiple inheritance is not supported due to “Deadly Diamond of Death” problem.

How interfaces avoid “Deadly Diamond of Death” problem?

Since interface methods do not have their underlying implementation, unlike the inherited class methods, there won’t be this problem as there can be multiple method signatures that are same, but there can be only one implementation for a particular class instance as duplicate methods cannot be compiled without any errors.

Reference:
Head First Java

Thursday, March 13, 2014

Constructor() has private access in Class

If it is obvious for you that this has nothing to do with an issue on granting access, check for version incompatibilities of the .class or the related class.

package java.nio.file does not exist in Mac OSX

This is new addition in java 1.7, so if by default JDK is set as older version, this exception will be given. However, when I check java -version and it says java version "1.7.0_45".

If you have java version specific code in your maven application add the following section in your pom.xml
           

    

   
   










Still it will give the following error:
[ERROR] Failed to execute goal X.plugins:maven-compiler-plugin:2.5.1:compile (default-compile) on project X: Compilation failure
[ERROR] Failure executing javac, but could not parse the error:
[ERROR] javac: invalid target release: 1.7
[ERROR] Usage: javac
[ERROR] use -help for a list of possible options


To solve this issue, set the JAVA_HOME variable to the following using any of the following methods:

// Set JAVA_HOME for one session
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home

OR

// Set JAVA_HOME for permanently
vim ~/.bash_profile
export JAVA_HOME=$(/usr/libexec/java_home)
source .bash_profile
echo $JAVA_HOME

Now compile the application

For those who are curious...

When deciding which JVM to consider for compiling, path specified in JAVA_HOME is used. Here's how to check that.
echo $JAVA_HOME

If it is not specified in JAVA_HOME, using  the following command, you can see where JDK is located in your machine:
which java

It will give something like this: /usr/bin/java

Try this to find where this command is heading to.
ls -l /usr/bin/java

This is a symbolic link to the path /System/Library/Frameworks/JavaVM.framework/Versions/Current/Commands

Now try the following command:
cd /System/Library/Frameworks/JavaVM.framework/Versions
ls

Check where "CurrentJDK" version is linked to. (Right click > Get info)
Mine it was  /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents.

Version specified as the "currentJDK" will determine which JVM should be used from the available JVMs.

So, this is why I got the "package java.nio.file does not exist" at the first place, as the default referenced JDK is older than 1.7.

How to point Current JDK to correct version?

cd /System/Library/Frameworks/JavaVM.framework/Versions
sudo rm CurrentJDK
sudo ln -s /Library/Java/JavaVirtualMachines/jdk1.7.0_21.jdk/Contents/ CurrentJDK

Additional info...

Also, use the following command to verify from where the Java -version is read. (for fun!.. :))
sudo dtrace -n 'syscall::posix_spawn:entry { trace(copyinstr(arg1)); }' -c "/usr/bin/java -version"

It will output something like this:
dtrace: description 'syscall::posix_spawn:entry ' matched 1 probe
dtrace: pid 7584 has exited
CPU     ID                    FUNCTION:NAME
  2    629                posix_spawn:entry   /Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home/bin/java



Sunday, February 2, 2014

Better approach to load resources using relative paths in Java

FileInputStream (Absolute path)


To load a resource file such as x.properties for program use, first thing that we would consider will be specifying the absolute file path as given below:

InputStream input = new FileInputStream("/Users/jwithanawasam/some_dir/src/main/resources/
config.properties”);

However, when ever we moved the project to another location, this path has to be changed, which is not acceptable.

FileInputStream (Relative path)


So, the next option would be to use the relative file path as given below, instead of giving absolute file path:

InputStream input = new FileInputStream("src/main/resources/config.properties”);

This approach seems to solve the above mentioned concern.

However, problem with this is the relative path is depending on the current working directory, which JVM is started. In this scenario, it is "/Users/jwithanawasam/some_dir". But, in a different deployment setting this may change, which leads to change the given relative path accordingly. Moreover, we, as developers do not have much control over JVMs current working directory.


In any of the above cases, we will get java.io.FileNotFoundException error, which is a familiar exception for most java developers.


class.getResourceAsStream


At runtime, JVM checks the class path to locate any user defined  classes and packages. (In Maven, build artifacts and dependancies are stored under path given for M2_REPO class path variable. E.g., /Users/jwithanawasam/.m2/repository) The .jar file which is the deployable unit of the project will be located here.

JVM uses class loader to load java libraries specified in class path.

So, best thing we can do is load the resource specifying a path relative to its class path using class loader. Here, specified relative path will work  irrespective of the actual disk location the package is deployed.

Following methods reads the file using class loader.

InputStream input = Test.class.getResourceAsStream("/config.properties");

Usually, in Java projects resources such as configuration files, images etc. are located in src/main/resources/ path. So, if we add a resource immediately inside this folder, during packaging, the file will be located in the immediate folder in .jar file.

We can verify this using the following command to extract content of jar file:

jar xf someproject.jar

If you place the resources in another sub folder, then you have to specify the path relative to src/main/resources/ path.

So, using this approach we can load resources using relative paths in a hard disk location independent manner. Once we package the application, it is ready to be deployed anywhere, as it it is, without the overhead of having to validate resource file paths, thus improving the portability of the application.

ServletContext.getResourceAsStream for web applications


For web applications, use the following method:

ServletContext context = getServletContext();
    InputStream is = context.getResourceAsStream("/filename.txt");
 
Here, file path is taken relative to your web application folder. (The unzipped version of the .war file)
E.g., mywebapplication.war (unzipped) will have a hierarchy similar to the following.  
 
mywebapplication
    META-INF
    WEB-INF
        classes
   filename.txt
 
So, "/" means the root of this web application folder.  
This method allows servlet containers to make a resource available to a servlet from any location, without using a class loader. 


 

Thursday, December 19, 2013

Topic Modeling: Infer topics for documents using Latent Dirichlet Allocation (LDA)

Introduction to Latent Dirichlet Allocation (LDA)


In LDA model, first you need to create a vocabulary on probabilistic term distribution over each topic using a set of training documents.

In a simple scenario, assume there are 2 documents in the training set and their content has following unique, important terms. (Important terms is extracted using TF vectors as I have mentioned later)

Document 1: "car", "hybrid", "Toyota"
Document 2: "birds", "parrot", "Sri Lanka"

Using the above terms, LDA creates a vocabulary on probabilistic term distribution over each topic as given below: We define that we need to form 2 topics from this training content.

Topic 1: car: 0.7,  hybrid: 0.1, Toyota: 0.1, birds: 0.02, parrot: 0.03, Sri Lanka: 0.05

Topic 1: Term-Topic distribution

Topic 2: car: 0.05,  hybrid: 0.03, Toyota: 0.02, birds: 0.4, parrot: 0.5, Sri Lanka: 0.1

Topic 2: Term-Topic distribution

The topic model is created based on above training data which will be later used for inference.

For a new document, you need to infer the probabilistic topic distribution over document. Assume the document content is as follows:

Document 3: "Toyota", "Prius", "Hybrid", "For sale", "2003"

For the above document,  probabilistic topic distribution over document will (roughly!) be a value like this:

Topic 1: 0.99, Topic 2: 0.01

Topic distribution over the new document


So, we can use the terms in the topics with high probability (E.g., car, hybrid) as metadata for the document which can be used in different applications such as search indexing, document clustering, business analytic etc.

Pre-processing 


  • Preparing input TF vectors

To bring out the important words within a document, we normally use TF-IDF vectors. However, in LDA, TF vectors are used instead of TF-IDF words to recognize the co-occurrence or correlation between words.

(In vector space model [VSM] it is assumed that occurrences of the words are independent of each other, but this assumption is wrong in many cases! n-gram generation is a solution for this problem)
    • Convert input documents to SequenceFile format

sequence file is a flat file consisting of binary key value pairs. This is used as input/ output file format for map-reduce jobs in Hadoop, which is the underlying framework which Mahout is running on.
        Configuration conf = new Configuration();
        HadoopUtil.delete(conf, new Path(infoDirectory));
        SequenceFilesFromDirectory sfd = new SequenceFilesFromDirectory();

        // input: directory contains number of text documents
        // output: the directory where the sequence files will be created
        String[] para = { "-i", targetInputDirectoryPath, "-o", sequenceFileDirectoryPath };
        sfd.run(para);
      • Convert sequence files to TF vectors

    Configuration conf = new Configuration();

    Tokenization and Analyzing


    During the tokenization, document content will be split in to set of terms/tokens. Different analyzers may use different tokenizers. Stemming and removing stop words can be done and customized in this stage. Please note that both stemming and stop words are language dependent.

    You can specify your own analyzer if you want, specifying on how you want the terms to be extracted. That has to be extended by the Lucene Analyzer class.

    StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);

    DocumentProcessor.tokenizeDocuments(new Path(sequenceFileinputDirectoryPath + "/" + "part-m-00000"), analyzer.getClass().asSubclass(Analyzer.class),
                    new Path(infoDirectory + "/" + DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER), conf);
            analyzer.close();

    There are couple of important parameters for generating TF vectors.

    In mahout, DictionaryVectorizer class is used for TF weighting and n-gram collocation.

    // Minimum frequency of the term in the entire collection to be considered as part of the dictionary file. Terms with lesser frequencies are ignored.
            int minSupport = 5;

    // Maximum size of n-grams to be selected. For more information, visit:  ngram collocation in Mahout
            int maxNGramSize = 2;


    // Minimum log likelihood ratio (This is related to ngram collocation. Read more here.)
    // This work only when maxNGramSize > 1 (Less significant ngrams have lower score here)
            float minLLRValue = 50;


    // Parameters for Hadoop map reduce operations
            int reduceTasks = 1;
            int chunkSize = 200;
            boolean sequentialAccessOutput = true;

        DictionaryVectorizer.createTermFrequencyVectors(new Path(infoDirectory + DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER),
                    new Path(infoDirectory), DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf, minSupport, maxNGramSize, minLLRValue,
                    -1.0f, false, reduceTasks, chunkSize, sequentialAccessOutput, true);

    Once the TF vectors are generated for each training document, the model can be created.

    Training

    • Generate term distribution for each topic and generate topic distribution for each training document 

      (Read about the CVB algorithm in mahout here.)
    CVB0Driver cvbDriver = new CVB0Driver();

    I will explain the parameters and how you need to assign them values. Before that you need to read the training dictionary in to memory as given below:

    Configuration conf = new Configuration();
            FileSystem fs = FileSystem.get(conf);
            SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(
                    dictionaryFilePath), conf);
            Text key = new Text();
            IntWritable val = new IntWritable();
            ArrayList dictLst = new ArrayList();
            while (reader.next(key,val)) {
                System.out.println(key.toString()+" "+val.toString());
                dictLst.add(key.toString());
            }
            String[] dictionary = new String[dictLst.size()];
            dictionary = dictLst.toArray(dictionary);


    Then, you have to convert vector representation of documents to a matrix, like this.
            RowIdJob rowidjob = new RowIdJob();
            String[] para = { "-i", inputVectorPath, "-o",
                    TRAINING_DOCS_OUTPUTMATRIX_PATH };
            rowidjob.run(para);

    Now, I will explain each parameters and factors you should consider on deciding values.

    // Input path to the above created matrix using TF vectors
    Path inputPath = new Path(TRAINING_DOCS_OUTPUTMATRIX_PATH + "/matrix");

    // Path to save the model (Note: You may need this during inferring new documents)
    Path topicModelOutputPath = new Path(TRAINING_MODEL_PATH);

    // Numbe of topics (#important!) Lower value results in broader topics and higher value may result in niche topics. Optimal value for this parameter can vary depending on the given use case. Large number of topics may cause the system to slowdown.
    int numTopics = 2;

    // Number of terms in the training dictionary. Here's the method to read that:
    private static int getNumTerms(Configuration conf, Path dictionaryPath) throws IOException {
        FileSystem fs = dictionaryPath.getFileSystem(conf);
        Text key = new Text();
        IntWritable value = new IntWritable();
        int maxTermId = -1;
        for (FileStatus stat : fs.globStatus(dictionaryPath)) {
          SequenceFile.Reader reader = new SequenceFile.Reader(fs, stat.getPath(), conf);
          while (reader.next(key, value)) {
            maxTermId = Math.max(maxTermId, value.get());
          }
          reader.close();
        }
       
        return maxTermId + 1;
      }
          
    int numTerms = getNumTerms(conf, new Path(TRAINING_DOCS_ROOT_PATH + "dictionary.file-0"));

    // Smoothing parameters for p(topic|document) prior: This value can control how term topic likelihood is calculated for each document
            double alpha = 0.0001;
            double eta = 0.0001;
            int maxIterations = 10;
            int iterationBlockSize = 10;
            double convergenceDelta = 0;
            Path dictionaryPath = new Path(TRAINING_DOCS_ROOT_PATH + "dictionary.file-0");

    // Final output path for probabilistic topic distribution training documents
            Path docTopicOutputPath = new Path(TRAINING_DOCS_TOPIC_OUTPUT_PATH);

    // Temporary output path for saving models in each iteration
            Path topicModelStateTempPath = new Path(TRAINING_MODEL_TEMP_PATH);

            long randomSeed = 1;

    // This is a measurement of how well a probability distribution or probability model predicts a sample. LDA is a generative model, you start with a known model and try to explain the data by refining parameters to fit the model of the data. These values can be taken to evaluate the performance.
            boolean backfillPerplexity = false;

            int numReduceTasks = 1;
            int maxItersPerDoc = 10;
            int numUpdateThreads = 1;
            int numTrainThreads = 4;
            float testFraction = 0;

            cvbDriver.run(conf, inputPath, topicModelOutputPath,
                    numTopics, numTerms, alpha, eta, maxIterations, iterationBlockSize, convergenceDelta, dictionaryPath, docTopicOutputPath, topicModelStateTempPath, randomSeed, testFraction, numTrainThreads, numUpdateThreads, maxItersPerDoc, numReduceTasks, backfillPerplexity)    ;

    Once this step is completed the training phase of topic modeling is over. Now, lets see how to infer new documents using the trained model.
    • Topic Inference for new document

    To infer topic distribution for new document, you need to follow the same steps for the new document which I have mentioned earlier.
      • Pre-processing - stop word removal
      • Convert the document to sequence file format
      • Convert the content in the sequence file to TF vectors
    There is an important step here, (Even I missed this step at the first time and got wrong results as the outcome :( )

    We need to map the new document's dictionary with the training documents' dictionary and identify the common terms, that appears in both. Then, a TF vector needs to be created for the new document with the cardinality of training documents' dictionary. This is how you should do that.

            //Get the model dictionary file
                    HashMap modelDictionary = new HashMap<>();
                    SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path("reuters-dir/dictionary.file-0"), conf);
                    Text keyModelDict = new Text();
                    IntWritable valModelDict = new IntWritable();
                    int cardinality = 0;
                    while(reader.next(keyModelDict, valModelDict)){
                        cardinality++;
                        modelDictionary.put(keyModelDict.toString(), Integer.parseInt(valModelDict.toString()));
                    }   
                   
                    RandomAccessSparseVector newDocVector = new RandomAccessSparseVector(cardinality);
                   
                    reader.close();
                   
            //Get the new document dictionary file
                    ArrayList newDocDictionaryWords = new ArrayList<>();
                    reader = new SequenceFile.Reader(fs, new Path("reuters-test-dir/dictionary.file-0"), conf);
                    Text keyNewDict = new Text();
                    IntWritable newVal = new IntWritable();
                    while(reader.next(keyNewDict,newVal)){
                        System.out.println("Key: "+keyNewDict.toString()+" Val: "+newVal);
                        newDocDictionaryWords.add(keyNewDict.toString());
                    }
                   
                    //Get the document frequency count of the new vector
                    HashMap newDocTermFreq = new HashMap<>();
                    reader = new SequenceFile.Reader(fs, new Path("reuters-test-dir/wordcount/ngrams/part-r-00000"), conf);
                    Text keyTFNew = new Text();
                    DoubleWritable valTFNew = new DoubleWritable();
                    while(reader.next(keyTFNew, valTFNew)){
                        newDocTermFreq.put(keyTFNew.toString(), Double.parseDouble(valTFNew.toString()));
                    }
                   
                    //perform the process of term frequency vector creation
                    for (String string : newDocDictionaryWords) {
                        if(modelDictionary.containsKey(string)){
                            int index = modelDictionary.get(string);
                            double tf = newDocTermFreq.get(string);
                            newDocVector.set(index, tf);
                        }
                    }
                    System.out.println(newDocVector.asFormatString());

      • Read the model (Term distribution for each topic) 
     // Dictionary is the training dictionary

        double alpha = 0.0001; // default: doc-topic smoothing
        double eta = 0.0001; // default: term-topic smoothing
        double modelWeight = 1f;

    TopicModel model = new TopicModel(conf, eta, alpha, dictionary, 1, modelWeight, TRAINING_MODEL_PATH));
      • Infer topic distribution for the new document
    The final result, which is probabilistic topic distribution over new document will be stored  in this vector
    If you have a prior guess as to what the topic distribution should be, you can start with it here, instead of the uniform prior

            Vector docTopics = new DenseVector(new double[model.getNumTopics()]).assign(1.0/model.getNumTopics());

    Empty matrix holding intermediate data - Term-Topic likelihoods for each term in the new document will be stored here.

            Matrix docTopicModel = new SparseRowMatrix(model.getNumTopics(), newDocVector.size());

     int maxIters = 100;
            for(int i = 0; i < maxIters; i++) {
                model.trainDocTopicModel(newDocVector, docTopics, docTopicModel);
            }
        model.stop();

    To be continued...

    References: Mahout In Action, Wikipedia

    Wednesday, December 18, 2013

    How to resolve "import java.neo.file cannot be resolved" error?

    Ypu will get the "import java.neo.file cannot be resolved" error with following imports:
    import java.nio.file.Files;
    import java.nio.file.Paths;

    To resolve that do the following:
    Right click on the project > Properties > Java Compiler > Set Compiler compliance level as 1.7

    Refresh the project

    Friday, November 22, 2013

    Error: JAVA_HOME is not set.

    Following command will output the java installation directory
    which java

    Mine is,  /usr/bin/java (OSX 10.9)

    Then set the class path using the command given below:
    export JAVA_HOME=/usr/bin/java