Please visit the following link for more information about LDA algorithm.
http://jayaniwithanawasam.blogspot.com/2013/12/infer-topics-for-documents-using-latent.html
Here’s the code for LDA algorithm in Spark MLlib.
import scala.Tuple2;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.clustering.DistributedLDAModel;
import org.apache.spark.mllib.clustering.LDA;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.SparkConf;
public class lda {
public static void main(String[] args) {
// Spark configuration details
SparkConf conf = new SparkConf().setAppName("LDA");
JavaSparkContext sc = new JavaSparkContext(conf);
// Load and parse the data (sample_lda_data.txt is available with Spark installation)
// word count vectors (columns: terms [vocabulary], rows [documents])
String path = "data/mllib/sample_lda_data.txt";
// Read data
// creates a RDD with each line as an element
// E.g., 1 2 6 0 2 3 1 1 0 0 3
JavaRDD
// Map is a transformation that passes each data set element through a function
// It returns a new RDD representing the results
// Prepares input as numerical representation
JavaRDD
new Function
public Vector call(String s) {
String[] sarray = s.trim().split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i < sarray.length; i++)
values[i] = Double.parseDouble(sarray[i]);
return Vectors.dense(values);
}
}
);
// Index documents with unique IDs
// The transformation 'zipWithIndex' provides a stable indexing, numbering each element in its original order.
JavaPairRDD
new Function
public Tuple2
return doc_id.swap();
}
}
));
corpus.cache();
// Cluster the documents into three topics using LDA
// number of topics = 3
DistributedLDAModel ldaModel = new LDA().setK(3).run(corpus);
// Topic and its term distribution
// columns = 3 topics/ rows = terms (vocabulary)
System.out.println("Topic-Term distribution: \n" + ldaModel.topicsMatrix());
// document and its topic distribution
// [(doc ID: [topic 1, topic 2, topic3]), (doc ID: ...]
JavaRDD
System.out.println("Document-Topic distribution: \n" + topicDist.collect());
sc.close();
}
}
Output:
Topic-Term distribution
Document-Topic distribution
This comment has been removed by the author.
ReplyDeleteThank you for you post. As I can see, LDA model have input data is term-document matrix vector. But, If I have a list of document text file. What should I do to convert from list text file into term-document matrix vector. Can you help me to solve this problem?
ReplyDelete