Dev 007: July 2015

Saturday, July 25, 2015

Content Recommendations with PredictionIO

I’m currently working on content recommendations using PredictionIO mainly with its SimilarityProduct engine and Recommendation engine (given below)

https://docs.prediction.io/templates/similarproduct/quickstart/
https://docs.prediction.io/templates/recommendation/quickstart/

Before going into details about content recommendations, lets get an understanding about the under lying frameworks, technologies and languages used.

PredictionIO

PredictionIO (https://prediction.io/) is an open source machine learning server which mainly facilitates infrastructure support for diverse machine learning algorithms. It provides built-in rest apis/ SDKs to access machine learning algorithms in an operational machine learning environment.

PredictionIO has abstracted away the key steps in machine learning such as,

reading data from data source (datasource)
preparing data according to required input format (datapreparator)
training data set (train)
testing the model and querying (test)

It has provided a framework approach to use, implement or customise algorithms according to application needs.

PredictionIO is mostly based on Spark cluster computing framework.

Spark RDDs

Most of the algorithms are implemented with Scala, which is JVM based, functional programming language.
Algorithms are running on top of Spark, which is capable of distributing data as RDDs across a cluster of computers to parrellelize the computations performed on them.

RDD (Resilient Distributed Dataset) is the core abstraction of Spark, which is partitioned across the cluster to perform in-memory operations on top of them. Unlike the Hadoop Map Reduce programming paradigm, Spark RDDs are capable of caching intermediate results in memory (not storage) which makes them significantly faster.

Consequently, this approach is ideal for iterative machine learning algorithms such as K-means, gradient descent and ALS (Alternating Least Squares). Further, RDDs have optimised the fault tolerance mechanism in distributed setting by recomputing lost RDDs using instructions given as DAG (Direct Acyclic Graph) of transformations and actions. This eliminates the need for data replication.

Content Recommendations using ALS in MLlib

Content recommendation is a key area which PredictionIO is focused on. Currently, ALS (Alternating Least Squares) is supported in Spark MLlib scalable machine learning library.
Mlle. is built on Apache Spark for large scale data processing and it is a standard component in Spark.

why MLlib? - http://stanford.edu/~rezab/sparkworkshop/slides/xiangrui.pdf

A key goal of using machine learning in any application is to find hidden patterns or insights in large datasets.

When it comes to recommendations, If you take movie recommendation example, there can be latent factors (hidden factors) such as movie actors, genre, theme that affects user ratings and user preferences for movies. The idea in ALS algorithm is to uncover these latent factors using Matrix factorization in Linear algebra.

A set of user features (E.g., John likes Keenu Reeves movies) and product features (E.g., Keenu Reeves acts in Matrix movie) that can be the implicit reasons for user preferences is estimated mathematically based on known user preferences for given items to predict unseen/ unknown preferences for a given user.

The following link contains a good introduction with mathematical details about matrix factorisation. You can try out Python example as well.
http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/

The following link well explains the scalable in-memory matrix calculations of ALS in Spark MLlib.
http://www.slideshare.net/MrChrisJohnson/collaborative-filtering-with-spark

Given below are the key stages that your data will go through to provide useful item recommendations to users in PredictionIO.

Datasource

Hbase is one default backend for Event store/ input is taken as rest query [item/ user/ ratings]
metadata is stored in Elastic search
One example event imported:
{"eventId":"fBAiPFEGKPOdFsTrGK7VwwAAAU6RUP-Tl37BvsDRuHs","event":"rate","entityType":"user","entityId":"admin","targetEntityType":"item","targetEntityId":"6","properties":{"rating":5},"eventTime":"2015-07-15T10:44:41.491Z","creationTime":"2015-07-15T10:44:41.491Z"}

Datapreparator

prepare input data for the ML algorithm in the format that is required, which in this case users, items and user ratings

Train

Once trained, the model is saved in local file system.
Training parameters for the ALS algorithm can be configured in Engine.json file as given below:
"algorithms": [

    {

      "name": "als",

      "params": {

        "rank": 10,

        "numIterations": 20,

        "lambda": 0.01,

        "seed": 3

      }

    }

]

rank - number of latent factors for model
lambda - regularization parameter in ALS (to avoid overfitting)
iterations - number of iterations

Querying (Test)

Once the model is trained recommendations for unseen user preferences can be requested via REST service

How to set up Apache Spark (Java) - MLlib in Eclipse?

Apache Spark version: 1.3.0

Download Apache Spark required pre-built version from the following link:
http://spark.apache.org/downloads.html

Create Maven project in Eclipse
File > New > Maven Project

Add following dependencies in pom.xml.

org.apache.spark
spark-mllib_2.10
1.3.0
provided

org.apache.spark
spark-core_2.10
1.3.0
provided

We have mentioned scope as “provided” as those dependancies are already available in Spark server.

Create new class and add you Java source code for required MLlib algorithm

Run as > Maven Build… > package

Verify .jar file is created in ‘target' folder of Maven project

Change the location to Spark installation you downloaded and unpacked and try following command:
./bin/spark-submit —class --master local[2]
E.g.,

./bin/spark-submit --class fpgrowth --master local[2] /Users/XXX/target/uber-TestMLlib-0.0.1-SNAPSHOT.jar

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.mllib

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.mllib.clustering.LDA.run(Lorg/apache/spark/api/java/JavaPairRDD;)Lorg/apache/spark/mllib/clustering/DistributedLDAModel

Spark version during the compilation time (in Maven repository) was different from runtime Spark version in Spark server class path (Spark installation directory/ lib)

Pages