PredictionIOPredictionIO (https://prediction.io/) is an open source machine learning server which mainly facilitates infrastructure support for diverse machine learning algorithms. It provides built-in rest apis/ SDKs to access machine learning algorithms in an operational machine learning environment.
PredictionIO has abstracted away the key steps in machine learning such as,
- reading data from data source (datasource)
- preparing data according to required input format (datapreparator)
- training data set (train)
- testing the model and querying (test)
It has provided a framework approach to use, implement or customise algorithms according to application needs.
PredictionIO is mostly based on Spark cluster computing framework.
Most of the algorithms are implemented with Scala, which is JVM based, functional programming language.
Algorithms are running on top of Spark, which is capable of distributing data as RDDs across a cluster of computers to parrellelize the computations performed on them.
RDD (Resilient Distributed Dataset) is the core abstraction of Spark, which is partitioned across the cluster to perform in-memory operations on top of them. Unlike the Hadoop Map Reduce programming paradigm, Spark RDDs are capable of caching intermediate results in memory (not storage) which makes them significantly faster.
Consequently, this approach is ideal for iterative machine learning algorithms such as K-means, gradient descent and ALS (Alternating Least Squares). Further, RDDs have optimised the fault tolerance mechanism in distributed setting by recomputing lost RDDs using instructions given as DAG (Direct Acyclic Graph) of transformations and actions. This eliminates the need for data replication.
Content Recommendations using ALS in MLlib
Content recommendation is a key area which PredictionIO is focused on. Currently, ALS (Alternating Least Squares) is supported in Spark MLlib scalable machine learning library.
Mlle. is built on Apache Spark for large scale data processing and it is a standard component in Spark.
why MLlib? - http://stanford.edu/~rezab/sparkworkshop/slides/xiangrui.pdf
A key goal of using machine learning in any application is to find hidden patterns or insights in large datasets.
When it comes to recommendations, If you take movie recommendation example, there can be latent factors (hidden factors) such as movie actors, genre, theme that affects user ratings and user preferences for movies. The idea in ALS algorithm is to uncover these latent factors using Matrix factorization in Linear algebra.
A set of user features (E.g., John likes Keenu Reeves movies) and product features (E.g., Keenu Reeves acts in Matrix movie) that can be the implicit reasons for user preferences is estimated mathematically based on known user preferences for given items to predict unseen/ unknown preferences for a given user.
The following link contains a good introduction with mathematical details about matrix factorisation. You can try out Python example as well.
The following link well explains the scalable in-memory matrix calculations of ALS in Spark MLlib.
Given below are the key stages that your data will go through to provide useful item recommendations to users in PredictionIO.
DatasourceHbase is one default backend for Event store/ input is taken as rest query [item/ user/ ratings]
metadata is stored in Elastic search
One example event imported:
Datapreparatorprepare input data for the ML algorithm in the format that is required, which in this case users, items and user ratings
TrainOnce trained, the model is saved in local file system.
Training parameters for the ALS algorithm can be configured in Engine.json file as given below:
- rank - number of latent factors for model
- lambda - regularization parameter in ALS (to avoid overfitting)
- iterations - number of iterations