Saturday, August 15, 2015

Market Basket Analysis with Apache Spark MLlib FP-Growth

Market Basket Analysis 

source: http://www.noahdatatech.com/solutions/predictive-analytics/

Market basket analysis is identifying items in the supermarket which customers are more likely to buy together.
e.g., Customers who bought pampers also bought beer

      
This is important for super markets to arrange their items in a consumer convenient manner as well as to come up with promotions taking item affinity in to consideration.

Frequent Item set Mining and Association Rule Learning  


Frequent item set mining is a sub area in data mining that focuses on identifying frequently co-occuring items. Once, the frequent item set is ready, we can come up with rules         to derive association between items.
        e.g., Frequent item set = {pampers, beer, milk}, association rule = {pampers, milk ---> beer}

        There are two possible popular approaches for frequent item set mining and association rule learning as given below:

Apriori algorithm 
FP-Growth algorithm

To explain above algorithms, let us consider example with 4 customers making 4 transactions in supermarket that contain 7 items in total as given below:

    Transaction 1: Jana’s purchase: egg, beer, pampers, milk
    Transaction 2: Abi’s purchase: carrot, milk, pampers, beer
    Transaction 3: Mahesha’s purchase: perfume, tissues, carrot
    Transaction 4: Jayani’s purchase: perfume, pampers, beer

    Item index
    1: egg, 2: beer, 3: pampers, 4: carrot, 5: milk, 6: perfume, 7: tissues

Using Apriori algorithm


Apriori algorithm identifies frequent item sets by starting individual items and  extending item set by one at a time. This is known as candidate generation step.
This algorithm makes the assumption that any sub set of item within a frequent item set is also frequent.

Transaction: Items
1: 1, 2, 3, 5
2: 4, 5, 3, 2
3: 6, 7, 4
4: 6, 3, 2

Minimum Support 


Minimum support is used to prune the associations that are less frequent.

Minimum support = number of times item occur in transactions/ number of transactions

For example, lets say we define minimum support as 0.5.
Calculating support for egg is 1/4 = 0.25 (0.25 < 0.5), so that is eliminated. Support for beer is 3/4 = 0.75 (0.75 > 0.5) is it is considered for further processing.

Calculation of support for all items

size of the candidate itemset = 1

itemset: support
1: 0.25: eliminated
2: 0.75
3: 0.75
4: 0.5
5: 0.5
6: 0.5
7: 0.25: eliminated

remaining items: 2, 3, 4, 5, 6

extend candidate itemset by 1
size of the items = 2

itemset: support
2, 3: 0.75
2, 4: 0.25: eliminated
2, 5: 0.5
2, 6: 0.25: eliminated
3, 4: 0.25: eliminated
3, 5: 0.5
3, 6: 0.25: eliminated
4, 5: 0.25: eliminated
4, 6: 0.25: eliminated
5, 6: 0.25: eliminated

remaining items: {2,3},{ 2, 5}, {3, 5}

extend candidate itemset by 1
size of the items = 3

2, 3, 5: 0.5

Using FP-Growth algorithm


In FP-Growth algorithm, frequent patterns are mined using a tree approach (construction of Frequent Patter Tree)
FP-Growth algorithm has been proven to execute much faster than the Apriori algorithm.

Calculate support for frequent items and sort in degreasing order of the frequency as given below:

item: frequency
1: 1 - eliminated
2: 3
3: 3
4: 2
5: 2
6: 2
7: 1 - eliminated

Decreasing order of the frequency
2 (3), 3 (3), 4 (2), 5 (2), 6 (2)

Construction of FP-Tree

A) Transaction 1
 1, 2, 3, 5 > 2 (1), 3 (1), 5 (1)

B) Transaction 2
4, 5, 3, 2 > 2 (2), 3 (2), 4 (1), 5 (1)

C) Transaction 3
6, 7, 4 > 4 (1), 6 (1)

D) Transaction 4
6, 3, 2 > 2 (3), 3 (3), 6 (1)

Once the FP-tree is constructed, frequent item sets are calculated using depth first strategy along with divide and conquer mechanism.
This enables algorithm is be computationally more effective and parallelizable (using map-reduce).

Code Example with Apache Spark MLlib

    public static void main(String[] args) {
   
        SparkConf conf = new SparkConf().setAppName("Market Basket Analysis");
        JavaSparkContext sc = new JavaSparkContext(conf);
       
        // Items
        String item1 = "egg";
        String item2 = "beer";
        String item3 = "pampers";
        String item4 = "carrot";
        String item5 = "milk";
        String item6 = "perfume";
        String item7 = "tissues";
       
        // Transactions
        List transaction1 = new ArrayList();
        transaction1.add(item1);
        transaction1.add(item2);
        transaction1.add(item3);
        transaction1.add(item5);

        List transaction2 = new ArrayList();
        transaction2.add(item4);
        transaction2.add(item5);
        transaction2.add(item3);
        transaction2.add(item2);
       
        List transaction3 = new ArrayList();
        transaction3.add(item6);
        transaction3.add(item7);
        transaction3.add(item4);
       
        List transaction4 = new ArrayList();
        transaction4.add(item6);
        transaction4.add(item3);
        transaction4.add(item2);
       
        List> transactions = new ArrayList>();
        transactions.add(transaction1);
        transactions.add(transaction2);
        transactions.add(transaction3);
        transactions.add(transaction4);
       
        // Make transaction collection parallel with Spark
        JavaRDD> transactionsRDD = sc.parallelize(transactions);

        // Set configurations for FP-Growth
        FPGrowth fpg = new FPGrowth()
          .setMinSupport(0.5)
          .setNumPartitions(10);
       
        // Generate model
        FPGrowthModel model = fpg.run(transactionsRDD);

        // Display frequently co-occuring items
        for (FPGrowth.FreqItemset itemset: model.freqItemsets().toJavaRDD().collect()) {
           System.out.println("[" + Joiner.on(",").join(itemset.javaItems()) + "], " + itemset.freq());
        }
        sc.close();
    }

3 comments:

  1. "min support" is very well explained.
    could you please elaborate on "num partitions" parameter?
    thank you

    ReplyDelete