Tuesday, July 10, 2012

How to run Mahout k-means Clustering algorithm 


Work -
====
1. Need to start Hadoop
2. create a folder "work" in mahout-distribution-0.5/examples/bin/
3. copy reuters21578.tar.gz to "mahout-distribution-0.5/examples/bin/work"
4. create a folder "reuters-sgm".
5. Extract reuters21578.tar.gz to "reuters-sgm" folder
6. Create a folder "reuters-out".
7. Convert sgm to txt file using org.apache.lucene.benchmark.utils.ExtractReuters

8. Turn raw text in a directory into mahout sequence files
  ./bin/mahout seqdirectory -c UTF-8  -i /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-out -o /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-out-seqdir

(or)

  ./bin/mahout seqdirectory -i ./examples/bin/work/reuters-out/ -o ./examples/bin/work/reuters-out-seqdir -c UTF-8 -chunk 5

  Note : Hadoop Should not be running while running the above command.

9. Examine the sequence files with seqdumper:

./bin/mahout seqdumper -s /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-out-seqdir/chunk-0 | more


10. Create tfidf weighted vectors

   mahout seq2sparse \{color}
   -i reuters-seqfiles/ \{color}
   -o reuters-vectors/ \{color}
   -ow -chunk 100 \{color}
   -x 90 \{color}
   -seq \{color}
   -ml 50 \{color}
   -n 2 \{color}
   -nv

   This uses the default analyzer and default TFIDF weighting, -n 2 is good for cosine distance,
   which we are using in clustering and for similarity, -x 90 meaning that if a token appears in
   90% of the docs it is considered a stop word, -nv to get named vectors making further data files easier to inspect.

(or)

   ./bin/mahout seq2sparse -i ./examples/bin/work/reuters-out-seqdir/ -o ./examples/bin/work/reuters-out-seqdir-sparse


11. Examine the vectors if you like but they are not really human readable...

mahout seqdumper -s reuters-seqfiles/part-r-00000
          (or)


dfcount / part-r-00000
----------------------

./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/df-count/part-r-00000

tfidf-vectors / part-r-00000
----------------------------

./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/part-r-00000

tf-vectors / part-r-00000
-------------------------

./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/tf-vectors/part-r-00000

tokenized-documents / part-m-00000
----------------------------------

./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents/part-m-00000

wordcount / part-r-00000
------------------------

./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/wordcount/part-r-00000

dictionary.file-0
----------------

./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0

frequency.file-0
----------------
./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/frequency.file-0


12. Examine the tokenized docs to make sure the analyzer is filtering out enough (note that the rest of this example used a more restrictive lucene analyzer and not the default so your result may vary):

mahout seqdumper  -s reuters-vectors/tokenized-documents/part-m-00000


    This should show each doc with nice clean tokenized text.


13. Examine the dictionary. It maps token id to token text.

mahout seqdumper -s reuters-vectors/dictionary.file-0 | more



Cluster documents using kmeans
==============================

1.  Create clusters and assign documents to the clusters

mahout kmeans \
  -i reuters-vectors/tfidf-vectors/ \
  -c reuters-kmeans-centroids \
  -cl \
  -o reuters-kmeans-clusters \
  -k 20 \
  -ow \
  -x 10 \
  -dm org.apache.mahout.common.distance.CosineDistanceMeasure

If -c and -k are specified, kmeans will put random seed vectors into the -c directory,
if -c is provided without -k then the -c directory is assumed to be input and kmeans
        will use each vector in it to seed the clustering. -cl tell kmeans to also assign the input
doc vectors to clusters at the end of the process and put them in reuters-kmeans-clusters/clusteredPoints.
if -cl is not specified then the documents will not be assigned to clusters.

(or)

./bin/mahout kmeans
-i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/
-c ./examples/bin/work/clusters
-o ./examples/bin/work/reuters-kmeans
-x 10
-k 20
-ow

./bin/mahout kmeans -i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c ./examples/bin/work/clusters -o ./examples/bin/work/reuters-kmeans -x 10 -k 2 -ow

Note:
# The following parameters must be specified
#i|input = /path/to/input
#c|clusters = /path/to/initial/clusters
#o|output = /path/to/output
#x|max = <the maximum number of iterations to attempt>

# The following parameters all have default values if not specified
#ow|overwrite = <clear output directory if present>
#cl|clustering = <cluster points if present>
#dm|distance = <distance measure class name. Default: SquaredEuclideanDistanceMeasure>
#cd|convergenceDelta = <the convergence threshold. Default: 0.5>
#r|numReduce = <the number of reduce tasks to launch. Default: 1>




Examine kmeans cluster output
        =============================

clusters/part-randomSeed
------------------------
./bin/mahout seqdumper -s examples/bin/work/clusters/part-randomSeed

reuters-kmeans/clusters-1/part-r-00000
--------------------------------------
./bin/mahout seqdumper -s examples/bin/work/reuters-kmeans/clusters-1/part-r-00000


2. Examine the clusters and perhaps even do some anaylsis of how good the clusters are:

  mahout clusterdump \
  -d reuters-vectors/dictionary.file-0 \
  -dt sequencefile \{{color:#000000}}
  -s reuters-kmeans-clusters/clusters-3-final/part-r-00000 \
  -n 20 \
  -b 100 \
  -p reuters-kmeans-clusters/clusteredPoints/

Note: clusterdump can do some analysis of the quality of clusters but is not shown here.

(or)

./bin/mahout clusterdump
-s examples/bin/work/reuters-kmeans/clusters-10
-d examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0
-dt sequencefile
-b 100
-n 20

        ./bin/mahout clusterdump -s examples/bin/work/reuters-kmeans/clusters-10 -d examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 -dt sequencefile -b 100 -n 20




Calculate several similar docs to each doc in the data
------------------------------------------------------

1. First create a matrix from the vectors:

mahout rowid \{color}
  -i reuters-vectors/tfidf-vectors/part-r-00000
  -o reuters-matrix

(or)

./bin/mahout rowid -Dmapred.input.dir=/home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-out-seqdir-sparse/tfidf-vectors/part-r-00000 -Dmapred.output.dir=/home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-matrix
 
Note: Second one is working

21578 rows and 31805 columns

2. Create a collection of similar docs for each row of the matrix above:

   mahout rowsimilarity \{color}
   -i reuters-named-matrix/matrix \{color}
   -o reuters-named-similarity \{color}
   -r 19515
   --similarityClassname SIMILARITY_COSINE
   -m 10
   -ess

(or)

./bin/mahout rowsimilarity
-i /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-matrix/matrix
-o /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-named-similarity
-r 21578
-similarityClassname SIMILARITY_COSINE
-m 10
-ess

(or)

./bin/mahout rowsimilarity  -i /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-matrix/matrix  -o /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-named-similarity -r 21578 --similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE -m 10 --tempDir /home



----------------------------------------------------------------------------------------------

1. List total JobName

./bin/mahout run shortJobName


2. driver.classes.props in conf folder




No comments:

Post a Comment