Technology

How to run Mahout k-means Clustering algorithm

Work -
====
1. Need to start Hadoop
2. create a folder "work" in mahout-distribution-0.5/examples/bin/
3. copy reuters21578.tar.gz to "mahout-distribution-0.5/examples/bin/work"
4. create a folder "reuters-sgm".
5. Extract reuters21578.tar.gz to "reuters-sgm" folder
6. Create a folder "reuters-out".
7. Convert sgm to txt file using org.apache.lucene.benchmark.utils.ExtractReuters

8. Turn raw text in a directory into mahout sequence files
./bin/mahout seqdirectory -c UTF-8 -i /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-out -o /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-out-seqdir

(or)

./bin/mahout seqdirectory -i ./examples/bin/work/reuters-out/ -o ./examples/bin/work/reuters-out-seqdir -c UTF-8 -chunk 5

Note : Hadoop Should not be running while running the above command.

9. Examine the sequence files with seqdumper:

./bin/mahout seqdumper -s /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-out-seqdir/chunk-0 | more

10. Create tfidf weighted vectors

mahout seq2sparse \{color}
-i reuters-seqfiles/ \{color}
-o reuters-vectors/ \{color}
-ow -chunk 100 \{color}
-x 90 \{color}
-seq \{color}
-ml 50 \{color}
-n 2 \{color}
-nv

This uses the default analyzer and default TFIDF weighting, -n 2 is good for cosine distance,
which we are using in clustering and for similarity, -x 90 meaning that if a token appears in
90% of the docs it is considered a stop word, -nv to get named vectors making further data files easier to inspect.

(or)

./bin/mahout seq2sparse -i ./examples/bin/work/reuters-out-seqdir/ -o ./examples/bin/work/reuters-out-seqdir-sparse

11. Examine the vectors if you like but they are not really human readable...

mahout seqdumper -s reuters-seqfiles/part-r-00000
(or)

dfcount / part-r-00000
----------------------

./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/df-count/part-r-00000

tfidf-vectors / part-r-00000
----------------------------

./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/part-r-00000

tf-vectors / part-r-00000
-------------------------

./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/tf-vectors/part-r-00000

tokenized-documents / part-m-00000
----------------------------------

./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents/part-m-00000

wordcount / part-r-00000
------------------------

./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/wordcount/part-r-00000

dictionary.file-0
----------------

./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0

frequency.file-0
----------------
./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/frequency.file-0

12. Examine the tokenized docs to make sure the analyzer is filtering out enough (note that the rest of this example used a more restrictive lucene analyzer and not the default so your result may vary):

mahout seqdumper -s reuters-vectors/tokenized-documents/part-m-00000

This should show each doc with nice clean tokenized text.

13. Examine the dictionary. It maps token id to token text.

mahout seqdumper -s reuters-vectors/dictionary.file-0 | more

Cluster documents using kmeans
==============================

1. Create clusters and assign documents to the clusters

mahout kmeans \
-i reuters-vectors/tfidf-vectors/ \
-c reuters-kmeans-centroids \
-cl \
-o reuters-kmeans-clusters \
-k 20 \
-ow \
-x 10 \
-dm org.apache.mahout.common.distance.CosineDistanceMeasure

If -c and -k are specified, kmeans will put random seed vectors into the -c directory,
if -c is provided without -k then the -c directory is assumed to be input and kmeans
will use each vector in it to seed the clustering. -cl tell kmeans to also assign the input
doc vectors to clusters at the end of the process and put them in reuters-kmeans-clusters/clusteredPoints.
if -cl is not specified then the documents will not be assigned to clusters.

(or)

./bin/mahout kmeans
-i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/
-c ./examples/bin/work/clusters
-o ./examples/bin/work/reuters-kmeans
-x 10
-k 20
-ow

./bin/mahout kmeans -i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c ./examples/bin/work/clusters -o ./examples/bin/work/reuters-kmeans -x 10 -k 2 -ow

Note:
# The following parameters must be specified
#i|input = /path/to/input
#c|clusters = /path/to/initial/clusters
#o|output = /path/to/output
#x|max = <the maximum number of iterations to attempt>

# The following parameters all have default values if not specified
#ow|overwrite = <clear output directory if present>
#cl|clustering = <cluster points if present>
#dm|distance = <distance measure class name. Default: SquaredEuclideanDistanceMeasure>
#cd|convergenceDelta = <the convergence threshold. Default: 0.5>
#r|numReduce = <the number of reduce tasks to launch. Default: 1>

Examine kmeans cluster output
=============================

clusters/part-randomSeed
------------------------
./bin/mahout seqdumper -s examples/bin/work/clusters/part-randomSeed

reuters-kmeans/clusters-1/part-r-00000
--------------------------------------
./bin/mahout seqdumper -s examples/bin/work/reuters-kmeans/clusters-1/part-r-00000

2. Examine the clusters and perhaps even do some anaylsis of how good the clusters are:

mahout clusterdump \
-d reuters-vectors/dictionary.file-0 \
-dt sequencefile \{{color:#000000}}
-s reuters-kmeans-clusters/clusters-3-final/part-r-00000 \
-n 20 \
-b 100 \
-p reuters-kmeans-clusters/clusteredPoints/

Note: clusterdump can do some analysis of the quality of clusters but is not shown here.

(or)

./bin/mahout clusterdump
-s examples/bin/work/reuters-kmeans/clusters-10
-d examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0
-dt sequencefile
-b 100
-n 20

./bin/mahout clusterdump -s examples/bin/work/reuters-kmeans/clusters-10 -d examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 -dt sequencefile -b 100 -n 20

Calculate several similar docs to each doc in the data
------------------------------------------------------

1. First create a matrix from the vectors:

mahout rowid \{color}
-i reuters-vectors/tfidf-vectors/part-r-00000
-o reuters-matrix

(or)

./bin/mahout rowid -Dmapred.input.dir=/home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-out-seqdir-sparse/tfidf-vectors/part-r-00000 -Dmapred.output.dir=/home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-matrix

Note: Second one is working

21578 rows and 31805 columns

2. Create a collection of similar docs for each row of the matrix above:

mahout rowsimilarity \{color}
-i reuters-named-matrix/matrix \{color}
-o reuters-named-similarity \{color}
-r 19515
--similarityClassname SIMILARITY_COSINE
-m 10
-ess

(or)

./bin/mahout rowsimilarity
-i /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-matrix/matrix
-o /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-named-similarity
-r 21578
-similarityClassname SIMILARITY_COSINE
-m 10
-ess

(or)

./bin/mahout rowsimilarity -i /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-matrix/matrix -o /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-named-similarity -r 21578 --similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE -m 10 --tempDir /home

----------------------------------------------------------------------------------------------

1. List total JobName

./bin/mahout run shortJobName

2. driver.classes.props in conf folder

Technology

Tuesday, July 10, 2012

No comments:

Post a Comment

About Me