How to run Mahout k-means Clustering algorithm
Work -
====
1. Need to start Hadoop
2. create a folder "work" in mahout-distribution-0.5/examples/bin/
3. copy reuters21578.tar.gz to "mahout-distribution-0.5/examples/bin/work"
4. create a folder "reuters-sgm".
5. Extract reuters21578.tar.gz to "reuters-sgm" folder
6. Create a folder "reuters-out".
7. Convert sgm to txt file using org.apache.lucene.benchmark.utils.ExtractReuters
8. Turn raw text in a directory into mahout sequence files
./bin/mahout seqdirectory -c UTF-8 -i /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-out -o /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-out-seqdir
(or)
./bin/mahout seqdirectory -i ./examples/bin/work/reuters-out/ -o ./examples/bin/work/reuters-out-seqdir -c UTF-8 -chunk 5
Note : Hadoop Should not be running while running the above command.
9. Examine the sequence files with seqdumper:
./bin/mahout seqdumper -s /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-out-seqdir/chunk-0 | more
10. Create tfidf weighted vectors
mahout seq2sparse \{color}
-i reuters-seqfiles/ \{color}
-o reuters-vectors/ \{color}
-ow -chunk 100 \{color}
-x 90 \{color}
-seq \{color}
-ml 50 \{color}
-n 2 \{color}
-nv
This uses the default analyzer and default TFIDF weighting, -n 2 is good for cosine distance,
which we are using in clustering and for similarity, -x 90 meaning that if a token appears in
90% of the docs it is considered a stop word, -nv to get named vectors making further data files easier to inspect.
(or)
./bin/mahout seq2sparse -i ./examples/bin/work/reuters-out-seqdir/ -o ./examples/bin/work/reuters-out-seqdir-sparse
11. Examine the vectors if you like but they are not really human readable...
mahout seqdumper -s reuters-seqfiles/part-r-00000
(or)
dfcount / part-r-00000
----------------------
./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/df-count/part-r-00000
tfidf-vectors / part-r-00000
----------------------------
./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/part-r-00000
tf-vectors / part-r-00000
-------------------------
./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/tf-vectors/part-r-00000
tokenized-documents / part-m-00000
----------------------------------
./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents/part-m-00000
wordcount / part-r-00000
------------------------
./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/wordcount/part-r-00000
dictionary.file-0
----------------
./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0
frequency.file-0
----------------
./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/frequency.file-0
12. Examine the tokenized docs to make sure the analyzer is filtering out enough (note that the rest of this example used a more restrictive lucene analyzer and not the default so your result may vary):
mahout seqdumper -s reuters-vectors/tokenized-documents/part-m-00000
This should show each doc with nice clean tokenized text.
13. Examine the dictionary. It maps token id to token text.
mahout seqdumper -s reuters-vectors/dictionary.file-0 | more
Cluster documents using kmeans
==============================
1. Create clusters and assign documents to the clusters
mahout kmeans \
-i reuters-vectors/tfidf-vectors/ \
-c reuters-kmeans-centroids \
-cl \
-o reuters-kmeans-clusters \
-k 20 \
-ow \
-x 10 \
-dm org.apache.mahout.common.distance.CosineDistanceMeasure
If -c and -k are specified, kmeans will put random seed vectors into the -c directory,
if -c is provided without -k then the -c directory is assumed to be input and kmeans
will use each vector in it to seed the clustering. -cl tell kmeans to also assign the input
doc vectors to clusters at the end of the process and put them in reuters-kmeans-clusters/clusteredPoints.
if -cl is not specified then the documents will not be assigned to clusters.
(or)
./bin/mahout kmeans
-i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/
-c ./examples/bin/work/clusters
-o ./examples/bin/work/reuters-kmeans
-x 10
-k 20
-ow
./bin/mahout kmeans -i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c ./examples/bin/work/clusters -o ./examples/bin/work/reuters-kmeans -x 10 -k 2 -ow
Note:
# The following parameters must be specified
#i|input = /path/to/input
#c|clusters = /path/to/initial/clusters
#o|output = /path/to/output
#x|max = <the maximum number of iterations to attempt>
# The following parameters all have default values if not specified
#ow|overwrite = <clear output directory if present>
#cl|clustering = <cluster points if present>
#dm|distance = <distance measure class name. Default: SquaredEuclideanDistanceMeasure>
#cd|convergenceDelta = <the convergence threshold. Default: 0.5>
#r|numReduce = <the number of reduce tasks to launch. Default: 1>
Examine kmeans cluster output
=============================
clusters/part-randomSeed
------------------------
./bin/mahout seqdumper -s examples/bin/work/clusters/part-randomSeed
reuters-kmeans/clusters-1/part-r-00000
--------------------------------------
./bin/mahout seqdumper -s examples/bin/work/reuters-kmeans/clusters-1/part-r-00000
2. Examine the clusters and perhaps even do some anaylsis of how good the clusters are:
mahout clusterdump \
-d reuters-vectors/dictionary.file-0 \
-dt sequencefile \{{color:#000000}}
-s reuters-kmeans-clusters/clusters-3-final/part-r-00000 \
-n 20 \
-b 100 \
-p reuters-kmeans-clusters/clusteredPoints/
Note: clusterdump can do some analysis of the quality of clusters but is not shown here.
(or)
./bin/mahout clusterdump
-s examples/bin/work/reuters-kmeans/clusters-10
-d examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0
-dt sequencefile
-b 100
-n 20
./bin/mahout clusterdump -s examples/bin/work/reuters-kmeans/clusters-10 -d examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 -dt sequencefile -b 100 -n 20
Calculate several similar docs to each doc in the data
------------------------------------------------------
1. First create a matrix from the vectors:
mahout rowid \{color}
-i reuters-vectors/tfidf-vectors/part-r-00000
-o reuters-matrix
(or)
./bin/mahout rowid -Dmapred.input.dir=/home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-out-seqdir-sparse/tfidf-vectors/part-r-00000 -Dmapred.output.dir=/home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-matrix
Note: Second one is working
21578 rows and 31805 columns
2. Create a collection of similar docs for each row of the matrix above:
mahout rowsimilarity \{color}
-i reuters-named-matrix/matrix \{color}
-o reuters-named-similarity \{color}
-r 19515
--similarityClassname SIMILARITY_COSINE
-m 10
-ess
(or)
./bin/mahout rowsimilarity
-i /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-matrix/matrix
-o /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-named-similarity
-r 21578
-similarityClassname SIMILARITY_COSINE
-m 10
-ess
(or)
./bin/mahout rowsimilarity -i /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-matrix/matrix -o /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-named-similarity -r 21578 --similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE -m 10 --tempDir /home
----------------------------------------------------------------------------------------------
1. List total JobName
./bin/mahout run shortJobName
2. driver.classes.props in conf folder
Work -
====
1. Need to start Hadoop
2. create a folder "work" in mahout-distribution-0.5/examples/bin/
3. copy reuters21578.tar.gz to "mahout-distribution-0.5/examples/bin/work"
4. create a folder "reuters-sgm".
5. Extract reuters21578.tar.gz to "reuters-sgm" folder
6. Create a folder "reuters-out".
7. Convert sgm to txt file using org.apache.lucene.benchmark.utils.ExtractReuters
8. Turn raw text in a directory into mahout sequence files
./bin/mahout seqdirectory -c UTF-8 -i /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-out -o /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-out-seqdir
(or)
./bin/mahout seqdirectory -i ./examples/bin/work/reuters-out/ -o ./examples/bin/work/reuters-out-seqdir -c UTF-8 -chunk 5
Note : Hadoop Should not be running while running the above command.
9. Examine the sequence files with seqdumper:
./bin/mahout seqdumper -s /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work/reuters-out-seqdir/chunk-0 | more
10. Create tfidf weighted vectors
mahout seq2sparse \{color}
-i reuters-seqfiles/ \{color}
-o reuters-vectors/ \{color}
-ow -chunk 100 \{color}
-x 90 \{color}
-seq \{color}
-ml 50 \{color}
-n 2 \{color}
-nv
This uses the default analyzer and default TFIDF weighting, -n 2 is good for cosine distance,
which we are using in clustering and for similarity, -x 90 meaning that if a token appears in
90% of the docs it is considered a stop word, -nv to get named vectors making further data files easier to inspect.
(or)
./bin/mahout seq2sparse -i ./examples/bin/work/reuters-out-seqdir/ -o ./examples/bin/work/reuters-out-seqdir-sparse
11. Examine the vectors if you like but they are not really human readable...
mahout seqdumper -s reuters-seqfiles/part-r-00000
(or)
dfcount / part-r-00000
----------------------
./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/df-count/part-r-00000
tfidf-vectors / part-r-00000
----------------------------
./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/part-r-00000
tf-vectors / part-r-00000
-------------------------
./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/tf-vectors/part-r-00000
tokenized-documents / part-m-00000
----------------------------------
./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/tokenized-documents/part-m-00000
wordcount / part-r-00000
------------------------
./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/wordcount/part-r-00000
dictionary.file-0
----------------
./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0
frequency.file-0
----------------
./bin/mahout seqdumper -s examples/bin/work/reuters-out-seqdir-sparse/frequency.file-0
12. Examine the tokenized docs to make sure the analyzer is filtering out enough (note that the rest of this example used a more restrictive lucene analyzer and not the default so your result may vary):
mahout seqdumper -s reuters-vectors/tokenized-documents/part-m-00000
This should show each doc with nice clean tokenized text.
13. Examine the dictionary. It maps token id to token text.
mahout seqdumper -s reuters-vectors/dictionary.file-0 | more
Cluster documents using kmeans
==============================
1. Create clusters and assign documents to the clusters
mahout kmeans \
-i reuters-vectors/tfidf-vectors/ \
-c reuters-kmeans-centroids \
-cl \
-o reuters-kmeans-clusters \
-k 20 \
-ow \
-x 10 \
-dm org.apache.mahout.common.distance.CosineDistanceMeasure
If -c and -k are specified, kmeans will put random seed vectors into the -c directory,
if -c is provided without -k then the -c directory is assumed to be input and kmeans
will use each vector in it to seed the clustering. -cl tell kmeans to also assign the input
doc vectors to clusters at the end of the process and put them in reuters-kmeans-clusters/clusteredPoints.
if -cl is not specified then the documents will not be assigned to clusters.
(or)
./bin/mahout kmeans
-i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/
-c ./examples/bin/work/clusters
-o ./examples/bin/work/reuters-kmeans
-x 10
-k 20
-ow
./bin/mahout kmeans -i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c ./examples/bin/work/clusters -o ./examples/bin/work/reuters-kmeans -x 10 -k 2 -ow
Note:
# The following parameters must be specified
#i|input = /path/to/input
#c|clusters = /path/to/initial/clusters
#o|output = /path/to/output
#x|max = <the maximum number of iterations to attempt>
# The following parameters all have default values if not specified
#ow|overwrite = <clear output directory if present>
#cl|clustering = <cluster points if present>
#dm|distance = <distance measure class name. Default: SquaredEuclideanDistanceMeasure>
#cd|convergenceDelta = <the convergence threshold. Default: 0.5>
#r|numReduce = <the number of reduce tasks to launch. Default: 1>
Examine kmeans cluster output
=============================
clusters/part-randomSeed
------------------------
./bin/mahout seqdumper -s examples/bin/work/clusters/part-randomSeed
reuters-kmeans/clusters-1/part-r-00000
--------------------------------------
./bin/mahout seqdumper -s examples/bin/work/reuters-kmeans/clusters-1/part-r-00000
2. Examine the clusters and perhaps even do some anaylsis of how good the clusters are:
mahout clusterdump \
-d reuters-vectors/dictionary.file-0 \
-dt sequencefile \{{color:#000000}}
-s reuters-kmeans-clusters/clusters-3-final/part-r-00000 \
-n 20 \
-b 100 \
-p reuters-kmeans-clusters/clusteredPoints/
Note: clusterdump can do some analysis of the quality of clusters but is not shown here.
(or)
./bin/mahout clusterdump
-s examples/bin/work/reuters-kmeans/clusters-10
-d examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0
-dt sequencefile
-b 100
-n 20
./bin/mahout clusterdump -s examples/bin/work/reuters-kmeans/clusters-10 -d examples/bin/work/reuters-out-seqdir-sparse/dictionary.file-0 -dt sequencefile -b 100 -n 20
Calculate several similar docs to each doc in the data
------------------------------------------------------
1. First create a matrix from the vectors:
mahout rowid \{color}
-i reuters-vectors/tfidf-vectors/part-r-00000
-o reuters-matrix
(or)
./bin/mahout rowid -Dmapred.input.dir=/home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-out-seqdir-sparse/tfidf-vectors/part-r-00000 -Dmapred.output.dir=/home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-matrix
Note: Second one is working
21578 rows and 31805 columns
2. Create a collection of similar docs for each row of the matrix above:
mahout rowsimilarity \{color}
-i reuters-named-matrix/matrix \{color}
-o reuters-named-similarity \{color}
-r 19515
--similarityClassname SIMILARITY_COSINE
-m 10
-ess
(or)
./bin/mahout rowsimilarity
-i /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-matrix/matrix
-o /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-named-similarity
-r 21578
-similarityClassname SIMILARITY_COSINE
-m 10
-ess
(or)
./bin/mahout rowsimilarity -i /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-matrix/matrix -o /home/venkat/Desktop/mahout/mahout-distribution-0.5/examples/bin/work_full_reuters/reuters-named-similarity -r 21578 --similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE -m 10 --tempDir /home
----------------------------------------------------------------------------------------------
1. List total JobName
./bin/mahout run shortJobName
2. driver.classes.props in conf folder
No comments:
Post a Comment