How to use mahout kmeans 04/11 Update SLTechnology News&Howtos

How to use mahout kmeans

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to use mahout kmeans". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn how to use mahout kmeans.

Mahout is an open source project under apache Soft Foundation.

Provides the implementation of some scalable classical algorithms in the field of machine learning, which aims to help developers create intelligent applications more easily and quickly.

Many implementations of Mahout, including clustering, classification, recommendation filtering, frequent subproject mining, in addition, by using the Apache Hadoop library

Mahout can be effectively extended to the cloud

Run the kmeans algorithm that comes with Mahout and verify that Mahout is running properly.

Prepare the test data download file

Put the file in the $MAHOUT_HOME directory, synthetic_con

23 17

[hdfs@cloudra ~] $hadoop fs-mkdir testdata

[hdfs@cloudra root] $hadoop fs-mkdir / output

[hdfs@cloudra ~] $hadoop fs-put synthetic_control.data testdata

DEPRECATED: Use of this script to execute hdfs command is deprecated.

Instead use the hdfs command for it.

You have new mail in / var/spool/mail/root

/ usr/java/default

Export JAVA_HOME=/usr/java/jdk1.7.0_79

Hdfs@cloudra ~] $mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

Or jar mahout-distribution-0.7/mahout-examples-0.7-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.

[root@localhost mahout-distribution-0.9] # hadoop fs-mkdir / user/root/testdata

16-11-23 05:28:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... Using builtin-java classes where applicable

Mkdir: `/ user/root/testdata': No such file or directory

[root@localhost mahout-distribution-0.9] # hadoop fs-mkdir-p / user/root/testdata

16-11-23 05:28:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... Using builtin-java classes where applicable

[root@localhost mahout-distribution-0.9] # ls

Bin lib mahout-examples-0.9.jar NOTICE.txt

Conf LICENSE.txt mahout-examples-0.9-job.jar README.txt

Docs mahout-core-0.9.jar mahout-integration-0.9.jar

Examples mahout-core-0.9-job.jar mahout-math-0.9.jar

[root@localhost mahout-distribution-0.9] # cd..

[root@localhost soft] # cd..

[root@localhost ~] # cd-

/ root/soft

[root@localhost soft] # ls

Data hadoop-2.6.0 jdk1.7.0_79 mahout-distribution-0.9

[root@localhost soft] # cd data

[root@localhost data] # ls

Synthetic_control.data

[root@localhost data] # hadoop fs-put / user/root/testdata

16-11-23 05:29:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... Using builtin-java classes where applicable

Put: `/ user/root/testdata': No such file or directory

[root@localhost data] # hadoop fs-put synthetic_control.data / user/root/testdata

16-11-23 05:29:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... Using builtin-java classes where applicable

[root@localhost data] # ls

Synthetic_control.data

[root@localhost data] # cd..

[root@localhost soft] # ls

Data hadoop-2.6.0 jdk1.7.0_79 mahout-distribution-0.9

[root@localhost soft] # cd mahout-distribution-0.9/

[root@localhost mahout-distribution-0.9] # ls

Bin lib mahout-examples-0.9.jar NOTICE.txt

Conf LICENSE.txt mahout-examples-0.9-job.jar README.txt

Docs mahout-core-0.9.jar mahout-integration-0.9.jar

Examples mahout-core-0.9-job.jar mahout-math-0.9.jar

[root@localhost mahout-distribution-0.9] # hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

16-11-23 05:30:30 INFO kmeans.Job: Running with default arguments

16-11-23 05:30:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... Using builtin-java classes where applicable

16-11-23 05:30:40 INFO kmeans.Job: Preparing Input

16-11-23 05:30:41 INFO client.RMProxy: Connecting to ResourceManager at hadoop02/127.0.0.1:8032

16-11-23 05:30:42 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

16-11-23 05:30:46 INFO input.FileInputFormat: Total input paths to process: 1

16-11-23 05:30:46 INFO mapreduce.JobSubmitter: number of splits:1

16-11-23 05:30:47 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1479907436985_0002

16-11-23 05:30:49 INFO impl.YarnClientImpl: Submitted application application_1479907436985_0002

05:30:49 on 16-11-23 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1479907436985_0002/

16-11-23 05:30:49 INFO mapreduce.Job: Running job: job_1479907436985_0002

16-11-23 05:31:40 INFO mapreduce.Job: Job job_1479907436985_0002 running in uber mode: false

16-11-23 05:31:40 INFO mapreduce.Job: map 0 reduce 0

The source file corresponding to mahout seqdumper converting the SequenceFile file into readable text form is org.apache.mahout.utils.SequenceFileDumper.java, which converts the vector file into

Readable text form, corresponding to the source file is org.apache.mahout.utils.vectors.VectorDumper.java

Mahout clusterdump analyzes the output of the final clustering, and the corresponding source file is org.apache.mahout.utils.clustering.ClusterDumper.java

[root@localhost bin] # mahout seqdumper-s output/clusters-5/part-r-00000-o / txt.data

Mahout clusterdump-seqFileDir / user/root/output/clusters-10-final-pointsDir / user/root/output/clusteredPoints-output $MAHOUT_HOME/examples/output/clusteranalyze.txt

Mahout includes three blocks of clustering, collaborative filtering (recommended item user), and classification algorithm (Bayesian).

Clustering, also known as group analysis, is not only a statistical algorithm for studying the classification of (samples or indicators), but also an important algorithm for data mining.

Cluster analysis is a vector of measurement, or a point in multi-dimensional space.

Cluster analysis is based on similarity, and there is more similarity between patterns in a cluster than between patterns that are not in the same cluster.

Clustering has a wide range of uses.

In business, clustering can help market analysts distinguish different consumer groups from the consumption database, and summarize the consumption patterns or habits of each category of consumers.

Clustering algorithm Canopy algorithm (canopy clustering) K-means algorithm (kmeans cluster) Fuzzy K-means (fuzzy kmeans), EM clustering (expectation maximization clustering EXPECTION MAXMIZATION)

Mean shift clustering (Mean shirt clustering) hierarchical clustering (hieratical cluster) Dikley process clustering (oirichiet process clustering)

Latent dinchiet allocation LOA clustering

Classification is to label objects according to a certain standard, and then distinguish and classify them according to the label.

Classification is defined in advance, and the number of categories remains the same, such as the color size of soybeans and mung beans.

Algorithmic logical regression (logistic regression) Bayesian (Bayesian) support vector machine (Support vector machine) perceptron algorithm

(perceptron and winnow) neural network (Neural network) random forest (random forests)

Finite Boltzmann machine (restric boltzman machine)

Collaborative filtering

Recommendation system (product recommendation, user recommendation)

Recommendation / Collaborative filtering Non-distributed recommenders/ (Distribute Recommenders) TasteUserCF (item cf,slotone) / item cf

Vector similarity calculation RowSimilantyJob / VectorDistanceJob calculate column similarity / calculate vector distance

Non-MR algorithm Hidden markov models Markov model

Collection method extension collocations extends java's collection class

Parallel Fp growth algorithim parallel FP growth algorithm for Mining Association rules

Regression Locally Weighted Linear Regression locally weighted linear regression

Reduced-dimensional stochastic singular value DeCOMPOSITION singular value decomposition / pricipal components Analysis principal component analysis / independent components analysis independent component analysis /

Gaussian discriminative analysis Gaussian discriminant analysis

Parallelization of watchmake Framework by Evolutionary algorithm

Thank you for your reading, the above is the content of "how to use mahout kmeans", after the study of this article, I believe you have a deeper understanding of how to use mahout kmeans, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.