Big data's learning route (made by himself, starting from scratch) 04/17 Update SLTechnology News&Howtos

Big data's learning route (made by himself, starting from scratch)

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Big data has been popular for a long time, always want to know it to learn it, there is no time to learn it, finally have time after the Chinese New year, understand some information, combined with my own situation, preliminarily sorted out a learning route, there are problems in the hope of God to guide.

Learning route

Linux (shell, High concurrency Architecture, lucene,solr)

Hadoop (Hadoop,HDFS,Mapreduce,yarn,hive,hbase,sqoop,zookeeper,flume)

Machine learning (RMA mahout)

Storm (Storm,kafka,redis)

Spark (scala,spark,spark core,spark sql,spark streaming,spark mllib,spark graphx)

Python (python,spark python)

Cloud computing platform (docker,kvm,openstack)

Noun interpretation

1. Linux

Lucene: the Architecture of full-text search engine

Solr: a full-text search server based on lucene, which is configurable, extensible, optimizes query performance, and provides a perfect functional management interface.

II. Hadoop

Hadoop common

HDFS: distributed storage system, including NameNode,DataNode. NameNode: metadata, DataNode. DataNode: stock data.

Yarn: can be understood as the coordination mechanism of MapReduce, which is essentially the processing and analysis mechanism of Hadoop, which is divided into ResourceManager NodeManager.

MapReduce: software framework, writing programs.

Hive: the data warehouse can be queried with SQL, and you can run Map/Reduce programs. Used to calculate trends or site logs, not for real-time queries, it takes a long time to return results.

HBase: database. It is very suitable for real-time query of big data. Facebook uses Hbase to store message data and analyze messages in real time.

ZooKeeper: for large-scale distributed reliability coordination systems. The distributed synchronization of Hadoop is realized by Zookeeper, such as multiple NameNode,active standby switching.

Sqoop: database transfer, relational database and HDFS transfer to each other

Mahout: extensible machine learning and data mining libraries. Used for recommendation mining, aggregation, classification, frequent itemset mining.

Chukwa: an open source collection system that monitors large distributed systems, based on HDFS and Map/Reduce frameworks. Display, monitor and analyze the results.

Ambari: used to configure, manage, and monitor Hadoop clusters, based on Web, with a friendly interface.

II. Cloudera

Cloudera Manager: manage Monitoring and Diagnostic Integration

Cloudera CDH: (Cloudera's Distribution,including Apache Hadoop) Cloudera has made corresponding changes to Hadoop, and the distribution is called CDH.

Cloudera Flume: log collection system that supports customizing various data senders in the log system to collect data.

Cloudera Impala: SQL that provides direct query interaction for data stored in HDFS,HBase in Apache Hadoop.

Cloudera hue: web manager, including hue ui,hui server,hui db. Hue provides an interface to the shell interface of all CDH components, and you can write mr in hue.

III. Machine Learning / R

R: language and operating environment for statistical analysis and drawing. Hadoop-R is currently available.

Mahout: provides the implementation of scalable classical machine learning algorithms, including clustering, classification, recommendation filtering, frequent sub-item mining, etc., and can be extended to the cloud through Hadoop.

IV. Storm

Storm: distributed, fault-tolerant real-time streaming computing system that can be used for real-time analysis, online machine learning, information flow processing, continuous computing, distributed RPC, real-time processing of messages and updating databases.

Kafka: a high-throughput distributed publish and subscribe messaging system that can handle all action flow data (browsing, search, etc.) in consumer-scale websites. Compared with Hadoop log data and offline analysis, real-time processing can be realized. At present, online and offline message processing is unified through Hadoop's parallel loading mechanism.

Redis: written in c language, supports network, memory-based and persistent log-type, key- value-type database.

5. Spark

Scala: a fully object-oriented programming language similar to java.

Spark: Spark is a general parallel framework similar to Hadoop MapReduce implemented in Scala language. In addition to the advantages of Hadoop MapReduce, what is different from MapReduce is that the intermediate output of job can be saved in memory, so there is no need to read and write HDFS, so Spark can be better applied to MapReduce algorithms that need iteration, such as data mining and machine learning. It can operate in parallel with the Hadoop file system, and a third-party cluster framework that has used Mesos can support this behavior.

Spark SQL:

Spark Streaming: a real-time computing framework built on Spark that extends the ability of Spark to process × × data.

Spark MLlib: MLlib is an implementation library of Spark commonly used machine learning algorithms. Currently (2014.05) supports binary classification, regression, clustering and collaborative filtering. At the same time, it also includes a basic gradient descent optimization algorithm. Jblas linear algebra library since MLlib, jblas itself since the remote Fortran program.

Spark GraphX: GraphX is the API used for graph and graph parallel computing in Spark. It can provide one-stop data solution on top of Spark and complete a whole set of pipelining operations of graph computing conveniently and efficiently.

Jblas: a fast linear algebra library (JAVA). Based on BLAS and LAPACK, matrix computing is based on actual industry standards, and uses advanced infrastructure and other ATLAS art implementations of all computing programs to make it very fast.

Fortran: the earliest computer high-level programming language, which is widely used in the field of science and engineering computing.

BLAS: basic linear algebra subroutine library, with a large number of programs that have been written about linear algebraic operations.

LAPACK: famous open software, including solving the most common numerical linear algebraic problems in scientific and engineering calculations, such as solving linear equations, linear least square problems, eigenvalue problems and singular value problems.

ATLAS: an optimized version of the blas linear algorithm library.

Spark Python: Spark is written in the scala language, but for promotion and compatibility, java and python interfaces are provided.

VI. Python

Python: an object-oriented, interpretive computer programming language.

Cloud computing platform

Docker: an open source application container engine

Kvm: (Keyboard Video Mouse)

Openstack: open source cloud computing management platform project

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.