Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Good programmer big data shares zero basics and learns how to do Hadoop.

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Zero basic learning Hadoop how to start, many students are learning hadoop to learn big data, learning materials may be books as the main reference direction, "hadoop authoritative Guide" is indeed a very good introduction to big data books, but big data system itself is a distributed system, so I think the relevant concepts of distributed systems are the basis for mastering all kinds of frameworks and knowledge of big data.

1 getting started:

The hadoop framework is an integrated framework that integrates storage (hdfs), computing (mr computing model) and resource management (yarn). Of course, it is the product of a historical stage, so let's take a look at the well-known specific approach of wordcount (mr) how to calculate under what scenarios?

1-1 distributed system

First of all, wordcount programs can also be processed in the traditional stand-alone mode. Here, we must think of multithreading, file cutting and other implementation methods. To put it simply, the idea of parallel computing has been around for a long time. With the continuous progress of hardware and the continuous improvement of performance, multi-core computing has been developed for many years. At the same time, the data generated by the world is growing rapidly. Then the original computing mode of multi-task and multi-thread under a single machine and the subsequent multi-core parallelism have encountered a serious mismatch between processing speed and processing data, and how to improve computing power is inevitable. Then the cluster method solves the ability of horizontal expansion of computing resources and has parallelism at the same time, which is the core idea at present. We can understand that the current cluster (a black box) is analogous to the traditional stand-alone mode, and the parallel computing between nodes in the cluster involves master-slave architecture, cluster management, message communication, fault-tolerant processing and so on. then these are the problems that the distributed system should consider and solve, because it is the distributed system itself.

1-2 distributed storage

Just now briefly mentioned the distributed system, when it comes to computing, in fact, there is another hidden problem is that there must be data in computing, which must involve storage, so storage is fundamental, so how to use a distributed storage system (hdfs) must understand its components (such as what blocks, file systems, distributed file systems) and how to use them (read and write HDFS). However, because most of the students are relatively familiar with relational database and its use SQL, these are all things at the application level and do not understand the specific underlying situations, or do not participate in the development of database software, have relatively little experience in learning and working on file classes, and are unfamiliar with the file IO operation, serialization, compression, built-in or custom file read-write format and read-write mode mentioned in it. Because hdfs is essentially a file system.

1-3 distributed computing

Mr computing model is also less contact before, no specific practical experience, such as what mr can do, what scenarios to use, and so on, because before we came into contact with OLTP (online transaction processing [OLTP Online Transaction Processing]

Online transaction processing, which represents a highly transactional system, is generally a highly available online system, mainly small transactions and small queries, and traditional relational databases as the main applications, mainly basic and daily transaction processing, mainly for business data, such as bank transactions) operations And big data was originally used for data mining, it is more of an OLAP (online analytical processing [OLAP Online Analytical Processing]:

Online analytical processing, sometimes also known as DSS decision support system, is what we call data warehouse, the focus is mainly analysis-oriented, will produce a large number of queries, generally rarely involve additions, deletions and changes. ) operation, map operation and reduce operation of mr computing model are the requirements that we often encounter. Map operation is responsible for data cleaning and transformation, and reduce operation is responsible for data aggregation. At the same time, the select clause and group by clause in sql also correspond to this kind of actual requirements, only in different ways.

2 advanced

2-1 it is suggested to look at all kinds of frameworks in big data from the perspective of distributed system, and understand the distributed theory such as CAP theory, master-slave architecture, etc.

2-2 of course, since these frameworks do not deal with problems in the same direction, we classify them first, as shown below.

Technical architecture

`

1 data acquisition: flume, logstash

2 data storage: hdfs, hbase, alluxio, es, neo4j, janusGraph, redis, mongodb, tidb

3 data calculation: hive, impala, spark, flink, druid

4 data channels: kafka, pulsar

5 Task scheduling: azkaban, airflow

6 multidimensional data model

7 data synchronization: sqoop, datax, canal

8 data formats: parquet, orc, csv, json

9 Coordination Service: zookeeper

10 Monitoring: zabbix, prometheus

3 recommendation

Big data's official website of various frameworks is always a firsthand resource. Be sure to see it.

3.2 A large number of official accounts, stackoverflow, github, etc.

3.3 google query Resources

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report