Hadoop Learning Series (1. Big data's typical characteristics and distributed development difficulties) 04/14 Update SLTechnology News&Howtos

Hadoop Learning Series (1. Big data's typical characteristics and distributed development difficulties)

2025-04-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

First day

1. Big data's typical characteristics and distributed Development difficulties

2.Hadoop framework introduction and search technology system introduction 3.Hadoop version and characteristics introduction HDFS distributed file system architecture of 4.Hadoop core module introduction of Yarn operating system architecture of 5.Hadoop core module introduction of 6.Linux security disable settings and JDK installation explanation of 7.Hadoop pseudo-distributed environment deployment HDFS part 8.Hadoop pseudo-distributed environment deployment Yarn and MR part common errors in the use of 9.Hadoop environment Explanation of general settings and auxiliary functions of collective 10.Hadoop environment (-)

Explanation of General Settings and Auxiliary functions in 11.Hadoop Environment (2) matters needing attention for deploying Eclipse plug-ins in 12.Windows Environment

1. Big data's typical characteristics and distributed Development difficulties

1. Typical characteristics of big data

Before big data, according to the technology, I would take sampling statistics as an example (statistics of the ratio of men to women in a city). Is it true that we should find places with many people, randomly select some people and calculate the ratio of men to women? as the ratio of men to women in the city, the error is very large, the larger the amount of data, the more accurate the statistical results. In this way, we have to first solve the storage problem of such a large amount of data (this example does not reflect a variety of data types), and then whether we want to solve the problem of data calculation, we cannot count one by one manually. Big data technology can solve these problems for us.

The bottleneck of traditional RDBMS, the characteristic of relational data is that there is a certain relationship between various data items, which must be well designed in the design stage of the design database, but in today's requirements, we often analyze no relationship between the data. For example, when we design a recommendation system, to analyze customer behavior, there is no corresponding relationship between customer behavior data. The coexistence of structured data and unstructured data diversifies the data.

Huge amount of data, such a large amount of data, we have to deal with it very quickly. This is a great challenge to technology. This is the characteristic of big data.

Many: much here is a huge amount of data, we need to solve the problem of massive data storage.

Complex: the coexistence of structured, unstructured, and semi-structured data

Fast: such a large amount of data, so many different types of data, but also to deal with quickly, otherwise it will become a bottleneck of the system.

Our ultimate goal is to mine useful and valuable data.

two。 What can big data do?

3. The work of a data platform (complete platform)

3.1 offline

-"batch computing

3.2 Real-time

-"streaming computing

-"online analysis

3.3 data sharing

4. Data platform index

-"number of equipment: 5000

-"Total storage quantity: 100PB+

-"Daily increase: 200 TBLs, with a monthly growth rate of 10%

-"there are multiple data products

-"Storage Table 10w +

-"average number of JOB run per day

-"the average daily amount of calculation is 5PB+

5. The difficulties of distributed Development

-"platform building

-"distributed

-"synchronization, consistency (configuration (many frameworks will be built), time (subtle errors))

-"automated deployment management platform

-"hadoop version of CDH released by cloudera

-"cloudera manager, CM for short

-"the framework is open source unreliable

So many companies develop their own frameworks based on open source frameworks, such as Taobao's TFS file system.

Task scheduling framework oozie, Taobao's own framework Zeus.

-"the question of cost

Because the machines used in the cluster are relatively cheap, there will be node failures, so we must have a corresponding fault-tolerant mechanism to ensure the robustness of the cluster.

6. The basis of learning from big data:

If you learn your own essays, there are problems in the organization, don't say if you don't like it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.