Introduction to some related knowledge of big data 05/09 Update SLTechnology News&Howtos

Introduction to some related knowledge of big data

2025-05-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

What is big data big data (big data), which refers to the data set that can not be captured, managed and processed with conventional software tools within a certain time range? it is a massive, high growth rate and diversified information asset that requires a new processing model to have stronger decision-making power, insight and process optimization ability. Big data is defined as 4Vs:Volume, Velocity, Variety and Veracity. A simple description in Chinese means big, fast, many and true.

Volume-large amount of data

With the development of technology, people's ability to collect information is getting stronger and stronger, and the amount of data obtained increases explosively. For example, Baidu processes hundreds of PB of data every day, and the total amount of data has reached the EP level.

Velocity-- fast processing speed

It refers to the frequency of events that people care about, such as sales, transactions, metrology, and so on. On Singles' Day in 2017, the peak value of successful payment reached 256000 messages per second and the peak value of real-time data processing reached 472 million messages per second.

Variety-A variety of data sources

Now the data sources to be processed include a variety of relational databases, NoSQL, flat files, XML files, machine logs, pictures, audio and video, and so on, and new data formats and data sources are generated every day.

Veracity-authenticity

Such as hardware and software anomalies, application system bug, human errors and so on will make the data incorrect. Big data should analyze and filter out these biased, forged and abnormal parts in order to prevent dirty data from harming the accuracy of the data. How to learn big data when talking about learning big data, I have to mention Hadoop and Spark. Hadoop

Hadoop is a distributed system infrastructure developed by the Apache Foundation.

Users can develop distributed programs without knowing the underlying details of the distribution. Make full use of the power of the cluster for high-speed operation and storage. [1]

Hadoop implements a distributed file system (Hadoop Distributed File)

System), abbreviated as HDFS. HDFS has high fault tolerance and is designed to be deployed on low-cost hardware; and it provides high throughput (high

Throughput) to access the application's data, which is suitable for those with very large datasets (large data

Set). HDFS relaxes the requirement of (relax) POSIX to access (streaming access) data in a file system as a stream.

The core design of Hadoop's framework is: HDFS and MapReduce. HDFS provides storage for massive data, while MapReduce provides computing for massive data.

In short, Hadoop is a distributed system infrastructure that deals with big data.

SparkApache Spark is a fast and general computing engine specially designed for large-scale data processing. Spark is an open source Hadoop MapReduce-like general parallel framework developed by UC Berkeley AMP lab (AMP Lab of the University of California, Berkeley), Spark, with Hadoop

The advantage of MapReduce; but what is different from MapReduce is that the intermediate output of Job can be saved in memory, so there is no need to read and write HDFS, so Spark is more suitable for iterative MapReduce algorithms such as data mining and machine learning.

Spark is an open source cluster computing environment similar to Hadoop, but there are some differences between the two. These useful differences make Spark

It performs better in some workloads; in other words, Spark enables in-memory distributed datasets to optimize iterative workloads in addition to providing interactive queries.

Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop, Spark and Scala

Can be tightly integrated, where Scala can manipulate distributed datasets as easily as local collection objects. Despite creating a Spark

It is designed to support iterative jobs on distributed datasets, but it is actually a supplement to Hadoop and can be run in parallel in the Hadoop file system. Through the name Mesos

The third-party cluster framework can support this behavior Spark by the University of California, Berkeley AMP Lab (Algorithms, Machines, and)

People Lab) development, which can be used to build large, low-latency data analysis applications.

In short, Spark is a tool specifically used to deal with big data with distributed storage.

With regard to Hadoop and Spark learning, I am also a beginner. I can't give a good answer to the overall learning route at present, but I can recommend some good articles and related resources to learn from big data, which can be obtained at the bottom of this article.

Introduction of big data's related techniques

First of all, take a look at big data's overall technical picture, you can have a more intuitive understanding.

Cdn.xitu.io/2018/8/14/16536f567136f1ae?w=655&h=413&f=png&s=275394 ">

Note: Shark has been replaced by Spark SQL.

Seeing so many related technologies, are you dazzled? not to mention proficient in all the above technologies, it is estimated that all of them can be used well. So what should these skills mainly learn?

Let's classify these technologies first.

File storage: Hadoop HDFS, Tachyon, KFS offline computing: Hadoop MapReduce, Spark streaming, real-time computing: Storm, Spark Streaming, S4, Heron, FlinkK-V, NOSQL database: HBase, Redis, MongoDB resource management: YARN, Mesos log collection: Flume, Scribe, Logstash, Kibana message system: Kafka, StormMQ, ZeroMQ, RabbitMQ query analysis: Hive, Impala, Pig, Presto, Phoenix, SparkSQL, Drill, distributed coordination services: Zookeeper, Kylin, Druid cluster management and monitoring: Ambari, Ganglia, Nagios, Cloudera Manager data mining, machine learning: Mahout, Spark MLLib data synchronization: Sqoop task scheduling: Oozie

After this as a whole, is there a more clear route to how to learn?

So, personally, I think the skills for preliminary learning should be as follows:

HDFS

HDFS (Hadoop Distributed File System,Hadoop distributed File system) is the foundation of data storage management in Hadoop system. It is a highly fault-tolerant system that can detect and respond to hardware failures and is used to run on low-cost general-purpose hardware. HDFS simplifies the file consistency model, provides high-throughput application data access through streaming data access, and is suitable for applications with large datasets. HDFS stores related roles and functions: Client: client, system user, call HDFS API to operate files; interact with NN to obtain file metadata; interact with DN to read and write data. Namenode: metadata node, which is the only manager of the system. Responsible for the management of metadata; interact with client to provide metadata query; assign data storage nodes, etc. Datanode: data storage node, responsible for the storage and redundant backup of data blocks, and performing read and write operations of data blocks.

MapReduce

MapReduce is a computing model, which is used to calculate a large amount of data. The MapReduce implementation of Hadoop, together with Common and HDFS, constitutes the three components in the early stage of Hadoop development. MapReduce divides the application into two steps: Map and Reduce, in which Map performs specified operations on the independent elements on the dataset to generate intermediate results in the form of key-value pairs. Reduce specifies all values of the same "key" in the intermediate result to get the final result. The function partition such as MapReduce is very suitable for data processing in a distributed parallel environment composed of a large number of computers.

YARN

YARN is the latest resource management system of Hadoop. In addition to Hadoop MapReduce, the Hadoop ecosystem now has many applications to manipulate data stored in HDFS. Multiple jobs of a resource management system responsible for multiple applications can be run at the same time. For example, in a cluster, some users may submit MapReduce job queries, while others may submit Spark job queries. The role of resource management is to ensure that both computing frameworks can get the resources they need, and if multiple people submit queries at the same time, ensure that these queries are served in a reasonable way.

SparkStreaming

SparkStreaming is a high-throughput, fault-tolerant streaming system for real-time data streams. It can perform complex operations such as Map, Reduce and Join on a variety of data sources (such as Kdfka, Flume, Twitter, Zero and TCP sockets), and save the results to external file systems, databases or real-time dashboards.

SparkSQL

SparkSQL is another famous SQL engine in Hadoop. As the name suggests, it uses Spark as the underlying computing framework and is actually a subset of the Scala programming language. The basic data structure of Spark is RDD, a read-only data set distributed among cluster nodes. The traditional MapReduce framework forces the use of a specific linear data flow processing method in distributed programming. The MapReduce program reads the input data from the disk, decomposes the data into key / value pairs, produces output after data processing such as washing, sorting, merging and so on, and saves the final results on disk. The results of both the Map phase and the Reduce phase are written to disk, which greatly degrades system performance. For this reason, MapReduce is mostly used to perform batch tasks

Hive

Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table, provide simple sql query function, and transform sql statements into MapReduce tasks to run. Its advantage is that the learning cost is low, simple MapReduce statistics can be quickly realized through SQL-like statements, and there is no need to develop special MapReduce applications, so it is very suitable for statistical analysis of data warehouse.

Impala

Impala is a massively parallel processing (MPP) query engine running on Hadoop, which provides high-performance, low-latency SQL queries for Hadoop cluster data, using HDFS as the underlying storage. Rapid response to queries makes interactive queries and tuning of analytical queries possible, which are difficult to accomplish in traditional SQL-on-Hadoop techniques for processing long-time batch jobs. The biggest highlight of Impala is its execution speed. Officials claim that in most cases it can return query results in seconds or minutes, while the same Hive query usually takes dozens of minutes or even hours to complete, so Impala is suitable for analytical queries of data on the Hadoop file system. Impala defaults to the Parquet file format, which is efficient for large queries in typical data warehouse scenarios.

HBase

A distributed storage system for structured data. Different from the general relational database, HBase is a database suitable for unstructured data storage. Another difference is that HBase is column-based rather than row-based. HBase is a scalable, highly reliable, high-performance, distributed and column-oriented dynamic schema database for structured data. Unlike traditional relational databases, HBase uses BigTable's data model: enhanced sparse sort Mapping Table (Key/Value), where keys are made up of row keywords, column keywords, and timestamps. HBase provides random, real-time read and write access to large-scale data. At the same time, the data stored in HBase can be processed by MapReduce, which perfectly combines data storage and parallel computing.

Apache Kylin

Apache Kylin ™is an open source distributed analysis engine that provides SQL query interface and multidimensional analysis (OLAP) capabilities over Hadoop to support very large-scale data, originally developed by eBay Inc. Develop and contribute to the open source community. It can query huge Hive tables in subseconds.

Flume

Flume is a highly available, highly reliable, distributed massive log collection, aggregation and transmission system provided by Cloudera. Flume supports customizing all kinds of data sender in the log system to collect data; at the same time, Flume provides the ability to simply process the data and write to various data receivers (customizable). Reference article

Big data's preliminary understanding

Http://lxw1234.com/archives/2016/11/779.htm

Big data's miscellaneous remarks

Http://lxw1234.com/archives/2016/12/823.htm