What are the core technologies of big data? Newcomers must read it. 04/23 Update SLTechnology News&Howtos

What are the core technologies of big data? Newcomers must read it.

2025-04-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The system of big data technology is huge and complex. The basic technology includes data collection, data preprocessing, distributed storage, NoSQL database, data warehouse, machine learning, parallel computing, visualization and other technical categories and different technical levels. First of all, a general big data processing framework is given, which is mainly divided into the following aspects: data acquisition and preprocessing, data storage, data cleaning, data query and analysis and data visualization.

Here I still want to recommend the big data Learning Exchange Group I built myself: 529867072, all of them are developed by big data. If you are studying big data, the editor welcomes you to join us. Everyone is a software development party. Irregularly share practical information (only related to big data software development), including the latest big data advanced materials and advanced development tutorials sorted out by myself. Welcome to join us if you want to go deep into big data.

I. data acquisition and preprocessing

For data from various sources, including mobile Internet data, social network data and so on, these structured and unstructured massive data are fragmented, that is, the so-called data isolated island, and these data do not make any sense at this time. Data acquisition is to write these data into the data warehouse, integrate the scattered data together, and analyze the data. Data acquisition includes file log collection, database log collection, relational database access and application access. When the amount of data is small, you can write a regular script to write the log to the storage system, but with the growth of the amount of data, these methods can not provide data security, and the operation and maintenance is difficult, so a stronger solution is needed.

As a real-time log collection system, Flume NG supports customizing all kinds of data senders in the log system to collect data. At the same time, the data is simply processed and written to various data receivers (such as text, HDFS,Hbase, etc.). Flume NG uses a three-tier architecture: Agent layer, Collector layer and Store layer, each of which can be expanded horizontally. Agent contains Source,Channel and Sink,source used to consume (collect) data sources into channel components. Channel serves as an intermediate temporary storage to store all source component information. Sink reads data from channel. After successful reading, it deletes the information in channel.

NDC,Netease Data Canal, literally translated as NetEase data Canal system, is NetEase's platform solution for real-time data migration, synchronization and subscription of structured databases. It integrates NetEase's past tools and experience in the field of data transmission, and strings stand-alone databases, distributed databases, OLAP systems and downstream applications together through data links. In addition to ensuring efficient data transmission, the design of NDC follows the design philosophy of unitary and platform.

Logstash is an open source server-side data processing pipeline that can collect data from multiple sources, transform it, and send it to your favorite "repository" at the same time. A commonly used repository is Elasticsearch. Logstash supports a variety of input options, can capture events from many commonly used data sources at the same time, and can easily collect data from your logs, metrics, Web applications, data storage and various AWS services in a continuous stream.

Sqoop, a tool used to transfer data between relational database and Hadoop, can import data from a relational database (such as Mysql, Oracle) into Hadoop (such as HDFS, Hive, Hbase), or data from Hadoop (such as HDFS, Hive, Hbase) into relational database (such as Mysql, Oracle). Sqoop enables a MapReduce job (extremely fault-tolerant distributed parallel computing) to execute tasks. Another advantage of Sqoop is that its process of transferring large amounts of structured or semi-structured data is fully automated.

Streaming computing is a hot topic in the industry. Streaming computing cleans, aggregates and analyzes multiple high-throughput data sources in real time, and can quickly process and feedback the data flow that exists on social networking sites, news, and so on. At present, there are many × × analysis tools, such as open source strom,spark streaming.

The Strom cluster structure is a master-slave structure composed of a master node (nimbus) and multiple working nodes (supervisor). The master node is statically specified or dynamically elected at run time. Nimbus and supervisor are both background daemons provided by Storm, and the communication between them is combined with the status change notification and monitoring notification of Zookeeper. The primary responsibility of the nimbus process is to manage, coordinate, and monitor the topology running on the cluster (including topology publishing, task assignment, reassigning tasks during event handling, etc.). The supervisor process waits for nimbus to assign tasks and then generates and monitors worker (jvm processes) to execute tasks. Supervisor and worker run on different jvm, and if a worker started by supervisor exits abnormally due to an error (or is dropped by kill), supervisor will try to regenerate a new worker process.

When using the data of the upstream module for calculation, statistics, and analysis, you can use the messaging system, especially the distributed messaging system. Kafka, written in Scala, is a distributed, publish / subscribe-based messaging system. One of the design concepts of Kafka is to provide both offline and real-time processing, as well as real-time backup of data to another data center. Kafka can have many producers and consumers share multiple topics and summarize messages in topic units; the program for Kafka to publish messages is called producer, also known as producers, and the program for booking topics and consuming messages is called consumer, also called consumers. When Kafka runs as a cluster, it can be composed of one or more services, each of which is called a broker. During the operation, producer sends messages to the Kafka cluster through the network, and the cluster provides messages to consumers. Kafka manages the cluster configuration through Zookeeper, elects leader, and rebalance when the Consumer Group changes. Producer publishes messages to broker,Consumer using the push pattern subscribes and consumes messages from broker using the pull pattern. Kafka can work with Flume, and if you need to transfer streaming data from Kafka to hadoop, you can use Flume proxy agent, using Kafka as a source source, so that data can be read from Kafka to Hadoop.

Zookeeper is a distributed, open source distributed application coordination service that provides data synchronization services. Its main functions are configuration management, name service, distributed lock and cluster management. Configuration management means that if the configuration is modified in one place, then all those who are interested in the configuration in this place can be changed, eliminating the tedious need to copy the configuration manually, and ensuring the reliability and consistency of the data. at the same time, it can obtain information such as the address of resources or services through names, monitor the changes of machines in the cluster, and achieve functions similar to heartbeat mechanism.

II. Data storage

As an open source framework, Hadoop is designed for offline and large-scale data analysis. As its core storage engine, HDFS has been widely used in data storage.

HBase, which is a distributed, column-oriented open source database, can be thought of as the encapsulation of hdfs, which is essentially data storage and NoSQL database. HBase is a Key/Value system deployed on hdfs, which overcomes the shortcomings of hdfs in random read and write. Like hadoop, Hbase targets mainly rely on scale-out to increase computing and storage capacity by increasing cheap commercial servers.

Phoenix, the equivalent of Java middleware, helps development engineers access NoSQL database HBase in the same way that JDBC is used to access relational databases.

Yarn is a kind of Hadoop resource manager, which can provide unified resource management and scheduling for upper-level applications. Its introduction has brought great benefits to the cluster in terms of utilization, unified resource management and data sharing. Yarn consists of the following major components: a global resource manager ResourceManager, each node agent NodeManager of ResourceManager, an Application representing each application, and each ApplicationMaster has multiple Container running on NodeManager.

Mesos is an open source cluster management software that supports application architectures such as Hadoop, ElasticSearch, Spark, Storm and Kafka.

Redis is a very fast non-relational database that can store the mapping between keys and five different types of values, persist key-value pairs stored in memory to the hard disk, use replication features to expand performance, and use client-side sharding to expand write performance.

Atlas is a middleware between the application and the MySQL. In the view of the back-end DB, Atlas is equivalent to the client that connects to it, and to the front-end application, Atlas is equivalent to a DB. Atlas communicates with the application as a server, it implements the client and server protocols of MySQL, and communicates with MySQL as a client. It shields the details of DB from the application and maintains connection pooling to reduce the burden on MySQL. After Atlas starts, multiple threads are created, one of which is the main thread and the rest is the worker thread. The main thread is responsible for listening to all client connection requests, while the worker thread only listens to the command requests of the main thread.

Kudu is a storage engine built around the Hadoop biosphere. Kudu has the same design concept as the Hadoop ecosphere. It runs on ordinary servers, can be distributed and deployed on a large scale, and meets the high availability requirements of the industry. Its design concept is fast analytics on fast data. As an open source storage engine, it can provide both low-latency random read and write and efficient data analysis capabilities. Kudu not only provides row-level insert, update, delete API, but also provides batch scanning operations close to Parquet performance. Using the same storage, it can not only read and write randomly, but also meet the requirements of data analysis. Kudu has a wide range of applications, such as real-time data analysis, time series data applications where data may change, and so on.

In the data storage process, the data tables involved are hundreds of columns, including a variety of complex Query, it is recommended to use column storage methods, such as parquent,ORC to compress the data. Parquet can support flexible compression options, significantly reducing storage on disk.

3. Data cleaning

As the query engine of Hadoop, MapReduce is used for parallel computing of large-scale data sets. "Map" and "Reduce" are its main ideas. It greatly facilitates programmers to run their programs in distributed systems without distributed parallel programming.

With the increase of the amount of business data, the data that needs to be trained and cleaned will become more and more complex. At this time, task scheduling systems, such as oozie or azkaban, are needed to schedule and monitor critical tasks.

Oozie is a workflow scheduling engine for Hadoop platform, which provides RESTful API interface to accept user's submission request (submit workflow job). When the workflow is submitted, the workflow engine is responsible for the execution of workflow and the transition of state. The user deploys the job (MR job) on HDFS, and then submits the Workflow,Oozie to Oozie to submit the job (MR job) to Hadoop asynchronously. This is why a JobId is returned immediately after the job is submitted by calling the RESTful interface of Oozie, and the user program does not have to wait for the job to finish (because some large jobs may be executed for a long time (hours or even days). Oozie is asynchronous in the background, and then the Action corresponding to workflow is submitted to hadoop for execution.

Azkaban is also a workflow control engine, which can be used to solve the problem of dependencies between offline computing tasks such as multiple hadoop or spark. Azkaban is mainly composed of three parts: Relational Database,Azkaban Web Server and Azkaban Executor Server. Azkaban stores most of the status information in MySQL. Azkaban Web Server provides Web UI, which is the main manager of azkaban, including the management, authentication, scheduling and monitoring of workflow execution of project. Azkaban Executor Server is used to schedule workflows and tasks, and log workflows or tasks.

Sloth, the processing platform for streaming computing tasks, is NetEase's first self-developed streaming computing platform, which aims to meet the growing demand for streaming computing for the company's products. As a computing service platform, it is easy to use, real-time and reliable, saves the investment of technology (development, operation and maintenance) for users, and helps users to focus on solving the flow computing needs of the product itself.

IV. Data query and analysis

The core work of Hive is to translate SQL statements into MR programs, which can map structured data into a database table, and provide HQL (Hive SQL) query function. Hive itself does not store and calculate data, it is completely dependent on HDFS and MapReduce. You can think of Hive as a client-side tool that converts the SQL operation to the corresponding MapReduce jobs and then runs on hadoop. Hive supports the standard SQL syntax, eliminating the process of writing MapReduce programs. The emergence of SQL allows users who are proficient in SQL skills, but not familiar with MapReduce, weak programming skills and not good at Java language to query, summarize and analyze data on HDFS large-scale data sets.

Hive is born for big data batch processing, and the emergence of Hive solves the bottleneck of traditional relational databases (MySql, Oracle) in big data processing. Hive divides the execution plan into map- > shuffle- > reduce- > map- > shuffle- > reduce. It's a model. If a Query is compiled into multiple rounds of MapReduce, there will be more write intermediate results. Due to the characteristics of the MapReduce execution framework, too many intermediate processes will increase the execution time of the whole Query. In the process of running Hive, users only need to create tables, import data, and write SQL analysis statements. The rest of the process is done automatically by the Hive framework.

Impala is a supplement to Hive and enables efficient SQL queries. Use Impala to implement SQL on Hadoop, which is used for real-time query and analysis of big data. Big data is manipulated through the familiar SQL style of traditional relational databases, and the data can also be stored in HDFS and HBase. Instead of using slow Hive+MapReduce batches, Impala can query data directly from HDFS or HBase using SELECT, JOIN, and statistical functions by using a distributed query engine similar to that in commercial parallel relational databases (made up of Query Planner, Query Coordinator, and Query Exec Engine), thus greatly reducing latency. Impala divides the entire query into an execution plan tree rather than a series of MapReduce tasks, with no MapReduce startup time compared to Hive.

Hive is suitable for long-time batch query analysis, while Impala is suitable for real-time interactive SQL query. Impala provides data personnel with a fast experiment to verify the idea of big data analysis tool, you can first use Hive for data conversion processing, and then use Impala on the data set processed by Hive for fast data analysis. In general: Impala represents the execution plan as a complete execution plan tree, which can more naturally distribute the execution plan to each Impalad to execute the query, instead of combining it into a pipeline-based map- > reduce pattern like Hive, so as to ensure better concurrency of Impala and avoid unnecessary intermediate sort and shuffle. However, Impala does not support UDF, and there are certain restrictions on what problems can be handled.

Spark has the characteristics of Hadoop MapReduce, it saves the intermediate output of Job in memory, so there is no need to read HDFS. Spark enables in-memory distributed datasets to optimize iterative workloads in addition to providing interactive queries. Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop, Spark and Scala can be tightly integrated, where Scala can manipulate distributed datasets as easily as local collection objects.

Nutch is an open source search engine implemented by Java. It provides us with all the tools we need to run our own search engine, including full-text search and Web crawlers.

Solr is a full-text search server for independent enterprise search applications written in Java and running in Servlet containers such as Apache Tomcat or Jetty. It provides an API interface similar to Web-service. Users can submit XML files in a certain format to the search engine server through http requests to generate indexes; they can also make search requests through Http Get operations and get the return results in XML format.

Elasticsearch is an open source full-text search engine based on Lucene search server, which can quickly store, search and analyze huge amounts of data. Designed for cloud computing, can achieve real-time search, stable, reliable, fast, easy to install and use.

It also involves some machine learning languages, for example, the main goal of Mahout is to create some scalable machine learning algorithms for developers to use free of charge under the license of Apache; deep learning framework Caffe and open source software library TensorFlow that uses data flow diagrams for numerical calculation, commonly used machine learning algorithms such as Bayesian, logical regression, decision tree, neural network, collaborative filtering, etc.

5. Data visualization

Connect some BI platforms and visualize the data obtained from the analysis to guide the decision-making service. Mainstream BI platforms such as foreign agile BI Tableau, Qlikview, PowrerBI, domestic SmallBI and emerging NetEase (you can click here for free trial) and so on.

In each of the above stages, ensuring the security of data is a problem that can not be ignored.

Kerberos, a protocol based on network identity authentication, is used to authenticate personal communications by secure means in non-secure networks. It allows an entity to communicate in an insecure network environment and prove its identity to another entity in a secure way.

Ranger, which controls permissions, is a Hadoop cluster permissions framework, which provides complex data permissions for operation, monitoring and management. It provides a centralized management mechanism to manage all data permissions in the yarn-based Hadoop ecosystem. Fine-grained data access control can be performed on Hadoop ecological components such as Hive,Hbase. By operating the Ranger console, administrators can easily configure policies to control user access to HDFS folders, HDFS files, databases, tables and fields. These policies can be set for different users and groups, and permissions can be seamlessly docked with hadoop.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.