Introduction to the hadoop Service role of big data Framework 04/27 Update SLTechnology News&Howtos

Introduction to the hadoop Service role of big data Framework

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

After looking through the sharing written recently, the download, installation, runtime environment deployment and other related contents of the DKHadoop distribution have almost been written. Although some places may not write in great detail, please forgive me for the limited level of personal understanding. I remember that when I wrote the deployment of the DKHadoop runtime environment, I left out the content of the hadoop service role. This article specially complements this part, otherwise I always feel uncomfortable.

To run a DKHadoop service in a cluster, you need to specify one or more nodes in the cluster to perform the specific functions of the service. Role assignment is necessary. Without a role cluster, it will not work properly. Before assigning roles, you need to understand the meaning of these roles.

Hadoop service role:

1. Zookeeper role: ZooKeeper service refers to a cluster service framework that contains one or more nodes for cluster management. For clusters, the functions provided by Zookeeper services include maintaining configuration information, naming, and providing distributed synchronization of HyperBase. It is recommended that there are at least three nodes in the ZooKeeper cluster.

2. JDK role: JDK is the software development kit of Java language, and JDK is the core of the whole Java development. It includes the running environment of Java, Java tools and Java basic class libraries.

3. Apache-Flume role: Flume is a highly available, highly reliable, distributed massive log collection, aggregation and transmission system provided by Cloudera. Flume supports customizing various data senders in the log system for data collection; at the same time, Flume provides the ability to simply process data and write to various data receivers (customizable).

4. Apache-Hive role: Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table, provide simple SQL query function, and convert SQL statements into MapReduce tasks to run.

5. Apache-Storm role: Storm is memory-level computing, and data is directly imported into memory through the network. Read-write memory is n orders of magnitude faster than read-write disk. When the computing model is suitable for streaming, the streaming processing of Storm saves the time of collecting data in batch processing.

6. Elasticsearch role: Elasticsearch is developed in Java and released as open source under the Apache license terms, and is currently a popular enterprise search engine. Designed for cloud computing, can achieve real-time search, stable, reliable, fast, easy to install and use.

7. NameNode role: nodes in the HDFS system are used to maintain the directory structure of all files in the file system and to track which data nodes the file data is stored on. When the client needs to obtain files from the HDFS file system, it communicates with NameNode to know which data node of the client has the files the client needs. There can be only one NameNode in a Hadoop cluster. NameNode cannot be assigned other roles.

8. DataNode role: in HDFS, DataNode is the node used to store blocks of data.

9. Secondary NameNode role: a node that creates periodic checkpoints for data on NameNode. The node will periodically download the current NameNode image and log file, merge the log and image file into a new image file, and then upload it to NameNode. Machines assigned the NameNode role should no longer be assigned the Secondary NameNode role.

10. Standby Namenode role: the name Node metadata of Standby mode (both Namespcae information and Block are synchronized with the metadata in Active NameNode. As soon as you switch to Active mode, NameNode services can be provided immediately.

11. JournalNode role: Standby NameName and Active NameNode communicate through JournalNode to keep information synchronized.

12. HBase role: HBase is a distributed, column-oriented open source database. HBase provides BigTable-like capabilities on top of Hadoop. HBase is a subproject of Apache's Hadoop project. Different from the general relational database, HBase is a database suitable for unstructured data storage. Another difference is that HBase is column-based rather than row-based.

13. Kafka role: Kafka is a high-throughput distributed publish and subscribe messaging system that can handle all action flow data in consumer-scale websites. This action (web browsing, search and other user actions) is a key factor in many social functions on the modern web. This data is usually resolved by processing logs and log aggregations due to throughput requirements. For log data and offline analysis systems like Hadoop's, but requiring real-time processing limitations, this is a feasible solution. The purpose of Kafka is to unify online and offline message processing through Hadoop's parallel loading mechanism, as well as to provide real-time consumption through clustering.

14. Redis role: Redis is an open source log, Key-Value database written in C language, network-enabled, memory-based and persistent, and provides API in multiple languages.

15. Scala roles: Scala is a multi-paradigm programming language, a programming language similar to Java, designed to implement a scalable language and integrate the features of object-oriented programming and functional programming.

16. Sqoop role: Sqoop is a tool for transferring data from Hadoop and relational databases. You can import data from a relational database (such as MySQL, Oracle, Postgres, etc.) into the HDFS of Hadoop, or you can import HDFS data into a relational database.

17. Impala role: Impala is a new query system led by Cloudera, which provides SQL semantics and can query PB-level big data stored in Hadoop's HDFS and HBase. Although the existing Hive systems also provide SQL semantics, because the underlying execution of Hive uses the MapReduce engine, it is still a batch process, which is difficult to meet the interactivity of the query. By contrast, the biggest feature and selling point of Impala is its rapidity.

18. Crawler role: Crawler is a proprietary component of Fast DKHadoop, a crawler system that crawls dynamic and static data.

19. Spark role: Spark is an open source cluster computing environment similar to Hadoop, but there are some differences between the two. These useful differences make Spark superior in some workloads. In other words, Spark enables in-memory distributed datasets to optimize iterative workloads in addition to interactive queries. Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop, Spark and Scala can be tightly integrated, where Scala can manipulate distributed datasets as easily as local collection objects.

20. HUE role: HUE is a set of network applications that can interact with your Hadoop set. HUE applications allow you to browse HDFS and work, manage Hive metastore, run Hive, browse HBase Sqoop export data, submit MapReduce programs, build custom search engines and schedule repetitive workflows with Solr.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.