What's the difference between HPCC and Hadoop? 04/28 Update SLTechnology News&Howtos

What's the difference between HPCC and Hadoop?

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about the difference between HPCC and Hadoop. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

Hardware environment

Intel or AMD CPU-based blade servers are usually used to build cluster systems, and outdated hardware that has been discontinued can be used to reduce costs. Nodes have local memory and hard drives, connected by high-speed switches (usually gigabit switches), and hierarchical switching can be used if there are many cluster nodes. The nodes in the cluster are peer-to-peer (all resources can be simplified to the same configuration), but this is not necessary.

Operating system

Linux or windows

System configuration

Two configurations are used to implement HPCC cluster: data processing (Thor) is similar to Hadoop's MapReduce cluster; data distribution engine (Roxie) provides independent high-performance online query processing and data warehouse functions. Both configurations can be used as distributed file systems, but they try to improve performance in different ways. A HPCC environment typically consists of multiple clusters of two configuration types. Although the file systems on each cluster are independent of each other, a cluster can access files in file systems located on other clusters in the same environment.

The Hadoop system software uses the MapReduce processing paradigm to implement the cluster. Such a cluster can also be used as a distributed file system running HDFS. Other features are on top of Hadoop's file system software such as MapReduce and Hbase,Hive.

Licensing and maintenance costs

HPCC: the club version is free. Enterprise license fees currently depend on the size of the cluster and the type of system configuration.

Hadoop: free, but there are several vendors that offer different paid maintenance services.

Core software

HPCC: if Thor configuration is used, the core software includes the operating system and a variety of services installed on each node of the cluster to enable task execution and distributed file system access. A stand-alone server with the name word Dali provides file system name services and units of work for managing tasks in the HPCC environment. A Thor cluster can be configured as a primary node and multiple standby nodes. A Roxie cluster is a cluster of peering connections in which each node can run servers and task agents that perform queries as well as key and file processing. The file system of the Roxie cluster uses distributed B + trees to store indexes and data and provide access to encrypted data. Additional middleware components are essential to operate on Thor and Roxie clusters.

Hadoop: the core software includes the operating system, Hadoop's MapReduce cluster and HDFS software. Each standby node includes a task tracking service and a data node service. The primary node includes a task tracking service, which can be configured as an independent hardware node or run on a standby hardware node. Similarly, for HDFS, the primary name node service is also necessary to provide a name service, and the service can be run on an alternate node or a separate node.

Middle ware

HPCC: middleware includes ECL code repository implemented on MySQL server, ECL server for compiling ECL programs and queries, ECL agent, that is, client program for managing task execution on Thor cluster, ESP server (enterprise service platform), which provides authentication, logging, security and other services for performing tasks and providing Web service environment, Dali server. It can be used as system data for storing task work meta-information and providing name services for distributed file systems. Middleware can run flexibly on one or more nodes. Multiple such servers can provide redundant backups and improve performance.

Hadoop: there is no middleware. The client software can submit the task directly to the task tracker of the cluster master node. The ability of the Hadoop Workflow Scheduler (HWS) running as a server to manage tasks that require multiple MapReduce sequences is under development.

System tool

HPCC includes a suite of clients and operational tools for managing, maintaining, and monitoring HPCC configurations and environments. This suite includes ECL IDE, program development environment, attribute migration tools, distributed file applications (DFU), environment configuration applications, and Roxie configuration applications. The command line version is also available. ECLWatch is a Web-based application that monitors the HPCC environment, including queue management, distributed file system management, task monitoring, and system performance monitoring tools. Other tools are provided through the Web service interface.

The Hadoop:dfsadmin tool provides the status information of the file system; fsck is an application that checks the health of files on the HDFS; the data node block scanner periodically verifies the storage blocks on the data nodes; and the balancer republishes the blocking on the overloaded data nodes to the low-load data nodes as needed. MapReduce's WEB user interface includes a task tracker page that displays information about running and completed tasks; click down on a specific task to see the details of that task. There are also task pages that display Map and Reduce task information.

Easy to deploy

HPCC: environment configuration tool. The source server has a centralized repository that distributes operating system-level settings, services, and binaries to all network bootable nodes in the configuration.

Hadoop: requires online tool assistance from third-party application wizards. RPM needs to be deployed manually.

Distributed file system

HPCC:Thor 's distributed file system is record-oriented, using the local Linux file system to store some files. Files are initialized loaded (extracted) across nodes, and each node has a separate part of the file, which can be empty for a distributed file. The file is segmented within an even number of records / documents specified by the user. The main and standby structure is divided by name service and file mapping information stored in an independent server. Each node needs only one local file to represent a distributed file. Read and write access settings are also supported among multiple clusters in the same environment. Using specific adapters allows access to files from external databases such as MySQL, and allows transaction data to be merged with distributed file data and incorporated into batch tasks. The Roxie distributed file system uses distributed B + tree index files, which contain key-value information and data stored in the local files of each node.

Hadoop: block oriented, most installations use blocks of 64MB or 128MB size. Blocks are stored as separate cells / local files of the node's local Unix/Linux file system. The metadata information for each block is stored as a separate file. The main and standby structure uses a single name node to provide name service and block mapping, and uses multiple data nodes. The file is divided into blocks and stored distributed on each node of the cluster. Multiple local files (one for storing block data and one for storing metadata) stored across nodes for each logical block on a node represent a distributed file.

Fault tolerance

HPCC:Thor and Roxie's distributed file system (configurable) keeps copies of some files on other nodes to prevent disk or node failures. The Thor system provides either automatic or manual switching and hot start after a node failure, and the task restarts or continues to run from the most recent checkpoint. When the data is copied to a new node, the copy is made automatically. The Roxie system continues to operate when the number of nodes is reduced and the nodes fail.

Hadoop:HDFS (configurable) stores multiple (user-specified) copies on other nodes to prevent disk or node failures due to automatic recovery. The MapReduce architecture includes exploratory execution, and when a slow or failed Map task is detected, other Map tasks will resume from the failed node.

The above is the difference between HPCC and Hadoop shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.