In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Author | Bai Song
[note: this article is original, citation and reprint should contact the blogger. ]
Giraph introduction:
Apache Giraph is an iterative graph processing system built for high scalability. For example, it is currently used at Facebook to analyze the social graph formed by users and their connections. Giraph originated as the open-source counterpart to Pregel, the graph processing architecture developed at Google and described ina 2010 paper. Both systems are inspired by the Bulk Synchronous Parallelmodel of distributed computation introduced by Leslie Valiant. Giraph adds several features beyond the basic Pregel model, including master computation, sharded aggregators, edge-oriented input, out-of-core computation, and more. With a steady development cycle and a growing community of users worldwide, Giraph is a natural choice for unleashing the potential of structured datasets ata massive scale.
Principle:
Giraph is based on Hadoop and encapsulates Mapper in MapReduce without using reducer. Iterate multiple times in Mapper, and each iteration is equivalent to SuperStep in the BSP model. A Hadoop Job is equivalent to a BSP job. The infrastructure is shown in the following figure.
Cdn.xitu.io/2019/7/19/16c082e33d47141a?w=514&h=234&f=png&s=16506 ">
The functions of each part are as follows:
1. ZooKeeper: responsible for computation state
-partition/worker mapping
-global state: # superstep
-checkpoint paths, aggregator values, statistics
2. Master: responsible for coordination
-assigns partitions to workers
-coordinates synchronization
-requests checkpoints
-aggregates aggregator values
-collects health statuses
3. Worker: responsible for vertices
-invokes active vertices compute () function
-sends, receives and assigns messages
-computes local aggregation values
Description
(1) Experimental environment
Three servers: test165, test62, test63. Test165 is both JobTracker and TaskTracker.
Test example: the SSSP program that comes with the official website, and the data is generated by self-simulation.
Run the command: Hadoop jar giraph-examples-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar org.apache.giraph.GiraphRunner org.apache.giraph.examples.SimpleShortestPathsVertex-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat-vip / user/giraph/SSSP-of org.apache.giraph.io.formats.IdWithValueTextOutputFormat-op / user/giraph/output-sssp-debug-7-w 5
(2) in order to save space, all the code below are core code snippets.
(3) the path of hadoop.tmp.dir in core-site.xml is set to: / home/hadoop/hadooptmp
(4) the writing of this article is completed by debugging many times, so the JobID in the text is different, and the reader can understand it as the same JobID.
(5) the following articles also follow the above rules.
Org.apache.giraph.graph.GraphMapper class
The org.apache.giraph.graph.GraphMapper class is defined in Giraph to inherit the org.apache.hadoop.mapreduce.Mapper class in Hadoop, overriding the setup (), map (), cleanup (), and run () methods. The description of the GraphMapper class is as follows:
This mapper that will execute the BSP graph tasks alloted to this worker. All tasks will be performed by calling the GraphTaskManager object managed by this GraphMapper wrapper classs. Since this mapper will not be passing data by key-value pairs through the MR framework, the Mapper parameter types are irrelevant, and set to Object type.
The operation logic of BSP is encapsulated in the GraphMapper class, which has a GraphTaskManager object that is used to manage the tasks of Job. Each GraphMapper object is equivalent to a compute node (compute node) in BSP.
In the setup () method in the GraphMapper class, create a GraphTaskManager object and call its setup () method to do some initialization. As follows:
The map () method is empty because all operations are encapsulated in the GraphTaskManager class. Call the execute () method of the GraphTaskManager object in the run () method to do the BSP iterative calculation.
Org.apache.giraph.graph.GraphMapper class
Function: The Giraph-specific business logic for a single BSP compute node in whatever underlying type of cluster our Giraph job will run on. Owning object will provide the glue into the underlying cluster framework and will call this object to perform Giraph work.
The setup () method is described below, with the code as follows:
# introduce the features of each method in turn:
1. LocateZookeeperClasspath (zkPathList)
Find the local copy of ZK jar, whose path is: / home/hadoop/hadooptmp/mapred/local/taskTracker/root/jobcache/job_201403270456_0001/jars/job.jar, to start the ZooKeeper service.
2. StartZooKeeperManager (), initialize and configure ZooKeeperManager.
The definition is as follows:
3. Org.apache.giraph.zk.ZooKeeperManager class
Functions: Manages the election of ZooKeeper servers, starting/stopping the services, etc.
The setup () definition of the ZooKeeperManager class is as follows:
The createCandidateStamp () method creates a file for each task under the _ bsp/_defaultZkManagerDir/job_201403301409_0006/_task directory on HDFS, with empty contents. The file name is the native Hostname+taskPartition, and the screenshot is as follows:
The runtime specifies five workers (- w 5), plus a master, all with six task on it.
In the getZooKeeperServerList () method, the task with a taskPartition of 0 calls the createZooKeeperServerList () method to create the ZooKeeper server List, which also creates an empty file that describes the Zookeeper servers by the file name.
First get the files in the taskDirectory (_ bsp/_defaultZkManagerDir/job_201403301409_0006/_task) directory, and if there are any files in the current directory, store the Hostname and taskPartition in the file name (Hostname+taskPartition) into hostNameTaskMap. After scanning the taskDirectory directory, if the size of hostNameTaskMap is greater than serverCount (equal to the ZOOKEEPER_SERVER_COUNT variable in GiraphConstants.java, defined as 1), the outer loop is stopped. The purpose of the outer loop is: because the files under taskDirectory are created by multiple task under distributed conditions when each task file is created, it is possible that when task 0 creates the server List here, other task has not yet generated the post-task file. Giraph defaults to starting one ZooKeeper service per Job, which means that only one task starts the ZooKeeper service.
After many tests, task 0 is always selected as ZooKeeper Server, because in the same process, when scanning taskDirectory, only its corresponding task file (other task files have not been generated), and then exit the for loop, found that the size of hostNameTaskMap is equal to 1, and directly exit the while loop. So test162 0 is chosen here.
Finally, the file is created: _ bsp/_defaultZkManagerDir/job_201403301409_0006/zkServerList_test162 0
OnlineZooKeeperServers (), according to the zkServerList_test162 0 file, Mr. Task 0 becomes the zoo.cfg configuration file, uses ProcessBuilder to create the ZooKeeper service process, then Task 0 connects to the ZooKeeper service process through socket, and finally creates the file _ bsp/_defaultZkManagerDir/job_201403301409_0006/_zkServer/test162 0 to mark the completion of the master task. Worker has been doing a loop to check whether master has generated _ bsp/_defaultZkManagerDir/job_201403301409_0006/_zkServer/test162 0, that is, worker waits until the ZooKeeper service on master has been started.
The command to start the ZooKeeper service is as follows:
4. DetermineGraphFunctions ().
There are CentralizedServiceMaster objects and CentralizedServiceWorker objects in the GraphTaskManager class, corresponding to master and worker, respectively. The role determination logic played by each BSP compute node is as follows:
A) If not split master, everyone does the everything and/or running ZooKeeper.
B) If split master/worker, masters also run ZooKeeper
C) If split master/worker = = true and giraph.zkList is set, the master will not instantiate a ZK instance, but will assume a quorum is already active on the cluster for Giraph to use.
This decision is defined in the static method determineGraphFunctions () in the GraphTaskManager class, and the snippet code is as follows:
By default, Giraph distinguishes between master and worker. The zookeeper service is started on the master, not the ZooKeeper service on the worker. Then Task 0 is master+ZooKeeper and the other Tasks is workers.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.