One of the Spark application developers: Hadoop Analysis big data 07/11 Update SLTechnology News&Howtos

One of the Spark application developers: Hadoop Analysis big data

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

When you want to learn and use a technology, you must first find out the background of the technology and the problems to be solved. The first thing to say about spark is to understand the processing of massive data and Hadoop technology.

A system in the process of running will produce a lot of log data, these log data including but not limited to our usual development using log4j or logback generated logs recording the operation of the system. For example, for Internet service providers, their devices may record users' online and offline time, web address visited, response time and other data. Some of the information recorded in these data files can be extracted and analyzed to obtain a lot of index information, so as to provide data basis for improving network structure and improving services. However, these data will be very large, and it will be difficult to achieve the purpose of analysis by using general techniques and schemes, so the emergence of a new computing model and framework for dealing with massive data becomes imminent.

The first problem to be solved in dealing with massive data is storage. We need to store the collected log files somewhere for later data analysis. However, the capacity of a machine is always very limited, and with the increase of data, it is impossible for us to infinitely expand the storage capacity of a server, so we need to store the collected data on many machines and manage it uniformly through some scheme.

The second problem to be solved in dealing with huge amounts of data is computing. The computing power of a server is limited, and it is directly limited by CPU and memory. With the increase of the amount of data, we can not expand them indefinitely, so like data storage, we also need to use the computing power of multiple machines to complete the computing work together. Each computer is an independent individual, and the code they run is irrelevant. We also need some scheme to coordinate the execution of each computer and make it logically a super super computer.

Based on the idea of GSF (Google file system), a distributed file system HDFS used by Hadoop is also developed. HDFS is a distributed file system based on the computer's local file system, which means that HDFS stores files directly on the computer's local file system (of course, we can't directly view the contents of the file).

HDFS solves the data storage problem mentioned above. In general, there is only one DataNode process on each computer that manages local data (the computer is called DataNode node). The DataNode process communicates with the NameNode process on the master node (this node is called NameNode node) to report the status of data blocks and send heartbeats. NameNode is a central server responsible for managing and maintaining the namespace (namespace) of the file system and client access to files.

Note: Namespace, the organization of file system file directory, is an important part of file system, which provides users with a visual and understandable view of the file system, thus solving or reducing the semantic gap between human and computer in data storage. At present, the organization of the file system with tree structure is most similar to that in the real world, and is widely accepted by people. Therefore, most file systems organize file directories in Tree, including various disk file systems (EXTx, XFS, JFS, Reiserfs, ZFS, Btrfs, NTFS, FAT32, etc.), network file systems (NFS,AFS, CIFS/SMB, etc.), cluster file systems (Lustre, PNFS, PVFS, GPFS, PanFS, etc.), distributed file systems (GoogleFS,HDFS, MFS, KFS, TaobaoFS, FastDFS, etc.).

Then let's talk about the parallel calculation box Map/Reduce for big data batch analysis. The framework divides data processing into two independent Map and Reduce phases, and corresponds to two methods, map and reduce, respectively:

/ *

@ key since the framework needs to serialize key and sort by key, all this key type must implement the WritableComparable interface

@ value this is a specific row of data. Get the value passed by the previous map. Since serialization is required, you need to implement the Writable interface.

@ out sets the mapped key value to the data interface. Call the collect (WritableComparable,Writable) method of this interface and pass it to key and value, respectively.

@ reporter applications can use Reporter to report progress, set application-level status messages, update Counters (counters), or simply indicate that they are running properly

, /

Map (WritableComparable key, Writable value, OutputCollector out, Reporter reporter)

/ *

@ key output key from the previous stage (map)

@ values output that has been sorted in the previous stage (because the output of different mapper may have the same key)

, /

Reduce (WritableComparable key, Iterator values, OutputCollector out, Reporter report)

In the Map phase, the Hadoop framework first obtains the file to be processed under the specified path on the HDFS, then InputSplit the file, and then uses one Map task for each shard. Before calling the map method, the Hadoop framework uses objects of type InputFormat to handle data sharding. For each InputSplit, an object of type RecordReader is created to read the contents of each part and then the map method is called to process the data.

The InputFormat type logically divides the file into slices, and each slice records information such as the offset and size of the data, but the slicing operation will split the data that is originally one line into two or more slices, so that the data processed later is wrong. This is the problem that RecordReader needs to solve. Take LineRecordReader as an example. If a shard is the first shard of a file, it will be read from the first byte, otherwise it will be read from the second line of the shard; if a shard is the last shard of the file, then you can finish reading the data of this shard, otherwise the first line of data of the next shard will end. In this way, for data that is split by rows, you can ensure that the rows read each time are complete.

Take LineRecordReader as an example, LineRecordReader reads each row of data in the shard, and then calls the map function in the form of a key-value sequence pair (key-value), where key is the offset of the data and value is the data.

The Hadoop framework puts the return value of each execution of the map function into a buffer, and when the cache usage reaches the specified threshold, a thread is used to overflow this part of the data into a temporary file. Several actions will be done on the data to be overflowed before the overflow:

1. Partition according to key through partitioner operation to determine which reducer a certain data belongs to for processing.

2, sort the data according to key

3, merge the data according to key (specified by the user as needed)

After completing the above steps, overflow the data to a temporary file. When a shard is processed, many such overflow files may be generated, and then the overflow files need to be merged to produce a complete file (this completeness refers to the part of the data to be processed by the shard). The data also needs to be sorted and merged when merging, and external sorting is used here because the file may be too large to be loaded into memory for sorting at once. But the data in the resulting file is grouped and sorted by partitions.

This is the end of the Map phase, and the next step is the Reduce phase. Before actually calling the reduce method, there is a shuffle phase that needs to be preprocessed. At the end of each map task, Reduce task is notified and saves the location information of the data to be processed in mapLocations, then the data is filtered and duplicated and saved in scheduledCopies, and then several threads copy the data in parallel, and sort and merge operations.

Finally, the merged data is processed by calling the reduce method and the result is output to HDFS.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.