What are the scenarios where Hadoop is not applicable? 07/09 Update SLTechnology News&Howtos

What are the scenarios where Hadoop is not applicable?

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what are the scenarios where Hadoop is not applicable". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what are the scenarios where Hadoop is not applicable"?

1. MR and relational data

MR and traditional relational databases deal with different data. Traditional relational databases deal with more structured data, but not very well for semi-structured and non-institutional data. MR just complements the areas in which relational data is not good at. The key values input by MR are not inherent attributes of data, but are chosen by analysts. At present, they are complementary. MR implements hadoop's inherent SQL through HIVE, but mr is more adaptable, but relational databases will evolve as they develop in the future. Another thing worth noting is that MR is data flow-based data processing, dealing with a kind of unstructured data, more applicable than SQL structural data, and this kind of programming is more scalable.

2. The core of MR is to realize data localization. Because MR is based on data streaming, bandwidth resources are particularly important.

3. Because MR uses a framework without sharing, and each node is relatively independent, MR can detect node failure. In the process of MR, the R phase takes the result of M as the input, so if R does not get the result of M, it will be executed again. In the MR process, the output output directory of R cannot exist, otherwise it cannot be executed, and it is very annoying that the result of a long execution is overwritten.

4. In the process of MR, there are two very important roles, JobTracker and TaskTracker, which can be used as an analogy. J is equivalent to an administrator (responsible for assigning tasks), and a group of employees T (responsible for executing tasks). Employees can perform tasks only when they get a task number from J.

5. Hadoop divides the input data into equal-length data fragments, each of which is a map task, and each record in the slice is processed by the user-defined function MR.

6. The main factors that determine the whole job time are divided into two types: 1, the time to manage shards 2, and the time to create tasks in map

7. Data local optimization, each node executes map, and the data slicing of each node tends to HDFS partitioning to get the best performance, which ensures the maximum output block of a single node. The size of the output block depends on the actual situation of the hardware and the cluster, and the amount of disk read. The R phase does not have the advantage of localization, and its input is usually the result of M, so it needs to be transmitted through the network to the task R node, which involves the bandwidth problem, because we should try our best to reduce the input from M to R through some other means such as COMBINER or parallel input or by means of compressing the data stream.

8. Scenarios where hadoop is not applicable at present. 1. For those with low time delay, you can consider Storm or Spark 2 or a large number of small files, mainly because all the block metadata information is stored in the namenode folder. Such a large number of small files will cause namenode files to be written by multiple users and modified arbitrarily, because hadoop is designed to be written once and read multiple times.

9. The data block of HDFS is larger than that of disk in order to reduce addressing.

10. Namenode manages all the domain name space of datanode. He maintains the file system and the file directory of the whole tree. This information is permanently stored on the local disk. Namenode also records the data information of each node, but it does not save it on the local disk. With each restart, the node information is reconstructed. What is more fatal is that namenode is a single node, and there is only one namenode in a cluster, so once there is a problem with namenode, the whole cluster is paralyzed. If keepalived is used, it can be used as a secondary node, but there will inevitably be data loss. Hadoop2.x is said to have solved the problem.

11. The HDFS client gets a FsDataInputStream object through DistributedFileSystem, from the na

Get the file location information of datanode from menode. DFSoutputStream handles communication between datanode and namenode.

12. Consistent model: when reading data, the data obtained by FSDdataoutputstream is displayed in blocks. That is to say, when a block is read, it cannot immediately show that it needs to wait for the block to be read, which results in data caching. In the event of a cluster failure, the loss of data blocks is not allowed in the production environment. HDFS provides a way to force all caches to be synchronized with data nodes, that is, to call sync () on FSDataOutPutStream, which ensures the consistency of data written so far, but adds some additional performance overhead to the application, so referencing this method also depends on the balance of cluster performance.

13. Parallel replication, a typical application is the transfer between two HDFS. This method can copy a large amount of data from the hadoop file system, or copy a large amount of data from the hadoop. Here we can optimize to reduce the MAP build as much as possible and let him copy as much as possible.

Thank you for your reading. The above is the content of "what are the scenarios where Hadoop is not applicable". After the study of this article, I believe you have a deeper understanding of the scenarios where Hadoop is not applicable, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.