What are the common Hadoop interview questions? 04/09 Update SLTechnology News&Howtos

What are the common Hadoop interview questions?

2025-04-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what are the common Hadoop interview questions?" in the operation of actual cases, many people will encounter such a dilemma, and then let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1 single topic selection

1.1 which of the following programs is responsible for HDFS data storage.

A) NameNode

B) Jobtracker

C) Datanode

D) secondaryNameNode

E) tasktracker

Answer C datanode

1.2 how many copies of block in HDfS are saved by default?

A) 3 copies

B) 2 copies

C) 1 copy

D) uncertainty

Answer A defaults to 3 copies

Which of the following programs is usually launched on the same node as NameNode?

A) SecondaryNameNode

B) DataNode

C) TaskTracker

D) Jobtracker

Answer D, analysis of this question:

Hadoop's cluster is based on master/slave mode, namenode and jobtracker belong to master,datanode and tasktracker belong to slave, master has only one, while slave has multiple SecondaryNameNode memory requirements and NameNode in the same order of magnitude, so usually secondary, NameNode (running on separate physical machines) and NameNode run on different machines.

JobTracker and TaskTracker,JobTracker correspond to NameNode,TaskTracker corresponding to DataNode,DataNode and NameNode to data storage. JobTracker and TaskTracker are for MapReduce execution. There are several main concepts in mapreduce. Mapreduce as a whole can be divided into several execution clues: obclient,JobTracker and TaskTracker.

JobClient packages the configuration parameters of the application to hdfs through the JobClient class, and submits the path to Jobtracker. Then JobTracker creates each Task (that is, MapTask and ReduceTask) and distributes them to each TaskTracker service for execution.

JobTracker is a master service. After the software starts, JobTracker receives Job, schedules every subtask of Job task to run on TaskTracker, monitors them, and reruns it if there is a failed task. In general, JobTracker should be deployed on a separate machine.

TaskTracker is a slaver service that runs on multiple nodes. TaskTracker actively communicates with JobTracker, receives jobs, and is responsible for performing each task directly. TaskTracker needs to be run on HDFS's DataNode.

1.4 Hadoop author

A) Martin Fowler

B) Kent Beck

C) Doug cutting

Answer C Doug cutting

1.5 HDFS default Block Size

A) 32MB

B) 64MB

C) 128MB

Answer: B

(because the version is changed quickly, the answer here is for reference only.)

1.6 which of the following is usually the main bottleneck of a cluster:

A) CPU

B) Network

C) disk IO

D) memory

Answer: C disk

The analysis of this question:

First of all, the purpose of clustering is to save costs by replacing minicomputers and mainframes with cheap pc computers. What are the characteristics of minicomputers and mainframes?

Cpu has strong processing power.

Memory is large enough. Therefore, the bottleneck of the cluster cannot be an and d.

The network is a scarce resource, but it is not a bottleneck.

As big data is faced with huge amounts of data, io is needed to read and write data, and then redundant data. Hadoop generally has 3 copies of data, so IO will be discounted.

1.7 which is true about SecondaryNameNode?

A) it is a hot backup for NameNode

B) it has no memory requirements

C) its purpose is to help NameNode merge and edit logs and reduce NameNode startup time

D) SecondaryNameNode should be deployed to one node with NameNode.

Answer C

2 multiple choice of topics

2.1 which of the following can be used for cluster management?

A) Puppet

B) Pdsh

C) Cloudera Manager

D) Zookeeper

Answer: ABD

2.2 which of the following is correct to configure rack awareness:

A) if there is a problem with a rack, it will not affect data reading and writing

B) when writing data, it will be written to the DataNode of different racks.

C) MapReduce will get the network data close to it according to the rack

Answer ABC

2.3 which of the following is true when uploading files on Client?

A) data is passed to DataNode via NameNode

B) the Client side cuts the file into Block, and uploads the file in turn

C) Client only uploads data to one DataNode, and then NameNode is responsible for Block replication

Answer B, the analysis of the question:

Lient initiates a request for a file write to NameNode.

According to the file size and file block configuration, NameNode returns to Client the part of DataNode that it manages

information.

Client divides the file into multiple Block and writes to each DataNode block sequentially according to the address information of the DataNode.

2.4 which of the following is the mode in which Hadoop is running:

A) standalone version

B) pseudo-distributed

C) distributed

Answer ABC

2.5 what are the methods of installing CDH provided by Cloudera?

A) Cloudera manager

B) Tarball

C) Yum

D) Rpm

Answer: ABCD

3 judgment questions

3.1 Ganglia can not only monitor, but also alarm. (correct)

Analysis: the purpose of this question is to test the understanding of Ganglia. Strictly speaking, it is correct. As one of the most commonly used monitoring software in Linux environment, ganglia is good at collecting data from nodes according to the needs of users at a lower cost.

But ganglia is not good at warning and notifying users of events. Ganglia of * * already has some of these features. But Nagios is also better at giving warnings. Nagios is a software that is good at early warning and notification. By combining Ganglia and Nagios, taking the data collected by Ganglia as the data source of Nagios, and then using Nagios to send early warning notification, a whole set of monitoring and management system can be realized.

3.2 Block Size is not modifiable. (error)

Analysis: it is the basic configuration file of Hadoop that can be modified. When a Job is established by default, the Config,Config of Job will be established. First read the configuration of hadoop-default.xml, and then read the configuration of hadoop-site.xml (the initial configuration of this file is). The main configuration in hadoop-site.xml is the system-level configuration of hadoop-default.xml to be overridden.

Nagios cannot monitor the Hadoop cluster because it does not provide Hadoop support. (error)

Analysis: Nagios is a cluster monitoring tool and one of the three powerful tools of cloud computing

3.4 if NameNode terminates unexpectedly, SecondaryNameNode will replace it to keep the cluster working.

(error)

Analysis: SecondaryNameNode is to help restore, not replace, how to restore, you can see

3. 5 Cloudera CDH is required for a fee. (error)

Analysis: * paid products are made public by Cloudera Enterpris,Cloudera Enterprise at the Hadoop Conference (Hadoop Summit) held in California, USA, with a number of proprietary management, monitoring, and operational tools to enhance the functions of Hadoop. The fee is ordered by contract, and the price varies according to the size of the Hadoop collection.

Hadoop is developed by Java, so MapReduce only supports writing in the Java language. (error)

Analysis: rhadoop is developed in R language, MapReduce is a framework, can be understood as an idea, can be developed in other languages.

3.7 Hadoop supports random reading and writing of data. (wrong)

Analysis: lucene supports random reads and writes, while hdfs only supports random reads. But HBase can fix it. HBase provides random read and write to solve problems that Hadoop cannot handle. HBase has focused on scalability issues since its low-level design: tables can be "tall" with billions of rows; they can be "wide" with millions of columns; and they can be horizontally partitioned and replicated automatically on thousands of ordinary commercial nodes. The schema of the table is a direct reflection of physical storage, which makes it possible for the system to improve the serialization, storage and retrieval of efficient data structures.

3.8 NameNode is responsible for managing each read and write request on the metadata,client side. It reads or writes metadata information from the disk and gives feedback to the client side. (error)

Analysis of this question:

NameNode does not need to read metadata from disk, all the data is in memory, and what is on the hard disk is only the result of serialization, which is read only every time namenode starts.

1) File Writin

Client initiates a request for a file write to NameNode.

NameNode returns information about the part of DataNode it manages to Client based on file size and file block configuration.

Client divides the file into multiple Block and writes to each one sequentially according to the address information of the DataNode

In the DataNode block.

2) File reading

Client initiates a file read request to NameNode.

3.9 NameNode local disk holds the location information of Block. (personally, I think it is correct. Other comments are welcome.)

Analysis: DataNode is the basic unit of file storage. It stores Block in the local file system, saves the Meta-data of Block, and periodically sends all existing Block information to NameNode. NameNode returns information about the DataNode stored in the file.

Client reads file information.

3.10 DataNode maintains communication with NameNode over a persistent connection. (disagreement)

There is a difference in this: we are looking for useful information in this regard. The following information is provided for reference.

First of all, let's clarify the concept:

(1)。 Long connection

The Client party and the Server party first establish a communication connection, which is opened continuously after the connection is established, and then the message is sent and received.

In this way, because the communication connection always exists, this method is commonly used in point-to-point communication.

(2)。 Short connection

Client and Server communicate with each other every time they make a message receiving and receiving transaction, and immediately disconnect after the transaction.

This method is often used for point-to-multipoint communication, such as multiple Client connecting to a Server.

This is the end of the content of "what are the common Hadoop interview questions?" Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.