What are the latest interview questions of big data in 2021? 04/20 Update SLTechnology News&Howtos

What are the latest interview questions of big data in 2021?

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "what are the latest big data interview questions in 2021". The explanation content in this article is simple and clear, and it is easy to learn and understand. Please follow the ideas of Xiaobian slowly and deeply to study and learn "what are the latest big data interview questions in 2021" together!

1. Multiple choice questions

1.1. Which of the following programs is responsible for HDFS data storage?

a)NameNode

b)Jobtracker

c)Datanode

d)secondaryNameNode

e)tasktracker

Answer C: DataNode

1.2. How many copies of block are saved by default in HDfS?

a)3

b)2

c)1

d) Uncertain

Answer A Default 3

1.3. Which of the following programs is usually started on a node with NameNode?

a)SecondaryNameNode

b)DataNode

c)TaskTracker

d)Jobtracker

answer D

1.4. HDFS Default Block Size

a)32MB

b)64MB

c)128MB

Answer: B

1.5. Which of the following is usually the most important bottleneck for clustering

a)CPU

b) Networks

c) Disk IO

d) Memory

Answer: Disk C

1.6. What is correct about Secondary NameNode?

a) It is a hot standby for NameNode

b) It has no memory requirements

c) Its purpose is to help NameNode merge edit logs and reduce NameNode startup time

d)SecondaryNameNode should be deployed to a node with NameNode

Answer C.

1.7. Which of the following can be administered as a cluster?

a)Puppet

b)Pdsh

c)Cloudera Manager

d)Zookeeper

Answer ABD

1.8. Which of the following is true when uploading files on the Client side

a) Data is passed to DataNode via NameNode

b) The Client divides the file into blocks and uploads them in turn.

c)Client only uploads data to a DataNode, and NameNode is responsible for Block replication.

Answer B: Client initiates a file write request to NameNode. NameNode returns information about the DataNode it manages to the Client based on file size and file block configuration. Client divides the file into multiple blocks and writes them into each DataNode block in sequence according to the address information of DataNode. Read more about HDFS architecture and its advantages and disadvantages.

1.9. Which of the following is the mode in which Hadoop operates

a) Stand-alone version

b) Pseudo-distributed

c) Distributed

Answer ABC stand-alone version, pseudo-distributed only for learning.

2. Interview Questions 2.1. What is the core configuration of Hadoop?

The core configuration of Hadoop is accomplished through two xml files: 1, hadoop-default.xml; and 2, hadoop-site.xml. These files are in xml format, so each xml has attributes, including names and values, but these files no longer exist.

2.2. How should they be configured now?

Hadoop now has three configuration files: 1, core-site.xml;2, hdfs-site.xml; and 3, mapred-site.xml. These files are stored in the conf/subdirectory.

2.3. What is the use of the "jps" command?

This command checks if Namenode, Datanode, Task Tracker, Job Tracker are working properly.

2.4. How does MapReduce work?

2.5. HDFS storage mechanism?

Process:

1. Client link namenode stores data

The namenode records a piece of data location information (metadata) that tells the client where to store it.

The client uses the api of hdfs to store the data block (default is 64M) on the datanode.

4. Datanodes backup data horizontally. And after the backup, it will feed back to the client.

5. The client notifies the namenode that the storage block is complete.

namenode synchronizes metadata into memory.

7. Another piece loops the process above.

Process:

1. Client links to namenode, view metadata, and find the storage location of the data.

Client reads data concurrently through HDFS API.

3. Close the connection.

2.6. A simple example of how mareduce works?

Examples of wordcount

2.7. Use mapreduce to implement the following requirements?

There are now 10 folders, each with 100000 urls. Now let's find top100000 urls.

Answer: topk

(You can also use treeMap, add each one to 100000, delete the smallest)

2.8. What is the role of Combiner in hadoop?

Combiner is an implementation of reduce that runs computational tasks on the map side and reduces the output data on the map side.

The role is to optimize.

But combiner usage scenarios are mapreduce maps and reduce inputs and outputs the same.

2.9. Hadoop installation

2.10. Please list hadoop process names

2.11. Fix the following error

2.12. Write the following command

2.13. Brief description of hadoop scheduler

2.14. List the languages in which you developed mareduce

2.15. writing program

2.16. Advantages and disadvantages of different languages

2.17. What are the ways hives store metadata and what are their characteristics?

2.18. Combiner and partition

2.19. Difference between Hive internal and external tables

Internal table: load data into hdfs directory where hive is located. When deleting, metadata and data files are deleted.

External tables: do not load data into the hdfs directory where hive is located. When deleting, only delete the table structure.

2.20. How to create hbase rowkey? How to create a better family?

When hbase is stored, the data is sorted according to the dictionary order of Row Key. When designing keys, order them properly.

Store this feature, storing rows that are often read together. (Location dependency) A column family is a file at the bottom of the data, so put the columns that are often queried together into a column family, with as few column families as possible to reduce the addressing time of the file.

2.21. How to deal with data skew with mapreduce?

2.22. How to optimize in Hadoop framework

2.23. When we develop jobs, can we remove the reduce phase?

Yes. Set the reduce number to 0.

2.24. Under what circumstances will the datanode not be backed up?

DataNode will not backup during forced shutdown or abnormal power failure

2.25. Combiner appears in that process

Appears after the map method in the map phase.

2.26. HDFS architecture

HDFS consists of namenode, secondraynamenode and datanode.

n+1 model

namenode is responsible for managing datanodes and recording metadata

secondraynamenode is responsible for merging logs

DataNode stores data

2.27. What if one of the three datanodes has an error?

The data from this data node will be backed up again on the other data nodes.

2.28. Describe where cache mechanisms are used in hadoop, and what are their functions?

After mapreduce submits the job's fetch id, all files are stored in a distributed cache so they can be shared by all mapreduce.

2.29. How to determine the health status of a hadoop cluster

Monitoring through pages, monitoring through scripts.

2.30. Why is external tables recommended in a production environment?

Because external tables do not load data into hive, data transmission is reduced and data can be shared.

Hive does not modify data, so there is no need to worry about data corruption.

3. When deleting a table, only the table structure is deleted, and the data is not deleted.

Thank you for your reading. The above is the content of "What are the latest big data interview questions in 2021?" After studying this article, I believe you have a deeper understanding of what are the big data interview questions in the latest version of 2021. The specific use situation still needs to be verified by practice. Here is, Xiaobian will push more articles related to knowledge points for everyone, welcome to pay attention!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.