What are the Hadoop interview questions and answers? 04/19 Update SLTechnology News&Howtos

What are the Hadoop interview questions and answers?

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what are the Hadoop interview questions and answers". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

What is Hadoop?

Hadoop is an open source software framework for storing large amounts of data and concurrently processing / querying data on clusters with multiple commercial hardware (that is, low-cost hardware) nodes. In summary, Hadoop includes the following:

HDFS (Hadoop Distributed File System,Hadoop distributed File system): HDFS allows you to store large amounts of data in a distributed and redundant way. For example, a 1 GB (that is, 1024 MB) text file can be split into 16 * 128MB files and stored on eight different nodes in the Hadoop cluster. Each split can be replicated three times to achieve fault tolerance, so that if 1 node fails, there is also a backup. HDFS is suitable for sequential "write once, read multiple" type access.

MapReduce: a computing framework. It processes large amounts of data in a distributed and parallel manner. When you execute a query on the above 1 GB file for all users over 18 years old, the "8 Maps" function will run in parallel to extract users older than 18 from its 128 MB split file, and then the "reduce" function will run to combine all individual outputs into a single final result.

YARN (Yet Another Resource Nagotiator, another resource locator): a framework for job scheduling and cluster resource management.

Hadoop ecosystem with more than 15 frameworks and tools, such as Sqoop,Flume,Kafka,Pig,Hive,Spark,Impala, to feed data into HDFS, transfer data in HDFS (i.e., transform, enrich, aggregate, etc.), and query data from HDFS for business intelligence and analysis.

Some tools, such as Pig and Hive, are layers of abstraction on MapReduce, while others, such as Spark and Impala, are improved architectures / designs from MapReduce for significantly increased latency to support near real-time (i.e., NRT) and real-time processing.

What are the benefits of a Hadoop-based data center?

With the increase of data volume and complexity, the overall SLA (service level agreement) is improved. For example, the "Shared Nothing" architecture, parallel processing, memory-intensive processing frameworks such as Spark and Impala, and resource preemption in YARN capacity schedulers.

1. Scaling the data warehouse can be expensive: adding additional high-end hardware capacity and obtaining licenses for data warehouse tools can significantly increase costs. Hadoop-based solutions are not only cheaper in terms of commodity hardware nodes and open source tools, but also complement data warehouse solutions by offloading data transformations to Hadoop tools such as Spark and Impala, resulting in more efficient parallel processing of big data. This will also release data warehouse resources.

2. Explore new channels and clues: Hadoop can provide data scientists with an exploratory sandboxie to discover potentially valuable data from social media, log files, e-mails, etc., which are often not available in data warehouses.

3. Better flexibility: usually changes in business requirements also require changes in architecture and reporting. Hadoop-based solutions provide the flexibility to handle not only evolving patterns, but also semi-structured and unstructured data from different sources, such as social media, application log files, image,PDF and document files.

Fourth, a brief description of how to install and configure an open source version of apache hadoop, only the description, do not need to list the complete steps, it is better to list the steps.

1. Install JDK and configure the environment variable (/ etc/profile)

2. Turn off the firewall

3. Configure the hosts file to facilitate hadoop access through the hostname (/ etc/hosts)

4. Set ssh password-free login

5. Extract the hadoop installation package and configure the environment variables

6. Modify the configuration file ($HADOOP_HOME/conf); hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml

7. Format hdfs file system (hadoop namenode-format)

8. Start hadoop ($HADOOP_HOME/bin/start-all.sh)

9. Use jps to view the process.

5. Please list which processes need to be started by hadoop in a working hadoop cluster, and what their roles are, and write them as comprehensively as possible.

1. NameNode: the daemon of HDFS is responsible for recording how the file is divided into data blocks and which data blocks are stored on those data nodes. Its main function is to centrally manage memory and IO.

2. Secondary NameNode: an auxiliary daemon that communicates with NameNode to save snapshots of HDFS metadata on a regular basis.

3. DataNode: responsible for reading and writing HDFS blocks to the local file system.

4. JobTracker: responsible for allocating task and monitoring all running task.

5. TaskTracker: responsible for executing specific task and interacting with JobTracker.

Please list the hadoop scheduler you know and briefly describe how it works

Three popular schedulers are: default scheduler FIFO, computing power scheduler Capacity Scheduler, and fair scheduler Fair Scheduler

1. The default scheduler in the default scheduler FIFO:hadoop adopts the first-in-first-out principle.

2. Computing power scheduler Capacity Scheduler: select those with small resource consumption and high priority to execute first.

3. Fair scheduler Fair Scheduler: jobs in the same queue share all resources in the queue fairly.

7. What are the ways in which Hive saves metadata, and each has its own characteristics.

1. Memory database derby, which is small and not commonly used.

2. Local mysql is more commonly used.

3. Remote mysql is not commonly used.

8. Briefly describe several methods of realizing Join by hadoop.

1 、 reduce side join

Reduce side join is the simplest way to join, and its main ideas are as follows:

In the map phase, the map function reads two files File1 and File2 at the same time. In order to distinguish the key/value data pairs from the two sources, each piece of data is labeled (tag). For example, tag=0 indicates that it comes from the file File1,tag=2 and it comes from the file File2. That is, the main task of the map phase is to tag the data in different files.

In the reduce phase, the reduce function takes key's same value list from File1 and File2 files, and then join the data in File1 and File2 (Cartesian product) for the same key. That is, the actual connection operation is performed in the reduce phase.

2 、 map side join

Reduce side join exists because you cannot get all the required join fields during the map phase, that is, the fields corresponding to the same key may be in different map. Reduce side join is very inefficient because there is a large amount of data transfer in the shuffle phase.

Map side join is optimized for scenarios where one of the two tables to be joined is very large and the other is so small that the small table can be stored directly in memory. In this way, we can make multiple copies of the small table, so that there is one copy in each map task memory (such as storing it in hash table), and then scan only the large table: for each record key/value in the large table, look in hash table to see if there is a record of the same key, and if so, output it after connection.

To support file copying, Hadoop provides a class DistributedCache, which can be used as follows:

(1) users use static methods

DistributedCache.addCacheFile () specifies the file to be copied, and its argument is the URI of the file (if it is a file on HDFS, you can do this: hdfs://namenode:9000/home/XXX/file, where 9000 is your own configured NameNode port number).

JobTracker takes the URI list before the job starts and copies the corresponding files to the local disk of each TaskTracker.

(2) user use

The DistributedCache.getLocalCacheFiles () method gets the file directory and reads the corresponding file using the standard file read and write API.

3 、 SemiJoin

SemiJoin, also known as semi-join, is a method borrowed from distributed databases. Its motivation is: for reduce side join, the amount of data transferred across machines is very large, which becomes a bottleneck of join operation. If the data that will not participate in join operation can be filtered on the map side, the network IO can be greatly saved.

The implementation method is simple: select a small table, suppose it is File1, extract the key that participates in the join, and save it to the file File3. The File3 file is generally very small and can be placed in memory. In the map phase, you use DistributedCache to copy the File3 to each TaskTracker, and then filter out the records corresponding to the key in the File2 that are not in the File3, and the rest of the reduce phase works the same as reduce side join.

4. Reduce side join + BloomFilter

In some cases, the key collection of small tables extracted by SemiJoin still cannot be stored in memory, and BloomFiler can be used to save space.

The most common function of BloomFilter is to determine whether an element is in a collection. Its two most important methods are add () and contains (). The biggest feature is that there will be no false negative, that is, if contains () returns false, then the element must not be in the collection, but there will be a certain true negative, that is, if contains () returns true, then the element may be in the collection.

So you can save the key in the small table to BloomFilter, filter the large table in the map phase, there may be some records that are not filtered out in the small table (but the records in the small table will not be filtered out), it doesn't matter, it just adds a small amount of network IO.

This is the end of the content of "what are the Hadoop interview questions and answers". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.