Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the interview questions of Hadoop big data?

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article will explain in detail what are the interview questions about Hadoop big data. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.

A brief description of how to install and configure an open source hadoop for apache. Just describe it. There is no need to list specific steps. It is better to do so.

1 sign in using your root account

2 modify IP

3 modify host hostname

4 configure SSH password-free login

5 turn off the firewall

6 install JDK

6. Extract the hadoop installation package

7 configure the core files hadoop-env.sh,core-site.xml, mapred-site.xml, hdfs-site.xml of hadoop

8 configure hadoop environment variables

9 format hadoop namenode-format

10 start node start-all.sh

2.0Please list which processes hadoop needs to start and what their roles are in a normal hadoop cluster. Please list them in detail as much as possible.

Answer: namenode: responsible for managing the metadata of pieces in hdfs, responding to client requests, managing the balance of file block on datanode, and maintaining the number of copies

Secondname: mainly responsible for checkpoint operation; you can also do cold backup and snapshot backup of data within a certain range.

Datanode: stores data blocks and is responsible for client io requests for data blocks

Jobtracker: manages tasks and assigns them to tasktracker.

Tasktracker: performs tasks assigned by JobTracker.

Resourcemanager

Nodemanager

Journalnode

Zookeeper

Zkfc

3.0Please write the following shell command

(1) Kill a job

(2) Delete the / tmp/aaa directory on hdfs

(3) add a new storage node and delete the commands to be executed by a node

Answer: (1) hadoop job-list gets the id of job, and then the line hadoop job-kill jobId can kill the job of a specified jobId.

(2) hadoopfs-rmr / tmp/aaa

(3) add a new node to execute on the new points

Hadoop daemon.sh start datanode

Hadooop daemon.sh start tasktracker/nodemanager

When you go offline, list the hostname of the datanode machine to be taken offline in the excludes file in the conf directory

Then execute hadoop dfsadmin-refreshnodes à offline a datanode in the master node

When you delete a node, you only need to execute it on the primary node

Hadoop mradmin-refreshnodes-à offline a tasktracker/nodemanager

4.0Please list the hadoop scheduler you know and briefly describe how it works

Answer: Fifo schedular: default, first-in, first-out principle

Capacity schedular: computing power scheduler, select the lowest footprint and high priority to execute first, and so on.

Fair schedular: fair scheduling, all job have the same resources.

5.0Please list the languages you have used to develop mapreduce in your work

Answer: java,hive, (python,c++) hadoop streaming

6.0 the current log sampling format is

A, b, c, d

B, b, f, e

A, a, c, f

Please write mapreduce in the language you are most familiar with and calculate the number of elements in the fourth column.

Answer:

Public classWordCount1 {

Public static final String INPUT_PATH = "hdfs://hadoop0:9000/in"

Public static final String OUT_PATH = "hdfs://hadoop0:9000/out"

Public static void main (String [] args) throws Exception {

Configuration conf = newConfiguration ()

FileSystem fileSystem = FileSystem.get (conf)

If (fileSystem.exists (newPath (OUT_PATH) {}

FileSystem.delete (newPath (OUT_PATH), true)

Job job = newJob (conf,WordCount1.class.getSimpleName ())

/ / 1. 0 reads files and parses them into key,value pairs

FileInputFormat.setInputPaths (job,newPath (INPUT_PATH))

/ / 2.0 write your own logic, process the input value, and convert it into a new key,value pair for output

Job.setMapperClass (MyMapper.class)

Job.setMapOutputKeyClass (Text.class)

Job.setMapOutputValueClass (LongWritable.class)

/ / 3.0 partitions the exported data

/ / 4.0sort and group the partitioned data, and put the value of the same key into a collection

/ / 5.0 stipulate the grouped data

/ / 6.0pair to copy the map output data to the reduce node over the network

Write your own reduce function logic to deal with the data output from map

Job.setReducerClass (MyReducer.class)

Job.setOutputKeyClass (Text.class)

Job.setOutputValueClass (LongWritable.class)

FileOutputFormat.setOutputPath (job,new Path (OUT_PATH))

Job.waitForCompletion (true)

}

Static class MyMapper extendsMapper {

@ Override

Protected void map (LongWritablek1, Text v1

Org.apache.hadoop.mapreduce.Mapper.Contextcontext)

Throws IOException,InterruptedException {

String [] split = v1.toString () .split ("\ t")

For (String words: split) {

Context.write (split [3], 1)

}

}

}

Static class MyReducer extends Reducer {

Protected void reduce (Text k2 perfect Iterable v2

Org.apache.hadoop.mapreduce.Reducer.Contextcontext)

Throws IOException,InterruptedException {

Long count = 0L

For (LongWritable time: v2) {

Count + = time.get ()

}

Context.write (v2, newLongWritable (count))

}

}

}

7.0What do you think are the advantages of using java, streaming and pipe to develop map/reduce

Only java and hiveQL have been used.

Java can implement complex logic by writing mapreduce, which is tedious if the requirement is simple.

HiveQL is basically written for table data in hive, but it is difficult to implement complex logic. It's easy to write.

What are the ways and advantages of saving metadata in 8.0 hive?

Three kinds: built-in database derby, which is very small and not commonly used, and can only be used for single node

Mysql is commonly used

Look up the professional name on the Internet: single user mode..multiuser mode...remote user mode

9.0 Please briefly describe how hadoop implements secondary sorting (that is, double sorting for key and value)

The first method is that Reducer caches all the values of a given key and then sorts them within the Reducer. However, because Reducer needs to save all values for a given key, it may result in a memory exhaustion error.

The second method is to add part or all of the value to the original key to generate a combined key. These two methods have their own advantages. The first method is easy to write, but slow in the case of small concurrency and large amount of data (in danger of running out of memory).

The second method is to assign the sorting task to the MapReduce framework shuffle, which is more in line with the design idea of Hadoop/Reduce. The second is chosen in this article. We will write a Partitioner to ensure that all data with the same key (the original key, excluding the added parts) is sent to the same Reducer, and we will write a Comparator so that the data is grouped by the original key as soon as it reaches the Reducer.

10. A brief introduction to several methods of realizing jion by hadoop

Map side join---- size table join scene, you can use distributed cache

Reduce side join

11.0 Please use java to implement non-recursive binary query

12.0 Please briefly describe the role of combine and partition in mapreduce

A: combiner occurs in the last stage of map, and its principle is also a small reducer. Its main function is to reduce the amount of data output to reduce, alleviate the bottleneck of network transmission, and improve the efficiency of reducer implementation.

The main role of partition is to allocate all the kv pairs generated in the map phase to different reducer task processing, and the processing load in the reduce phase can be shared.

13.0 differences between internal and external tables in hive

When Hive imports data into an internal table, it moves the data to the path that the data warehouse points to. In the case of an external table, the specific storage directory of the data is specified by the user when the table is created.

When you delete a table, the metadata and data of the internal table are deleted together.

The external table deletes only the metadata, not the data.

In this way, external tables are relatively more secure, data organization is more flexible, and it is convenient to share source data.

14. How to create a rowKey for Hbase? How to create a cluster?

Answer:

RowKey is best to create a regular rowKey, that is, preferably orderly.

Data that often needs to be read in bulk should be kept continuous in their rowkey.

Organize keywords that are often needed as conditional queries into rowkey

Creation of column families:

Data are classified according to business characteristics, and different categories are placed in different column families.

15. How to deal with the problem of data skew with mapreduce

Essence: let the data of each partition be evenly distributed

You can set the appropriate partition policy according to the business characteristics.

If the distribution law of the data is not known at all in advance, the random sampler is used to sample and generate the partition strategy for reprocessing.

16. How to optimize the hadoop framework

It can be done in many ways: for example, how to optimize hdfs, how to optimize mapreduce program, how to optimize job scheduling of yarn, how to optimize hbase, how to optimize hive.

17. What is the internal mechanism of hbase

Hbase is a database system that can adapt to online business.

Physical storage: the persistent data of hbase is stored on hdfs

Storage management: a table is divided into many region, and these region are distributed on many regionserver

Region can also be divided into store,store with memstore and storefile.

Version management: data updates in hbase are essentially adding new versions and merging files between versions through compact operations.

Split of Region

Cluster management: zookeeper + hmaster (responsibilities) + hregionserver (responsibilities)

18. Can we get rid of the reduce phase when developing distributed computing job?

A: yes, for example, our cluster is designed to store files, and mapReduce can be omitted without involving the calculation of data.

For example, the behavior trajectory enhancement part of the traffic operation project

How to get rid of the reduce phase

After being removed, there will be no sorting and no shuffle operation.

Data Compression algorithms commonly used in 19 hadoop

Lzo

Gzip

Default

Snapyy

If you want to compress the data, it is best to convert the original data to SequenceFile or Parquet File (spark)

20. The scheduling mode of mapreduce (ambiguous meaning, which can be understood as the scheduling mode of yarn or the internal workflow of mr)

A: appmaster acts as the scheduling director, managing maptask and reducetask

Appmaster is responsible for starting and monitoring maptask and reducetask

After the Maptask processing is completed, appmaster monitors it and notifies reducetask of its output, then reducetask pulls the file from the map side, and then processes

When the reduce phase is complete, appmaster will log out to resourcemanager.

21. The principle of interaction between hive bottom layer and Database

The query function of Hive is realized by the combination of hdfs and mapreduce

The relationship between Hive and mysql: simply borrows mysql to store metadata information for tables in hive, called metastore

twenty-two。 Implementation principles of hbase filter

You can talk about the parent class of the filter (comparison filter, special filter).

What is the purpose of the filter:

Enhance the function of querying data in hbase

Reduce the amount of data returned by the server to the client

23. What is the output of data after reduce (combined with specific scenarios, such as pi)

Enhancement log (1.5T---2T) for the Sca phase

Filter mr program with less output than input

Analytic nature of the mr program, more output than input (find common friends)

24. Test mapreduce mastery and hive ql language mastery when there are problems in the field.

Under what circumstances will 25.datanode not back up data

Answer: specify the number of copies of the file as 1 when the client uploads the file

Which process does 26.combine appear in?

A: during the shuffle process

Specifically, the data output in maptask overflows from memory to disk and may be called multiple times.

Combiner should be used with special care not to affect the final logical result

twenty-seven。 The architecture of hdfs

Answer:

Cluster architecture:

Namenode datanode secondarynamenode

(active namenode, standby namenode) journalnode zkfc

Internal working mechanism:

The data is distributed.

Provide a unified directory structure to the outside world

Provide a specific responder (namenode)

Data block mechanism, copy mechanism

Job responsibilities and mechanisms of Namenode and datanode

Read and write data flow

twenty-eight。 The process of flush

Answer: flush is carried out on the basis of memory. When writing a file first, it will first write the file to memory. When the memory is full, it will write all the files to the hard disk at one time to save, and clear the files in the cache.

twenty-nine。 What is a queue?

A: it's a scheduling strategy, and the mechanism is first in, first out.

thirty。 The difference between List and set

A: both List and Set are interfaces. They each have their own implementation classes, with or without sequential implementation classes, as well as sequential implementation classes.

The biggest difference is that List is repeatable. And Set cannot be repeated.

List is suitable for frequently appending, inserting and deleting data. But the efficiency of random sampling is relatively low.

Set is suitable for saving, inserting and deleting on a regular basis. But the efficiency in ergodic is relatively low.

thirty-one。 Three paradigms of data

Answer:

First normal form () column without repetition

The second normal form (2NF) attribute is completely dependent on the primary key [eliminates partial subfunction dependencies]

The third normal form (3NF) attribute does not depend on other non-primary attributes [eliminate transitive dependency]

thirty-two。 What happens when one of the three datanode has an error?

Answer:

Namenode will sense datanode offline through heartbeat mechanism.

A new copy of the block block on this datanode will be copied in the cluster to restore the number of copies of the file.

It will cause the operation and maintenance team to respond quickly, send colleagues to detect and repair the offline datanode, and then go online again.

When 33.sqoop imports data into mysql, how to avoid repeatedly importing data? if there is a data problem, how does sqoop deal with it?

A: FAILED java.util.NoSuchElementException

The reason for this error is that the fields of the sqoop parsing file do not correspond to the fields of the table in the MySql database. Therefore, you need to add parameters to the sqoop during execution, telling the sqoop file delimiter so that it can parse the file fields correctly.

The default field delimiter for hive is'\ 001'

thirty-four。 Describe where caching mechanisms are used in hadoop and what are their functions?

Answer:

In Shuffle

Hbase---- client / regionserver

35.MapReduce optimization experience

Answer: (1) Set a reasonable number of map and reduce. Reasonable setting of blocksize

(2.) Avoid data skew

(3.combine function

(4. Compress the data

(5. Optimization of small file processing: merge small files into large files in advance, combineTextInputformat, and merge small files into large SequenceFile files with mapreduce on hdfs (key: file name, value: file content)

(6. Parameter optimization

thirty-six。 Please list the files under / etc/ that have been modified, and explain what problems are solved by the modification?

Answer: / etc/profile this file is mainly used to configure environment variables. Let the hadoop command be executed in any directory.

/ ect/sudoers

/ etc/hosts

/ etc/sysconfig/network

/ etc/inittab

thirty-seven。 Please describe how to analyze and optimize the performance of the above program during the development process.

thirty-eight。 At present, 100 million integers are evenly distributed. If you want to get the largest number in the first 1K, find the optimal algorithm.

See "interview Collection of massive data algorithm"

The general flow of 39.mapreduce

Answer: mainly divided into eight steps

1 / slicing planning of the file

2 / start the appropriate number of maptask processes

3 / call RecordReader in FileInputFormat, read a row of data and encapsulate it as k1v1

4 / call the custom map function and pass the k1v1 to map

5 / collect the output of map, partition and sort

The 6/reduce task task starts and pulls data from the map side

7/reduce task calls a custom reduce function for processing

8 / call the recordwriter of outputformat to output the result data

forty。 Implementation of sql select count (x) from a group by b with mapreduce

forty-one。 Set up a hadoop cluster, which services are run by master and slaves

A: master mainly runs our master node, and slaves mainly runs our slave node.

forty-two。 Hadoop parameter tuning

forty-three。 What is the difference between the syntax of pig, latin and hive

forty-four。 Describe the building process of Hbase,ZooKeeper

Operation principle of 45.hadoop

A: the main core of hadoop is composed of two parts, HDFS and mapreduce. First of all, the principle of HDFS is a distributed file storage system, which divides a large file into multiple small files and stores them on multiple servers.

The principle of Mapreduce is to use JobTracker and TaskTracker to execute jobs. Map is to expand the task, and reduce is the result of summary processing.

The principle of 46.mapreduce

A: the principle of mapreduce is that a MapReduce framework consists of a separate master JobTracker and a slave TaskTracker for each cluster node. Master is responsible for scheduling all the tasks that make up a job, and on these slave, master monitors their execution and re-executes tasks that have failed. Slave is only responsible for performing the tasks assigned by maste.

47.HDFS storage mechanism

Answer: HDFS is mainly a distributed file storage system. Namenode receives the user's operation requests, and then splits the large file into multiple block blocks to save according to the file size and the defined block block size.

forty-eight。 Give an example of how mapreduce works.

Wordcount

forty-nine。 How to confirm the health status of hadoop cluster

A: there is a perfect cluster monitoring system (ganglia,nagios)

Hdfs dfsadmin-report

Hdfs haadmin-getServiceState nn1

50.mapreduce job, do not let reduce output, with what to replace the function of reduce.

How to tune 51.hive

A: hive will eventually be converted to mapreduce's job to run. Hive tuning is actually mapreduce tuning, which can be tuned in the following aspects. Solve the problem of skewed receipts, reduce the number of job, set a reasonable number of map and reduce, merge small files, grasp the whole when optimizing, the best of a single task is not as good as the whole. Partition according to certain rules.

How 52.hive controls permissions

Our company didn't do it. We don't need it.

What is the principle of 53.HBase writing data?

Answer:

Can 54.hive build multiple libraries like a relational database?

A: of course you can.

How to deal with 55.HBase downtime

Answer: downtime is divided into HMaster downtime and HRegisoner downtime. If HRegisoner outage occurs, HMaster will redistribute the region it manages to other active RegionServer. Since the data and logs are persisted in the HDFS, this operation will not result in data loss. Therefore, the consistency and security of the data are guaranteed.

If HMaster is down, HMaster does not have a single point of problem. Multiple HMaster can be started in HBase, and a Master is always running through Zookeeper's Master Election mechanism. That is, ZooKeeper will ensure that there will always be a HMaster providing services.

fifty-six。 Suppose the company is going to build a data center, what would you do with it?

First carry on the demand investigation and analysis

Design function division

Architecture design

Estimation of throughput

Type of technology adopted

Software and hardware selection

Cost-benefit analysis

Project management

Expansibility

Security, stability

This is the end of the article on "what are the interview questions for Hadoop big data?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report