In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article will explain in detail what are the interview questions about Hadoop big data. The editor thinks it is very practical, so I share it for you as a reference. I hope you can get something after reading this article.
A brief description of how to install and configure an open source hadoop for apache. Just describe it. There is no need to list specific steps. It is better to do so.
1 sign in using your root account
2 modify IP
3 modify host hostname
4 configure SSH password-free login
5 turn off the firewall
6 install JDK
6. Extract the hadoop installation package
7 configure the core files hadoop-env.sh,core-site.xml, mapred-site.xml, hdfs-site.xml of hadoop
8 configure hadoop environment variables
9 format hadoop namenode-format
10 start node start-all.sh
2.0Please list which processes hadoop needs to start and what their roles are in a normal hadoop cluster. Please list them in detail as much as possible.
Answer: namenode: responsible for managing the metadata of pieces in hdfs, responding to client requests, managing the balance of file block on datanode, and maintaining the number of copies
Secondname: mainly responsible for checkpoint operation; you can also do cold backup and snapshot backup of data within a certain range.
Datanode: stores data blocks and is responsible for client io requests for data blocks
Jobtracker: manages tasks and assigns them to tasktracker.
Tasktracker: performs tasks assigned by JobTracker.
Resourcemanager
Nodemanager
Journalnode
Zookeeper
Zkfc
3.0Please write the following shell command
(1) Kill a job
(2) Delete the / tmp/aaa directory on hdfs
(3) add a new storage node and delete the commands to be executed by a node
Answer: (1) hadoop job-list gets the id of job, and then the line hadoop job-kill jobId can kill the job of a specified jobId.
(2) hadoopfs-rmr / tmp/aaa
(3) add a new node to execute on the new points
Hadoop daemon.sh start datanode
Hadooop daemon.sh start tasktracker/nodemanager
When you go offline, list the hostname of the datanode machine to be taken offline in the excludes file in the conf directory
Then execute hadoop dfsadmin-refreshnodes à offline a datanode in the master node
When you delete a node, you only need to execute it on the primary node
Hadoop mradmin-refreshnodes-à offline a tasktracker/nodemanager
4.0Please list the hadoop scheduler you know and briefly describe how it works
Answer: Fifo schedular: default, first-in, first-out principle
Capacity schedular: computing power scheduler, select the lowest footprint and high priority to execute first, and so on.
Fair schedular: fair scheduling, all job have the same resources.
5.0Please list the languages you have used to develop mapreduce in your work
Answer: java,hive, (python,c++) hadoop streaming
6.0 the current log sampling format is
A, b, c, d
B, b, f, e
A, a, c, f
Please write mapreduce in the language you are most familiar with and calculate the number of elements in the fourth column.
Answer:
Public classWordCount1 {
Public static final String INPUT_PATH = "hdfs://hadoop0:9000/in"
Public static final String OUT_PATH = "hdfs://hadoop0:9000/out"
Public static void main (String [] args) throws Exception {
Configuration conf = newConfiguration ()
FileSystem fileSystem = FileSystem.get (conf)
If (fileSystem.exists (newPath (OUT_PATH) {}
FileSystem.delete (newPath (OUT_PATH), true)
Job job = newJob (conf,WordCount1.class.getSimpleName ())
/ / 1. 0 reads files and parses them into key,value pairs
FileInputFormat.setInputPaths (job,newPath (INPUT_PATH))
/ / 2.0 write your own logic, process the input value, and convert it into a new key,value pair for output
Job.setMapperClass (MyMapper.class)
Job.setMapOutputKeyClass (Text.class)
Job.setMapOutputValueClass (LongWritable.class)
/ / 3.0 partitions the exported data
/ / 4.0sort and group the partitioned data, and put the value of the same key into a collection
/ / 5.0 stipulate the grouped data
/ / 6.0pair to copy the map output data to the reduce node over the network
Write your own reduce function logic to deal with the data output from map
Job.setReducerClass (MyReducer.class)
Job.setOutputKeyClass (Text.class)
Job.setOutputValueClass (LongWritable.class)
FileOutputFormat.setOutputPath (job,new Path (OUT_PATH))
Job.waitForCompletion (true)
}
Static class MyMapper extendsMapper {
@ Override
Protected void map (LongWritablek1, Text v1
Org.apache.hadoop.mapreduce.Mapper.Contextcontext)
Throws IOException,InterruptedException {
String [] split = v1.toString () .split ("\ t")
For (String words: split) {
Context.write (split [3], 1)
}
}
}
Static class MyReducer extends Reducer {
Protected void reduce (Text k2 perfect Iterable v2
Org.apache.hadoop.mapreduce.Reducer.Contextcontext)
Throws IOException,InterruptedException {
Long count = 0L
For (LongWritable time: v2) {
Count + = time.get ()
}
Context.write (v2, newLongWritable (count))
}
}
}
7.0What do you think are the advantages of using java, streaming and pipe to develop map/reduce
Only java and hiveQL have been used.
Java can implement complex logic by writing mapreduce, which is tedious if the requirement is simple.
HiveQL is basically written for table data in hive, but it is difficult to implement complex logic. It's easy to write.
What are the ways and advantages of saving metadata in 8.0 hive?
Three kinds: built-in database derby, which is very small and not commonly used, and can only be used for single node
Mysql is commonly used
Look up the professional name on the Internet: single user mode..multiuser mode...remote user mode
9.0 Please briefly describe how hadoop implements secondary sorting (that is, double sorting for key and value)
The first method is that Reducer caches all the values of a given key and then sorts them within the Reducer. However, because Reducer needs to save all values for a given key, it may result in a memory exhaustion error.
The second method is to add part or all of the value to the original key to generate a combined key. These two methods have their own advantages. The first method is easy to write, but slow in the case of small concurrency and large amount of data (in danger of running out of memory).
The second method is to assign the sorting task to the MapReduce framework shuffle, which is more in line with the design idea of Hadoop/Reduce. The second is chosen in this article. We will write a Partitioner to ensure that all data with the same key (the original key, excluding the added parts) is sent to the same Reducer, and we will write a Comparator so that the data is grouped by the original key as soon as it reaches the Reducer.
10. A brief introduction to several methods of realizing jion by hadoop
Map side join---- size table join scene, you can use distributed cache
Reduce side join
11.0 Please use java to implement non-recursive binary query
12.0 Please briefly describe the role of combine and partition in mapreduce
A: combiner occurs in the last stage of map, and its principle is also a small reducer. Its main function is to reduce the amount of data output to reduce, alleviate the bottleneck of network transmission, and improve the efficiency of reducer implementation.
The main role of partition is to allocate all the kv pairs generated in the map phase to different reducer task processing, and the processing load in the reduce phase can be shared.
13.0 differences between internal and external tables in hive
When Hive imports data into an internal table, it moves the data to the path that the data warehouse points to. In the case of an external table, the specific storage directory of the data is specified by the user when the table is created.
When you delete a table, the metadata and data of the internal table are deleted together.
The external table deletes only the metadata, not the data.
In this way, external tables are relatively more secure, data organization is more flexible, and it is convenient to share source data.
14. How to create a rowKey for Hbase? How to create a cluster?
Answer:
RowKey is best to create a regular rowKey, that is, preferably orderly.
Data that often needs to be read in bulk should be kept continuous in their rowkey.
Organize keywords that are often needed as conditional queries into rowkey
Creation of column families:
Data are classified according to business characteristics, and different categories are placed in different column families.
15. How to deal with the problem of data skew with mapreduce
Essence: let the data of each partition be evenly distributed
You can set the appropriate partition policy according to the business characteristics.
If the distribution law of the data is not known at all in advance, the random sampler is used to sample and generate the partition strategy for reprocessing.
16. How to optimize the hadoop framework
It can be done in many ways: for example, how to optimize hdfs, how to optimize mapreduce program, how to optimize job scheduling of yarn, how to optimize hbase, how to optimize hive.
17. What is the internal mechanism of hbase
Hbase is a database system that can adapt to online business.
Physical storage: the persistent data of hbase is stored on hdfs
Storage management: a table is divided into many region, and these region are distributed on many regionserver
Region can also be divided into store,store with memstore and storefile.
Version management: data updates in hbase are essentially adding new versions and merging files between versions through compact operations.
Split of Region
Cluster management: zookeeper + hmaster (responsibilities) + hregionserver (responsibilities)
18. Can we get rid of the reduce phase when developing distributed computing job?
A: yes, for example, our cluster is designed to store files, and mapReduce can be omitted without involving the calculation of data.
For example, the behavior trajectory enhancement part of the traffic operation project
How to get rid of the reduce phase
After being removed, there will be no sorting and no shuffle operation.
Data Compression algorithms commonly used in 19 hadoop
Lzo
Gzip
Default
Snapyy
If you want to compress the data, it is best to convert the original data to SequenceFile or Parquet File (spark)
20. The scheduling mode of mapreduce (ambiguous meaning, which can be understood as the scheduling mode of yarn or the internal workflow of mr)
A: appmaster acts as the scheduling director, managing maptask and reducetask
Appmaster is responsible for starting and monitoring maptask and reducetask
After the Maptask processing is completed, appmaster monitors it and notifies reducetask of its output, then reducetask pulls the file from the map side, and then processes
When the reduce phase is complete, appmaster will log out to resourcemanager.
21. The principle of interaction between hive bottom layer and Database
The query function of Hive is realized by the combination of hdfs and mapreduce
The relationship between Hive and mysql: simply borrows mysql to store metadata information for tables in hive, called metastore
twenty-two。 Implementation principles of hbase filter
You can talk about the parent class of the filter (comparison filter, special filter).
What is the purpose of the filter:
Enhance the function of querying data in hbase
Reduce the amount of data returned by the server to the client
23. What is the output of data after reduce (combined with specific scenarios, such as pi)
Enhancement log (1.5T---2T) for the Sca phase
Filter mr program with less output than input
Analytic nature of the mr program, more output than input (find common friends)
24. Test mapreduce mastery and hive ql language mastery when there are problems in the field.
Under what circumstances will 25.datanode not back up data
Answer: specify the number of copies of the file as 1 when the client uploads the file
Which process does 26.combine appear in?
A: during the shuffle process
Specifically, the data output in maptask overflows from memory to disk and may be called multiple times.
Combiner should be used with special care not to affect the final logical result
twenty-seven。 The architecture of hdfs
Answer:
Cluster architecture:
Namenode datanode secondarynamenode
(active namenode, standby namenode) journalnode zkfc
Internal working mechanism:
The data is distributed.
Provide a unified directory structure to the outside world
Provide a specific responder (namenode)
Data block mechanism, copy mechanism
Job responsibilities and mechanisms of Namenode and datanode
Read and write data flow
twenty-eight。 The process of flush
Answer: flush is carried out on the basis of memory. When writing a file first, it will first write the file to memory. When the memory is full, it will write all the files to the hard disk at one time to save, and clear the files in the cache.
twenty-nine。 What is a queue?
A: it's a scheduling strategy, and the mechanism is first in, first out.
thirty。 The difference between List and set
A: both List and Set are interfaces. They each have their own implementation classes, with or without sequential implementation classes, as well as sequential implementation classes.
The biggest difference is that List is repeatable. And Set cannot be repeated.
List is suitable for frequently appending, inserting and deleting data. But the efficiency of random sampling is relatively low.
Set is suitable for saving, inserting and deleting on a regular basis. But the efficiency in ergodic is relatively low.
thirty-one。 Three paradigms of data
Answer:
First normal form () column without repetition
The second normal form (2NF) attribute is completely dependent on the primary key [eliminates partial subfunction dependencies]
The third normal form (3NF) attribute does not depend on other non-primary attributes [eliminate transitive dependency]
thirty-two。 What happens when one of the three datanode has an error?
Answer:
Namenode will sense datanode offline through heartbeat mechanism.
A new copy of the block block on this datanode will be copied in the cluster to restore the number of copies of the file.
It will cause the operation and maintenance team to respond quickly, send colleagues to detect and repair the offline datanode, and then go online again.
When 33.sqoop imports data into mysql, how to avoid repeatedly importing data? if there is a data problem, how does sqoop deal with it?
A: FAILED java.util.NoSuchElementException
The reason for this error is that the fields of the sqoop parsing file do not correspond to the fields of the table in the MySql database. Therefore, you need to add parameters to the sqoop during execution, telling the sqoop file delimiter so that it can parse the file fields correctly.
The default field delimiter for hive is'\ 001'
thirty-four。 Describe where caching mechanisms are used in hadoop and what are their functions?
Answer:
In Shuffle
Hbase---- client / regionserver
35.MapReduce optimization experience
Answer: (1) Set a reasonable number of map and reduce. Reasonable setting of blocksize
(2.) Avoid data skew
(3.combine function
(4. Compress the data
(5. Optimization of small file processing: merge small files into large files in advance, combineTextInputformat, and merge small files into large SequenceFile files with mapreduce on hdfs (key: file name, value: file content)
(6. Parameter optimization
thirty-six。 Please list the files under / etc/ that have been modified, and explain what problems are solved by the modification?
Answer: / etc/profile this file is mainly used to configure environment variables. Let the hadoop command be executed in any directory.
/ ect/sudoers
/ etc/hosts
/ etc/sysconfig/network
/ etc/inittab
thirty-seven。 Please describe how to analyze and optimize the performance of the above program during the development process.
thirty-eight。 At present, 100 million integers are evenly distributed. If you want to get the largest number in the first 1K, find the optimal algorithm.
See "interview Collection of massive data algorithm"
The general flow of 39.mapreduce
Answer: mainly divided into eight steps
1 / slicing planning of the file
2 / start the appropriate number of maptask processes
3 / call RecordReader in FileInputFormat, read a row of data and encapsulate it as k1v1
4 / call the custom map function and pass the k1v1 to map
5 / collect the output of map, partition and sort
The 6/reduce task task starts and pulls data from the map side
7/reduce task calls a custom reduce function for processing
8 / call the recordwriter of outputformat to output the result data
forty。 Implementation of sql select count (x) from a group by b with mapreduce
forty-one。 Set up a hadoop cluster, which services are run by master and slaves
A: master mainly runs our master node, and slaves mainly runs our slave node.
forty-two。 Hadoop parameter tuning
forty-three。 What is the difference between the syntax of pig, latin and hive
forty-four。 Describe the building process of Hbase,ZooKeeper
Operation principle of 45.hadoop
A: the main core of hadoop is composed of two parts, HDFS and mapreduce. First of all, the principle of HDFS is a distributed file storage system, which divides a large file into multiple small files and stores them on multiple servers.
The principle of Mapreduce is to use JobTracker and TaskTracker to execute jobs. Map is to expand the task, and reduce is the result of summary processing.
The principle of 46.mapreduce
A: the principle of mapreduce is that a MapReduce framework consists of a separate master JobTracker and a slave TaskTracker for each cluster node. Master is responsible for scheduling all the tasks that make up a job, and on these slave, master monitors their execution and re-executes tasks that have failed. Slave is only responsible for performing the tasks assigned by maste.
47.HDFS storage mechanism
Answer: HDFS is mainly a distributed file storage system. Namenode receives the user's operation requests, and then splits the large file into multiple block blocks to save according to the file size and the defined block block size.
forty-eight。 Give an example of how mapreduce works.
Wordcount
forty-nine。 How to confirm the health status of hadoop cluster
A: there is a perfect cluster monitoring system (ganglia,nagios)
Hdfs dfsadmin-report
Hdfs haadmin-getServiceState nn1
50.mapreduce job, do not let reduce output, with what to replace the function of reduce.
How to tune 51.hive
A: hive will eventually be converted to mapreduce's job to run. Hive tuning is actually mapreduce tuning, which can be tuned in the following aspects. Solve the problem of skewed receipts, reduce the number of job, set a reasonable number of map and reduce, merge small files, grasp the whole when optimizing, the best of a single task is not as good as the whole. Partition according to certain rules.
How 52.hive controls permissions
Our company didn't do it. We don't need it.
What is the principle of 53.HBase writing data?
Answer:
Can 54.hive build multiple libraries like a relational database?
A: of course you can.
How to deal with 55.HBase downtime
Answer: downtime is divided into HMaster downtime and HRegisoner downtime. If HRegisoner outage occurs, HMaster will redistribute the region it manages to other active RegionServer. Since the data and logs are persisted in the HDFS, this operation will not result in data loss. Therefore, the consistency and security of the data are guaranteed.
If HMaster is down, HMaster does not have a single point of problem. Multiple HMaster can be started in HBase, and a Master is always running through Zookeeper's Master Election mechanism. That is, ZooKeeper will ensure that there will always be a HMaster providing services.
fifty-six。 Suppose the company is going to build a data center, what would you do with it?
First carry on the demand investigation and analysis
Design function division
Architecture design
Estimation of throughput
Type of technology adopted
Software and hardware selection
Cost-benefit analysis
Project management
Expansibility
Security, stability
This is the end of the article on "what are the interview questions for Hadoop big data?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.