You don't know the interview questions of BAT big data. 07/12 Update SLTechnology News&Howtos

You don't know the interview questions of BAT big data.

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. What information does the message of kafka include?

The Message of a Kafka consists of a fixed length header and a variable length message body body.

The header part consists of an one-byte magic (file format) and a four-byte CRC32 (used to determine whether the body of the body message is normal). When the value of magic is 1, there is an extra byte of data between magic and crc32: attributes (holds some related attributes, such as whether to compress, compressed format, etc.); if the value of magic is 0, then there is no attributes attribute

Body is a message body made up of N bytes and contains specific key/value messages.

2. How to view the offset of kafka

Version 0.9 or above, you can use the latest Consumer client client, and consumer.seekToEnd () / consumer.position () can be used to get the latest offset:

3. Shuffle process of hadoop

1. Shuffle on Map

The Map side processes the input data and produces an intermediate result, which is written to the local disk instead of the HDFS. The output of each Map is first written to the memory buffer, and when the written data reaches the set threshold, the system will start a thread to write the buffer data to disk, a process called spill.

Before spill is written, it is sorted twice, first according to the partition to which the data belongs, and then the data in each partition is sorted by key. The purpose of partition is to divide the records into different Reducer in order to achieve load balancing, and the future Reducer will read its corresponding data according to partition. Then run combiner (if set), the essence of combiner is also a Reducer, and its purpose is to process the file that will be written to disk first, so that the amount of data written to disk will be reduced. Finally, the data is written to the local disk to generate the spill file (the spill file is saved in the directory specified by {mapred.local.dir} and will be deleted when the Map task ends).

Finally, each Map task may produce multiple spill files, which are merged into a single file by a multiplex merge algorithm before each Map task is completed. At this point, Map's shuffle process is over.

Second, the shuffle on the reduce side

The shuffle on the Reduce side mainly consists of three stages, copy, sort (merge) and reduce.

The first step is to copy the output file generated by the Map side to the reduce side, but how does each Reducer know what data it should process? Because when partition is carried out on the map side, it is actually equivalent to specifying the data to be processed by each Reducer (partition corresponds to Reducer), so Reducer only needs to copy the data in its corresponding partition when copying data. Each Reducer processes one or more partition, but you need to copy the data in its own corresponding partition from the output of each Map.

Next comes the sort phase, also known as the merge phase, because the main job of this phase is to perform merge sorting. The data copied from the Map side to the Show side is orderly, so it is suitable for merging and sorting. Finally, a larger file is generated on the reduce side as input to the Reduce.

Finally, there is the Reduce process, which produces the final output and writes it to HDFS.

4. The mode of spark cluster operation

There are many modes of Spark, the simplest is the stand-alone local mode, and there is the stand-alone pseudo-distributed mode. The complex ones run in the cluster. At present, they can run well in Yarn and Mesos. Of course, Spark also has its own Standalone mode. For most cases, Standalone mode is sufficient. If the enterprise already has a Yarn or Mesos environment, it is also very convenient to deploy.

Standalone (Cluster Mode): typical Mater/slave mode, but you can also see that Master has a single point of failure; Spark supports ZooKeeper to implement HA

On yarn (Cluster Mode): runs on the framework of yarn Explorer, with yarn responsible for resource management and Spark responsible for task scheduling and computing

On mesos (Cluster Mode): runs on the framework of mesos Explorer, with mesos responsible for resource management and Spark responsible for task scheduling and computing

On cloud (cluster mode): for example, AWS's EC2, which can easily access Amazon's S3athSpark supports a variety of distributed storage systems: HDFS and S3

5. The process of reading and writing data in HDFS

Read:

1. Communicate with namenode to query metadata and find the datanode server where the file block is located.

2. Select a datanode (nearest principle, then random) server and request to establish a socket stream

3. Datanode starts to send data (read data from disk and put it into stream, and verify it in packet)

4. The client receives it in packet, now caches it locally, and then writes to the target file.

Write:

1. The root namenode communicates with the request to upload the file. Namenode checks whether the target file already exists and the parent directory exists.

2. Namenode returns whether it can be uploaded.

3. Client requests which datanode servers the first block should be transferred to.

4. Namenode returns 3 datanode server ABC

5. Client requests one of the three dn's A to upload data (essentially a RPC call to establish a pipeline). A will continue to call B when receiving the request, and then B will call C to complete the establishment of the real pipeline and return it to the client step by step.

6. Client starts uploading the first block to A (first reading data from disk to a local memory cache). Taking packet as a unit, A receives a packet and then passes it to Bmai B and Cten A. each packet is put into a response queue to wait for a reply.

7. When a block transfer is complete, client again requests namenode to upload the server of the second block.

6. Which is the better performance of reduceBykey or groupByKey in RDD, and why

ReduceByKey:reduceByKey will merge each mapper locally before the result is sent to reducer, similar to combiner in MapReduce. The advantage of this is that after a reduce on the map side, the amount of data will be greatly reduced, thus reducing the transmission and ensuring that the reduction side can calculate the results faster.

GroupByKey:groupByKey aggregates the value values in each RDD to form a sequence (Iterator), which occurs on the reduce side, so it is bound to transmit all the data through the network, resulting in unnecessary waste. At the same time, if the amount of data is very large, it may also cause OutOfMemoryError.

From the above comparison, it can be found that reduceByKey is recommended for reduce operations with a large amount of data. It can not only improve speed, but also prevent memory overflow problems caused by using groupByKey.

7. Understanding of spark2.0

Simpler: ANSI SQL and more reasonable API

Faster: using Spark as the compiler

Smarter: Structured Streaming

8. How to divide rdd into wide and narrow dependencies

Wide dependency: multiple partitions of the parent RDD quilt RDD using operations such as groupByKey, reduceByKey, sortByKey, etc., will produce wide dependencies, resulting in shuffle

Narrow dependency: each partition of the parent RDD is covered by only one partition of the child RDD. The use of operations such as map, filter, union, and so on will create narrow dependencies.

9. Two ways for spark streaming to read kafka data

The two ways are:

Receiver-base

It is implemented using Kafka's high-level Consumer API. The data that receiver gets from Kafka is stored in Spark Executor's memory, and then the job started by Spark Streaming processes that data. However, in the default configuration, this approach may lose data due to underlying failures. If you want to enable a highly reliable mechanism with zero data loss, you must enable Spark Streaming's pre-write logging mechanism (Write Ahead Log,WAL). This mechanism synchronously writes received Kafka data to pre-written logs on distributed file systems such as HDFS. Therefore, even if the underlying node fails, the data in the prewritten log can be used for recovery.

Direct

Direct is introduced into Spark1.3 instead of using Receiver to receive data, which periodically queries the Kafka to get the latest offset for each topic+partition, thus defining the scope of the offset for each batch. When the job that processes the data starts, the simple consumerapi of the Kafka is used to get the data in the offset range specified by the Kafka.

10. Is the data of kafka stored in memory or on disk

The core idea of Kafka is to use disks instead of memory. Everyone may think that memory must be faster than disks, and I am no exception. After reading the design idea of Kafka, consulting the corresponding data and our own tests, it is found that the sequential read and write speed of the disk is the same as that of memory.

And Linux also has many optimizations for disk read and write, including read-ahead and write-behind, disk cache and so on. If you do these operations in memory, one is that the memory cost of JAVA objects is very high, and the other is that with the increase of heap memory data, the GC time of JAVA will become very long. Using disk operations has the following benefits:

Disk cache is maintained by the Linux system, which reduces a lot of work for programmers.

Disk sequential read and write speed exceeds memory random read and write.

The GC of JVM is inefficient and takes up a lot of memory. This problem can be avoided by using disks.

After the system starts cold, the disk cache is still available.

11. How to solve the data loss of kafka

Producer side:

From a macro point of view, to ensure the reliable security of the data, it must be based on the number of partitions to do a good data backup and set up the number of copies.

Broker side:

Topic sets multiple partitions, and partitions are adaptively located on the machine. In order to make each partition evenly distributed in the broker, the number of partitions should be greater than the number of broker.

Partition is the unit of parallel reading and writing in kafka, and it is the key to improve the speed of kafka.

Consumer side:

The case of message loss on the consumer side is relatively simple: if the offset is submitted before the message processing is completed, it is possible to cause data loss. Since Kafka consumer automatically submits displacement by default, you must ensure that the message is processed normally before submitting displacement at the background. Therefore, heavy processing logic is not recommended. If the processing takes a long time, it is recommended to put the logic in another thread. To avoid data loss, here are two suggestions:

Enable.auto.commit=false turns off auto-submit displacement

Manually submit the displacement after the message has been fully processed

12. What's the difference between fsimage and edit?

We all know the relationship between namenode and secondarynamenode. When they want to synchronize data, they use fsimage and edit,fsimage to save the latest metadata information. When the fsimage data reaches a certain size, it will generate a new file to save the metadata information. This new file is edit,edit will roll back the latest data.

13. How many profile optimizations are listed?

1) Optimization of Core-site.xml files

A, fs.trash.interval, default: 0; note: this is the option to enable the automatic transfer of hdfs files to the dustbin, and the value is the time for the removal of trash files. It is generally better to turn this on in case of erroneous deletion of important files. The unit is in minutes.

B, dfs.namenode.handler.count, default: 10; description: the number of task threads started in the hadoop system is changed to 40. You can also try to set the most appropriate value for the effect of this value on efficiency.

C, mapreduce.tasktracker.http.threads, default: 40; note: map and reduce transfer data through http, which sets the number of parallel threads for transmission.

14. When datanode joins cluster for the first time, if log reports an incompatible file version, it requires namenode to perform a formatting operation. What is the reason for this?

1) this is unreasonable, because the namenode formatting operation is to format the file system. When namenode formatting, all the files under the two directories under dfs/name are emptied, and then the files are created under the directory dfs.name.dir.

2) the text is not compatible. If the namespaceID and clusterID in the data of namenode and datanode are inconsistent, you can find two ID locations and modify them to the same location.

15. At which stage does sorting take place in MapReduce? Can these sorts be avoided? Why?

1) A MapReduce job consists of two parts: the Map phase and the Reduce phase, which sort the data. In this sense, the MapReduce framework is essentially a Distributed Sort.

2) in the Map phase, Map Task will output a file sorted by key (using quick sort) on the local disk (multiple files may be generated in the middle, but will eventually be merged into one). In the Reduce phase, each Reduce Task will sort the received data, so that the data is divided into several groups according to Key, and then handed over to reduce () for processing.

3) many people misunderstand that in the Map stage, there will be no sorting if you do not use Combiner, which is wrong. No matter whether you use Combiner,Map Task or not, the resulting data will be sorted (if there is no Reduce Task, it will not be sorted. In fact, the sorting in Map stage is to reduce the sorting load on the Reduce side).

4) because these sorting is done automatically by MapReduce and cannot be controlled by the user, it cannot be avoided and cannot be closed in hadoop 1.x, but hadoop2.x can be closed.

16. Optimization of hadoop?

1) the idea of optimization can be optimized from the design ideas of configuration files and systems as well as code.

2) profile optimization: adjust the appropriate parameters and test when adjusting the parameters

3) Code optimization: the number of combiner is the same as that of reduce as far as possible, and the type of data is the same, which can reduce the progress of unpacking and packaging.

4) system optimization: you can set the linux system to open the maximum number of files to predict the network bandwidth MTU configuration

5) adding a Combiner to job can greatly reduce the amount of data copied from the maoTask in the shuffer phase to the remote reduce task. Generally speaking, combiner is the same as reduce.

6) the mode of using stringBuffer instead of string,string in development is read-only, if you modify it, it will produce temporary objects, while stringBuffers are modifiable and will not produce temporary objects.

7) modify the configuration: the following is the modification of mapred-site.xml file

A. Modify the maximum number of slots: the number of slots is set on the mapred-site.xml on each tasktracker. The default is 2.

Mapred.tasktracker.map.tasks.maximum

two

Mapred.tasktracker.reduce.tasks.maximum

two

B. Adjust the heartbeat interval: when the cluster size is less than 300, the heartbeat interval is 300 milliseconds.

Mapreduce.jobtracker.heartbeat.interval.min heartbeat time

For each additional number of nodes in the mapred.heartbeats.in.second cluster, the time increases by the following values

Each time the number of mapreduce.jobtracker.heartbeat.scaling.factor clusters increases, how much does the heartbeat increase?

C. start the out-of-band heartbeat

Mapreduce.tasktracker.outofband.heartbeat defaults to false

D. Configure multiple disks

Mapreduce.local.dir

E. Configure the number of RPC hander

Mapred.job.tracker.handler.count defaults to 10, which can be changed to 50, depending on the machine's ability.

F. Configure the number of HTTP threads

Tasktracker.http.threads defaults to 40, which can be changed to 100 according to the ability of the machine.

G. Choose the appropriate compression method, taking snappy as an example:

Mapred.compress.map.output

True

Mapred.map.output.compression.codec

Org.apache.hadoop.io.compress.SnappyCodec

17. Design questions

1) collect the log generated by nginx. The format of the log is hundreds of millions of files generated by user ip time url htmlId every day. Please save the data to HDFS and provide the function of real-time query (the response time is less than 3 seconds).

A, the number of times a user visits a URL in a day

B, the total number of visits to a URL on a given day

The real-time idea is to use Logstash + Kafka + Spark-streaming + Redis + report display platform.

The offline idea is: Logstash + Kafka + Elasticsearch + Spark-streaming + relational database.

A, B, data are filtered into Spark-streaming, and the data that meet the requirements are saved to Redis.

There are 10 files, each file 1G, every line of each file stores the user's query, and the query of each file may be duplicated. You are required to sort according to the frequency of query. It is still a typical TOP K algorithm.

Solutions are as follows:

1) option 1:

Read 10 files sequentially and write query to the other 10 files (marked) as a result of hash (query). The size of each of the newly generated files is about 1G (assuming the hash function is random). Find a machine with about 2G in memory, and use hash_map (query, query_count) in turn to count the number of times each query appears. Sort by the number of occurrences using fast / heap / merge sort. Output the sorted query and the corresponding query_cout to a file. This results in 10 sorted files (marked as). Merge and sort the 10 files (a combination of inner sort and outer sort).

2) option 2:

In general, the total amount of query is limited, but the number of repetitions is relatively large, and it may be possible for all query to be added to memory at once. In this way, we can count the number of occurrences of each query directly using trie trees / hash_map, etc., and then do a quick / heap / merge sort by the number of occurrences.

3) option 3:

Similar to scenario 1, but after the hash is done and divided into multiple files, it can be processed by multiple files, processed using a distributed architecture (such as MapReduce), and finally merged.

19. Find out the non-repeating integers among the 250 million integers. Note that there is not enough memory to hold the 250 million integers.

1) solution 1: use 2-Bitmap (each number allocates 2bitfocus 00 to indicate that it does not exist, 01 means to appear once, 10 means multiple times, 11 is meaningless), which requires a total memory of 2 ^ 32 * 2bit=1 GB memory, which is acceptable. Then scan the 250 million integers to see the corresponding bits in Bitmap. If it is 00 to 01, 01 to 10, 10 remains the same. After the description is done, check the bitmap and output the integer with the corresponding bit 01.

2) option 2: a method similar to question 1 can also be used to divide small files. Then find out the non-duplicate integers in the small file and sort them. Then merge and pay attention to the removal of repetitive elements.

20. Tencent interview questions: give 4 billion non-repeated unsigned int integers, unsorted integers, and then give a number, how to quickly determine whether this number is among the 4 billion numbers?

1) for 1:oo, apply for 512MB of memory, and one bit bit represents a unsigned int value. Read the number of 4 billion, set the corresponding bit bit, read the number to be queried, and see if the corresponding bit bit is 1, 1 means it exists, and 0 means it does not exist.

2) Plan 2: this problem is well described in programming Zhuji. You can refer to the following ideas and discuss it: again, because 2 ^ 32 is more than 4 billion, a given number may or may not be in it; here we represent each of the 4 billion numbers in 32-bit binary, assuming that the 4 billion numbers start to be placed in a file. Then divide the 4 billion numbers into two categories:

1. The highest bit is 0

two。 The highest level is 1.

And write these two categories to two files respectively, the number of the number in one file = 2 billion (which is equivalent to halving); compare with the highest bit of the number to find and then enter the corresponding file and search again, and then divide the file into two categories:

1. The next highest level is 0

two。 The next highest position is 1.

And write these two types to two files respectively, the number in one of the files = 1 billion (which is equivalent to halving); compare with the next highest bit of the number to find and then enter the corresponding file to find again.

And so on, it can be found, and the time complexity is O (logn), scheme 2 is finished.

3) attachment: here, under the brief introduction, the bitmap method: use the bitmap method to determine whether there is repetition in the × × array, and it is one of the common programming tasks to judge whether there is repetition in the collection. When the amount of data in the collection is relatively large, we usually want to do fewer scans, so the double loop method is not desirable.

Bitmap method is more suitable for this situation. Its practice is to create a new array of length max+1 according to the largest element in the set, max, and then scan the original array again. If you encounter a few, give 1 to the position of the new array. If you encounter 5, set 1 to the sixth element of the new array, so that the next time you encounter 5, you will find that the sixth element of the new array is 1. This shows that the data this time must be duplicated with the previous data. This method of initializing a new array with zero followed by one is similar to the bitmap processing method, so it is called the bitmap method. Its worst case is 2N. If the maximum value of the array is known, the call efficiency of the new array can be doubled if the length of the new array is set in advance.

21. How to find the one with the most repetitions among the huge amounts of data?

1) Plan 1: do hash first, then map the module to a small file, find out the one with the most repetitions in each small file, and record the number of repetitions. Then find out that the most repeated data in the previous step is the request (refer to the previous question).

22, tens of millions or hundreds of millions of data (there are duplicates), count the N data that appear the most.

1) solution 1: tens of millions or hundreds of millions of data, the memory of the current machine should be able to store. So consider using hash_map/ search binary tree / red-black tree to count the number of times. Then it is to take out the first N data with the most occurrence, which can be done using the heap mechanism mentioned in question 2.

23. A text file, with about 10,000 lines and one word per line, is required to count the top 10 words that appear most frequently, give ideas, and give time complexity analysis.

1) Plan 1: this question is about time efficiency. The trie tree is used to count the number of occurrences of each word, and the time complexity is O (n*le) (le represents the average length of the word). Then find out the top 10 words that appear most frequently, which can be implemented with a heap. As mentioned in the previous question, the time complexity is O (n*lg10). So the total time complexity is which of O (n*le) or O (n*lg10) is larger.

Find out the maximum number of 100 out of 24 and 100w.

1) option 1: in the previous question, we have mentioned that it is done with a minimum heap of 100 elements. The complexity is O (100w*lg100).

2) Scheme 2: using the idea of fast sorting, only the part larger than the axis is considered after each segmentation, and when the part larger than the axis is more than 100, the traditional sorting algorithm is used to sort the first 100. The complexity is O (100w*100).

3) Plan 3: local elimination method is adopted. The first 100 elements are selected and sorted as sequence L. Then scan the remaining element x at a time, compared to the smallest element of the 100 sorted elements, and if it is larger than this smallest element, delete the smallest element, and insert x into sequence L using the idea of insertion sorting. Loop in turn until all the elements are scanned. The complexity is O (100w*100).

There are 10 million text messages, there are duplicates, saved in the form of text files, one line, there are duplicates. Please take 5 minutes to find out the top 10 items that repeat most.

1) Analysis: the conventional method is to sort first and find out the top 10 items with the most repetition after traversing once. But the lowest complexity of sorting algorithm is nlgn.

2) you can design a hash_table, hash_map, read 10 million short messages in turn, load them into the hash_table table, and count the number of repetitions, while maintaining a short message table with up to 10 messages. In this way, the top 10 items can be found at one time, and the complexity of the algorithm is O (n).

Get more information about big data:

Big data engineer of a first-tier enterprise gives free online guidance on technical problems. Please scan the Wechat QR code below:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.