Answers to frequently asked questions in big data's interview 04/14 Update SLTechnology News&Howtos

Answers to frequently asked questions in big data's interview

2025-04-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Generally speaking, big data (big data) refers to the collection of data that cannot be captured, managed and processed with conventional software tools within a bearable time range. This article summarizes the common questions and solutions in big data's interview for your reference:

1. Can Spark replace Hadoop?

A: Hadoop includes Common,HDFS,YARN and MapReduce,Spark has never said to replace Hadoop, at best, to replace MapReduce. In fact, Hadoop has evolved into an ecosystem, and the Hadoop ecosystem also accepts more excellent frameworks, such as Spark (Spark can be seamlessly integrated with HDFS and can run well on YARN). On the other hand, Spark is not only integrated with Hadoop ecology, but also supported by other frameworks such as ElasticSearch, Cassandra and so on. So Hadoop ecology is not a prerequisite for using Spark, although Spark is well integrated into Hadoop ecology.

2. Talk about Flink, compare with Spark?

A: first of all, as a Spark evangelist in China, I must admit that I followed Flink very early.) the current popularity of Flink in Europe is OK. Like Spark, Flink wants to do integrated computing engines (batch processing, streaming processing, machine learning, graph computing, etc.), and they can well integrate into the Hadoop ecology (such as seamless integration with HDFS, and support YARN as a scheduling framework, etc.). At first glance, the two are slightly similar, but in fact they take a different path. When Spark was launched, it emphasized "memory computing", while Flink actually took a path similar to MPP. On the other hand, Flink claims to be able to do real-time computing, while Spark can only do micro batch. However, the incremental iteration supported by Flink is interesting because it can only deal with those changed data during the iteration. in fact, when iterating later, Flink only needs to deal with a small subset. But none of the above is why I focused on Flink in the first place. The reason I focused on Flink was that it was unique in terms of memory management. Flink decided to do memory management on its own, it divides heap into three parts: Network buffers, Memory Manager pool and Remaing heap. If you are interested, you can check the relevant information in bytes. Of course, Spark also played a killer's mace: project tungsten, commonly known as tungsten wire plan, in addition to do their own memory management, will also do other very strong optimization. These optimizations will be reflected in versions 1.4, 1.5 and 1.6. Oh, finally, Spark only supports DAG, while Flink supports cyclic. That's all for this topic, and there will probably be a special article later.

3. What do you think of HBase and Cassandra?

A: first of all, some domestic engineers have an inexplicable misunderstanding about Cassandra, thinking that when Facebook "abandoned" it, it would not work. Personally, I have a lot of practical HBase, but I haven't used HBase for some time in the past (I ran as the back end of OpenTSDB more than a year ago at most). HBase and Cassandra are very similar in many places. First, it is column-oriented storage; second, it will first be written to Log, then into the storage structure in memory, and finally brushed disk, and even the data structure used is similar: LSM. To put it simply, from the perspective of HBase is Data to HLog to Memstor to StoreFile (HFile); from the perspective of Cassandra is Data to CommitLog to memtable to sstable. Third, slightly, too much. Let's take a look at the differences. HBase needs ZK support and Cassandra is self-sufficient; HBase needs Master,Cassandra; seed nodes; HBase needs information from ZK; Cassandra uses gossip to communicate; HBase bottom layer needs HDFS support, Cassandra does not; HBase native does not support secondary index, Cassandra support; HBase has coprocessor,Cassandra for a long time. No, one by one. The biggest difference is that HBase is essentially a centralized organization (peer-to-peer), while Cassandra is decentralized, so I may prefer the latter. At present, it is difficult to say which is better or worse, your choice is up to you:)

4. Distributed message queue is very important in the flow processing system. Which message queue do you choose? Can you give a brief account of the reason?

A: there is no doubt about Kafka! Add a Flume in front of it at most. The reason for any selection comes from what your needs are. Fast,Scalable,Durable is my need, and Kafka satisfies it perfectly. Tell me a little bit about the details, a lot of which we all must know. Kafka writes the data to disk and actually writes it to OS's page cache, and when reading it, it uses sendfile to transfer the data to NIC very efficiently. Kafka is also very scalable, as long as you add broker. The logic of Kafka is also very clear. Data from different business logic can be written to topic, and topic can be split into several partition for parallel processing. After Kafka0.9, zk only needs to be used by broker, and consumer no longer needs to use zk to record offset, greatly reducing the pressure on zk and reducing the pressure on scale from the side. Kafka also has a friendly deletion strategy. You can delete it directly according to max age or max size, or you can compact it according to key, which basically meets the needs. On the other hand, the Kafka community is very active, and almost all popular (streaming) computing frameworks now support Kafka, such as Spark Streaming,Storm. By the way, there is a tool called camus that can transfer data from Kafka to HDFS on a regular basis, which has been recommended by some friends. Again, it depends on the demand, and can hold live.

5. What do you think of Tachyon?

A: I tried Tachyon very early. Note, it's a trial! Now it has been escorted by the commercial company TachyonNexus. Oh, by the way, just to be clear, Tachyon is written in Java, not Scala. Personally, I am very optimistic about the future of Tachyon. I don't seem to have said what Tachyon is for a long time. Tachyon is a distributed memory file system, as memory becomes cheaper and cheaper, as long as the quality of Tachyon itself is excellent, there is obviously no shortage of users. Briefly talk about the principles and characteristics of Tachyon, as a memory-based file system, it is obvious that the use of memory is very radical, then someone will worry about it, node hung up the memory data loss how to do, in fact, do not worry, there is generally a layer of underlying filesystem under the Tachyon, in most cases is HDFS, of course, it also supports some other file systems, Tachyon regularly checkpoint the data to underlying filesystem. In fact, Tachyon also has the concept of Journal (p_w_picpath + edit). It is very interesting that Tachyon has moved the concept of lineage in Spark and relied on lineage to restore files. As mentioned earlier, it is difficult to mix a framework without combining with Hadoop, so Tachyon also implements the interface of HDFS very amicably, so both MapReduce and Spark, including Flink, can use Tachyon with almost no code changes. Another excellent point of Tachyon is that it supports table. Users can put column with high query density into Tachyon to improve query efficiency. Add a few more Tachyon to solve the problem of sharing data between different job of case:Spark; sharing data between different frames; avoiding the loss of all Spark cache due to blockmanager hanging; solving the problem of memory reuse, etc. The problem will stop here and will not unfold. (the above content is reproduced from: ChinaScala)

6. There are 10 files, each file 1G, every line of each file stores the user's query, and the query of each file may be duplicated. You are required to sort according to the frequency of query.

Option 1:

1) read 10 files sequentially and write query to the other 10 files (marked as) according to the result of hash (query). The size of each of the newly generated files is about 1G (assuming the hash function is random).

2) find a machine with about 2G in memory, and use hash_map (query, query_count) to count the number of times each query appears. Sort by the number of occurrences using fast / heap / merge sort. Output the sorted query and the corresponding query_cout to a file. This results in 10 sorted files (marked as).

3) merge and sort the 10 files (the combination of inner sort and outer sort).

Option 2:

In general, the total amount of query is limited, but the number of repetitions is relatively large, and it may be possible for all query to be added to memory at once. In this way, we can count the number of occurrences of each query directly using trie trees / hash_map, etc., and then do a quick / heap / merge sort by the number of occurrences.

Option 3:

Similar to scenario 1, but after the hash is done and divided into multiple files, it can be processed by multiple files, processed using a distributed architecture (such as MapReduce), and finally merged.

7. There is a 1G file in which each line is a word, the size of the word is no more than 16 bytes, and the memory limit size is 1m. Returns the 100 most frequently used words.

Solution: read the file sequentially, take it for each word x, and then save it to 5000 small files (marked as) according to that value. So each file is about 200k. If some of the files are more than 1m in size, you can continue to divide them down in a similar way, knowing that the size of the decomposed small files is no more than 1m. For each small file, count the words that appear in each file and the corresponding frequency (you can use trie tree / hash_map, etc.), and take out the 100th words with the largest frequency (you can use the minimum heap with 100nodes), and save the 100words and the corresponding frequency into the file, so you get another 5000 files. The next step is the process of merging (similar to merging and sorting) the 5000 files.

8. Use massive log data to extract the IP that visits Baidu the most on a given day.

Solution: first of all, it is this day, and the IP in the log of visiting Baidu is taken out and written into a large file one by one. Notice that IP is 32-bit, with at most one IP. We can also use mapping methods, such as module 1000, to map the whole large file into 1000 small files, and then find out the IP with the highest frequency in each small text (we can use hash_map for frequency statistics, and then find out the highest frequency) and the corresponding frequency. Then, among the 1000 largest IP, find out which IP has the highest frequency, which is what you want.

9. Find out the non-repeating integers among the 250 million integers, and the memory is not enough to hold the 250 million integers.

Solution 1: use 2-Bitmap (each number allocates 2bitfocus 00 means it does not exist, 01 means it occurs once, 10 means multiple times, 11 is meaningless), memory is required, and it is acceptable. Then scan the 250 million integers to see the corresponding bits in Bitmap. If it is 00 to 01, 01 to 10, 10 remains the same. After the description is done, check the bitmap and output the integer with the corresponding bit 01.

Option 2: the method similar to the above question can also be used to divide small files. Then find out the non-duplicate integers in the small file and sort them. Then merge and pay attention to the removal of repetitive elements.

10. How to find the one with the most repetitions among the huge amounts of data?

Solution: first do hash, and then calculate the module mapping to a small file, find out the one with the largest number of repetitions in each small file, and record the number of repetitions. Then find out that the most repeated data in the previous step is the request (refer to the previous question).

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.