What are the real questions in web big data's interview? 05/05 Update SLTechnology News&Howtos

What are the real questions in web big data's interview?

2025-05-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what are the real questions of web big data interview". In the daily operation, I believe many people have doubts about the real questions of web big data interview. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts of "what are the real questions of web big data interview?" Next, please follow the editor to study!

First question: big data pen test questions-Java related ("Static Father") write the output of the following programs: class Father {static {System.out.println ("Static Father");} {System.out.println ("Non-static Father");} public Father () {System.out.println ("Constructor Father");}} public class Son extends Father {static {System.out.println ("Static Son") } {System.out.println ("Non-static Son");} public Son () {System.out.println ("Constructor Son");} public static void main (String [] args) {System.out.println ("First Son"); new Son (); System.out.println ("Second Son"); new Son ();}}

Running result:

Static FatherStatic SonFirst SonNon-static FatherConstructor FatherNon-static SonConstructor SonSecond SonNon-static FatherConstructor FatherNon-static SonConstructor Son

Analysis:

This program examines the concepts of static code blocks, construction code blocks, and constructors in Java.

Static code block static {}:

It executes when the class is loaded, that is, JVM loads the class and executes only once, and the execution priority is higher than the non-static initialization block. It will be executed once when the class is initialized, and then destroyed after execution. It can only initialize class variables, that is, static-decorated data members.

Non-static code block, also known as construction code block {}:

If there is a static code block during execution, the static code block is executed first and then the non-static code block, which is executed once when each object is generated, which initializes the instance variables of the class. Non-static code blocks are run when the constructor is executed, before the constructor body code is executed.

Constructor:

As soon as the object is created, the corresponding constructor is called, that is, the constructor will not run without creating the object. An object is created, the constructor is run only once, and a general method can be called multiple times by that object.

Let's take a look at this topic. When the program runs, executing the main () method will first load the class where the main () method is located and load the Son class, but the Son class inherits from the Father class, so the parent class is loaded first, and the static code block of the parent class is executed, then the Son class itself is loaded, and the static code block of the Son class begins to execute. After the class loading is completed, the statements inside the main () method are executed, First Son is printed, and then new Son () When you create an object, you first construct the object of the parent class, because the static code block is executed only once when the class is loaded, so it is no longer executed, and then the constructor of the parent class is executed, and the constructor takes precedence over the constructor. Then the Son object itself is executed. When you are done, print Second Son, then new Son (), and repeat the above steps. The constructor block and constructor are executed every time you new.

Question 2: big data interview question-JVM related (Fengnest Technology) Q: explain the use of stack (stack), heap (heap) and static storage area in memory?

A: usually we define a variable of a basic data type, an object reference, and the field storage of function calls all use the stack space in memory; while objects created by new keywords and constructors are placed in the heap space; literal in the program, such as directly written 100s, "hello" and constants are all placed in static storage. Stack space operation is the fastest but also very small, usually a large number of objects are placed in the heap space, the whole memory, including the virtual memory on the hard disk can be used as heap space.

String str = new String ("hello")

In the above statement, str is placed on the stack, string objects created with new are placed on the heap, and the literal quantity "hello" is placed in static storage.

Add: newer versions of Java use a technique called escape analysis, which can put some local objects on the stack to improve the operational performance of objects. (it is supported in Java SE 6u23 + and is enabled by default. You don't need to add this parameter. )

Question 3: big data interview questions-massive data related (Tencent) Q: give 4 billion unrepeated unsigned int integers, unsorted ones, and then give a number, how to quickly judge whether this number is among the 4 billion numbers?

Answer: plan 1: apply for 512m of memory. 512m is more than 4.2 billion bit, and one bit bit represents an unsigned int value. Read the number of 4 billion, set the corresponding bit bit, read the number to be queried, and see if the corresponding bit bit is 1, 1 means it exists, and 0 means it does not exist.

Plan 2: this problem is well described in programming Zhuji. You can refer to the following ideas and explore: because 2 ^ 32 is more than 4.2 billion, a given number may or may not be in it; here we represent each of the 4 billion numbers in a 32-bit binary, assuming that the 4 billion numbers start to be placed in a file. Then divide the 4 billion numbers into two categories:

The highest bit is 0

The highest level is 1.

Write these two categories to two files, one of which has a number of 2 billion (equivalent to half); compare it with the highest bit of the number you want to find, and then enter the corresponding file and look for it, and then divide the file into two categories:

The next highest level is 0

The next highest position is 1.

And write these two types to two files respectively, the number of the number in one file = 1 billion (equivalent to half); compare with the next highest bit of the number to find and then enter the corresponding file to find again. . And so on, it can be found, and the time complexity is O (logn).

Question 4: big data interview question-Hadoop related (Ali) Q: which stage does the ranking in MapReduce take place? Can these sorts be avoided? Why?

A MapReduce job consists of two parts, the Map phase and the Reduce phase, which sort the data. In this sense, the MapReduce framework is essentially a Distributed Sort.

In the Map phase, Map Task will output a file sorted by key (using quick sort) on the local disk (multiple files may be generated in the middle, but will eventually be merged into one). In the Reduce phase, each Reduce Task will sort the received data (using merge sorting), so that the data is divided into groups according to Key and then handed over to reduce for processing.

Many people misunderstand that in the Map stage, there will be no sorting if you do not use Combiner, which is wrong. No matter whether you use Combiner,Map Task or not, the resulting data will be sorted (if there is no Reduce Task, it will not be sorted. In fact, the sorting of the Map phase is to reduce the sorting load on the Reduce side).

Because these sorting is done automatically by MapReduce and cannot be controlled by the user, it cannot be avoided or turned off in hadoop 1.x, but hadoop2.x can be turned off.

Question 5: big data interview question-Kafka related (Shangtang Technology) Q: the relationship between kafka data partition and consumers, the data offset reading process of kafka, and how to ensure the order within kafka

The relationship between kafka data partition and consumers: one partition can only be consumed by one consumer in the same group, while the consumer in the same group has a balanced effect.

Kafka's data offset reading process:

Connect to the ZK cluster and get the partition information of the corresponding topic and the Leader information of the partition from the ZK

Connect to the broker corresponding to the corresponding Leader

Consumer sends the self-saved offset to Leader

Leader locates to segment (index file and log file) based on offset and other information.

According to the contents of the index log file, navigate to the start position corresponding to the offset in the log file to read the data of the corresponding length and return it to consumer

How to ensure the order within the kafka: kafka can only guarantee the order within the partition, but the order between the partition is impossible. Iqiyi's search architecture is to hit the same partition in an orderly manner from the business.

Question 6: big data interview question-distributed correlation (Ali) Q: tell me about the three distributed locks?

Implement distributed lock based on database: (poor performance, risk of locking table, non-blocking, CPU for failure)

Pessimistic lock

Take advantage of select... Where... For update exclusive lock

Note: other additional functions are basically the same as the implementation. What you need to note here is "where name=lock". The name field must be indexed, otherwise the table will be locked. In some cases, such as a small table, the mysql optimizer will not move the index, resulting in table locking problems.

Optimistic lock

The biggest difference between the so-called optimistic lock and pessimistic lock is based on the idea of CAS, it is not mutually exclusive, it will not produce lock waiting and consume resources, and it is considered that there is no concurrency conflict in the process of operation, which can only be detected after the failure of update version. Our rush to buy, second kill is to use this kind of implementation to prevent overselling. Optimistic locking is achieved by adding an incremental version number field.

Implement distributed locking based on cache (Redis, etc.): (expiration time is difficult to control, non-blocking, failure requires polling CPU)

When acquiring a lock, use setnx to add a lock, and use the expire command to add a timeout for the lock, after which the lock is automatically released. The value of the lock is a randomly generated UUID, which can be judged when the lock is released.

When acquiring the lock, you also set a timeout for the acquisition, and if it exceeds this time, the acquisition lock is discarded.

When releasing the lock, the UUID determines whether it is the lock or not, and if it is the lock, delete is executed to release the lock.

Implementation of distributed locks based on Zookeeper: (highly available, reentrant, blocking locks)

General idea: when each client locks a function, it "generates" a unique instantaneous ordered node under the directory of the specified node corresponding to that function on the zookeeper. The way to determine whether or not to acquire a lock is simple, just to judge the one with the lowest sequence number in the ordered node. When the lock is released, you only need to delete the instantaneous node. At the same time, it can avoid the deadlock problem caused by the unreleased lock caused by service downtime.

Advantages: the security of the lock is high, the zk can be persisted, and the client state of the lock can be monitored in real time. Once the client goes down, the instantaneous node disappears, and the zk can release the lock as soon as possible. This also eliminates the logic of adding timeout judgment in the process of implementing locks with distributed caches.

Disadvantages: the performance overhead is relatively high. Because it needs to dynamically generate and destroy instantaneous nodes to achieve the lock function. Therefore, subscription is too suitable for direct use in scenarios with high concurrency.

Implementation: distributed locks can be easily realized by directly using zookeeper third-party library curator.

Applicable scenarios: used in scenarios where reliability is very high and concurrency is not high. Such as timing full synchronization / incremental synchronization of core data.

Question 7: big data interview questions-Hadoop, Spark related (JD.com Finance) Q: what are the similarities and differences between Hadoop and Spark?

The bottom layer of Hadoop uses MapReduce computing architecture, only map and reduce operations, which lack expression ability, and will repeatedly read and write hdfs in the process of MR, resulting in a large number of disk io read and write operations, so it is suitable for batch computing applications in high delay environment.

Spark is a memory-based distributed computing architecture, which provides a richer type of dataset operations, which is mainly divided into conversion operations and action operations, including map, reduce, filter, flatmap, groupbykey, reducebykey, union and join. Data analysis is faster, so it is suitable for computing applications in low latency environment.

The biggest difference between spark and hadoop lies in the iterative computing model. Based on the mapreduce framework of Hadoop is mainly divided into map and reduce two stages, the end of the two stages, so in a job can do very limited processing; spark computing model is based on memory iterative computing model, can be divided into n stages, according to the user written RDD operators and procedures, after dealing with a phase can continue to deal with many stages, not just two stages. Therefore, compared with mapreduce, spark computing model is more flexible and can provide more powerful functions.

But spark also has disadvantages, because spark is calculated based on memory, although it is easy to develop, when really facing big data, without tuning, there may be a variety of problems, such as OOM memory overflow, which may lead to spark programs not running up, while mapreduce runs slowly, but at least it can be finished slowly.

Question 8: big data interview questions-related to Yarn (Tesla) Q: how does an application execute on a Yarn cluster?

When jobclient submits an application to YARN, YARN will run the application in two phases: one is to start the ApplicationMaster; and the second phase is for ApplicationMaster to create the application, request resources for it, and monitor the operation until the end.

The specific steps are as follows:

The user submits an application to YARN and specifies the ApplicationMaster program, the command to start ApplicationMaster, and the user program.

RM assigns the first Container to the application and communicates with the corresponding NM, asking it to start the application ApplicationMaster in this Container.

ApplicationMaster registers with RM, then splits into internal subtasks, requests resources for each internal task, and monitors the operation of those tasks until the end.

AM applies for and receives resources from RM by polling.

RM allocates resources to AM and returns as Container.

After the AM applies for the resource, it communicates with the corresponding NM, asking the NM to start the task.

NodeManager sets up the running environment for the task, writes the task startup command into a script, and starts the task by running the script.

Each task reports its status and progress to the AM so that it can be restarted when it fails.

When the application is complete, ApplicationMaster logs out to ResourceManager and closes itself.

Question 9: big data interview questions-data quality related (Ant Financial Services Group) Q: how to monitor data quality?

For example, the number of records in a table is within a known range, or fluctuating up and down does not exceed a certain threshold:

SQL result: var data quantity = select count (*) from table where time and other filtering conditions

Alarm trigger condition setting: if the amount of data is not in the [numerical lower limit, numerical upper limit], the alarm will be triggered.

Year-on-year increase: if ((data volume of this week-last week) / data volume of last week * 100) is not in [ratio offline, ratio upper limit], an alarm will be triggered.

Month-on-month increase: if ((today's data volume-yesterday's data volume) / yesterday's data volume * 100) is not in [ratio offline, ratio upper limit], an alarm will be triggered.

The alarm trigger condition must be set. If there is no configured threshold, you cannot monitor the number of daily active users, weekly active users, monthly active users, retention (daily, weekly, monthly), conversion rate (daily, weekly, monthly) GMV (daily, weekly, monthly) repurchase rate (daily, weekly, monthly)

Single table null value detection

The number of records with an empty field is in a range, or the percentage of the total is within a threshold range.

Target field: select the field to monitor. "none" cannot be selected.

SQL result: amount of var exception data = select count (*) from table where target field is null

Single detection: if (the amount of abnormal data) is not in [numerical lower limit, numerical upper limit], the alarm will be triggered.

Single table duplicate value detection

Whether one or more fields meet certain rules

Target field: the first step is to count the number of entries normally; select count (*) form table

The second step is to remove the double statistics; select count (*) from table group by a field

Subtract the value of the first step and the value of the second step to see if it is within the upper and lower line threshold.

Single detection: if (the amount of abnormal data) is not in [numerical lower limit, numerical upper limit], the alarm will be triggered.

Cross-table data comparison

Mainly for the synchronization process, monitoring whether the amount of data of the two tables is the same.

SQL result: count (this table)-count (associated table)

The threshold configuration is the same as null detection.

Question 10: big data interview questions-massive data related (Baidu) Q: in the massive log data, extract the IP that visits Baidu the most times in a certain day

A: this kind of problem is classified as the problem of finding Top K, and the solutions are similar.

Take out the IP that visited Baidu's log on this day and write it into a large file one by one. Notice that the IP is 32-bit, with a maximum of 2 ^ 32 IP. We can also use mapping methods, such as module 1000, to map the whole large file into 1000 small files, and then find out the IP with the highest frequency in each small text (we can use HashMap for frequency statistics, and then find out the highest frequency) and the corresponding frequency. Then, among the 1000 largest IP, find out which IP has the highest frequency, which is what you want.

Algorithm idea: divide and conquer + Hash

The IP address has a maximum of 2 ^ 32 = 4G values, so it cannot be completely loaded into memory for processing.

You can consider adopting the idea of divide and conquer and store massive IP logs in 1024 small files according to the Hash (IP)% 1024 value of IP addresses, so that each small file contains a maximum of 4MB IP addresses. Here's why we use the Hash (IP)% 1024 value. If we don't use it and classify it directly, there may be a situation in which there is an IP in every small file, and this IP is not necessarily the most numerous in that small file, so the result we may choose in the end will be problematic. So we use the Hash (IP) 24 value here, so by calculating the IP hash value. The same IP will certainly be put in a file, of course, different IP may also have the same hash value, there is a small file.

For each small file, you can build a HashMap whose IP is key and the number of occurrences is value, and record the IP address with the most occurrences

You can get the IP with the largest number of occurrences in 1024 small files, and then get the IP with the most occurrence in general according to the conventional sorting algorithm.

At this point, the study on "what are the real questions in the interview of web big data" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.