In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "what are the crawler interview questions for Python". Interested friends might as well take a look. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn "what are the crawler interview questions for Python"?
1. Basic skills of Python
1. Briefly describe the characteristics and advantages of Python.
Python is an open source interpretive language. Compared with Java C++ and other languages, Python has a dynamic nature and is very flexible.
2. What are the data types of Python?
Python has six built-in data types, of which the immutable data types are Number (number), String (string), Tuple (tuple), and the variable data types are List (list), Dict (dictionary), and Set (collection).
3. The difference between list and tuple
Lists and tuples are iterable objects that can be looped, sliced, and so on, but tuples tuple are immutable. The immutable nature of tuples makes it a key in the dictionary Dict.
4. How does Python run
CPython:
When the Python program runs, it will first compile, compile the code in the .py file into bytecode (byte code), store the compilation results in the PyCodeObject in memory, and then be interpreted and run by the Python virtual machine. When the program is finished, the Python interpreter saves the PyCodeObject to the pyc file. In each run, Python will first look for the pyc file with the same name as the file, compare the modification record if pyc exists, and decide to run it directly or compile again according to the modification record, and finally generate the pyc file.
5. The reason for the slow running speed of Python
a)。 Python is not a strongly typed language, so the interpreter needs to check its data types when it encounters variables, as well as data type conversions, comparison operations, and references to variables.
b)。 Python's compiler starts faster than JAVA, but it starts compilation almost every time.
c)。 The object model of Python results in inefficient access to memory. The pointer to the Numpy points to the value of the cache data, while the pointer to the Python points to the cache object, and then points to the data through the cache object:
6. Is there any solution to the problem of slow Python
a)。 You can use other interpreters, such as PyPy and Jython.
b)。 For applications with high performance requirements and more static type variables, you can use CPython.
c)。 For applications with many IO operations, Python provides asyncio modules to improve asynchronous capabilities.
7. Describe the global interpreter lock GIL
Each thread needs to get the GIL at the time of execution to ensure that only one thread can execute the code at the same time, that is, only one thread uses CPU at the same time, that is to say, multithreading is not really executed at the same time. But during IO operations, locks can be released (which is why Python can be asynchronous). And if you want to take advantage of multicore CPU, you can use multiple processes.
8. Deep copy and shallow copy
A deep copy copies an object itself to another object, while a shallow copy copies a reference to another object. So when the copied object changes, the value of the original object of the deep copy will not change, while the value of the shallow copy of the original object will be changed.
9. The difference between is and =
Is represents the object identifier (object identity), while = = indicates equality.
The function of is is to check whether the identifiers of objects are consistent, that is, to compare whether the addresses of two objects in memory are the same, while = = is used to check whether two objects are equal. But to improve system performance, Python retains a copy of its value for smaller strings, and points directly to that copy when creating a new string. Such as:
A = 8b = 8a is b
10. File reading and writing
This paper briefly describes the difference and function of read, readline and readlines when reading files.
The difference between them is not only the scope of reading content, but also the type of content returned.
Read () reads the entire file, puts the contents of the file read to the end into a string variable, and returns the str type.
Readline () reads a line, puts it in a string variable, and returns the str type.
Readlines () reads all the contents of the file, puts them in a list by behavior unit, and returns the list type.
11. Please implement it with one line of code
Please use anonymous function and deduction to multiply the elements in [0,1,2,3,4,5], and print out tuples.
Print (tuple (map (lambda x: X * x, [0,1,2,3,4,5])) print (tuple for i in [0,1,2,3,4,5]))
12. Please implement it with one line of code
The factorial of n is calculated by reduce (n = 1 × 2 × 3 ×. × n)
Print (reduce (lambda x, y: X, range (1, n)
13. Please implement it with one line of code
Filter and print out a collection of 100 numbers divisible by 3
Print (set (filter (lambda n: n% 3 = = 0, range (1,100)
14. Please implement it with one line of code
Text = 'Obj {"Name": "pic", "data": [{"name": "async", "number": 9, "price": "$3500"}, {"name": "Wade", "number": 3, "price": "$5500"}], "Team": "Hot"'
Print the player's value tuple in the text, such as ($3500, $5500)
Print (tuple (i.get ("price") for i in json.loads (re.search (r'[(. *)]', text) .group (0)
15. Please write down the basic skeleton of recursion
Def recursions (n): if n = = 1: # exit condition return 1 # continue recursive return n * recursions (n-1)
16. Slice
Please write down the output below
Tpl = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95] print (TPL [3:]) print (tpl [: 3]) print (tpl [:: 5]) print (tpl [- 3]) print (tpl [3]) print (tpl [:: 5]) print (tpl [:]) del TPL [3:] print (tpl) print (tpl.pop () tpl.insert (3) 3) print (tpl) [15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95] [0,25, 50, 75] 8515 [95, 70, 45, 20, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95] [0, 5, 10] 10 [0, 5, 3]
17. File path
Printout the directory path where the current file is located
Import osprint (os.path.dirname (os.path.abspath (_ _ file__)
Printout current file path
Import osprint (os.path.abspath (_ _ file__))
Print the path of the two-tier file directory above the current file
Import osprint (os.path.dirname (os.path.dirname (os.path.abspath (_ _ file__)
18. Please write down the running result and answer the questions
Tpl = (1,2,3,4,5) apl = (6,7,8,9) print (tpl.__add__ (apl))
Question: has the value of tpl changed?
The running results are as follows:
1, 2, 3, 4, 5, 6, 7, 8, 9)
Answer: tuples are immutable. They generate new objects.
19. Please write down the running result and answer the questions
Name = ('James',' Wade', 'Kobe') team = [' Agar, 'Bread,' C'] tpl = {name: team} print (tpl) apl = {team: name} print (apl)
Question: can this code finish running? Why? What is the result of its operation?
Answer: this code does not run completely, it throws an exception at apl, because the key of the dictionary can only be an immutable object, and list is mutable, so it cannot be used as the key of the dictionary. The result of the operation is:
{('James',' Wade', 'Kobe'): [' Aids, 'bones,' C']} TypeError
20. Decorator
Please write the decorator code skeleton
Def log (func): def wrapper (* args, * * kw): print ('call% s ():'% func.__name__) return func (* args, * * kw) return wrapper
Briefly describe the role of decorator in Python:
Add new functions to it without changing the original function code.
21. Multiprocess and multithreading
Is multiprocess more stable or multithreaded more stable? Why?
Multiple processes are more stable, they run independently and do not affect other processes because of one crash.
What is the fatal disadvantage of multithreading?
Because all threads share the process's memory, any thread that hangs may directly cause the entire process to crash.
What are the ways to communicate between processes?
Share variables, queues, pipes.
II. Details of Python
1. Use join or + to connect the string
When connecting a string with operator +, each time + is executed, a new piece of memory is requested, and then the result of the last + operation and the right operator of this operation are copied to this memory space. therefore, several memory requests and copies are involved in connecting strings with +. When concatenating a string, join will first calculate how much memory is needed to store the result, and then apply for the required memory and copy the string at once, which is why the performance of join is better than +. So when concatenating string arrays, you should consider using join first.
2. Python garbage collection mechanism
Reference https://blog.csdn.net/xiongchengluo1129/article/details/80462651
Garbage collection in Python is dominated by reference counting, supplemented by generational collection. The flaw in reference counting is the problem of circular references.
In Python, if an object has 0 references, the Python virtual machine will reclaim the object's memory.
The principle of reference counting is that each object maintains an ob_refcnt to record the number of times the current object is referenced, that is, to track how many references point to the object. When the object is created, the object is referenced, the object is passed in a function, and stored in the container, the object's reference counter + 1
The object is created as a result 14
The object is referenced bicona
The object is passed as an argument to the function func (a)
Object as an element is stored in the container List= {a, "a", "b", 2}
Corresponding to the above situation, when the alias of the object is destroyed by del, the reference of the object is assigned to a new object, after the execution of the Chinese script is completed, and when it is deleted from the container, the reference counter of the object is-1
Del a when the alias of the object is explicitly destroyed
When the reference name of the object is assigned to a new object, axiom 26
An object leaves its scope, for example, when the func function finishes executing, the reference counter of the local variable in the function will be-1 (but the global variable will not).
When the element is removed from the container, or when the container is destroyed.
When the reference counter to the object's memory is 0, the memory will be freed by the Python virtual machine.
Sys.getrefcount (a) can view the reference count of the an object, but it is 1 higher than the normal count, because an is passed in when the function is called, which makes the reference count of a + 1
Advantages of reference counting:
1. High efficiency
2. There is no pause in the runtime: once there is no reference, the memory is directly freed. You don't have to wait for a specific time like other mechanisms. Real-time also brings another benefit: the time spent on processing reclaimed memory is apportioned to normal times.
3. The object has a definite life cycle.
4. Easy to implement
Disadvantages of reference counting:
1. Maintaining the reference count consumes resources, and the number of times to maintain the reference count is proportional to the reference assignment, unlike mark and sweep, which is basically related to the amount of memory recovered.
2. The problem of circular reference can not be solved. An and B refer to each other and no longer have external references to either of An and B. their reference count is 1, but obviously they should be recycled.
# sample circular reference list1 = [] list2 = [] list1.append (list2) list2.append (list1)
In order to solve these two shortcomings, Python also introduces another mechanism: tag removal and generational recycling.
Mark clear
Tag cleanup (Mark-Sweep) algorithm is a garbage collection algorithm based on tracking collection (tracing GC) technology. It is divided into two stages: the first stage is the marking phase, GC will mark all the "active objects", and the second stage is to recycle the unmarked "inactive objects". So how does GC determine which objects are active and which are inactive?
Objects are connected by references (pointers) to form a digraph, objects constitute the nodes of the digraph, and reference relationships constitute the edges of the digraph. Starting from the root object (root object), traverse the object along the directed edge, the reachable (reachable) object is marked as the active object, and the inreachable object is the inactive object to be cleared. The root object is the global variable, call stack, and register.
In the figure above, we regard the small black circle as a global variable, that is, as a root object, starting from the small black circle, object 1 can be directly reached, then it will be marked, objects 2 and 3 can be indirectly reached and will be marked, while 4 and 5 are not reachable, then 1, 2, 3 are active objects, 4 and 5 are inactive objects will be recycled by GC.
As an auxiliary garbage collection technology of Python, tag removal algorithm mainly deals with some container objects, such as list, dict, tuple,instance and so on, because it is impossible to cause circular reference problems for string and numeric objects.
Python uses a two-way linked list to organize these container objects. However, this simple and crude tag removal algorithm also has obvious disadvantages: it must scan the entire heap memory sequentially before clearing inactive objects, and scan all objects even if only a small number of active objects are left.
Generation by generation recovery
Generational collection is also used as an auxiliary garbage collection technology for Python to deal with those container objects.
Logic of GC
Allocate memory-> find that the threshold has been exceeded-> trigger garbage collection-> put all the linked lists of collectable objects together-> traverse, calculate valid reference count-> divide it into two sets: valid reference count = 0 and valid reference count > 0-> those greater than 0, put into the older generation-> = 0, perform collection-> collect each element in the container Subtract the corresponding element reference count (break the circular reference)-> execute the logic of-1. If the object reference count is found to be 0, trigger memory recovery-> python underlying memory management mechanism to recover memory
In Python, a generation is a linked list, and all memory blocks belonging to the same generation are linked in the same linked list. The structure used to represent "generation" is gc_generation, including the current generation list header, the upper limit of the number of objects, and the number of current objects.
Python defines a collection of three-generation objects by default. The larger the number of indexes, the longer the survival time of the object, and the newly generated object will be added to the zero generation. The omitted part in the previous _ PyObject_GC_Malloc is the time when the Python GC is triggered. Each newly generated object checks to see if generation 0 is full, and if so, start garbage collection.
Generation recycling is a space-for-time operation. Python divides memory into different sets according to the survival time of objects, and each set is called a generation. Python divides memory into three "generations", namely the younger generation (the 0th generation), the middle age (the first generation) and the old age (the second generation). They correspond to three linked lists, and their garbage collection frequency decreases with the increase of the object's survival time. Newly created objects will be allocated to the younger generation, and when the total number of linked lists of the younger generation reaches the upper limit, the Python garbage collection mechanism will be triggered to recycle those objects that can be recycled, while those that will not be recycled will be moved to the middle age, and so on, the objects in the old era are the longest living objects, even throughout the life cycle of the system. At the same time, generational recycling is based on label removal technology.
3. Recursion
What is the default depth of Python recursion? What is the reason for the recursive depth limit?
The depth of Python recursion can be viewed with sys.getrecursionlimit () in the built-in function library.
Because infinite recursion can cause C stack overflows and Python crashes.
At this point, I believe you have a deeper understanding of "what are the crawler interview questions of Python?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.