Performance Optimization Analysis of Python 07/19 Update SLTechnology News&Howtos

Performance Optimization Analysis of Python

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "Python performance Optimization Analysis". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Why python has poor performance:

When we refer to the efficiency of a programming language: it usually has two meanings: * * is development efficiency, which is the time it takes for programmers to complete the code; the other is running efficiency, which is for computers, the time it takes to complete a computing task. Coding efficiency and operation efficiency are often the relationship between fish and bear's paw, so it is difficult to take into account at the same time. Different languages have different priorities, and there is no doubt that the python language cares more about coding efficiency, life is short,we use python.

Although programmers who use python should accept the fact that it is inefficient, python is widely used in more and more fields, such as scientific computing, web servers, and so on. Programmers certainly want python to be faster and python to be more powerful.

First of all, how slow python is compared to other languages, this different scenario and test case, the results will certainly be different. This URL gives a comparison of the performance of different languages under various case. This page is a comparison of python3 and C++. Here are two case:

What are the specific reasons for the low efficiency of python? here are some of the following

*: python is a dynamic language

The type of object that a variable points to is determined at run time, and the compiler cannot make any predictions and optimize it. Take a simple example: r = a + b. An and b add together, but the types of an and b are known at run time, and different types of addition operations are handled differently, so each run time will determine the type of an and b, and then perform the corresponding operation. In static languages such as C++, the runtime code is determined at compile time.

Second: python is interpretive execution, but does not support JIT (just in time compiler). Although the famous google tried the Unladen Swallow project, it also failed in the end.

Third: everything in python is an object, and each object needs to maintain a reference count, which adds extra work.

Fourth: python GIL

GIL is the most criticized point in Python because multithreading in GIL,python does not really concurrency. In the IO bound business scenario, this is not a big problem, but in the CPU BOUND scenario, it is fatal. Therefore, the author in the work of the use of python multithreading is not much, generally use multi-process (pre fork), or in addition to the co-program. Even in a single thread, GIL has a big performance impact, because every 100th opcode that python executes (by default, you can set it through sys.setcheckinterval ()) will try to switch threads, the specific source code is in ceval.c::PyEval_EvalFrameEx.

Fifth: garbage collection, which may be a common problem in all programming languages with garbage collection. Python uses a marking and generational garbage collection strategy, each time the garbage collection will interrupt the execution of the program, resulting in the so-called Dunka. There is an article on infoq that mentions that Instagram performance has improved by 10% after disabling Python's GC mechanism. Interested readers can read it carefully.

Be pythonic

We all know that premature optimization is the source of evil, and all optimizations need to be based on profile. However, as a python developer, you should need pythonic, and pythonic code is often more efficient than non-pythonic code, such as:

Use the iterator iterator,for example:

Iteritems of dict instead of items (same as itervalues,iterkeys)

Using generator, especially if it is possible to advance break in a loop

Determine whether the same object uses is instead of =

To determine whether an object is in a collection, use set instead of list

Making use of the short-circuit evaluation characteristic, the logical expression of "short-circuit" probability is written in front. Other lazy ideas is also available.

For the accumulation of a large number of strings, use the join operation

Use the for else (while else) syntax

Exchange the values of two variables using: a, b = b, a

Optimization based on profile

Even though our code is very pythonic, it may not be as efficient as expected. We also know that the 80Universe 20 law, most of the time is spent in a small number of code snippets, the key to optimization is to find these bottlenecks. There are many ways: add log printing timestamps everywhere, or test suspected functions separately using timeit, but the most effective is to use the profile tool.

Python profilers

For python programs, there are three well-known profile tools: profile, cprofile, and hotshot. Among them, profile is implemented in pure python language, Cprofile implements part of profile in native, and hotshot is also implemented in C language. the difference between hotshot and Cprofile is that hotshot has less impact on the running of the target code, the cost is more post-processing time, and hotshot has stopped maintenance. It is important to note that profile (Cprofile hotshot) is only suitable for single-threaded python programs.

For multithreading, you can use yappi,yappi to not only support multithreading, but also accurate to CPU time

For greenlet, you can use greenletprofiler, modify it based on yappi, and live in thread context with greenlet context hook.

The following is a piece of made-up "inefficient" code and uses Cprofile to illustrate the specific methods of profile and the performance bottlenecks we may encounter.

#-*-coding: UTF-8-*-from cProfile import Profile import math def foo (): return foo1 () def foo1 (): return foo2 () def foo2 (): return foo3 () def foo3 (): return foo4 () def foo4 (): return "this call tree seems ugly But it always happen "def bar (): ret = 0 for i in xrange (10000): ret + = I * I + math.sqrt (I) return ret def main (): for i in range (100000): if I% 10000 = = 0: bar () else: foo () if _ name__ = ='_ main _ _': prof = Profile () prof.runcall (main) prof.print_stats () # prof.dump_stats ('test.prof') # dump profile result to test.prof code for profile

The running results are as follows:

For the above output, each field has the following meaning:

Total number of calls to the ncalls function

Elapsed time inside the tottime function (excluding subfunctions)

Percall (*) tottime/ncalls

The time taken by the cumtime function including the subfunction

Percall (second) cumtime/ncalls

Filename:lineno (function) files: line numbers (functions)

The output in the code is very simple, and you can actually use pstat to diversify the output of profile results, as detailed in the official document python profiler.

Profile GUI tools

Although the output of Cprofile has been relatively intuitive, we still tend to save the results of profile, and then use graphical tools to analyze from different dimensions, or compare the code before and after optimization. There are also many tools to view profile results, such as visualpytune, qcachegrind, runsnakerun, this article uses visualpytune for analysis. For the above code, rerun the generated test.prof file after the comment generation is modified, and open it directly with visualpytune, as follows:

The meaning of the field is basically the same as the text output, but the convenience can be sorted by clicking on the field name. The calller (caller) of the current function is listed at the bottom left, and the time consumption within the current function and subfunctions is listed at the bottom right. This is the result sorted by cumtime (that is, the sum of the time taken by the function and its subfunctions).

Performance bottlenecks are usually caused by functions that are called frequently, functions that are very expensive at a single time, or a combination of both. In our previous example, foo is a case of high-frequency calls, and bar is a case of very high single consumption, which is the focus of our need to optimize.

Python-profiling-tools describes the use of qcachegrind and runsnakerun, two colorful tools that are much more powerful than visualpytune. For more information on how to use it, please refer to the original text. The following figure shows the result of opening test.prof with qcachegrind.

Qcachegrind is indeed more powerful than visualpytune. As can be seen from the above picture, it is roughly divided into three parts:. * part is similar to visualpytune in that it takes time for each function, where Incl is equivalent to cumtime and Self is equivalent to tottime. The second part and the third part have a lot of tags, different tags show the results from different angles, as shown in the figure, the "call graph" in the third part shows the call tree of the function and contains the percentage of time of each subfunction, which is clear at a glance.

Profile for optimization

Know the hot spots, you can carry out targeted optimization, and this optimization is often closely related to the specific business, do not use * keys, specific problems, specific analysis. In terms of personal experience, the most effective optimization is to discuss the requirements with the product manager, which may be able to meet the demand in another way, and the product manager can accept a slight compromise. The second is to modify the implementation of the code, such as the previous use of a more easy to understand but less efficient algorithm, if this algorithm becomes a performance bottleneck, then consider a more efficient but may be difficult to understand algorithm, or use dirty Flag mode. For these same methods, need to be combined with specific cases, this article will not repeat.

Then, combined with the features of python language, we introduce some ways to make python code less pythonic, but can improve performance.

* *: reduce the level of function calls

Each layer of function call will bring a lot of overhead, especially for the frequent calls, but less consumption of calltree, multi-layer function call overhead is very high, you can consider expanding it at this time.

For the previously called profile code, foo this call tree is very simple, but high frequency. Modify the code and add a plain_foo () function to return the final result directly. The key output is as follows:

Compared with the previous results:

As you can see, the optimization is almost 3 times.

Second: optimize attribute search

As mentioned above, python's property lookup is inefficient, and if you frequently access a property (such as a for loop) in a piece of code, you can consider replacing the object's properties with local variables.

Third: turn off GC

As mentioned in the * * section of this article, turning off GC can improve the performance of python, and the card brought by GC is also unacceptable in application scenarios with high real-time requirements. But shutting down GC is not an easy task. We know that python's reference count can only handle situations where there are no circular references, and with circular references, you need to rely on GC to deal with it. In the python language, it is easy to write circular references. For example:

Case 1: a, b = SomeClass (), SomeClass () a.b, b.a = b, a case 2: lst = [] lst.append (lst) case 3: self.handler = self.some_func

Of course, you may say, who would be so stupid to write such code, yes, the above code is too obvious, when there are a few more levels in the middle, there will be "indirect" loop applications. The OrderedDict in python's standard library collections is case2:

The * solution to circular references is to use weak references (weakref), and the second is to unloop references manually.

Fourth: setcheckinterval

If the program determines that it is single-threaded, then modify checkinterval to a larger value, which is described here.

Fifth: use _ _ slots__

The main purpose of slots is to save memory, but it can also improve performance to some extent. We know that a class that defines _ _ slots__ will reserve enough space for an instance, so that _ _ dict__ will no longer be created automatically. Of course, there are many considerations for using _ _ slots__, and most importantly, all classes on the inheritance chain must be defined with a detailed description of _ _ slots__,python doc. Let's look at a simple test example:

Class BaseSlots (object): _ _ slots__ = ['eyed,' favored,'g'] class Slots (BaseSlots): _ _ slots__ = ['averse,' baked,'c' 'd'] def _ init__ (self): self.a = self.b = self.c = self.d = self.e = self.f = self.g = 0 class BaseNoSlots (object): pass class NoSlots (BaseNoSlots): def _ init__ (self): super (NoSlots Self). _ init__ () self.a = self.b = self.c = self.d = self.e = self.g = 0 def log_time (s): begin = time.time () for i in xrange (10000000): s. S.G return time.time ()-begin if _ _ name__ ='_ _ main__': print 'Slots cost', log_time (Slots ()) print' NoSlots cost', log_time (NoSlots ())

Output result:

Slots cost 3.12999987602 NoSlots cost 3.48100018501

Python C extension

Perhaps through profile, we have found a performance hotspot, but this hot spot is to run a lot of computing, and can not cache, can not be omitted. At this time, it's time for python's C extension, which reimplements part of the python code with C or C++, and then compiles it into a dynamic link library, providing an interface for other python code to call. Because the C language is much more efficient than python code, it is very common to use C extensions, such as the cProfile we mentioned earlier, which is based on a layer of encapsulation of _ lsprof.so. Python's large owner libraries that require performance use or provide C extensions, such as gevent, protobuff, and bson.

The author has tested the efficiency of pure python versions of bson and cbson, and in a comprehensive case, cbson is almost 10 times faster!

The C extension of python is also a very complex issue, and this article only gives a few considerations:

* *: pay attention to the correct management of reference counting

This is the most difficult and complicated point. We all know that python manages the lifecycle of objects based on pointer technology, and if there is a problem with reference counting in the extension, it is either a program crash or a memory leak. What's more, the problem caused by reference counting is very difficult to debug.

The three most critical words about reference counting in the C extension are: steal reference,borrowed reference,new reference. It is recommended that you read the official python documentation before writing the extension code.

Second: C extension and multithreading

The multithreading here refers to the C language thread that comes out of new in the extension, not the multithread of python. If you give the introduction in python doc, you can also take a look at the relevant chapter of "python cookbook".

Third: C expand the application scenario

It is only suitable for logic that is not so closely related to the business code, and it is difficult to extend C if a piece of code has a large number of business-related object attributes.

The process of encapsulating C extensions into interfaces that can be called by python code is called binding,Cpython itself provides a native set of API, although the most widely used, but the specification is more complex. Many third-party libraries are encapsulated to varying degrees so that developers can use them, such as boost.python, cython, ctypes, cffi (which also supports pypy cpython), and how to use google.

Beyond CPython

Although the performance of python is not satisfactory, its easy-to-learn and easy-to-use feature has won more and more users, and the industry Daniel has never given up on the optimization of python. The optimization here is some reflection or enhancement on the design or implementation of the python language. Some of these optimization projects have been aborted, and some are still in the process of further improvement. Some of the projects that are good so far are introduced in this chapter.

Cython

As mentioned earlier, cython can use the binding c extension, but it does much more than that.

The main purpose of Cython is to accelerate the efficiency of python, but it is not as complex as the C extension mentioned in the previous section. In Cython, writing C extensions is about the same complexity as writing python code (thanks to Pyrex). Cython is a superset of the python language, adding support for C language function calls and type declarations. From this point of view, cython converts dynamic python code into statically compiled C code, which is why cython is efficient. Using cython, like the C extension, needs to be compiled into a dynamic link library, and you can use either the command line or distutils in the linux environment.

If you want to learn cython systematically, it is recommended to start with cython document, which is well documented. Here is a simple example to show the usage and performance of cython (linux environment).

First, install cython:

Pip install Cython

Here is the python code for the test. You can see that both case are examples of high computational complexity:

Def f (): return Xerox def integrate_f (a, b) N): s = 0 dx = (bmura) / N for i in range (N): s + = f (a+i*dx) return s * dx def main (): import time begin = time.time () for i in xrange (10000): for i in xrange (10000): F (10) print 'call f cost:' Time.time ()-begin begin = time.time () for i in xrange (10000): integrate_f (1.0,100.0, 1000) print 'call integrate_f cost:', time.time ()-begin if _ _ name__ = =' _ main__': main ()

Running result:

Call f cost: 0.215116024017 call integrate_f cost: 4.33698010445

You can enjoy the performance improvement brought by cython without changing any python code, as follows:

Step1: change the file name (cython_example.py) to cython_example.pyx

Step2: add a setup.py file and add the following code:

From distutils.core import setup from Cython.Build import cythonize setup (name = 'cython_example', ext_modules = cythonize ("cython_example.pyx"),)

Step3: execute python setup.py build_ext-inplace

You can see that two files have been added, corresponding to the intermediate result and the * dynamic link library.

Step4: execute the command python-c "import cython_example;cython_example.main ()" (Note: make sure there is no cython_example.py in the current environment)

Running result:

Call f cost: 0.0874309539795 call integrate_f cost: 2.92381191254

Performance has about tripled. Let's try the static type (static typing) provided by cython, modify the core code of cython_example.pyx, and replace the implementation of f () and integrate_f () as follows:

Def f (double x): # Parameter static type return x def integrate_f (double a, double b, int N): cdef int i cdef double s, dx s = 0 dx = (bmura) / N for i in range (N): s + = f (a+i*dx) return s * dx

Then rerun the third and fourth steps above: the results are as follows

Call f cost: 0.042387008667 call integrate_f cost: 0.958620071411

The above code only introduces static type judgment for parameters, and static type judgment is also introduced for the return value below.

The implementation of replacing f () and integrate_f () is as follows:

Cdef double f (double x): # the returned value also has type judgment return x cdef double integrate_f (double a, double b, int N): cdef int i cdef double s, dx s = 0 dx = (bmura) / N for i in range (N): s + = f (a+i*dx) return s * dx

Then rerun the third and fourth steps above: the results are as follows

Call f cost: 1.19209289551e-06 call integrate_f cost: 0.187038183212

Amazing!

Pypy

Pypy is an alternative implementation of CPython. Its main advantage is the speed of pypy. Here are the test results on the official website:

Tested in a real project, pypy is about 3 to 5 times faster than cpython! the performance improvement of pypy comes from JIT Compiler. The Unladen Swallow project of google mentioned earlier is also intended to introduce JIT into CPython. After the failure of this project, many developers began to join the development and optimization of pypy. In addition, pypy takes up less memory and supports stackless, which is basically equivalent to the co-program.

The disadvantage of pypy is that it doesn't support C extensions very well, so you need to use CFFi to do binding. For widely used library, pypy is generally supported, but niche or self-developed C extensions need to be repackaged.

This is the end of "Python performance Optimization Analysis". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.