Detailed explanation of linux epoll mechanism 07/19 Update SLTechnology News&Howtos

Detailed explanation of linux epoll mechanism

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Before linux does not implement epoll event-driven mechanism, we generally choose to use IO multiplexing methods such as select or poll to implement concurrent service programs. In linux's new kernel, there is a mechanism to replace it, which is epoll.

Select () and poll () IO Multiplexing Model

Disadvantages of select:

1. There is a maximum limit on the number of file descriptors that a single process can monitor, usually 1024. Of course, the number can be changed, but because select uses polling to scan file descriptors, the greater the number of file descriptors, the worse the performance. (in the linux kernel header file, there is such a definition: # define _ FD_SETSIZE 1024)

two。 Kernel / user space memory copy problem, select needs to copy a large number of handle data structures, resulting in huge overhead

3.select returns an array containing the entire handle, and the application needs to traverse the entire array to find out which handles have events

4.select is triggered horizontally, and if the application does not finish IO a ready file descriptor, then each select call will still notify the process of these file descriptors.

Compared to the select model, poll uses linked lists to hold file descriptors, so there is no limit on the number of monitoring files, but the other three disadvantages remain.

Assuming that our server needs to support 1 million concurrent connections, then when _ _ FD_SETSIZE is 1024, we need to open at least 1k processes to achieve 1 million concurrent connections. In addition to the time consumption of inter-process context switching, a large number of brainless memory copies and array polling from kernel / user space are difficult for the system to bear. Therefore, it is a difficult task for server programs based on select model to achieve 100000 level of concurrent access.

Implementation Mechanism of epoll IO Multiplexing Model

Because the implementation mechanism of epoll is completely different from that of select/poll, the shortcomings of select mentioned above no longer exist on epoll.

Imagine a scenario where 1 million clients maintain an TCP connection to a server process at the same time. At a time, there are usually only hundreds or thousands of TCP connections active (in fact, this is the case in most scenarios). How to achieve such high concurrency?

In the era of select/poll, the server process told the operating system these 1 million connections every time (from the user mode to the kernel state), and asked the operating system kernel to query whether there were any events on these sockets. After polling, the handle data was copied to the user state, allowing the server application to poll and deal with the network events that had occurred. This process consumes a lot of resources, so Select/poll generally can only handle a few thousand concurrent connections.

The design and implementation of epoll is completely different from that of select. Epoll applies for a simple file system in the Linux kernel (what data structures do file systems usually use? B + tree). The original select/poll call is divided into three parts:

1) call epoll_create () to create an epoll object (allocate resources for this handle object in the epoll file system)

2) call epoll_ctl to add the sockets of these 1 million connections to the epoll object

3) call epoll_wait to collect the connection of the events that occurred

In this way, to implement the scenario mentioned above, you only need to create an epoll object when the process starts, and then add or remove connections to the epoll object as needed. At the same time, epoll_wait is also very efficient, because when calling epoll_wait, it does not copy the handle data of these 1 million connections to the operating system, and the kernel does not need to traverse all the connections.

Epoll implementation mechanism

When a process calls the epoll_create method, the Linux kernel creates an eventpoll structure with two members that are closely related to how epoll is used. The eventpoll structure is as follows:

Struct eventpoll {.... / * the root node of the red-black tree, which stores all events added to the epoll that need to be monitored * / struct rb_root rbr; / * the double-linked list stores the qualified events that will be returned to the user through epoll_wait * / struct list_head rdlist;.}

Each epoll object has a separate eventpoll structure for storing events that are added to the epoll object through the epoll_ctl method. These events are mounted in the red-black tree so that repeatedly added events can be efficiently identified by the red-black tree (the insertion time efficiency of the red-black tree is lgn, where n is the height of the tree).

All events added to the epoll establish a callback relationship with the device (network card) driver, that is, the callback method is called when the corresponding event occurs. This callback method, called ep_poll_callback in the kernel, adds events that occur to the rdlist double-linked list.

In epoll, for each event, an epitem structure is created, as follows:

Struct epitem {struct rb_node rbn;// red-black tree node struct list_head rdllink;// bi-directional linked list node struct epoll_filefd ffd; / / event handle information struct eventpoll * ep; / / points to the eventpoll object to which it belongs struct epoll_event event; / / the type of event expected to occur}

When calling epoll_wait to check whether an event has occurred, you only need to check if there is an epitem element in the rdlist double-linked list in the eventpoll object. If the rdlist is not empty, the events that occur are copied to the user state and the number of events is returned to the user.

Through the red-black tree and double-linked list data structure, and combined with the callback mechanism, the efficiency of epoll is created.

Interface of epoll

1.epoll_create

Create an epoll handle

Function declaration: int epoll_create (int size)

Parameter: size is used to tell the kernel the total number of listeners.

Return value: returns the created epoll handle.

When the epoll handle is created, it will occupy an FD value. If you look at the / proc/ process id/fd/, under linux, you can see this fd, so after using epoll, you must call close () to close, otherwise it may cause fd to be exhausted.

2.epoll_ctl

Adds or removes the listening descriptor to the epoll handle or modifies the listening event.

Function declaration: int epoll_ctl (int epfd, int op, int fd, struct epoll_event*event)

Parameters:

Epfd: the return value of epoll_create ()

Op: indicates the operation to be performed. The values are:

EPOLL_CTL_ADD: register a new fd into epfd

EPOLL_CTL_MOD: modify the listening event of a registered fd

EPOLL_CTL_DEL: removes a fd from epfd

Fd: file handle that needs to be manipulated / monitored

Event: tells the kernel what events to listen for. The struct epoll_event is as follows:

Typedef union epoll_data {void * ptr; int fd; _ _ uint32_t U32; _ _ uint64_t U64;} epoll_data_t; struct epoll_event {_ _ uint32_t events; / * Epoll events * / epoll_data_t data; / * User data variable * /}

An events can be a collection of the following macros:

EPOLLIN: this event is triggered to indicate that there is readable data on the corresponding file descriptor. (including the peer SOCKET shuts down normally)

EPOLLOUT: this event is triggered to indicate that data can be written on the corresponding file descriptor

EPOLLPRI: indicates that the corresponding file descriptor has urgent data to read (here it should indicate that out-of-band data is coming)

EPOLLERR: indicates that an error occurred in the corresponding file descriptor

EPOLLHUP: indicates that the corresponding file descriptor is hung up

EPOLLET: set EPOLL to EdgeTriggered mode, as opposed to horizontal trigger (Level Triggered).

EPOLLONESHOT: only listen for one event. After listening to this event, if you need to continue listening to the socket, you need to add the socket to the EPOLL queue again.

Example:

Struct epoll_event ev;// sets the file descriptor related to the event to be processed ev.data.fd=listenfd;// sets the event type to be handled ev.events=EPOLLIN | EPOLLET;// registers epoll event epoll_ctl (epfd,EPOLL_CTL_ADD,listenfd,&ev)

1.epoll_wait

Wait for the occurrence of socket fd events registered on epfd, and if they occur, put the sokct fd and event types that occur into the events array.

Function prototype: int epoll_wait (int epfd, struct epoll_event * events, int maxevents, int timeout)

Parameters:

Epfd: epoll file descriptor generated by epoll_create

Events: an array of events to be processed by the callback generation

Maxevents: the maximum number of events that can be handled at a time

Timeout: the number of milliseconds of timeout waiting for the event to occur.-1 equals blocking and 0 equals non-blocking. You can usually use-1.

The working mode of epoll

ET (EdgeTriggered): high-speed operation mode, only no_block (non-blocking mode) is supported. In this mode, the kernel tells you through epoll when the descriptor is never ready and becomes ready. It then assumes that the user knows that the file descriptor is ready and will not send any more ready notifications for that file descriptor until something causes the file descriptor to be no longer ready. (the trigger mode is notified only once when the data is ready, and if the data is not read, it will not be notified next time until new ready data is available.)

LT (LevelTriggered): works by default and supports blocksocket and no_blocksocket. In LT mode, the kernel tells you whether a file descriptor is ready, and then you can IO the ready fd. If nothing is done, the kernel will continue to notify! If the data is not finished, the kernel will continue to notify until the device data is empty!

The example shows:

1. We have added a file handle (RFD) to the epoll descriptor to read data from the pipe

two。 At this time, the data of 2KB is written from the other end of the pipe.

3. Call epoll_wait (2), and it returns RFD, indicating that it is ready for the read operation

4. And then we read the 1KB data.

5. Call epoll_wait (2)...

ET mode of operation:

If we used the EPOLLET flag when we added RFD to the epoll descriptor in step 1 and performed a write operation in step 2, step 3 epoll_wait will return that the event notified at the same time will be destroyed. Because the read operation in step 4 does not read the data in the file input buffer, it is uncertain whether we hang after we call epoll_wait (2) in step 5. When epoll works in ET mode, it must use a non-blocking socket to avoid starving the task of processing multiple file descriptors due to a blocking read / write operation of one file handle.

You need to hang and wait only when read (2) or write (2) returns EAGAIN (thought to be finished). However, this does not mean that each read () needs to be read in a loop until an EAGAIN is generated. When the length of the read data returned by read () is less than the requested data length (that is, less than sizeof (buf)), it can be determined that there is no data in the buffer, and the read event can be considered completed.

LT mode of operation:

When the epoll interface is called in LT mode, it is equivalent to a faster poll (2), and regardless of whether the later data is used or not, so they have the same function.

Example / * * file epollTest.c*/#include # define MAXEVENTS 64 / / function: create and bind a TCP socket / / Parameter: Port / / return value: created socket static int create_and_bind (char * port) {struct addrinfo hints; struct addrinfo * result, * rp; ints, sfd; memset (& hints, 0, sizeof (struct addrinfo)) Hints.ai_family = AF_UNSPEC; / * Return IPv4 and IPv6 choices * / hints.ai_socktype = SOCK_STREAM; / * We want a TCP socket * / hints.ai_flags = AI_PASSIVE; / * All interfaces * / s = getaddrinfo (NULL, port, & hints, & result); if (s! = 0) {fprintf (stderr, "getaddrinfo:% s\ n", gai_strerror (s)); return-1;} for (rp = result; rp! = NULL) Rp = rp- > ai_next) {sfd = socket (rp- > ai_family, rp- > ai_socktype, rp- > ai_protocol); if (sfd =-1) continue; s = bind (sfd, rp- > ai_addr, rp- > ai_addrlen); if (s = = 0) {/ * We managed to bind successfully! * / break;} close (sfd);} if (rp = = NULL) {fprintf (stderr, "Could not bind\ n"); return-1 } freeaddrinfo (result); return sfd;} / function / / function: set socket to non-blocking static int make_socket_non_blocking (int sfd) {int flags, s; / / get the file status flag flags = fcntl (sfd, F_GETFL, 0); if (flags =-1) {perror ("fcntl"); return-1;} / / set the file status flag flags | = O_NONBLOCK S = fcntl (sfd, F_SETFL, flags); if (s =-1) {perror ("fcntl"); return-1;} return 0;} / / int main (int argc, char * argv []) {int sfd, s; int efd; struct epoll_event event; struct epoll_event * events specified by parameter argv [1] If (argc! = 2) {fprintf (stderr, "Usage:% s [port]\ n", argv [0]); exit (EXIT_FAILURE);} sfd = create_and_bind (argv [1]); if (sfd = =-1) abort (); s = make_socket_non_blocking (sfd); if (s =-1) abort (); s = listen (sfd, SOMAXCONN); if (s =-1) {perror ("listen") Abort ();} / / except that the parameter size is ignored, this function is exactly the same as epoll_create efd = epoll_create1 (0); if (efd =-1) {perror ("epoll_create"); abort ();} event.data.fd = sfd; event.events = EPOLLIN | EPOLLET;// read, edge trigger mode s = epoll_ctl (efd, EPOLL_CTL_ADD, sfd, & event) If (s =-1) {perror ("epoll_ctl"); abort ();} / * Buffer where events are returned * / events = calloc (MAXEVENTS, sizeof event); / * The event loop * / while (1) {int n, I; n = epoll_wait (efd, events, MAXEVENTS,-1); for (I = 0; I < n) Events +) {if ((events [I] .events & EPOLLERR) | | (events [I] .events & EPOLLHUP) | (! (events [I] .events & EPOLLIN)) {/ * An error has occured on this fd, or the socket is not ready for reading (why were we notified then?) * / fprintf (stderr, "epoll error\ n"); close (events [I] .data.fd); continue } else if (sfd = = events [I] .data.fd) {/ * We have a notification on the listening socket, which means one or more incoming connections. * / while (1) {struct sockaddr in_addr; socklen_t in_len; int infd; char hbuf [NI _ MAXHOST], sbuf [NI _ MAXSERV]; in_len = sizeof in_addr; infd = accept (sfd, & in_addr, & in_len) If (infd =-1) {if ((errno = = EAGAIN) | | (errno = = EWOULDBLOCK)) {/ * We have processed all incoming connections. * / break;} else {perror ("accept"); break;}} / / convert the address to hostname or service name s = getnameinfo (& in_addr, in_len, hbuf, sizeof hbuf, sbuf, sizeof sbuf, NI_NUMERICHOST | NI_NUMERICSERV) / / flag parameter: return / / host address and service address if (s = = 0) {printf ("Accepted connection on descriptor% d"(host=%s, port=%s)\ n", infd, hbuf, sbuf);} / * Make the incoming socket non-blocking and add it to the list of fds to monitor. * / s = make_socket_non_blocking (infd); if (s =-1) abort (); event.data.fd = infd; event.events = EPOLLIN | EPOLLET; s = epoll_ctl (efd, EPOLL_CTL_ADD, infd, & event); if (s =-1) {perror ("epoll_ctl"); abort ();} continue } else {/ * We have data on the fd waiting to be read. Read and display it. We must read whatever data is available completely, as we are running in edge-triggered mode and won't get a notification again for the same data. * / int done = 0; while (1) {ssize_t count; char buf [512]; count = read (events [I] .data.fd, buf, sizeof (buf)); if (count = =-1) {/ * If errno = = EAGAIN, that means we have read all data. So go back to the main loop. * / if (errno! = EAGAIN) {perror ("read"); done = 1;} break;} else if (count = = 0) {/ * End of file. The remote has closed the connection. * / done = 1; break;} / * Write the buffer to standard output * / s = write (1, buf, count); if (s =-1) {perror ("write"); abort ();}} if (done) {printf ("Closed connection on descriptor% d\ n", events.data.fd) / * Closing the descriptor will make epoll remove it from the set of descriptors which are monitored. * / close (events [I] .data.fd);} free (events); close (sfd); return EXIT_SUCCESS;}

After the code is compiled,. / epollTest 8888 is executed in another terminal

Telnet 192.168.1.161 8888, 192.168.1.161 is the ip that executes the test program. After any character is typed into the Enter in the telnet terminal, the typed character is displayed in the test terminal.

Summary

The above is the whole content of the detailed explanation of the linux epoll mechanism in this article. I hope it will be helpful to you. Interested friends can continue to refer to other related topics on this site, if there are any deficiencies, please leave a message to point out. Thank you for your support to this site!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.