Epoll-- the reuse of IO 07/04 Update SLTechnology News&Howtos

Epoll-- the reuse of IO

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

one。 About epoll

For the IO reuse model, we talked about the use of select and poll functions earlier. Select provides users with a data structure fd_set about storing events to monitor the readiness of waiting events, which is divided into read, write and exception event sets. On the other hand, poll manages the file descriptor of the event and the events concerned by the event with a pollfd-type structure, and notifies the user of the ready state of the event through the output parameter revents in the structure.

However, for the above two functions, the user is required to traverse all the event sets to determine which one or which events are ready for data processing, so when there are more waiting events, there will be data replication and system traversal costs that lead to inefficient efficiency. In view of the shortcomings of select and poll, another relatively efficient function for dealing with IO reuse appears, and that is epoll.

two。 The use of epoll related functions

First of all, unlike the select and poll functions, epoll does not directly use a function named after epoll, but provides three functions: epoll_create, epoll_ctl and epoll_wait.

Epoll_create

The epoll_create function creates an "instance" of epoll and requests the kernel to allocate a specified amount of space for the background storage of events. The function parameter size is just a hint about how the kernel maintains the internal structure, but now this size has been ignored and doesn't need to be cared about.

If the function succeeds, it returns a file descriptor that refers to the newly created epoll instance, which is used to call the structure of other epoll functions subsequently. If it is no longer needed, the close function should be used to close it, and the kernel will destroy the epoll instance and release related resources. If the function fails, it will return-1 and collocate the corresponding error code.

2. Epoll_ctl

In the function parameters

Epfd is an epoll file descriptor created with epoll_create to manipulate epoll instances.

Op is to operate on the created epoll instance, and op has the following three types of macros:

EPOLL_CTL_ADD is used to add registration events to be handled in the epoll instance of the epfd identity

EPOLL_CTL_MOD is used to change events of interest to a specific file descriptor

EPOLL_CTL_DEL is used to delete events registered in the epoll instance. The identity no longer needs to be concerned.

Fd refers to the file descriptor of the event for which the data IO is performed, that is, the file descriptor of the event that the user needs to act on.

Event is an epoll_event structure that stores information about the fd that needs to be manipulated:

In the structure

Events represents the action concerned by the event corresponding to the file descriptor fd, and is the setting of the corresponding bit. There are several macros as follows:

Among the macros above, the main ones are as follows:

EPOLLIN indicates that fd can read data.

EPOLLOUT indicates that fd can write data.

EPOLLPRI indicates that emergency data is currently available for reading

EPOLLERR indicates that an error has occurred in the current event

EPOLLHUP indicates that the current event was hung up

EPOLLET sets the relevant file descriptor to edge trigger because it is triggered horizontally by default; for LT and ET modes, it is discussed below

For data in a structure, it is a federation that represents data information about file descriptor operations:

Ptr is a pointer to the data buffer

Fd is the file descriptor for the corresponding operation

The epoll_ctl function returns 0 successfully, and the failure returns-1 and the corresponding error code.

3. Epoll_wait

If the above epoll_create and epoll_ctl are preparations for operations on related events, then it is the epoll_wait function that is really used to wait for multiple events just like the select and poll functions:

In the function parameters

Epfd is a file descriptor for epoll instances created with epoll_create

Events is the pointer to one of the above structures. Here is usually the first address of an array and an input-output parameter. When used as input, the user provides the system with an address space for storing ready events, while as an output parameter, the system will put ready events into it for users to extract, so it cannot be used for NULL.

Maxevents is the size of events

Timeout sets the timeout for waiting (in milliseconds)

It is worth mentioning here that since epoll is an improvement of select and poll, its main efficiency is reflected in the return value of epoll_wait:

Function failure returns-1 juxtaposing the corresponding error code

The function returns 0 to indicate a timeout, and there is no event ready within the scheduled time.

When the return value of the function is greater than 0, it tells the user the number of IO events that are ready in the current event set, and arranges them in the space events provided by the user from scratch, so it is not necessary to traverse the entire event set to find out the ready events as select and poll do, but only needs to access the fixed number of return values in the corresponding array to get all the ready events.

three。 Chestnut time

Similarly, using epoll-related interface functions, you can independently write a server based on TCP protocol. The basic steps are as follows:

First, you need to create a listening socket, bind the local network address information and put it in the listening state, but here, in order to make it more efficient, you also need to call the setsockopt function to set its property to SO_REUSEADDR so that its address information can be reused.

Call epoll_create to create a file descriptor about the epoll instance, which is used to manipulate epoll-related functions later

Call the epoll_ctl function to add the listening socket registration to the epoll instance

Define an array of epoll_event structures with a user-specified size for the system to store ready IO events

Call epoll_wait to wait for the event and receive its return value

When epoll_wait returns, the returned events are judged and processed one by one. If the listening event is ready, a connection request needs to be processed, and a new socket is added to the epoll instance. If other socket is ready, the data is ready for reading and writing.

When one end of the connection is closed or the epoll instance is used up, you need to call the close function to close the corresponding file descriptor to reclaim resources

The server client program is designed as follows:

# include # define _ BACKLOG_ 5 / / maximum define _ MAX_NUM_ 20 / / event-ready queue storage space # define _ DATA_SIZE_ 1024 / / data buffer size / / because data members in the epoll_event structure are a federation So there is a problem when you need to use both fd and ptr in the federation / / so you can take them out separately to store typedef struct data_buf {int _ fd. Char _ buf [_ DATA_SIZE_];} data_buf_t, * format judgment of data_buf_p;// command line arguments void Usage (const char * argv) {assert (argv); printf ("Usage:% s [ip] [port]\ n", argv); exit (0);} / / create listening socket static int CreateListenSock (int ip, int port) {int sock = socket (AF_INET, SOCK_STREAM, 0) / / create a new socket if (sock

< 0) { perror("socket"); exit(1); } int opt = 1;//调用setsockopt函数使当server首先断开连接的时候避免进入一个TIME_WAIT的等待时间 if(setsockopt(sock, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt)) < 0) { perror("setsockopt"); exit(2); } //设置本地网络地址信息 struct sockaddr_in server; server.sin_family = AF_INET; server.sin_port = htons(port); server.sin_addr.s_addr = ip; //绑定套接字和本地网络信息 if(bind(sock, (struct sockaddr*)&server, sizeof(server)) < 0) { perror("bind"); exit(3); } //设定套接字为监听状态 if(listen(sock, _BACKLOG_) < 0) { perror("listen"); exit(4); } return sock;}//执行epollvoid epoll_server(int listen_sock){ //创建出一个epoll实例，获取其文件描述符，大小随意指定 int epoll_fd = epoll_create(256); if(epoll_fd < 0) { perror("epoll_create"); exit(5); } //定义一个epoll_event结构体用于向epoll实例中注册需要IO的事件信息 struct epoll_event ep_ev; ep_ev.events = EPOLLIN; ep_ev.data.fd = listen_sock; if(epoll_ctl(epoll_fd, EPOLL_CTL_ADD, listen_sock, &ep_ev) < 0) { perror("epoll_ctl"); exit(6); } //申请一个确定的空间提供给系统，用于存放就绪事件队列 struct epoll_event evs[_MAX_NUM_]; int maxnum = _MAX_NUM_;//提供的空间大小 int timeout = 10000;//设定超时时间，如果为-1，则以阻塞方式一直等待 int ret = 0;//epoll_wait的返回值，获取就绪事件的个数 while(1) { switch((ret = epoll_wait(epoll_fd, evs, maxnum, timeout))) { case -1://出错 perror("epoll_wait"); break; case 0://超时 printf("timeout...\n"); break; default://至少有一个事件就绪 { int i = 0; for(; i < ret; ++i) { //判断是否为监听套接字，如果是，获取连接请求 if((evs[i].data.fd == listen_sock) && (evs[i].events & EPOLLIN)) { struct sockaddr_in client; socklen_t client_len = sizeof(client); //处理连接请求，获取新的通信套接字 int accept_sock = accept(listen_sock, (struct sockaddr*)&client, &client_len); if(accept_sock < 0) { perror("accept"); continue; } printf("connect with a client...[fd]:%d [ip]:%s [port]:%d\n", accept_sock, inet_ntoa(client.sin_addr), ntohs(client.sin_port)); //将新的事件添加进epoll实例中 ep_ev.events = EPOLLIN; ep_ev.data.fd = accept_sock; if(epoll_ctl(epoll_fd, EPOLL_CTL_ADD, accept_sock, &ep_ev) < 0) { perror("epoll_ctl"); close(accept_sock); } } else//除了监听套接字之外的IO套接字 { //如果为读事件就绪 if(evs[i].events & EPOLLIN) { //申请空间用于同时存储文件描述符和缓冲区地址 data_buf_p _data = (data_buf_p)malloc(sizeof(data_buf_t)); if(!_data) { perror("malloc"); continue; } _data->

_ fd = EVS [I] .data.fd; printf ("read from fd:% d\ n", _ data- > _ fd); / / read data from the buffer ssize_t size = read (_ data- > _ fd, _ data- > _ buf, sizeof (_ data- > _ buf)-1) If (size

< 0)//读取出错 printf("read error...\n"); else if(size == 0)//远端关闭连接 { printf("client closed...\n"); //收尾工作，将事件从epoll实例中移除，关闭文件描述符和防止内存泄露 epoll_ctl(epoll_fd, EPOLL_CTL_DEL, _data->

_ fd, NULL); close (_ data- > _ fd); free (_ data) } else {/ / read successfully. Output data (_ data- > _ buf) [size] ='\ 0' Printf ("client#% s", _ data- > _ buf); fflush (stdout); / / change the event to care about the write event, and write back ep_ev.data.ptr = _ data Ep_ev.events = EPOLLOUT; / / change the same event epoll_ctl (epoll_fd, EPOLL_CTL_MOD, _ data- > _ fd, & ep_ev) in the epoll instance }} else if (EVS [I] .events & EPOLLOUT) / / judged to be write event ready {data_buf_p _ data = (data_buf_p) EVS [I] .data.ptr / / write back data write (_ data- > _ fd, _ data- > _ buf, strlen (_ data- > _ buf) to the buffer) / / after writing, complete a communication and finish the epoll_ctl (epoll_fd, EPOLL_CTL_DEL, _ data- > _ fd, NULL); close (_ data- > _ fd); free (_ data) } else {}} break } int main (int argc, char * argv []) {if (argc! = 3) / / determine the correctness of command line parameters Usage (argv [0]); / / get the port number and IP address int port = atoi (argv [2]); int ip = inet_addr (argv [1]); / / get the listening socket int listen_sock = CreateListenSock (ip, port) / / perform epoll operation epoll_server (listen_sock); close (listen_sock); / / close the file descriptor return 0;}

To explain here, the system actually maintains a balanced search binary tree and a linked list for epoll-related operations. If the space provided by the user at one time is not enough to store all ready events, then the system will provide the rest next time, so there is no need to worry about the structure array space provided to epoll_wait.

Run the program:

The left side is the server side, and the right side is the telnet request connection side.

Because the design is a question-and-answer mode, after the server receives the connection request and data, the data is read out and written back to the connection request side, and a communication is considered to have been completed.

The above mode can also be tested with the browser, but when the browser makes a connection request, the server side thinks that the data has been received and needs to be written back, while the content of the write back is required, because most browsers use the HTTP protocol, so when the browser receives it, it should receive the message written back by the server as a response, and here, the HTTP response consists of three parts. The status line, message header and response body are in the format of "protocol version + response status code + text indicating status code". Too much content does not fall within the scope of this article, so I will not repeat it. In short, as a response message, the content written back by the server should be in the following format:

Char * msg = "HTTP/1.1 200 OK\ r\ n\ r\ nHello, what can i do for you?:)\ r\ n"; write (_ data- > _ fd, msg, strlen (msg))

Run the server port program, open the browser and enter the IP and port number:

When the browser connects to the server, the server will receive information about the browser, that is, the request information of the browser will be obtained, and then the response message will be returned to the browser, and the browser will get the body content according to the received response message and display it, such as the display on the right (127.0.0.1 for testing using the local loopback IP)

four。 Horizontal trigger and edge trigger

When epoll_wait is waiting for multiple events, if data is sent to the buffer, it indicates that the current event is ready, and you need to return to notify the user that "there is data and can be processed". Then the way for the system to notify the user can be divided into horizontal trigger and edge trigger:

Horizontal trigger (Level Trigger) is referred to as LT, which is characterized by notifying the user when the data arrives. If the user does not remove all the data in the buffer and leaves some of it, the system will think that the event is still ready the next time the epoll_wait of the same event is performed, and will continue to notify the user to retrieve the rest of the data. The characteristic of horizontal trigger is that as long as there is data in the data buffer, the current IO event is always ready, and epoll_wait will always return a valid value to notify the user program.

Edge trigger (Edge Triggered) is referred to as ET, which still returns to notify the user when data arrives, but unlike horizontal trigger, if the IO processing of the data is not complete after the notification, that is, there is still data in the buffer after the first processing, then the return for epoll_wait again will no longer indicate that the current event is ready. Only when the data of this event arrives again will the user program be notified to process the data. Therefore, the characteristic of edge trigger is that the system will notify the user program only once when the data arrives. If there is still data to be processed, the event will be ready again only when the data arrives again. Epoll_wait returns to notify the user program to process the data.

It should be noted here: for edge triggers, the system will notify the user program only once when the data arrives. If the current IO interface works in blocking mode, when an event is blocked, the readiness of other events will only be notified once but not processed, which will lead to the accumulation of multiple data, so when using edge triggers:

It is best to set the current IO interface to be non-blocking

When an IO event reads and writes data, it is best to process all the data in the buffer at once; therefore, for data reading, you can use a loop to read a specific length at a time, and when the last read length is less than a specific length, you can assume that all the data in the current buffer has been read and terminated. However, it is inevitable that if the last read happens to be of a specific length, then if the data in the read buffer is 0, an error code of EAGAIN will be returned, which can be used as the termination condition of the loop.

The error code of EAGAIN is 11, which can be found in / usr/include/asm-generic/errno.h and errno-base.h:

If you output its corresponding error description, it is: Resource temporarily unavailable, which means that the resource is temporarily unavailable. You can try again it.

To make the IO interface non-blocking, you can call the fcntl function:

In the function parameters

Fd represents the file descriptor to be operated on

Cmd indicates the operation to be performed

As for the later parameters, it is up to cmd to decide

To set the file interface to non-blocking, first set cmd to F_GETFL, which means to get the flag of the current file descriptor, because it needs to be used during reset. Then you need to call the fcntl function again, set cmd to F_SETFL, and reset the flag of the file descriptor. One of the options is O_NONBLOCK.

The return value of the fcntl function varies depending on the operation:

Comparing horizontal trigger with edge trigger, it can be found that horizontal trigger is safer and more reliable for data processing, while edge trigger is more efficient. Therefore, which notification method can be chosen depends on the situation.

Because in the above program, the default epoll_wait notification method is LT, which is triggered horizontally. To change it to a more efficient ET edge trigger mode, you need to meet the non-blocking conditions and data read completion conditions described above:

First, set the IO interface of the event to non-blocking mode, then the following function needs to be called during listen socket creation and every time a new connection request obtains a new IO file descriptor:

Int set_non_block (int fd) {/ / get the file ID of the current file descriptor int old_fl = fcntl (fd, F_GETFL); if (old_fl

< 0) { perror("fcntl"); return -1; } //将文件描述符所对应的事件设置为非阻塞模式 if(fcntl(fd, F_SETFL, old_fl|O_NONBLOCK)) { perror("fcntl"); return -1; } return 0;} 其次，就需要自行封装出一个函数来进行循环地获取或者写入缓冲区中数据，直到没有数据可读为止，这是为了避免边缘触发的特点带来的数据拥堵不能够被处理的现象： //读取数据ssize_t MyRead(int fd, char *buf, size_t size){ assert(buf); int index = 0; ssize_t ret = 0; //如果读取到的数据等于0，则说明远端关闭连接，直接返回0 //而如果为非0，不管是大于零还是出错小于零都需要进入循环 while((ret = read(fd, buf+index, size-index))) { if(errno == EAGAIN)//如果错误码为EAGAIN，则说明读取完毕，打印出错误码和错误消息并退出 { printf("read errno: %d\n", errno); perror("read"); break; } index += ret; } return (ssize_t)index;//返回获得的总数据量}//写入数据ssize_t MyWrite(int fd, char* buf, size_t size){ assert(buf); int index = 0; ssize_t ret = -1; //和读取数据一样，当写入数据量为0的时候直接返回0 //否则，返回值为非零进入循环 while((ret = write(fd, buf+index, size-index))) { if(errno == EAGAIN)//当数据全部写完的时候返回错误码为EAGAIN { printf("write errno: %d\n", errno); perror("write"); break; } index += ret; } return (ssize_t)index;//和读取数据相同，返回写入的总数据量} 将上面修改的代码添加到上述例子中之后，运行程序：分析一下程序结果，会发现第一次连接并没有什么问题，得到了一问一答的结果，但是如果第二次连接包括以后的多次连接，所发送的数据就无法被server端接收到，反而被认为连接端已经关闭了，因此server端就主动关闭了连接和相关事件的清除；这是怎么一回事呢？这是因为，在上面所封装的数据的读写函数中，当第一次连接进行数据的读取，读取完毕缓冲区中所有的数据之后，再次进行read就会出错，因而错误码被置为了EAGAIN，而错误码errno是个全局变量，所以当再次或者多次连接进行数据的读取的时候，即使读到了数据read的返回值大于零，但进入循环进行 if(errno == EAGAIN) 判断的时候，errno已经被第一次连接置为了EAGAIN，而运行是在同一个进程当中的，所以始终满足上述条件跳出循环，返回值为0，之后再进行判断，就会认为并没有读到数据，转而关闭相应的文件描述符；这就是在一个函数中使用了全局变量造成了函数的不可重入性；要解决上述问题，可以在上述的判断条件增加一个条件，即： if((ret < 0) && (errno == EAGAIN)){ printf("read errno: %d\n", errno); perror("read"); break;} 当read出错进入循环的时候，要和read成功分开进行操作，这样就不会有误了，虽然无法避免使用全局变量errno，但是可以通过read的返回值来进一步加强判断； 2. 另外有一种方法，就是可以用多进程来操作，即将errno变成某一个进程专属的全局变量，也就是当一个IO的读事件就绪的时候，就创建出一个子进程来进行缓冲区中数据的读写，将进行epoll_wait之后的读事件就绪以后的代码改为如下： else{ if(evs[i].events & EPOLLIN)//读事件就绪 { data_buf_p _data = (data_buf_p)malloc(sizeof(data_buf_t)); if(!_data) { perror("malloc"); continue; } _data->

_ fd = EVS [I] .data.fd; printf ("read from fd:% d\ n", _ data- > _ fd); / / create process pid_t id = fork (); if (id

< 0)//创建失败 perror("fork"); else if(id == 0)//子进程 { printf("child proc: %d\n", getpid()); ssize_t size = MyRead(_data->

_ fd, _ data- > _ buf, sizeof (_ data- > _ buf)-1); / / ssize_t size = read (_ data- > _ fd, _ data- > _ buf, sizeof (_ data- > _ buf)-1); if (size

< 0) printf("read error...\n"); else if(size == 0) { printf("client closed...\n"); exit(12); //epoll_ctl(epoll_fd, EPOLL_CTL_DEL, _data->

_ fd, NULL); / / close (_ data- > _ fd); / / free (_ data);} else {(_ data- > _ buf) [size] ='\ 0mm; printf ("client#% s", _ data- > _ buf); fflush (stdout) Ep_ev.data.ptr = _ data; ep_ev.events = EPOLLOUT | EPOLLET; epoll_ctl (epoll_fd, EPOLL_CTL_MOD, _ data- > _ fd, & ep_ev);} else {pid_t ret = wait (NULL); if (ret

< 0) perror("waitpid"); else printf("wait success : %d\n", ret); epoll_ctl(epoll_fd, EPOLL_CTL_DEL, _data->

_ fd, NULL); close (_ data- > _ fd); free (_ data);}} else if (EVs [I] .events & EPOLLOUT) {data_buf_p _ data = (data_buf_p) EVs [I] .data.ptr; MyWrite (_ data- > _ fd, _ data- > _ buf, strlen (_ data- > _ buf)) / / epoll_ctl (epoll_fd, EPOLL_CTL_DEL, _ data- > _ fd, NULL); / / close (_ data- > _ fd); / / free (_ data); exit (11);}

Here to explain: when creating a child process, the child process copies the PCB of the parent process, and naturally gets its corresponding file descriptor for operation, but when it is necessary to change its contents, such as file descriptor and epoll instance, the child process will make a write-time copy. At this time, it is no longer possible to just close the file descriptor and free space in the child process. Because this has no practical effect, it just clears the copied content, which is why the closing work in the child process is commented out in the above program and is carried out in the parent process instead. At the same time, the parent process needs to wait, and if it does not wait, it will cause the same IO event to be out of order and fail to achieve the desired effect.

Run the program:

In fact, for the reentrancy of the function, we can't help thinking about the safety of threads, so can the above program be changed to multithreading?

For threads, the resources of the process are shared, and errno is a global variable, which is valid in the whole process space. Therefore, this global variable is also shared for multiple threads. Although the global variable is a critical resource, the above problems are not caused by competing for critical resources, because the for loop is used to handle IO events one by one. However, the change of the global variable by the previous operation affects the later operation, which is the reentrant of the typical function, and the reentrant of the function is not the same as thread safety. it requires that all the variables used within the function come from its own stack space, so there is no change if multithreading or thread mutual exclusion is used for operation.

"finish"

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.