Security in DevSecOps Operation and maintenance Mode 07/01 Update SLTechnology News&Howtos

Security in DevSecOps Operation and maintenance Mode

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

This article would like to talk about my understanding of security in the DevSecOps operation and maintenance model of cloud computing data centers from a technical point of view, and my exploration in business continuity management of cloud services in the past few years.

Now public cloud service providers are all turning to the DevSecOps model. DevSecOps is another practice of DevOps, which regards information technology security as a basic point in all stages of software development. Security involves not only all levels of isolation and compliance checks, but also ensuring business continuity at the technical level. In the ISO/IEC 27001 information security management system, "business continuity management" is a very important part of security management, which aims to reduce the interruption of business activities, protect critical business processes from major failures or natural disasters, and ensure timely recovery. " Business continuity management "is a term in security governance, and the term that translates it into computer products is" reliability, availability, and maintainability (RAS) ".

I. decentralization

Every cloud computing data center has some centralized shared services, such as firewall, DNS, core routing, load balancer, distributed storage, and so on. Although the IT infrastructure takes high availability and high throughput into account in design and code execution, there are always some exceptions. For example, when we upgrade a firewall, because an occasional Bug,Peer does not take over all traffic, the result is unplanned interruption of many services.

After that, the decomposition of the IT infrastructure from a centralized structure to a large number of smaller failure domain structures has become one of our key considerations in designing and improving cloud computing data centers. Our cloud infrastructure is distributed across dozens of regional Regions. The data center in each region is physically divided into three availability domains Availability Domains, all of which are independent of the infrastructure. Available domains are isolated from each other, fault tolerant, and almost impossible to fail at the same time. Because availability domains do not share infrastructure (such as power or cooling) or internal availability domain networks, a failure of one availability domain in an area is unlikely to affect customers of other availability domains in the same area. In each availability domain, we are further decentralized and grouped into multiple failure domains Fault Domains. A failure domain is a set of hardware and infrastructure. By making proper use of the failure domain, our customers can improve the availability of applications running on Oracle Cloud Infrastructure. For example, if a customer has two Web servers and a cluster database, we recommend that they combine a Web server and a database node in one failure domain and assign the other half of the group to another fault domain. This ensures that the failure of any one failure will not cause the application to be interrupted.

In addition to the above failure domain, we have also proposed specific indicators for Oracle SaaS services (Oracle's ERP, CRM, HCM and other industry solutions, which currently have more than 25000 enterprise customers): no disaster event of any component should cause 10% of the customers in the data center, or 100 customers, to be disrupted. To this end, our team designed and implemented a decentralized improvement program a few years ago to achieve this goal. This is an infrastructure optimization solution aimed at zero downtime, involving firewalls, DNS, load balancers, Web front ends, storage, IMAP, and so on.

II. Backup and disaster recovery

Backup and disaster recovery are unavoidable topics to ensure the security and availability of services. Although the cost of backup and disaster recovery is very high, we still provide backup and disaster recovery solutions for various scenarios for customers to choose.

Backup data usage is very low. In a production environment, I receive an average of less than 2/1000 data recovery requests per quarter, mainly data recovery in a customer test environment. The average SaaS service data recovery request in a real production environment is less than 2/10000 per quarter. In order to achieve this 2/10000 usage probability, the operation and maintenance department takes a certain proportion of backups every week to test and verify data recovery according to a specific security process to ensure that the backup is effective.

My colleagues and I have also developed an implementation plan for Oracle SaaS DR. If customers purchase this service, they can quickly switch the production environment from one data center to another through a few simple steps of Oracle Site Guard's Web GUI interface. Mr. Zhao Cheng, Director of Technical Services of Mogujie, mentioned the difficulty of cold backup in his article "do disaster recovery, cold backup is not a good plan". Our DR solution technically focuses on solving the problems of data synchronization, clearing abnormal lock files, updating load balancers, updating application configuration, switching databases using Data Guard, and how to reverse synchronization and automatically switch to the configuration before unplanned interruptions after the primary node is restored. For RTO (Recovery Time Objective) and RPO (Recovery Point Objective) of our DR solution, you can query "Disaster Recovery for Oracle SaaS Public Cloud Services" on Google and get it from the official documentation. In fact, the validated data in our production environment is much better than the published data.

Third, continuously improve access control and find a balance between efficiency and security

I summarize the scope of access control as follows: a specific person authorized by the customer, within a specified period of time, in a verified and secure manner, accessing desensitized content, and encrypting all channels and nodes that customer data passes through as much as possible.

(1) customer authorization. According to the different industry attributes and data security requirements of customers, we have customized the access control approval workflow for multiple customer security audit departments. This authorized program involves the nationality of SRE engineers, third-party background checks, security training related to customer data protection, hard disk encryption status of laptops, etc. The limitation of access authorization may be an one-off, a few days, or a month, depending on the characteristics of the industry and customer needs.

(2) the fine granularity of access control. In the implementation of the technology, in addition to VPN and Bastion (also known as Jumpbox), we also introduce the Oracle Break Glass scheme to allow external customers to approve and authorize the SRE engineers of Oracle to manage access to the system and services, providing additional security in the application layer. Break Glass access is time-limited and protects customer data by providing only temporary access to Oracle support personnel. We also introduced HSM to enhance the management of digital keys in the cloud service environment. In the new generation of Oracle SaaS services, any engineer's SQL operation on the database will automatically suspend and automatically generate a SR requiring approval for execution until the security of the SQL statement is reviewed and approved.

(3) data encryption. In addition to this controlled access, we also use Oracle's Transparent Data Encryption (TDE) and Database Vault to protect and audit static data rows. Customers can control the TDE master encryption key and manage its lifecycle.

(4), penetration testing, safety assessment, repair and enhancement. In addition, we periodically review the security of authentication and authorization protocols for various components, the security of transport layer encryption and network isolation, and the fine-grained data access control from a technical point of view, and cite vulnerability scanning, penetration testing and evaluation. Timely automated repair and enhancement programs for potential weaknesses found.

Continuously verify and improve the reliability, availability and maintainability of each component from the point of view of operation and maintenance

When talking about reliability, chaos engineering Chaos Engineering is often mentioned. Personally, I think chaos Engineering is for the service consumers of cloud service providers. Cloud service consumers often lack understanding of low-level technologies, so it is necessary to introduce chaos engineering to trigger server instance failures, network failures, and application failures so that the public cloud services submitted by their R & D engineers can tolerate failures while still ensuring adequate quality of service.

For public cloud service providers, we also have to follow the expert model, introduce destructive testing, and continuously verify and improve the reliability, availability and maintainability of each component from the point of view of operation and maintainability, especially the solution for the recovery of possible failures, so as to improve the ability of the system to restore the service to running state in less time after failure.

We usually decompose the IT infrastructure of the entire service into several components, and then analyze and improve the recovery solution of each component from the following seven dimensions.

(1) single point of failure, for example, various components of hardware, various processes of software, hot plug of hard disk, whether bad disk will cause zero I / O, whether Chatty Disk will cause zero I / O, DISK Resilvering, system boot disk, hard disk shelf Enclosure.

(2) Cluster framework, for example, CRASH, HANG, PANIC, manual switching cluster, manual cluster Failback, cluster Split Brain, cluster heartbeat failure, cluster takeover operation under high load, distributed lock failure test, data consistency verification failure test.

(3), shared services, for example, if there are multiple configurations, adding or deleting an entry in DNS, NTP, AD, LDAP, NIS should not affect the access of the data access and management interface.

(4) data corruption, for example, including triggering Split Brain and observing whether there is a data corruption problem and finding a solution for data service recovery, triggering RAID corruption and observing whether there is a data corruption problem and finding a solution for data service recovery.

(5) Infrastructure service failure.

(6) the reliability of the management and monitoring interface.

(7), the performance and diagnostic problems brought by Overlay technology, and the solution of service recovery.

Because of in-depth research and preparation for the corresponding technical areas of each component, my SRE team basically achieved "response and data collection and analysis within 15 minutes, and solution within 15 minutes" for upgraded cloud service performance and availability issues (P1 Escalation).

In a word, the security in DevSecOps operation and maintenance mode of cloud computing data center is a process of continuous improvement. We should fully consider decentralization, backup and disaster recovery, continuously improve access control, and introduce destructive testing to improve the ability of the system to recover to the running state quickly after failure.

The purpose of this article is to briefly describe my understanding of "Sec" (security) in the current cloud computing data center DevSecOps operation and maintenance model as an IT system architect, as well as some exploration in my work. Its purpose is to attract people to discuss how to improve the security of cloud service data centers and ensure business continuity. Some of these views are not necessarily correct. Criticism and correction are welcome.

Linux Command Collection: https://www.linuxcool.com/

You are welcome to leave a message listing your company's experience in improving "business continuity" from a security perspective.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.