In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly introduces "what are the knowledge points of Hadoop system security". In the daily operation, I believe that many people have doubts about the knowledge points of Hadoop system security. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the questions of "what are the knowledge points of Hadoop system security?" Next, please follow the editor to study!
I. Hadoop basic framework
Before we talk about safety, we must first do a basic popularization. I guess many domestic security personnel know nothing about the framework of Hadoop, or know little about it.
In the application of Hadoop, the primary consideration of the business is performance expansion and cluster storage expansion, while the method mainly depends on open source. The point of understanding the Hadoop framework is that every component in the framework is potentially risky. This figure is a simplified view of the underlying MapReduce system of Hadoop:
As you can see from this diagram, Hadoop is very scalable, supporting parallel processing, providing scalable nodes, and supporting multi-tenancy across applications. However, there is a problem in security. Here, each node is peer-to-peer and communicates with each other, which includes a lot of content, such as correct data replication, node online and offline, storage optimization and so on. But that is to say, they trust each other, which is a risk.
MapReduce is just a distributed task processing system at the bottom of Hadoop, and the framework of the whole Hadoop is as follows:
You can also think of it as a LAMP stack, where you can add components according to your business situation. For example, if you use HBase to store data to achieve millisecond retrieval, but the combined query for multiple fields is weak, you may need Solr to do the combined query. Use Sqoop to help you derive data from traditional relational databases, and use Pig to achieve advanced MapReduce functions. Both Spark,Drill,Impala and Hive can be used to do SQL queries. So the whole Hadoop is flexible, customized and modular.
Flexibility brings complexity as well as difficulty to security. Each module has a different version, configuration, and even requires separate authentication, so each module is a risk point. So with Sentry, you can uniformly authorize the entire Hadoop ecosystem, and there may be new security modules in the future.
II. System security
At the system level of Hadoop:
1. Authentication and authorization
Role-based access control (RBAC) is the core of the whole access control. RBAC associates roles with permissions, including roles, groups, tables, labels, and other available data. Authentication and authorization in large enterprises require cross-team cooperation, such as integration with SSO and IT, data field secret-level particle control and data platform. In Hadoop ecology, identity is a complex content. Hadoop is loosely coupled with authoritative identity sources as far as possible. Hadoop uses Kerberos as the authentication protocol by default, but Kerberos is not enough to support more advanced identity authentication.
2. Static data protection
The protection of static data actually refers to encryption. HDFS supports native static encryption to prevent data from being read directly from disk. But sensitive data exists not only in the HDFS layer, but also in logs, exchange files, message queues, meta-databases and other places.
3. Multi-tenant
Hadoop usually provides services for multi-tenants, such as my company, where tenants include different bg, including companies that buy and invest, and tenants who work with outside parties. Data between tenants should be isolated or encrypted, and some companies use ACL to control it, while in the cloud, most of it is encrypted through regional keys.
4. Node communication
For example, Hadoop and MongoDB communicate unsecurely by default, using unencrypted RPC based on TCP / IP. Although TLS and SSL are provided, they are rarely used between nodes.
5. Client interaction
Client interacts with resource manager and node, as shown in figure 1. Therefore, Client can make attacks, such as malicious occupation of resources, exploitation of vulnerabilities. While Hadoop is a distributed architecture, traditional tools such as firewalls are not suitable.
6. Distributed nodes
There is a classic saying that mobile computing is cheaper than mobile data, and data is calculated where resources are available, thus realizing large-scale parallel processing. Distribution makes the environment more complex, resulting in more attack surfaces, and patches, configuration management, authentication, static data protection, and consistency are all problems.
III. Operational safety
In addition to system security, there is also security in operation and maintenance. Operators generally want to provide functions such as configuration management, patch updates, policy maintenance, and so on. Hadoop did not have these things before, and even now it lacks mature means of operation. The problems in this part are:
1. Authentication and authorization
Identity and authentication are at the core of security, and Hadoop does a lot of integration in this respect, from not providing authentication at first to integrating LDAP,Active Directory,Kerberos, X.509, through which authorization can be mapped based on roles, or can be extended to finer-grained authorization (such as Apache Sentry), and then to customization.
2. Privileged access
In a company, one administrator may be in charge of the operating system, and Hadoop is another administrator, and they all have access to files in the cluster, so administrative roles need to be separated to minimize unnecessary access. Direct access to data can be managed through a combination of role-based authorization, access control lists, and so on. The administrative roles can be separated by the separation of powers. Even stronger, it is encryption and key management, HDFS encryption.
3. Configuration and patch management
There may be hundreds of nodes in the cluster, so it is a difficult problem to configure and patch these nodes uniformly, for example, the configuration of the new node and the original node are not unified. The existing configuration management tools are all on the underlying platform, and there is no corresponding configuration management for NoSQL system. In addition, there is no scanner for Hadoop specific inspection on the market.
4. Software dependence
Hadoop has many different components, each with its own configuration, patches, and validation methods. Fortunately, the emergence of Docker technology can alleviate this problem to a great extent.
5. Authentication of applications and nodes
If an attacker can add a node to the cluster, it can infiltrate into the data layer, which directly bypasses authentication. The identity can be forged if the Kerberos keytab file is further obtained through the node. At this point, you can consider certificates, although the deployment of certificates becomes more complex, but improves security.
6. Log audit
If there is a data leak, can it be tracked from the log? There are some open source tools available in big data's environment, such as Facebook open source Scribe or Logstash. Logs can be stored in the cluster, but there is also a risk of tampering, so many companies will consider dedicated platforms such as Splunk to upload logs to other platforms. There are many components in Hadoop, different log formats, so you also need to do log aggregation. In addition, only the user, ip, this information is not enough, but also know what the query statement is.
7. Monitoring and blocking
In the most common scenario, the computing platform is short of resources, because some bear children run some junk tasks on the platform and take up resources. Or if you find a malicious query, you have to have the means to stop it. Hadoop provides monitoring tools that are typically embedded in services such as Hive or Spark to filter malicious queries.
IV. Architectural security
In terms of security, authentication, authorization, encryption, key management, log audit are the cornerstones of the entire architecture security, but the combination of these technologies, how and where to deploy is very important. The following figure shows the risks and measures that can be considered at the security level of the system.
For example, there are two ways to protect the transport layer, one is SSL, but it requires you to be able to manage certificates. There is also a network separation to ensure that attackers cannot enter, which lacks internal protection but is easier to implement.
In terms of authentication and authorization, you can consider Apache Ranger or Sentry.
In the following figure, there are operational risks and measures, each of which has its own advantages and costs. For example, in the part of privilege management, different encryption methods are mentioned, which is of course the best choice under ideal circumstances, but in fact, many big data platforms have been running for some time, and encryption has to be done at this time. It will affect the production of upstream and downstream. An carelessness is an accident. In this case, encryption is not the best choice, but dynamic masks, field tags or tokenization should be considered, which will make the project easier to move forward and cheaper.
At the architectural level, there are more issues to consider for the security team. Which methods are effective, cost-effective to manage, and can be supported by business units. In large Internet companies, the plan put forward by the security department is to accept business challenge. For example, the security department proposed that the Hadoop should be field-based encryption, and to use kms one secret at a time, will be hacked to death by the business department. Therefore, safety should always be considered in reality, do not rigidly adhere to which is the safest, can be guaranteed by a variety of methods.
For example, some realistic choices, Hadoop authentication, through the application gateway, while IAM is transparent to the user, encryption post. If the cluster is highly multi-tenant, we must consider TLS to protect it, so we urgently need fine-grained dynamic level control (mask, tokenization, etc.).
The most common method is the one shown below, which is similar to a moat. The whole cluster is internal, access is strictly controlled, and the security of the entire infrastructure depends on the protection measures such as firewall and authentication. The advantage is that it is simple, it will not be challenged in the business, and it is easy to implement at a low cost. The disadvantage is that once it has passed the authentication and authorization, it will be smooth all the way.
In the internal security of the cluster: unlike other relational databases, the internal functions of Hadoop are transparent to inter-node, cluster replication and other functions. Its protection requires the integration of many native and tripartite security tools, and security should be part of the architecture of the entire cluster. Tools may include: SSL / TLS to ensure secure communication, authentication between Kerberos nodes, static data storage security, identity and authorization, etc. The location of various security measures is shown in the following picture.
In addition, large Internet companies generally have countless sources of data, and it is difficult for security teams to know where the data is active and what protection measures are, but the data is on the carrier of the data center, so you can choose some basic protection measures: tokenization, masking, and encryption. These measures ensure that the data can be protected with a certain intensity no matter where it is used. Tokenization is a data token, a bit like our game token in the arcade, it is not cash, but can be used to catch dolls. In data protection, data tokens are used to replace sensitive data such as bank card numbers, but the data token itself is meaningless and it can only map the real data. Masking partially masks or replaces the data, such as replacing the ID number with a random number, so that the original data does not appear in the query, but the real data is stored in the table.
These things are done because both internal and external users cannot be trusted, because you do not know when the data will be shared with partners or anyone, and dynamic desensitization can be additionally controlled according to ip, device type, and time. The static desensitization method can completely cover up the sensitive data while retaining the value of data analysis. You need to make choices according to different situations.
The number of security solutions compatible with Hadoop or designed for Hadoop has been increasing over the years, with the greatest contribution coming from the open source community, and even some enterprises have contributed directly to help users solve a lot of pain. Introduce several tools:
1 、 Apache Ranger
Ranger is a centralized security management solution for Hadoop clusters, including auditing, key management and fine-grained data access control. The security authentication mechanism of modules such as HDFS,Hive,YARN,Solr,Kafka can be integrated. The key! Ranger is one of the few tools that can provide a central management view, so you can set file and directory permissions in HDFS and SQL policies in Hive to form an overall security control.
2. HDFS encryption
HDFS provides "transparent" encryption embedded in Hadoop files, and data is transparently encrypted when stored in the file system, which is less expensive for application clusters. Support for encryption of regions, files, and directories, each with a different key, so that it can support multi-tenancy and integrate with KMS.
3 、 Apache Knox
Think of Knox as a Hadoop firewall. More specifically, it is an API gateway. Knox processes HTTP and RESTful requests and performs authentication and policy control. Combined with network partition and domain division, the attack surface is further reduced.
4 、 Apache Atlas
Open source governance framework, which can manage the core competence of a series of metadata, such as data consanguinity, data dictionary, data classification and so on. To put it simply, data discovery and access control are implemented. But Atlas is just developing, and maturity can be a problem.
5 、 Apache Ambari
The tools for managing Hadoop clusters can synchronize the configuration to the entire cluster, which seems to be rarely used in China. Of course, you can also write your own scripts, which can provide custom functions. But for small and medium-sized companies, Ambari can quickly start running consistency management for clusters.
6. Monitoring
Monitoring has two parts, real-time analysis and interception. The Hive,PIQL,Impala,Spark module provides SQL or pseudo-SQL syntax, which can be protected with the tokenization and masking mentioned earlier. Or finer-grained authorization to change the replacement query results. From the perspective of data monitoring, we will get more information than the monitoring perspective of the application.
5. Suggestions
Consider the security solution of Hadoop cluster, first give a few principles:
1. Do not damage the function of the cluster
2. The architecture is consistent and cannot conflict with the Hadoop architecture.
3. Be able to address security threats
Why do you say that? because there are too many mixed companies in the market that claim to provide big data solutions, and they are still using mysql to make their own products. Moreover, the whole ecology of Hadoop is still young, and not all tools are mature.
Technically, it is recommended to consider the following parts, or I think this is the baseline security of a Hadoop cluster:
1. Kerberos for node authentication
More and more companies have done so, and integrating Kerberos is now much easier than it used to be. Kerberos authentication is one of the most effective means of security control in node security, and it is built into the Hadoop infrastructure. Recommended.
2. File layer encryption
To put it simply, it is to protect static data, which can prevent unauthorized access by administrators, malicious users and tenants. Of course, it can also ensure the security of the hard disk after it is stolen. Moreover, many places have begun to require that data must be encrypted and stored, and compliance needs to be considered. File layer encryption can ensure consistency, span operating systems, platforms, storage media, and seamlessly expand as clusters increase. It is transparent to applications and platforms. So encryption is not a recommendation, it should be a mandatory option.
3 、 KMS
The attacker can get the key, and your encryption is meaningless. Many administrators and developers store keys on hard drives-I've really scanned developers' and dba's hard drives, so I know. Therefore, an independent key management platform is needed to distribute keys and certificates. Especially in large-scale enterprises, the management of multi-key and multi-tenant is a more realistic problem.
4 、 Apache Ranger
It has been introduced earlier.
5. Automated deployment
Some companies use scripts and source code control, some use traditional patch management systems, and some are nightmare management methods. Consider configuring automation tools such as Chef and Puppet, installing from trusted images, upgrading, distributing keys, and so on.
6. Log and monitoring
You can use the built-in Hadoop capabilities to create logs, use the cluster itself to store events, and include things like LogStash,Log4J and Kafka for data flow management and search.
7. Communication encryption
SSL / TLS is used to achieve secure communication between nodes and between nodes and applications. It is recommended to encrypt all of them. Although this has a slight impact on performance, the overhead is shared by all nodes, so the pressure is not great.
Encryption, authentication and platform management tools greatly improve the security of Hadoop clusters. The integration of authentication and authentication and fine-grained access control make security work easier. The speed of the Hadoop community is very fast, and to be honest, it exceeded my expectations. But frankly speaking, as far as many domestic companies I know, the security of Hadoop is still at the level of network isolation, and then expect the attacker to be unable to infiltrate, hard on the outside and soft on the inside. Under the requirements of compliance regulation, in the environment of open data ecology, network isolation is far from enough.
At this point, the study of "what are the knowledge points of Hadoop system security" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.