What is the cause of the problem of Puppet monitoring and its solution? 04/19 Update SLTechnology News&Howtos

What is the cause of the problem of Puppet monitoring and its solution?

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Puppet monitoring quick check the cause of the problem and what the solution is, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

Puppet is a centralized configuration management system based on Cramp S architecture. Based on its own descriptive language, it can manage configuration files, users, timing tasks, software packages and system services, and ensure the consistency of basic configuration of large-scale clusters.

We use Puppet to manage thousands of servers, and after many optimized monitoring, automatic grayscale publishing ensures the consistency of the basic configuration of all clusters. This article discusses how to monitor the Puppet system and share the typical problems and solutions with you.

Monitoring and selection

Foreman provides comprehensive interaction facilities, including Web front end, CLI and RESTful API. On this basis, we can build a monitoring and management system, as well as achieve alarm and other functions.

Core business process

You can simply abstract the workflow of Puppet into four parts:

Request phase: Agent sends its own information to Server based on SSL

Response phase: Server parses the corresponding configuration based on the client information, and finally sends the pseudo code (catalog) back to Agent

Execution phase: Agent receives catalog and executes commands or updates files

Reporting phase: Agent reports the results to Server.

Figure 1 Puppet workflow

Monitoring Overview

The core monitoring of Puppet mainly covers the following links:

Is the communication between Agent and Master normal?

Whether Agent policy enforcement is effective or not

The effective time and scope of the policy issued by Puppet

The running status of Master and the clusters it manages.

Black box monitoring

Puppet black box monitoring indicators do not meet expectations, indicating that the cluster does not work properly or is abnormal. Black box monitoring indicators include: whether all policies are effective, whether the effective scope of policies is in line with expectations, and whether the effective results of policies are in line with expectations.

Are all policies effective?

Note: add a batch of test nodes to the online Puppet cluster and run check scripts regularly to verify that all policies are effective.

Effective scope of the policy

Note: after the policy is launched, you need to confirm whether its effective scope is in line with expectations, that is, whether the policy only takes effect on the specified node.

Implementation: check the task regularly through the Puppet module MCollective (check whether the list of machines in effect is consistent with the list of machines in the service tree). As shown in the figure below, 98% of the machines in the cluster hn-xdata meet expectations and 2% do not.

Figure 2 Puppet policy effective scope monitoring

Is the effective result of the policy in line with expectations?

Note: after the policy is online, you need to make sure that all policies take effect on all machines.

Implementation: check the task regularly through the Puppet module MCollective, (check whether the list of machines in effect is consistent with the list of machines in the service tree), as shown in the figure below, each policy has a pie chart.

Figure 3 Puppet policy result monitoring

White box monitoring

White-box monitoring is a supplement to black-box monitoring, which serves for fault location and combs from four aspects: cluster capacity, traffic, delay and errors.

Data collection method:

Through Foreman API

Master log analysis

Table 1 Overview of white box indicators acquired through Foreman API

Index

Description

No reports

No reported mainframe

Error

Connected, but there was an error in the execution policy.

Out of sync

The execution policy timed out; the hostname is duplicate; the host cannot be connected

Active

Agent pull policy is normal

Pending

Capacity index, which can not be handled by Master

No changes

Agent normal pull policy but no change

Puppet_report_time_total

Total time for Agent to execute policies

Visits per minute

Capacity

CPU of the instance where Master resides, number of network connections indicator, Nic

Flow

Agent PV, which calculates traffic based on Puppet Master's access log puppetserver-access.log

Figure 4 Agent PV traffic diagram

Delay

Time required for a single Agent update policy: puppet_report_time_total

Description: puppet_report_time_total is the total time from Agent connecting to Master to sending the report to Master. 0-3s account for 50% of the report, 0-11s, 90% of the time, and 0-15s of 99%.

Figure 5 Agent delay

Error

No reports: number of unreported instances

Error agent: the number of instances with errors in policy enforcement

Out of sync: the number of instances in which the execution policy timed out, the hostname is duplicated, and the host is not connected to the Master.

Figure 6 Foreman error monitoring metrics

Problems found by Puppet monitoring

Agent covers all machines

Problem: there is no guarantee that all machine Agent will work properly.

Solution: add all machines to Agent process monitoring based on service tree or CMDB related systems.

Agent Enforcement Policy timed out

Problem: timeout alarm occurs when large files are downloaded concurrently.

Troubleshooting method: execute the command "puppet agent-t-debug" on the Agent and find that the timeout occurred when pulling the file. Because the file is large, there are many Agent pulls on the Master at the same time, resulting in a timeout.

Solution: store large files on cloud storage to improve download speed.

Grouping is not limited to existing Facter attributes

Problem: the existing Facter attributes of policy grouping and grayscale publishing grouping are not satisfied.

Reason: as there are more and more access services, there are more service groups.

Solution: customize Facter.

Agent out of sync (Out of Sync)

Problem: Agent report is out of sync.

Reasons and solutions:

Table 2

Reason

Solution

Duplicate hostname

Re-authentication after modification of Agent Hostname

Rename the host after authentication

Delete the machine with the original name authentication directly in the Foreman console

Agent service exception

Restart the Puppet service on Agent

Agent disk is full

After cleaning the disk, Agent will start and recover itself.

Agent-side certificate error

After deleting the / etc/puppetlabs/puppet/ssl folder on Agent, perform puppet agent-t recertification

Agent side puppet.conf file is empty

Write the corresponding [Agent] configuration to the puppet.conf file and restore it.

Master side puppe.conf file is empty

Write the corresponding [Master] configuration to the puppet.conf file and restore it.

Foreman service down dropped

Execute service httpd restart, service foreman restart on the Foreman machine

Could not request certificate

1) Agent and Master time are out of sync, ntpdate master-IP synchronization time; 2) Agent is not connected to Master network; 3) Port 8140 on Master is not available.

Policy release to unexpected cluster

Problem: there is an error in the scope of the policy.

Reason: the Puppet Master entry file is unified as site.pp. Due to the large number of policy groups, there will be many corresponding branches in the grayscale release stage, so operation and maintenance engineers are prone to operational errors.

Solution: manage site.pp as a policy module that contains default default groupings and groups that need to be published in grayscale. The site.pp under the manifest folder only needs to include the module.

Figure 7 default grouping strategy after site.pp optimization

Fig. 8 grouping of grayscale stages of policy release

Function monitoring found that the synchronized files were unexpected.

Problem: Master is deployed in a cluster mode, and the data on multiple Master may not be synchronized during the policy change. In this case, the files pulled by the same Agent may be inconsistent.

Reason: because there are multiple Master, and one of the Master does not update the file, the LB is forwarded through the polling policy. When the Agent requests Master, it is Master A, and when pulling the file, the request may be Master B. the data of the two Master are inconsistent.

Solution: update the LB policy to the source IP hash.

After reading the above, have you mastered the cause of the problem and the solution to the problem of Puppet monitoring? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.