Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Cloudera data Engineering to analyze salary Protection Plan data

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article shows you how to use Cloudera data engineering to analyze salary protection plan data, the content is concise and easy to understand, it will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

The pay Protection Program (PPP), implemented by the US federal government, is designed to provide direct incentives for companies to maintain their wages, especially during the Covid-19 pandemic. PPP helps qualified companies retain their workforce and help pay for related business expenses. From the data on the US Treasury's website, its company gets PPP loans and much of the work is on how to keep the program. The U.S. Treasury approved about 1 million PPP loans across the United States. The analysis of these data presents three challenges. First of all, there is a large amount of data. The time it takes to extract, collate, transform, retrieve, and report this data is time-consuming. Second, the dataset may evolve, which will consume additional development time and resources. Finally, in such a multi-stage process, things may fall apart. The ability to quickly identify errors or bottlenecks will help to meet SLA consistently. Here is how to use Apache Spark in Cloudera data Engineering (CDE) for reporting based on PPP data while addressing all of the above challenges. Objective the Texas Legislative Budget Commission (LBB) simulation scheme is set up below to help data engineers manage and analyze PPP data. The main goal of the data engineer is to provide two final reports to LBB:

Report 1: breakdown of all cities that keep jobs in Texas

Report 2: details of the types of companies that reserve positions

Cloudera data Engineering (CDE) this is where Cloudera data Engineering (CDE) running Apache Spark can help. CDE is a service in Cloudera Data Platform (CDP) that allows data engineers to create, manage, and schedule Apache Spark jobs, while providing useful tools to monitor job performance, access log files, and orchestrate workflows through Apache Airflow. Apache Spark is a data processing framework that can run large-scale data processing quickly.

The U.S. Treasury provides two different data sets, one for approved loans greater than $150000 and the other for approved loans under $150000. To generate the two final reports for LBB, follow these steps (see figure 1).

The first step is to load two separate datasets into the S3 bucket.

A Spark job is created for each dataset to extract and filter data from the S3 bucket.

These two Spark jobs transform clean data and load it into the Hive data warehouse for retrieval.

A third Spark job was created to process data from the Hive data warehouse to create two reports.

After the job is run, CDE will provide each Spark as a graphical representation of the various stages of the industry (see figure 2). This makes it easy for data engineers to see which parts of their work are likely to spend the most time, allowing them to easily refine and improve the code to best meet the customer's SLA. Figure 1: data travel to generate two final reports. Figure 2: CDE graphical representation of various Spark phases.

Conclusion the main goal of generating two final reports from the records of one million approved applicants has been achieved. The graphical summary of the first report (see figure 3) shows the top 10 samples of the number of jobs reserved in each city in Texas, and the second report (see figure 4) shows the top 5 samples by company type. For example, with these reports, the Texas Legislative Budget Commission can infer that cities with the least workload per capita may need resources to mitigate any economic impact.

Figure 3: top 10 Texas cities with the most jobs in 2020 figure 4: top 5 companies that retain the most jobs, Texas, 2020

The above is how to use Cloudera data engineering to analyze salary protection plan data. have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report