How to use Amazon Athena and Amazon QuickSight for visual analysis of weather data 07/15 Update SLTechnology News&Howtos

How to use Amazon Athena and Amazon QuickSight for visual analysis of weather data

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

How to use Amazon Athena and Amazon QuickSight for visual analysis of weather data, many novices are not very clear about this. In order to help you solve this problem, the following editor will explain it in detail. People with this need can come and learn. I hope you can get something.

With the promotion of enterprise digital transformation, a large amount of data has been generated in the business system. How to make good use of these massive data has become a major problem for enterprise IT operation. From the traditional point of view, enterprises are required to set up Hadoop clusters on their own. Due to the high demand for data storage and data operation analyzed by big data, the demand for hardware configuration is also very high, resulting in the high cost of the whole project.

After using the cloud server, the operation and maintenance problems of the local data center can be solved, and the stability of the overall system can be improved through high-standard and high-specification hardware and data center. However, if it is only deployed through the cloud server, it will only solve the problem of server operation and maintenance, and will not maximize the advantages of cloud big data analysis service.

In addition to EMR hosting Hadoop services, AWS also provides many big data-related analysis services, such as data warehouse Redshift, real-time data processing Kinesis,ETL service Glue and so on.

For AWS-hosted services, you only need to pay for the resources you use. Pay-on-demand model avoids heavy asset investment in local data center construction and resource waste caused by idle time of CVM.

Today we will demonstrate a scenario of rapid big data analysis, which does not need to set up a server and uses SQL language for data query in real time.

Background and challenges

Global climate analysis is essential for researchers to assess the impact of climate change on the earth's natural capital and ecosystem resources. This activity requires high-quality climate data sets, which can be challenging because of their scale and complexity.

In order to have confidence in their findings, researchers must have confidence in the source of the climate data set they study. For example, researchers may try to answer the question: will climate change in a particular food-producing area affect food security? They must be able to easily query authorities and manage datasets.

The National Environmental Information Center (NCEI) maintains a climate data set based on observations from global weather stations. It is a central repository of daily weather summaries at a ground station in the Global History and Climate Network Daily (GHCN-D). It consists of millions of quality-guaranteed observations that are updated every day.

The researchers provide weather data in CSV format through a year-separated FTP server. Annual organization means that a complete file requires more than 255 documents. Traditionally, a researcher needs to download the entire data locally to study this data. In order to ensure that the latest data is used for analysis, they have to download the data repeatedly every day.

Solution

Through AWS's big data project with NOAA, daily snapshots of the GHCN d dataset are now available on AWS. This data is publicly accessible through Amazon S3 bucket. There are several benefits to getting data in this way:

This data can be accessed globally through S3, users do not need to download the data completely to use it, and everyone can use the same consistent copy.

Reduced the time for analysis, and through the Athena and Insight services we demonstrated this time, we can start the analysis immediately

The cost of the research has been reduced, and researchers do not need to set up servers and Hadoop clusters, and can shut down resources at any time once the analysis is completed.

This article shows a workflow using Amazon S3MagneAmazon Athena,AWS Glue and Amazon QuickSight, demonstrating the speed at which people can gain insight from this dataset.

This workflow follows the following steps:

Extract the data file from the NOAA bucket and provide the data as a table

Use SQL to query data in a table

Demonstrate how to speed up analysis by creating tables by querying and storing them in an Amazon S3 private bucket

Visualize the data for easy display

Work flow

Extract the annual .csv file and add it to the Athena table

Extract site data and add it to a separate table in Athena

Annual document extraction

Complete daily weather observations are organized by year in a folder in Amazon S3 barrels. Csv format. The path to the data is s3://noaa-ghcn-pds/csv/.

Each file is named .csv after the year that began in 1763 and is stored until the current year.

Enter the Athena console, click AWS Glue Data Catalogj to enter the Glue console, select the table menu on the right, and select manually add table.

Set a table name and add a database

Next, select a different path to another account and enter the public bucket location s3://noaa-ghcn-pds/csv/ of the NOAA.

Next, define Schema

Add the following, using the string type (string type)

Year_date

Element

Data_value

M_flag

Q_flag

S_flag

Obs_time

Click finish after adding, and return to the Athena console. You can see the created table on the left. Here is some preparatory work to be done. Before running our first Athena query, set up a S3 bucket to hold the query results

At the same time, we also create some folders for data storage.

[your_bucket_name] / stations_raw/

[your_bucket_name] / ghcnblog/

[your_bucket_name] / ghcnblog/stations/

[your_bucket_name] / ghcnblog/allyears/

[your_bucket_name] / ghcnblog/1836usa/

Once set up, we can use table preview to create a query statement

Run the query to see the data in our table

Use CATS to speed up queries

In the query just now, we can see that the query statement is not running very fast. We need to create a table to use CREATE TABLE AS SELECT (CTAS) to speed up the query.

The reason is that during this process, we extract the data only once and store the extracted data in the private Amazon S3 bucket in column format (Parquet).

To illustrate the increase in speed, here are two examples:

A query calculates all the different id, the only weather station, which takes about 55 seconds to scan about 88gb data.

It takes about 13 seconds to make the same query on the converted data and scan approximately 5gb's data.

Here are the steps:

Open the Athena console

Create a new query and replace the bucket name with your own

/ * converting data to Parquet and storing it in a private bucket*/CREATE table ghcnblog.tblallyears_qaWITH (format='PARQUET', external_location='s3:// [your-bucket-name] / ghcnblog/allyearsqa/') AS SELECT * FROM ghcnblog.tblallyearsWHERE q_flag =''

Be sure to replace the bucket name and table name with the one you just created.

After running, you will see that a new table appears in the library on the left, and then we will continue to work on this new table.

Extract the data and add it to the table in Athena

The site text file contains information about the weather station, such as location, nationality, and ID. These data are kept in a separate file, which is different from the annual observation data. We need to import these data to observe the geographical distribution of weather observations. Although working with this file is a bit complicated, the steps for importing this data into Athena are similar to what we have already done.

To import this data, we will take the following steps:

1. Download ghcnd-stations text file.

two。 Open this file using a tabular editor, such as Excel

3. Save as CSV file

4. Upload this csv file to the previously established [your_bucket_name] / stations_raw/ folder

5. Use Glue to add tables, as we did before

In the add column step, add the following

Latitude

Longitude

Elevation

State

Name

Gsn_flag

Hcn_flag

Wmo_id

After clicking on the completion, we can see that the data has been imported successfully by looking at the preview of the table below.

6. Then you still use CATS to store the data in parquet format.

/ * converting data to Parquet and storing it in a private bucket*/CREATE table ghcnblog.tblghcnd_stations_qaWITH (format='PARQUET', external_location='s3://athena-cx-bucket/ghcnblog/stations/') AS SELECT * FROM ghcnblog.tblghcnd_stations

At this point, we have prepared the data and imported it into Athena

Simple data analysis

Next, we will demonstrate several examples of data analysis.

1. Query the number of observation points since 1763

SELECT count (*) AS Total_Number_of_ObservationsFROM ghcnblog.tblallyears_qa

2. Query the number of ground stations

SELECT count (*) AS Total_Number_of_StationsFROM ghcnblog.tblghcnd_stations_qa

3. Earth mean meteorological parameters

The following figure shows the query for calculating the earth's average maximum temperature (degrees Celsius), average minimum temperature (degrees Celsius) and average rainfall (millimeters) since 1763. In the query, we must convert the data value from the String variable to the Real variable. We also have to divide by 10, because the measurements of temperature and precipitation are 1/10 of their respective units.

For more information about these details and element codes (TMIB, TMAX, and PRCP), see the README file.

SELECT element, round (avg (CAST (data_value AS real) / 10), 2) AS valueFROM ghcnblog.tblallyears_qaWHERE element IN ('TMIN',' TMAX', 'PRCP') GROUP BY element

It would be convenient if we could run a simple query (such as this one) on this dataset and accept that the results are correct. The previous query assumes that weather stations around the world have been evenly distributed since 1763. In fact, the number and distribution of weather stations vary with time.

4. Visualize the increase in the number of weather stations on the earth

Next we will introduce Amazon QuickSight. For data visualization, you must configure QuickSight before the experiment, including registering and authorizing QuickSight to access Athena and S3.

Open the QuickSight console

Click on the top right corner and select manage QuickSight from the account menu.

Select security and permissions to ensure that S3 and Athena are connected

Then go back to the main menu, select the new analysis, then select the new dataset, select Athena in the data source below, enter the name of the database we created in Athena, ghcnblog, and click create data source.

Choose to use a custom SQL and enter the following statement

SELECT DISTINCT id AS numberofstations,substr (year_date,1,4) as yearFROM ghcnblog.tblallyears_qaGROUP BY substr (year_date,1,4), idORDER BY substr (year_date,1,4)

Select confirm query, select query data directly, click Visualize, select the line chart in the visual chart, take the year data as the X axis and number_of_stations as the value, you can get a curve of the growth of the number of global weather stations.

Analysis on the number of Meteorological stations in the United States

In 1836, data collected from the United States set up a weather station for the first time. In order to gain an in-depth understanding of the development of American observations, we extracted a subset of American data from the master data source (tblallyears qa). From 1836 to 2016, this data set collected data every 30 years. This query generates a set of big data. To improve performance, use the procedure described earlier to save the query as a table stored in Amazon S3 bucket. Execute the following statement in Athena.

CREATE TABLE ghcnblog.tbl1836every30thyearWITH (format='PARQUET', external_location='s3:// [your-bucket-name] / ghcnblog/1836every30years/') ASSELECT TA.id as id, substr (TA.year_date,1,4) as year, TS.state, CAST (TS.longitude as real) as longitde, CAST (TS.latitude as real) as latitude, element, CAST (data_value as real) as data_valueFROM "ghcnblog" .tblall years _ qa as TA, "ghcnblog" .tblghcnd _ stations_qa as TSWHERE substr (TA.year_date) 1) IN ('1836,' 1866, '1896,' 1926, 1956, 1986, '2016') AND substr (TA.id,1,2) = 'US'AND state' PI'AND TRIM (TA.id) = TRIM (TS.id) GROUP BY TA.id, substr (TA.year_date,1,4), state, longitude, latitude, element, data_value

Then create a new analysis in QuickSight, using the created new table as the data source

And use the following custom SQL

SELECT DISTINCT (id) AS number_of_stations, year, stateFROM ghcnblog.tbl1836every30thyearGROUP BY year, id, stateORDER BY year

In the next visualization interface, select the points on the earth as the image type, select state,size for geographic location data, select the number of weather stations, and use total statistics and color to select the year.

Data such as the average temperature of each state in the United States can also be calculated by customizing SQL.

Summary

Through the demonstration of this article, we know that through the combination of Athena,Glue and QuickSight, we can quickly build a visual big data analysis platform without setting up servers and maintaining Hadoop clusters. For rapid verification of analytical data, it is very convenient to make business decisions.

At the same time, the billing of these services is also collected through the analysis of the amount of data, which greatly reduces the idle cost of resources. After the results of data analysis, the data can be further stored and archived in S3 Glacier with lower cost to further save costs.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.