AWS data Analysis Service (10) 04/11 Update SLTechnology News&Howtos

AWS data Analysis Service (10)

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Amazon Kinesis concept deals with a large amount of streaming data on AWS data platform Kinesis Streams is used to collect data, Client Library is used to analyze the presentation to build custom applications for processing or analyzing streaming data, and can support capturing and storing TB-level data from hundreds of thousands of sources, such as website clickstreams, financial transactions, media feeds, IT logs, etc. Use IAM to restrict users and roles' access to Kinesis Using temporary security credentials of roles can improve security Kiesis can only use SSL encryption to access Kinesis components Kinesis Data Firehose loads large amounts of streaming data into AWS services data is stored in S3 by default, from S3 can be further transferred to Redshift data can also be written to ElaticSearch, and at the same time backup to S3Kinesis Data Streams: custom build applications, real-time analysis of streaming data using AWS development kit Can achieve data moving in the stream can still be processed, thus close to real-time, in order to be close to real-time, the complexity of processing is usually lighter. The creator Producer continuously pushes the data into the Data Streams data. The DataStream consists of a group of Shards, and each shard is a record. Through continuous slicing, the user Comsumer will process the content of the Data Steams in real time. And pushing the results to different AWS service data is temporary in Stream. The default storage is 24 hours, and the maximum can be set to 7 days.

Kinesis Data Analytics uses standard SQL real-time analysis stream data Kinesis Video Streams captures, processes and stores video streams for analysis and machine learning scenarios Real-time processing of massive data intake mass stream data real-time processing Elastic MapReduce (EMR) concept provides a fully managed on-demand Hadoop framework to start an EMR cluster the number of instance types of cluster nodes in the cluster the number of nodes in the cluster that you want to run the Hadoop version of the Hadoop cluster is very important The main factor is whether the cluster is persistent or transient. The cluster that needs to run continuously and analyze the data is a persistent cluster that starts on demand and stops immediately after completion. By default, the number of EMR clusters is not limited, but the total number of EMR nodes is limited to 20. You can apply for expansion to retrieve data from S3 and any other location. Hadoop log files are stored in S3 by default. And compressed EMR supports bidding instance EMR, which needs to be deployed in an availability zone, and cross-zone deployment is not supported. It is generally recommended to select the regional cluster where the data is located and start data processing within 15 minutes. EMR allows the use of magnetic, SSD and PIOPS SSD EBS volumes. Suitable for scene log processing, click stream analysis, genetics and life sciences file system HDFSHadoop standard file system all data is replicated in multiple instances to ensure persistence HDFS can make use of EBS storage to ensure no data loss when shutting down the cluster is very suitable for the implementation of persistent cluster EMRFSHDFS on AWS S3 Saving data in S3 allows you to use all Hadoop ecological tool systems that are well suited for instantaneous cluster EMR NoteBooksEMR Notebooks to provide a Jupyter Notebook-based managed environment for data scientists, analysts, and developers to prepare and visualize data, collaborate with peers, build applications, and perform interactive analysis using EMR clusters. You can use EMR Notebooks to build Apache Spark applications and easily run interactive queries on the EMR cluster. Multiple users can create serverless notebooks directly from the console, mount them to an existing shared EMR cluster, or provide at least 1 node directly from the console and immediately start experimenting with Spark. Security settings EMR sets two EC2 security groups by default: the master node and the slave node master security group define a port to open the SSH port for communication with the service, and allow the SSH key specified at startup to enter the instance. By default, the instance cannot be accessed by the external instance, but the slave security group can be set to only allow interaction with the master instance. By default, sending data to S3 using SSL can support marking the cluster, up to 10 tags. However, tag-based IAM licensing is not supported. Use IAM permissions and roles to control access and control to EMR you can set permissions that allow non-Hadoop users to submit jobs to a cluster can put EMR into a private VPC to implement additional protection AWS Data Pipeline concepts implement reliable processing and movement of data between AWS resources and local data at specified intervals you can quickly and easily deploy pipes without distractions in managing daily data operations This allows you to focus on getting the information you need from this data. You only need to specify the required data sources, timesheets, and processing activities for your data pipeline. Compared to SWF, Data Pipeline is specifically designed to simplify specific steps that are common in most data-driven workflows. For example, performing activities after the input data meets specific readiness criteria, easily replicating data between different data stores, and scheduling the conversion of links. This highly specific focus means that Data Pipeline workflow definitions can be created quickly and require no code or programming knowledge. Regularly access the stored data, process the data on a large scale, and convert the results into AWS services

Using the definition of Pipeline, you can schedule and run tasks every 15 minutes, every day and every week. Data nodes such as pipeline pipeline read and write data. It can be AWS such as S3Magic MySQL Magazine Redshift or local storage Pipeline, which usually needs to cooperate with other services to perform predefined tasks, such as EMR,EC2, etc., and automatically shut down the service Pipeline after execution if the orchestrated process supports conditional statements if an activity fails. By default, you will retry again and again, so you need to configure to limit the number of retries or the actions to be taken if you do not succeed. Each account can have 100 pipes by default, and you can have 100 objects in a single channel. You can apply for extended attribute pipes, that is, AWS Data Pipeline resources, which contain definitions of data sources, destinations, and associated data links composed of predefined or custom data processing activities required to execute business logic. Data nodes data nodes represent your business data. For example, a data node can represent a specific Amazon S3 path. AWS Data Pipeline supports the expression language, making it easier to reference data generated by the normal. The activity is the action initiated by AWS Data Pipeline on your behalf, which is part of the pipeline. Example activities include EMR or Hive jobs, replication, SQL queries, or command line scripts. A prerequisite is a maturity check, which can be optionally associated to a data source or activity. If the data source has a prerequisite check, the check must be completed successfully before you can start any activities that need to be used by the data source. If the activity has prerequisites, the check must be completed successfully before the activity can be run. The timesheet defines when the pipeline activity runs and how often the service expects available data. You can select the timesheet end date, after which the AWS Data Pipeline service does not perform any activities. When you associate a schedule with an activity, the activity runs according to the schedule. When you associate a schedule with a data source, you tell the AWS Data Pipeline service that you expect the data to be updated according to that schedule. Suitable for conventional batch ETL processes rather than continuous data streams Amazon Elastic Transcoder an online media transcoding tool converts video from source format to other formats and resolutions for playback on mobile phones, tablets, PC, etc. Generally, media files that need to be transcoded are placed on AWS S3 buckets, and corresponding pipes and tasks are created to transcode the files to a specific format. Finally, output the file to another S3 bucket. You can also use some preset templates to convert media formats. It can cooperate with the Lambda function to trigger the function code after a new file is uploaded to S3, execute Elastic Transcoder and transcode the media file automatically.

Amazon AthenaAmazon Athena is an interactive query service that allows you to easily analyze data in Amazon S3 using standard SQL. Athena does not have a server, so you do not need to manage any infrastructure and only pay for the queries you run. Athena is easy to use. Simply point to the data you store in Amazon S3, define the schema, and start the query using standard SQL to get the most results in seconds. With Athena, there is no need to perform complex ETL jobs to prepare for data analysis. This allows anyone with SQL skills to analyze large datasets quickly and easily. Supported data formats include JSON,Apache Parquet, a fully managed service that makes it easy to deploy, protect, and run a large number of Elasticsearch operations without downtime. The service provides open source Elasticsearch API, managed Kibana, and integration with Logstash and other AWS services, enabling you to securely access data from any source and perform real-time search, analysis, and visualization. When using Amazon Elasticsearch Service, you only need to pay according to the actual usage, and there is no upfront cost or usage requirement. With Amazon Elasticsearch Service, you can get the required ELK stack without having to incur operational expenses. AWS X-RayAWS X-Ray can help developers analyze and debug distributed production applications, such as those built using a micro-service architecture. With X-Ray, you can understand how your application and its underlying services are executed, thereby identifying and troubleshooting the root causes of performance problems and errors. X-Ray provides an end-to-end view of the request as it passes through the application and shows the mapping of the underlying components of the application. You can use X-Ray to analyze applications in development and production, from simple three-tier applications to complex micro-service applications that contain thousands of services.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.