Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What does MaxCompute mean?

2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

What is the meaning of MaxCompute, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

For many users who have just come into contact with MaxCompute, it is often difficult to quickly and fully understand the full picture of MaxCompute products in the face of numerous product documentation and community articles. At the same time, many developers with big data's development experience also hope to combine their own background knowledge to establish some association and mapping between MaxCompute product capabilities and open source projects and commercial software, so as to quickly find or judge whether MaxCompute meets their needs, and combine relevant experience to more easily learn and use products.

Here will stand in a more macro perspective to introduce MaxCompute products by topic, in the hope that readers can quickly get their understanding of MaxCompute products through this article.

Concept chapter

Product name: big data Computing Service (English name: MaxCompute)

Product description: MaxCompute (formerly ODPS) is a big data computing service that provides fast, fully managed PB-level data warehouse solutions that enable you to analyze and process massive data economically and efficiently.

In the first half of the product description, MaxCompute is defined as big data computing service, which can be understood as a cloud-based service product whose function is positioned to support big data computing. The latter part explains its applicable scenarios: large-scale data warehouse, massive data processing, analysis.

From here alone, we can't know what computing power big data computing service provides and what kind of service it has. The word "data warehouse" appears in the product definition, and we can see that MaxCompute can handle large-scale (PB level) structured data. In addition to the large scale of data, the processing of unstructured data needs to be verified in "massive data processing". At the same time, "analysis" provides other complex analysis capabilities in addition to the common SQL analysis capabilities.

With such questions, we continue to introduce them, hoping to answer them clearly in the following content.

Frame piece

Before introducing the function, the outline begins with the overall logical structure of the product, so that the reader can have a full picture.

MaxCompute provides cloud native, multi-tenant service architecture. MaxCompute computing services and service interfaces are pre-built on the underlying large-scale computing and storage resources, and supporting security control means and development tools management tools are provided. Products are used right out of the box.

Users can activate the service and create a MaxCompute project in a few minutes on the Aliyun console, without the need for underlying resource activation, software deployment, infrastructure operation and maintenance, and the system automatically (by Aliyun professional team) version upgrade and problem repair.

Functional chapter

Data storage

Support large-scale computing storage, suitable for storage and computing needs with a scale above TB, up to EB level. The same MaxCompute project supports the data scale requirements for enterprises to evolve from entrepreneurial teams to unicorns.

Data distributed storage, multi-copy redundancy, data storage only open table operation interface, does not provide file system access interface

Self-developed data storage structure, table data column storage, high compression by default, and ORC-compatible Ali-ORC storage format will be provided later.

Supports appearance and maps data stored in OSS object storage and OTS table storage to two-dimensional tables

Partition and bucket storage supporting Partition and Bucket

The lower layer is not HDFS, but the Pangu file system developed by Ali, but the architecture and task concurrency mechanism of the corresponding files under the table can be understood with the help of HDFS.

When in use, storage is decoupled from computing, and there is no need to expand unnecessary computing resources just for storage

Multiple computing models

It should be noted that in the traditional data warehouse scenario, most of the data analysis requirements in practice can be completed through SQL+UDF. However, as enterprises attach importance to the value of data and more different roles begin to use data, enterprises will also require richer computing functions to meet the needs of different scenarios and different users.

MaxCompute not only provides SQL data analysis language, it supports a variety of computing types on the basis of unified data storage and authority system.

MaxCompute SQL:

TPC-DS is supported by 100%. At the same time, the syntax is highly compatible with Hive, and developers with Hive background can directly use it. It has strong performance especially on the scale of big data.

Completely self-developed compiler, language function development is more flexible, iteration is faster, syntax and semantic checking is more flexible and efficient.

Cost-based optimizer, smarter, more powerful, and more suitable for complex queries

LLVM-based code generation to make the execution process more efficient

Support for complex data types (array,map,struct)

UDF/UDAF/UDTF that supports Java and Python languages

Syntax: Values, CTE, SEMIJOIN, FROM Flip, Subquery Operations, Set Operations (UNION / INTERSECT / MINUS), SELECT TRANSFORM, User Defined Type, GROUPING SET (CUBE/rollup/GROUPING SET), script Operation Mode, Parametric View

Support appearance (external data source + StorageHandler supports unstructured data)

MapReduce:

Support for MapReduce programming interface (provides optimized and enhanced MaxCompute MapReduce, as well as a highly Hadoop-compatible version of MapReduce)

Do not expose the file system, input and output are tables

Submit jobs through MaxCompute client tools, Dataworks

MaxCompute Graph diagram model:

MaxCompute Graph is a set of iterative-oriented graph computing processing framework. The graph computing job is modeled by a graph, which consists of Vertex and Edge, which contain weights (Value).

The graph is edited and evolved through iteration, and the result is finally solved.

Typical applications include: PageRank, single source shortest distance algorithm, K-means clustering algorithm, etc.

Use the interface Java SDK provided by MaxCompute Graph to write a graph calculation program and submit tasks through the MaxCompute client tool through the jar command

PyODPS:

Use familiar Python to process MaxCompute data using MaxCompute's large-scale computing power.

PyODPS is the Python SDK of MaxCompute. It also provides the DataFrame framework, provides a syntax similar to pandas, and can take advantage of the powerful processing power of MaxCompute to deal with very large-scale data.

PyODPS provides access to ODPS objects such as tables, resources, functions, and so on.

SQL can be submitted through run_sql/execute_sql.

Support uploading and downloading data through open_writer and open_reader or native tunnel API

PyODPS provides DataFrame API, which provides an interface similar to pandas, which can make full use of the computing power of MaxCompute to calculate DataFrame.

PyODPS DataFrame provides a lot of pandas-like interfaces, but extends its syntax, such as adding MapReduce API to adapt to big data's environment.

Using map, apply, map_reduce and other convenient methods to write and call functions on the client side, users can call three-party libraries in these functions, such as pandas, scipy, scikit-learn, nltk.

Spark:

MaxCompute provides a Spark on MaxCompute solution to make MaxCompute compatible with open source Spark computing services, providing a Spark computing framework based on a unified authority system for computing resources and datasets, and supporting users to submit and run Spark jobs in a familiar way of development and use.

Support native multi-version Spark jobs: all Spark1.x/Spark2.x jobs can be run

Open source system experience: Spark-submit submission (interactive spark-shell/spark-sql is not supported), and native Spark WebUI is provided for users to view.

Achieve more complex ETL processing by accessing external data sources such as OSS, OTS, database, etc., and support unstructured processing of OSS

Use Spark to carry out machine learning for MaxCompute internal and external data, and expand application scenarios.

Interactive analysis (Lightning)

Interactive query service for MaxCompute products with the following features:

PostgreSQL compatibility: a JDBC/ODBC interface compatible with the PostgreSQL protocol, all tools or applications that support PostgreSQL databases can easily connect to MaxCompute projects using the default driver. Support connection access to mainstream BI and SQL client tools, such as Tableau, soft BI, Navicat, SQL Workbench/J, etc.

Significantly improved query performance: improved query performance under a certain data scale, the query results are visible in seconds, and support scenarios such as BI analysis, Ad-hoc, online services, etc.

Machine learning:

MaxCompute has built-in support for hundreds of machine learning algorithms. At present, the machine learning capabilities of MaxCompute are uniformly provided by PAI products. At the same time, PAI provides elastic prediction services for deep learning framework, Notebook development environment, GPU computing resources and online deployment of models. PAI products integrate seamlessly with MaxCompute in terms of projects and data.

Contrastive chapter

To make it easier for readers, especially those with experience in the open source community, to quickly build an understanding of the main functions of MaxCompute, here is a simple mapping explanation.

Project

MaxCompute products

Some comparisons of the open source community

SQL

MaxCompute SQL

Ali self-developed SQL engine, syntax compatible with Hive, better function and performance

MapReduce

MaxCompute MR

Ali self-research, similar to and support Hadoop MapReduce,MaxCompute Open MR to optimize and improve

Interactive

MaxCompute Lightning

Serverless interactive query service with functions similar to open source ecological Presto, Hawk, etc.

Spark

Spark on MaxCompute

Support native Spark to run on MaxCompute, similar to Spark on Yarn form

Machine learning

PAI

Different from the algorithm library of the open source community, PAI has richer algorithms, super-large-scale processing capacity, and covers the platform services required by the whole process of ML/DL.

Storage

Pangu

Ali developed its own distributed storage service, similar to HDFS. Currently, MaxCompute only exposes the table interface and cannot directly access the file system.

Resource scheduling

Fuxi

Ali self-developed resource scheduling system, similar to Yarn.

Data upload and download

Tunnel

Do not expose the file system, upload and download batch data through Tunnel.

Streaming access

Datahub

MaxCompute's streaming data access service, which is roughly similar to kafka, can archive topic data to MaxCompute tables through simple configuration

User interface

CLT/SDK

Unified command line tools and JAVA/PYTHON SDK

Development & diagnosis

Dataworks/Studio/Logview

Supporting data synchronization, job development, workflow scheduling, job operation and maintenance and diagnosis tools. Sqoop, Kettle, Ozzie, which are common in the open source community, realize data synchronization and scheduling.

As a whole

Not isolated functions, complete enterprise services

No need for multi-component integration, tuning, customization, out of the box.

Question section

What is the relationship and difference between dataworks and MaxCompute?

These are two products, MaxCompute to do data storage and data analysis and processing, Dataworks is the integration of data integration, data development and debugging, job scheduling and operation and maintenance, metadata management, data quality management, data API services and other functions of the development of Dataworks IDE suite. Similar to the relationship between Spark and HUE, I don't know if this comparison is accurate.

If you want to test and experience MaxCompute, is it expensive?

Not high, I should say very low. MaxCompute provides a pay-per-job model, in which the cost of a single job is closely related to the size of the data processed by the job. Set up pay-per-view service and create 1 project. You can start the test experience by creating tables and uploading test data using the MaxCompute client tool (ODPSCMD) or in dataworks. If the data is small, 10 yuan can be used for a long time.

Of course, MaxCompute also has the mode of monopolizing resources, and it also chooses the prepaid mode because of the controllable cost.

In addition, MaxCompute will soon launch a "developer version", which gives developers a certain amount of free money for development and learning every month.

MaxCompute storage currently only exposes tables, can it handle unstructured data?

Yes, unstructured data can be stored on OSS, one way is through the appearance, through custom Extractor to achieve the logic of unstructured processing into structured data. In addition, you can also use Spark on MaxCompute to access OSS, extract and transform the files in the OSS directory through the Spark program, and write the results to the MaxCompute table.

Which data sources are supported to connect to MaxCompute

Through Dataworks data integration service or using DataX, you can access various offline data sources on Ali Cloud, such as database, HDFS, FTP and so on.

You can also use the MaxCompute Tunnel tool / SDK to upload and download data in batches through commands or SDK

Streaming data, which can be written to Datahub by using the Flume/logstash plug-in provided by MaxCompute, and then archived to the MaxCompute table

Support Ali Cloud SLS, DTS service data to write MaxCompute table

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report