In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
In this issue, the editor will bring you about how to use odpscmd in MaxCompute. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.
1. The positioning of command line tool odpscmd in MaxCompute ecology.
Odpscmd is actually the name of MaxCompute's command-line tool, which is at the top of the entire MaxCompute. As shown in the figure below, the whole MaxCompute ecology is the supporting relationship from bottom to top, and what is really exposed to users is a set of Rest API, which is also the core interface. Both Java and Python's SDK need to call these core API. The command line tool is a deep packaging for MaxCompute open Rest API, which enables users to submit jobs by command on the client side, and these jobs will be submitted to the MaxCompute cluster through the interface for related management and development. You may be familiar with the combination of MaxCompute and DataWorks, because you need to activate DataWorks before using MaxCompute. In fact, MaxCompute has some ecological tools of its own, such as odpscmd and MaxCompute Studio.
Cdn.com/b03edb54aee0bb5188a4fa4d570fbd0e85b11b28.png ">
Get started quickly: a complete and simple example
In this section, the various stages of data processing using the odpscmd client are introduced with a simple but complete example. In this paper, only a brief description of each step, the specific practical operation is detailed in video sharing.
As a client tool for MaxCompute, odpscmd is similar to client tools such as Hive CLI and PSQL. It is a black screen operation management tool. Next, I would like to share with you a complete and simple example, which can completely cover all aspects of big data's processing, including environmental preparation, data access, data processing and data consumption. In the common big data scenario, business data is often scattered in the database and other production environments, and data collection or synchronization is carried out regularly to synchronize the incremental data in the production environment to the data warehouse. This will involve data access. After that, some data processing will be done periodically in the data warehouse, which is sometimes done with the common SQL, which provides a programming framework such as MR for MySQL, while some users with deep requirements will implement some complex business logic through UDF, and at the same time, they will need to monitor the execution of jobs during data processing. So there will also be a need to view and manage jobs such as progress and success status. After the job is processed, the data cleaning and aggregation are basically completed, which can be provided to the data consumers. For data consumers, it is often necessary to return the processed consumable data to the business system to support online applications, or to connect BI tools through JDBC interface for visual analysis, and business analysts often want to download some data for secondary processing and processing.
1. Download and configure installation
For odpscmd, you can find its download address on the official website. What is downloaded locally through the browser is a ZIP package. After decompression, you can see some related directories of odpscmd. For Linux users, you can also download and install the corresponding package through the yarn source. After the download is completed, you need to modify the odps_config.ini configuration file. As shown in the blue box below, you need to enter the project name and fill in the authentication information such as access_id and access_key owned by the login visitor. At the same time, it should be noted that the end_point domain name of MaxCompute in the domestic Region is the same, while for tunnel_endpoint, it is closely related to Region, so the tunnel_endpoint entered is different for different Region. After filling out the configuration file, you can start the odpscmd client.
two。 Data environment preparation
For students who use the odpscmd client, they often use shell and some open source tools to cooperate deeply. Here, for example, there is a daily business table in the business database, which may store logs of daily business clicks and new order data. Then the common scenario is to synchronize the data to the data warehouse. This process requires some data synchronization tools to periodically load the data into the tables of the data warehouse, and often requires the establishment of corresponding partition tables. Put the corresponding data into the appropriate partition. This task can be accomplished through the open source tool DataX to synchronously insert data into data warehouse tables. When you manually configure the DataX command, some parameters such as partition fields tend to be dynamic, so you also need to dynamically put them into the parameters of the DataX script.
3. Data processing
When data synchronization is completed, some partition tables need to be processed in many scenarios. In the example shown in the following figure, a new table or insert OVERWRITE table is created, and the new partition data introduced by incremental synchronization in the data table is aggregated and calculated, and the results are generated into the new table. When the job is very long, the odpscmd tool also provides the job monitoring command "Show p", which can retrieve all historical jobs. Each job has its own instance_id, and for MaxCompute, the most basic task unit is instance, and each instance is an instance of the submitted job. According to the instance_id, the corresponding Logview can also be retrieved afterwards. In summary, odpscmd itself provides complete job submission, job post-mortem viewing, and the ability to view the details of a specified job. In the example shared this time, Tunnel is used to download the resulting dataset of MaxCompute and analyze it through Excel or other tools, so executing tunnel download can download the resulting dataset to a local file.
III. Capability framework provided by the client
In fact, the purpose of the above content is to simply review the daily complex big data processing flow and links through a simple example, so as to convey to you that MaxCompute client tools can support all aspects of daily work. So, what exactly are the functions of the MaxCompute client tools that can support the needs of each link? In fact, the functions of odpscmd include the management of project space, the operation management of tables, views and operation partitions, the management of resources and functions, the management of job instances, and the provision of data channels for uploading and downloading data, as well as other operations such as security and rights management. Next, I will introduce it to you in turn.
Project space related operations
Before connecting to the project, a MaxCompute project has been created, and when using the project, you can use commands such as "use;" like Hive database to switch across spaces, which allows users to quickly switch between multiple projects. After using "use;", all subsequent commands will be directly applied to the developed project.
Table related operation
Table-related commands can be easily operated on the odpscmd client tool, including table creation and deletion, as well as table modifications, such as column name modification, partition modification, master Owner modification, non-partition data deletion and so on. Other operations such as show tables are also compatible with Hive usage habits. The table-related operation commands and help documents are listed in the following figure.
View and partition related operations
For views and partitions, odpscmd can also re-encapsulate some complex processing logic through view to make it easier to expose. For view, create, modify, delete, and view operations are provided. As far as Partition is concerned, people pay more attention to how to view the partition of the table. Through the way of show partition, you can list how many partitions there are in this table and what the name of the partition is. At the same time, you can delete a partition or change its name with the help of alter.
Resource and function related operations
Deep users will find that many built-in functions can not meet their logic needs, and they often need to use some UDF for complex calculations, or they can use MR to do more free computing logic. At these times, users need to upload a custom development package, which is the resource Resource for MaxCompute. You can upload or view resources in the project through odpscmd. As far as functions are concerned, when the user creates a UDF, it can be easily created using the create method.
Instance-related operations
For an example, you may run a lot of jobs on the client, and you may want to see if the job has been run at some point, but you can't remember the specific ID of the job. At this time, you can use show p and show instance commands to list the submitted historical jobs, and filter by time and other conditions. When the instance list is fetched, you can also use the "wait" command to view the details of a specific task.
Tunnel related operations
There is an operation command that is actually submitted to the control, management, and data query jobs, and some of it is to do the uplink and downlink of the data, which is different from the task submission mentioned above, but more demanding for data throughput, so odpscmd integrates Tunnel tools to upload and download data from the command line. The problem often encountered here is that many developers synchronize data through Tunnel in their own production environment, and the ability to resume transmission from breakpoints will be more demanding at this time.
Security and permission related operations
Many users are familiar with DataWorks, and DataWorks has relatively simple and clear user role authorization management capabilities. On the other hand, some students who do the database are more accustomed to the way of black screen, that is, to do security and permission management by command.
Role-related rights management
For example, for the management of role-related permissions, you can use create role to create roles in the big data project, assign rights to the roles, add a user to the roles, or remove related roles. At the same time, you can also see which roles there are.
User-related rights management
For user-related rights management, the most common thing is to add an Aliyun account to the project, specify a specific role for the user, and obtain the corresponding permissions.
Data protection of project space
Some administrators have high requirements for the protection of project space, but they have done a lot of work on multi-tenant support in MaxCompute. For example, you can set to prohibit the downloading of project data and only allow data sharing among several authorized projects, so that the permission protection capabilities related to the project space can be placed in the odpscmd.
Permission View
At the same time, odpscmd also gives permission to view related commands.
Other actions
Odpscmd also provides some other common operations. You may often encounter some performance optimization scenarios, such as segmenting a large data table into scan partitions, which can increase the parallelism of job tasks. The switches of these optimization methods can be quickly set through the command line. At the same time, you can also estimate the cost of some SQL, and you can easily get help information. To sum up, odpscmd is a powerful and complete client-side tool.
IV. Description of key scenarios on the client
Next, we will focus on several key scenarios used by the client.
Scenario 1: call the tunnel command through the shell script to upload and upload files
In this scenario, uploads are uploaded by default through delimiters. When users encounter some non-standardized delimiters, they can quickly adapt the corresponding column delimiters through the-fd mode. And many students also hope to schedule dynamically through shell scripts, which will also involve the input of dynamic parameters. As shown in the example in the following figure, the date can be dynamically passed to the Tunnel command, and the new log files can be uploaded or downloaded to the corresponding directory periodically.
Scenario 2: debug odpscmd commands
In non-interactive scenarios, the MaxCompute command supports the-f parameter, and the MaxCompute command can be directly supported as odps-f in scripts or other programs. In addition, the MaxCompute command also supports the-e argument, in which case you need to embed the SQL command in parentheses.
Scenario 3: run data query / data processing job UDF
UDF and MR can be natively supported in odpscmd, as shown in the following figure is the process of using UDF jobs.
In the first two links, you need to package the written UDF programs into Jar packages offline, and transfer them to the designated project as a resource by "add jar" on the command line. Through the createfunction command, you can name a custom function and associate it with the uploaded main function, so you can really build such a function. You can then use it by calling the UDF you just created when testing or using it. This is the complete creation of UDF through odpscmd.
Scenario 4: run data query / data processing job MR
For MR jobs, you first need to write and package the MR program in the compilation environment, register it as a resource in the project, execute "add jar" in odpscmd, then run the MR job on the command line, and then you can get the corresponding results.
5. problems that are easy to encounter
Next, I summarize several common questions for you. First of all, the introduction of dynamic parameters, because odpscmd is a command-line way, then people call more through shell, this time involves how to call some dynamic parameters, in fact, you can transfer some dynamic information in disguise through shell. Another common problem is that "it is normal to execute the odpscmd command manually, but it is wrong to call the Times through the shell script." at this point, you need to check whether the environment variable of JAVA is set under the shell script.
This is how odpscmd is used in MaxCompute shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.