In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
SQL, which is basically similar to "Putonghua" in the field of data processing and analysis, is almost a necessary ability, but it is inseparable from the relational database system, that is, RDBMS, to use SQL, just as Putonghua mainly has to be spoken in China. Of course, if it is worth learning English in order to go to the United Kingdom and the United States, it is like learning a Python or Hadoop, and going out to find a job is also a skill. But if you want to analyze general text data, or Excel tables, it's like going to a small country like Kiribati, do you have to roll up your tongue and start by memorizing words and learning grammar in order to shop happily? At a time like this, I'm afraid the first thing that comes to mind is to go to a certain treasure and find a good translator.
Aggregator, on this issue, can be said to be a necessary artifact for home travel!
In fact, using SQL to deal with well-structured text or Excel tabular data is not only a lazy idea, but also a natural way of thinking. A file or table consists of several rows of data, each of which is either a definite delimiter (space, comma, tab). Barabara) separate projects, or specify a fixed length for each project. This representation is almost the same as a table (Table) in a relational database, and even variable-length and fixed-length fields seem to be similar. The difference is that there is no concept of primary key, data type, or nullability on the file. In addition, the description of the relationship between files is not as clear as the database, often only as business rules or experience, in the user's head or in some files for people to see.
The same is true of the aggregator, which maps the file to a "table" by automatically parsing structured text or Excel files, and on this basis, fully supports the syntax and functions of SQL.
All right, let's cut the gossip and get to the point. Let's take two related files as examples to see how to easily query and analyze without "installing database-> creating database tables-> importing data":
First, take a look at the sample data. There are two files: employee information (employee.txt) and state basic information (state.xlsx). Note! Here we use two kinds of files, one is formatted TXT text, the other is Excel spreadsheet, that is to say, the aggregator can connect different types of data sources at the same time. Are you surprised?
More miraculously, the aggregator can automatically identify and read four file types according to the file suffix! They are: text (txt), Excel (xls, xlsx) and csv files.
The following two figures are sample data for employee information and state information, and the two files are associated by the STATE item in the employee information (column 5) and the STATEID item in the state information (column 1).
Employee information data sample:
Sample of State Information data:
All right, let's get to work. First of all, for the simplest single-table query, look at the female (GENDER='F') employees whose salary is greater than 10000 (SALARY > 10000). The output is sorted by employee number (EID). The aggregator code is as follows:
A1=connect () 2=A1.query ("select * from c:/sql/employee.txt where gender='F' and salary > 10000 order by eid")
Yes, it's that simple, it's so familiar! Step 1, connect to the database. Er, there are no parameters specified here, so what is directly connected is the file system. Step 2, use the query () function to execute the SQL query, and the SQL here is exactly the same as the database query except for changing the table name after from to the file name! The query results are as follows:
Note that in windows, the file path in the aggregator uses a slash "/" instead of a backslash "\", which is consistent with the Java language.
Well, this is so similar. Let's look at employees with salaries greater than 10000 who were born no earlier than 01 / 01 / 1980.
A1 $() select * from c:/sql/employee.txt where BIRTHDAY > = date ('1980-01-01') and SALARY > 10000
Quite simply, using $() is equivalent to the connect () function, followed by writing SQL directly. In fact, different data source names can be written in parentheses to connect multiple data sources at the same time.
In addition, this example uses the string date transfer function date () in SQL.
Next, there is the key to distinguishing the SQL database from a single file, the associative query. For women who earn more than 10000, they want to see which state they are in:
A1=connect () 2=A1.query ("select t1.eid eid,t1.name name,t1.gender gender,t2.name state,t2.population population,t1.salary salary from c:/sql/employee.txt T1 left join c:/sql/state.xlsx T2 on t1.state=t2.stateid where t 1. Genderstories created F'and t1.salary > 10000")
Well, replacing the table name with a file name is a bit long, so we used the alias in SQL, and the result is as follows:
In addition to using aliases instead of the absolute path of the file, for very long paths or a large number of files, in order to facilitate writing and clear reading, you can also configure the home directory in the aggregator-menu-tools-option, so that you can directly use the file name or relative path in SQL. Is it more like specifying a database and directly accessing the tables in it?
The configuration method is shown in the following figure:
After configuring the home directory, the query looks like this: query the number of people and total wages of departments whose total payroll is greater than 100000:
A1=connect () 2=A1.query ("select dept,count (1) c from employee.txt group by dept having s (salary) s from employee.txt group by dept having s > 100000")
The query results are as follows:
Next, let's go into some details:
1) the aggregator supports logical operations such as and, or and not. For example, the last name of the query employee is Smith or Robinson, and he is a male employee outside the Sales department:
A1=connect () 2=A1.query ("select * from employee.txt where (surname='Smith' or surname='Robinson') and gender='M' and not dept='Sales'")
2) in the aggregator, it is supported to use is null to determine whether it is empty, and to use is not null to determine whether it is not empty, for example, to find employees whose surname is empty:
A1=connect () 2=A1.query ("select EID,NAME,SURNAME from employee.txt where surname is null")
It is also supported to use the coalesce function to handle null values. For example, if the employee surname field is empty, it will be displayed as "UNKNOWN" in the result:
A1=connect () 2=A1.query ("select EID,NAME,SURNAME,coalesce (SURNAME,'UNKNOWN') as SURNAME1 from employee.txt")
The query results are as follows:
Note: the field alias in the aggregator cannot be duplicated with the field name in the file.
3) the aggregator supports Case when, for example, if the gender field is "F", it should be displayed as "female" and if it is "M", it should be displayed as "male".
A1=connect () 2=A1.query ("select EID,NAME,GENDER, (case gender when 'F'then 'female' else' male' end) as GENDER1 from employee.txt")
The query results are as follows:
4) the aggregator supports fuzzy queries with the like keyword, for example, among employees, the query for employees whose surname field contains "son".
A1=connect () 2=A1.query ("select * from employee.txt where surname like'% son%'")
Where "%" is a wildcard character that represents one or more characters. In addition, "_" represents a character. If the query ends with "son" and is preceded by three characters, you can write surname like'_ _ son'; "[WJ]" to indicate a list of characters containing "W" and "J". Surname like'[WJ]% 'indicates that the surname begins with "W" or "J". Surname like'[! WJ]% 'indicates that the surname does not start with "W" or "J".
5) the aggregator supports querying data in multiple values through the In keyword. For example: query the employees of the three departments of "Finance, Sales, and rented".
A1=connect () 2=A1.query ("select * from employee.txt where dept in ('Finance','Sales','R&D')")
6) the aggregator supports the definition of an external table through with T as (x). For example, join the state field in employee.txt to the left of the stateid field in the state table of another data source demo database to find out the name and population of each employee's state:
A1=connect () 2=A1.query ("with T2 as (connect (\" demo\ ") .query (\" select * from states\ ") select t1.eid eid,t1.name name,t1.gender gender,t2.name state,t2.population population,t1.salary salary from employee.txt T1 left join T2 on t1.STATE=t2.STATEID")
In this SQL:
With T2 as (connect (\ "demo\"). Query (\ "select * from states\") defines an external table T2 that connects to the demo data source (actually the hsql demo database that comes with the aggregator) and executes SQL "select * from states" with the query function. (where\ "is an escape using double quotes in a string)
The following "select t1.eid … left join T2 on t1.STATE=t2.STATEID" uses the defined T2 and employee.txt left links to find out the name and population of each employee's state.
This query is a typical federated query of database and text files. In fact, the with keyword can define the data found by various data sources, so it is very flexible to implement federated queries across heterogeneous data sources.
7) the aggregator supports the output of query results through into to in files. For example, query the number of people and total wages of departments whose total payroll is greater than 100000, and the results are written into deptResult.xlsx. Here, the new file is similar to a new table in a relational database.
A1=connect () 2=A1.query ("select dept,count (1) c into deptResult.xlsx from employee.txt group by dept having s (salary) s into deptResult.xlsx from employee.txt group by dept having s > 100000")
Having said so much, we can see that through the aggregator, we can basically use SQL easily and directly on structured text data (txt, csv, etc.) and Excel files (xls, xlsx).
Of course, the aggregator does not completely "translate" the ability of SQL. For the subqueries in SQL, the aggregator can not directly support it, but will solve it in a more flexible, convenient and intuitive step-by-step calculation. At the same time, for some special join calculations, the aggregator is slower than traditional databases.
Finally, let's take a look at what additional benefits you can get from SQL calculation through the aggregator:
1) dynamically calculate according to the input parameters:
When querying data, we often need to calculate according to different conditions, that is, what we call dynamic execution. At this point, we can define "grid parameters" to reserve places for conditions that may change. For example, if you want to find out who are the higher-paid young employees in the company, but the age group and salary starting line are uncertain, we can define two parameters in the menu "Program / Grid parameters" of the aggregator IDE: birthday and salary:
Then use the placeholder "?" in the query statement. Write out the SQL and specify the corresponding grid parameter name as input in order:
A1=connect () 2=A1.query ("select * from employee.txt where BIRTHDAY > = and SALARY >?", birthday,salary)
If you specify a specific value when defining the grid parameters, and do not check "set parameters before each run", then the run step will directly specify the value. If "set parameters before each run" is checked, the "set Parameter values" window pops up each time the script is run. In this way, we can enter the parameter values we need at any time, and the query results will change accordingly:
2) use SQL to query files on the command line
In windows or linux systems, we can also query the file data directly by calling the written aggregator script from the command line. If combined with the scheduled task mechanism of the operating system, the batch data calculation can be completed at a specified time.
Let's first look at an example that does not return a result set. Provide the financial department with the corresponding number and total salary of departments with a total salary of more than 100000 on a regular basis, and the results are written into deptResult.xlsx (which can then be sent to the relevant personnel by email or other means).
First, script the aggregator and save it as deptResult.dfx.
A1=connect () 2=A1.query ("select dept,count (1) c into deptResult.xlsx from employee.txt group by dept having s (salary) s into deptResult.xlsx from employee.txt group by dept having s > 100000") 3 > output ("create deptResult.xlsx successfully!")
Then, execute the esprocx.exe command on the command line (in the bin folder of the aggregator installation directory), and execute the result:
| | C:\ Program Files\ raqsoft\ esProc\ bin > esprocx.exe deptResult.dfxcreate deptResult.xlsx successfully! | |
The second line is the prompt output from the Output function, which can be used to monitor program execution and debugging.
Let's look at another example of returning a result set, with the same query requirements, but not requiring output to a file, but viewing the results directly. This time we changed the name of the aggregator script to deptQuery.dfx.
A1=connect () 2=A1.query ("select dept,count (1) c from employee.txt group by dept having s (salary) s from employee.txt group by dept having s > 100000") 3return A2
Execute on the command line and view the results:
Furthermore, the aggregator can write a complete SQL statement directly on the command line and return the results of the query directly from the file. Is it as convenient as the database command line query tool?
First define a parameter sql, which is used to pass in the SQL statement that needs to be queried.
Then write the following aggregator script and save it as query.dfx
A1=connect () 2=A1.query (sql) 3return A2
When executing the command, write the SQL statement directly on the command line, and the result is as follows:
Combined with the method of dynamic calculation based on parameters mentioned above, some interaction can also be achieved when using command line calculation. Or take the example of inquiring about the higher-paid young employees in the company mentioned earlier:
In the aggregator IDE menu "Program / Grid parameters", define two parameters: birthday and salary.
Write the following aggregator script and save it as empQueryParam.dfx
A1=connect () 2=A1.query ("select * from employee.txt where BIRTHDAY > =? and SALARY >?", birthday,salary) 3return A2
When you execute the command, provide values for the two parameters in order, and the result is as follows:
At this point, we have fully understood the use of aggregators, we can use SQL this "diamond" to get data files of these "porcelain work". In fact, in this story, the aggregator is the real diamond! In addition to the way data files are treated directly as "tables" described in this article, the really powerful arsenal of aggregators is much more than that. Through this lightweight data analysis tool, whether it is the database or file system data, can be easily processed, quickly cut the mess!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.