In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
In this issue, the editor will bring you about how to analyze the language DolphinDB scripting language in big data. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.
To develop big data application, we need not only a distributed database that can support massive data, but also a distributed computing framework that can make efficient use of multi-core and multi-node. What's more, we need a programming language that can be organically integrated with distributed database and distributed computing, high performance and easy to expand, strong expression ability, and meet the needs of rapid development and modeling. DolphinDB took inspiration from the popular Python and SQL languages and designed the big data processing scripting language.
When it comes to database languages, it's easy to think of the standard SQL language. Different from the standard SQL,DolphinDB programming language, it has complete functions and powerful expression ability, and perfectly supports a variety of programming paradigms, such as imperative programming, vectorized programming, functional programming, SQL programming, remote procedure call programming (RPC), metaprogramming and so on. The syntax and expression habits of the DolphinDB programming language are very similar to those of Python and SQL, which can be easily mastered as long as you have some knowledge of Python and SQL. Relatively speaking, it is much more difficult to master the Q language of the in-memory temporal database kdb+.
The programming language of DolphinDB can meet the needs of rapid development and modeling by data scientists. DolphinDB language is concise and flexible, and has strong expression ability, which greatly improves the development efficiency of data scientists. DolphinDB supports vector computing and distributed computing, and has extremely fast running speed. The unique features of the DolphinDB programming language are described in more detail below.
1. Imperative programming
Like mainstream scripting languages Python, JS, etc., as well as strongly typed languages C, C++, Java, etc., DolphinDB also supports imperative programming. Imperative programming refers to achieving the ultimate goal by executing one statement after another. The imperative programming of DolphinDB is mainly used for the processing and scheduling of the upper module. In big data's analysis, due to the huge amount of data to be processed, if we use imperative programming to process data line by line, the efficiency will be very low and the performance will be degraded. Therefore, we recommend using other programming methods in DolphinDB to process data in batches.
/ / DolphinDB supports the assignment of univariate and multivariable values x = 1 23y = 4 5y + = 2x, y = y, x / / swap the value of x and yx, y = 1 23,45 swap the value of x and yx / 1 to 100 cumulative summation s = 0for (x in 1mov101) s + = xprint sum of elements in the array s = 0 For (x in 1 359 15) s + = mean of each column of the xprint swatch / print matrix m = matrix (123, 4 56, 7 8 9) for (c in m) print c.avg () / calculate the sales of each product in the product table t = table (["TV set", "Phone", "PC"] as productId, 1200 600 800 as price, 10 20 7 as qty) for (row int) print row.productId + ":" + row.price * row.qty
two。 Vectorization programming
Like programming languages such as matlab and R, DolphinDB supports vectorized programming. The Q language of kdb+ database mentioned earlier is also a vector processing language, which shows good performance and high efficiency in complex computation. DolphinDB programming language optimizes many algorithms, such as calculating the sliding window index for time series data, which greatly improves the efficiency of vector functions.
/ / the addition of two vectors with a length of 10 million, vectorization programming is more concise and time-consuming than imperative programming for statements. N = 100000000a = rand (1.0,0n) b = rand (1.0n) / / programming with forstatement takes 12 seconds c = array (DOUBLE, n) for (I in 0: n) c [I] = a [I] + b [I] Time elapsed: 12341.043 ms// using vectorization programming, only 36 milliseconds c = a + bTime elapsed: 36.901 ms
Vectorization programming usually loads the entire vector into continuous memory. Sometimes because of memory fragmentation, continuous memory is not found, and the vector is not available. DolphinDB specifically provides the big array data type to address this problem. Big array can compose physically discontinuous memory blocks into logically continuous vectors. Even very large vectors can be used in DolphinDB, which improves the availability of the system.
3. Functional programming
DolphinDB supports most of the functions of functional programming, including pure functions, custom functions, lambda functions, higher-order functions, partial applications and closures. DolphinDB has more than 400 built-in functions covering a variety of data types, data structures, and system calls.
The pure functional nature of DolphinDB reduces the side effects of functions. When customizing a function, DolphinDB cannot use variables defined outside the function. Pure function feature can greatly improve the readability of code and the quality of software.
3.1 Custom function
/ / define a function to return weekday def getWorkDays (dates) {return dates [def (x): weekday (x) between 1:5]} getWorkDays (2018.07.01 2018.08.01 2018.09.01 2018.10.01) [2018.08.01,2018.10.01]
The above example defines a function getWorkDays that takes a set of dates and returns the date between Monday and Friday. The implementation of the function uses the vector filtering function, that is, to accept a Boolean monocular function for data filtering.
3.2 higher order functions
In the following example, we use three high-order functions pivot, each and cross to neatly calculate the correlation between the two using three lines of code based on the tick-level quotation data of the stock day.
/ / generate 100 billion data points (stock symbol, trading time and price) n=10000000syms = rand (`FB`GOG`MSFT`AMZN`IBM, n) time = 09GOGOG`MSFT`AMZN`IBM, n) price = 500.0 + rand (500.0, n) / use pivot function to generate PivotTable priceMatrix = pivot (avg, price, time.minute (), syms) / each in conjunction with ratios function Generate a sequence of returns per minute for each stock (the column of the matrix) retMatrix = each (ratios, priceMatrix)-1//cross in conjunction with the corr function Calculate the correlation between stocks corrMatrix = cross (corr, retMatrix) RetMatrix) AMZN FB GOOG IBM MSFT-AMZN | 1 0.015181-0.056245 0.005822 0.084104FB | 0.015181 1-0.028113 0.034159-0.117279GOOG |-0.056245-0.028113 1 -0.039278-0.025165IBM | 0.005822 0.034159-0.039278 1-0.049922MSFT | 0.084104-0.117279-0.025165-0.049922 1
3.3 partial application
The function parameters in higher-order functions usually have restrictions on the parameters, and through some applications, we can ensure that the parameters meet the requirements. For example, given a vector a = 12 14 18, the correlation with each column in the matrix is calculated. Because you want to calculate the correlation of each column of the matrix, of course you can use the higher-order function each. But the corr function needs two parameters, and the matrix only provides one of the parameters, and the other parameter must be given in advance, so some applications can solve this problem. Of course, we can also use for statements to solve this problem, but the code is lengthy and inefficient.
A = 12 14 18m = matrix (5 67, 1 32, 8 7 11) / / use each and partial applications to calculate the correlation between each column of the matrix and a given vector each (corr {a}, m) / / use the for statement to solve the above problem cols = m.columns () c = array (DOUBLE, cols) for (I in 0:cols) c [I] = corr (a, m [I])
Another function of partial applications is to keep the function in a state. For example, in flow computing, users usually need to give a message handler (message handler), accept a new piece of information, and return a result. But we want the message handler to return the average of all numbers so far. We can solve this problem through some applications.
Def cumavg (mutable stat, newNum) {stat [0] = (stat [0] * stat [1] + newNum) / (stat [1] + 1) stat [1] + = 1 return stat [0]} msgHandler = cumavg {0.0} each (msgHandler, 12345)
4.SQL programming
The programming language of DolphinDB not only supports standard SQL, but also extends the functions of SQL for time series data, such as grouping calculation (context by), data perspective (pivot by), window function, asof connection and window connection, etc., which is more convenient to analyze time series data. The expression ability of simple SQL engine is limited, so it is difficult to meet more complex data analysis and algorithm implementation, which affects the development efficiency. In DolphinDB, the scripting language and the SQL language are completely integrated.
4.1 Integration of SQL and programming languages
/ generate an employee payroll table emp_wage = table (take (1.. 10100) as id, take (2017.10m + 1.. 10100) .sort () as month, take (5000 5500 6000 6500) as wage) / / calculate the average salary of a given group of employees. The employee list is stored in a local variable empIds, empIds = 3 467 9select avg (wage) from emp_wage where id in empIds group by idid avg_wage---3 55004 60006 60007 55009 5500 max / displays the employee's name in addition to calculating the average salary. Employee names are obtained using a dictionary, empName. EmpName = dict (1.. 10, `Alice`Bob`Jerry`Jessica`Mike`Tim`Henry`Anna`Kevin`Jones) select empname [first (id)] as name, avg (wage) from emp_wage where id in empIds group by idid name avg_wage-- 3 Jerry 55004 Jessica 60006 Tim 60007 Henry 55009 Kevin 5500
In the above example, the where clause and select clause of the SQL statement use the arrays and dictionaries defined in the context, respectively, so that the problems that need to be solved through subqueries and multi-table joins are solved by simple hash table. If SQL involves a distributed database, these context variables are automatically serialized to the desired node. This not only makes the code look cleaner and more readable, but also improves performance. In big data's analysis, there are many data table associations, and even if the SQL optimizer does a lot of optimization, it will inevitably bring performance problems.
4.2 context by-- 's friendly support for panel data
DolphinDB provides window function--context by similar to other database systems. But compared with window function, context by has a more concise syntax and less restrictions, and can be used with select or update.
/ / grouped by stock symbol to calculate the daily return of each stock. Suppose the data are arranged in chronological order. Update trades set ret = ratios (price)-1.0 context by sym// is grouped by date to calculate the ret descending ranking of each stock for each day. Select date, symbol, ret, rank (ret, false) + 1 as rank from trades where isValid (ret) context by date// choose the top 10 stocks of daily ret select date, symbol, ret from trades where isValid (ret) context by date having rank (ret, false) < 10
4.3 friendly support for time series data by asof join and window join--
T1 = table (09as price 30m 09table 34m as minute, 29.228.9 29.3 30.1 as price) T2 = table 31m 09table 34m 09as minute, 51.252.51.952.8 as price) select * from aj (T1, T2) `minute) minute price t2_minute t2diagnostic price09VO1m 51.209VOLING 31m 28.909VOLING 31m 52.409M29.30m 52.409VOLING 31m 52.409VOLING 34m 30.1 0934m 51.9
In the above example, there is no record corresponding to 09 33m and 09 34m in T2. Asof join (aj) will take the record corresponding to the most recent time before 09 33m and 09 34m in T2, that is, the record of 09 31m in T2.
P = table (1 2 3 as id, 2018.06M 2018.07M 2018.07M as month) s = table (1 2 1 2 2 as id, 2018.04M 2018.04M 2018.05M 2018.05M 2018.06M 2018.06M as month, 4500 5000 6000 5000 6000 4500 as wage) select * from wj (p, s,-3GlutaLue 1 line, `id`month) id month avg_wage---1 2018.06M 52502 2018.07M 4833.3333333 2018.07M 2018.07M
The above example illustrates the use of window join (wj). Wj first fetches the first row record of table p, that is, id=1,month=2018.06M. Then select id=1 in table s and month records from (2018.06M-3) to (2018.06M-1), that is, 2018.03m to 2018.05m, to calculate avg (wage). Therefore, avg_wage= (45006000) / 22550. And so on.
Asof join and window join are widely used in the field of financial analysis. A classic application is to associate the transaction table with the quotation table to calculate the transaction cost of individual stocks. For details, please refer to the use of Window Join to quickly estimate the transaction costs of individual stocks.
4.4 SQL other extensions
In order to meet the requirements of big data's analysis, DolphinDB has also made a lot of extensions to SQL. For example, a user's custom function can be used in SQL without being compiled, packaged, or deployed. For example, DolphinDB supports combined fields (Composite Column), which can output multiple return values of a complex analysis function to a row of a data table.
Factor1=3.2 1.2 5.9 6.9 11.1 9.6 1.4 7.3 2.0 0.1 6.1 2.9 6.3 8.4 5.6factor2=1.7 1.3 4.2 6.8 9.2 1.3 7.8 7.9 9.9 9.3 4.6 7.8 2.4 8.7t=table (take (1 23, 15). Sort () as id, 1.. 15 as y, factor1, factor2) / / at the same time as the output parameters Output t statistics. Use a custom function to wrap the output def myols (YMagnex) {r=ols (ymemxretrue 2) return r.Coefficient.beta join r.RegressionStat.statistics [0]} select myols (y, [factor1) Factor2]) as `alpha`beta1`beta2`R2 from t group by idid alpha beta1 beta2 R2Mui-1 1.063991-0.258685 0.732795 0.9460562 6.886877-0.148325 0.303584 0.9924133 11.833867 0.272352-0.065526 0.144837
5. Remote procedure call programming
Compared with other systems, the advantages of DolphinDB in remote procedure call (RPC) are mainly reflected in two aspects: first, in DolphinDB, whether it is a custom function or a built-in function, we can send it to other nodes to run through remote procedure call, while other systems cannot remotely call functions related to custom functions. Second, remote procedure calls to DolphinDB do not need to be compiled or deployed. The system automatically serializes the relevant function definitions and required data to the remote node. When writing functions related to remote procedure calls, data scientists or data analysts do not need engineers to compile and deploy, and can be used directly online, which greatly improves the efficiency of development and analysis.
The following example is to execute a remote function using remoteRun:
H = xdb ("localhost", 8081) / / execute a script remoteRun (h, "sum (1357)") on the remote node (h, "sum (1357)") 16max / the above remote call can also be simplified to h ("sum (1357)") 16pm / execute a function registered on the remote node h ("sum", 1357) 16pm / execute the local custom function def mysum (x): reduce (+) on the remote node X) h (mysum, 1 357) 16 as price / create a shared table salesh on the remote node (localhost:8081) ("share table (2018.07.02 2018.07.02 2018.07.03 as date, 1 2 3 as qty, 10 15 7 as price) as sales") / / if the local custom function has dependencies Dependent custom functions are also serialized to the remote node defg salesSum (tableName, d): select mysum (price*qty) from objByName (tableName) where date=dh (salesSum, "sales", 2018.07.02) 40
DolphinDB also provides functions related to distributed computing. Mr and imr are used to develop map-reduce distributed algorithms based on map-reduce and iteration respectively. Users only need to specify distributed data sources and custom core functions, such as map function, reduce function, final function and so on. Let's first create a distributed table, add some simulation data, and then demonstrate examples of developing and calculating median and linear regression.
/ / simulate the generation of distributed table sample Use the id partition / / y = 0. 5 + 3x1-0.5x2n=10000000x1 = pow (rand (1.0), 2) x2 = norm (3. 0 login, n) y = 3 * X1-0.5*x2 + norm (0. 0 V1. 0, n) t=table (rand (10, n) as id, y, x1, x2) login (`admin, "123456") db = database ("dfs://testdb", VALUE, 0.9) db.createPartitionedTable (t, "sample") "id") .append! (t)
Using the self-defined map function myOLSMap, the built-in reudce function plus function (+), the self-defined final function myOLSFinal, and the built-in map-reduce framework function mr, we quickly build a function myOLSEx that runs linear regression on distributed data sources.
Def myOLSMap (table, yColName, xColNames) {x = matrix (take (1.0, table.rows ()), table [xColNames]) xt = x.transpose () Return xt.dot (x), xt.dot (table [yColname])} def myOLSFinal (result) {xtx = result [0] xty = result [1] return xtx.inv (). Dot (xty) [0]} def myOLSEx (ds, yColName, xColNames) {return mr (ds, myOLSMap {, yColName, xColNames}, +, myOLSFinal)} / / calculate the linear regression coefficient sample = loadTable ("dfs://testdb") using distributed algorithms and distributed data sources developed by yourself "sample") myOLSEx (sqlDS (), `y, `x1`x2) [0.4991, 3.0001,-0.4996] / / use the built-in function ols and undivided data to calculate the coefficients of linear regression The same result is obtained in ols (y, [x1mai x2], true) [0.4991, 3.0001,-0.4996].
In the following example, we construct an algorithm to calculate the approximate median of a set of data on a distributed data source. The basic principle of the algorithm is to use the bucketCount function to calculate the number of data in a set of bucket on each node, and then accumulate the data on each node. So we can find out which range the median should fall in. If the interval is not small enough, subdivide the interval further until it is less than the given precision requirement. The median algorithm requires multiple iterations, so we use the iterative computing framework imr.
Def medMap (data, range, colName): bucketCount (data [colName], double (range), 1024, true) def medFinal (range Result) {x = result.cumsum () index = x.asof (x [1025] / 2.0) ranges = range [1]-range [0] if (index = =-1) return (range [0]-ranges*32): range [1] else if (index = = 1024) return range [0]: (range [1] + ranges*32) else {interval = ranges / 1024.0 startValue = range [0] + (index-1) * interval return startValue: (startValue + interval)}} def medEx (ds ColName, range, precision) {termFunc = def (prev Cur): cur [1]-cur [0]) id date price qty----001 01Universe 02Universe 2018 50.27 2886002 01apt 331, 2018 30.85 1, 331003 01, 2018 17.89 18004 01pact 2018 51.00 6, 439096 04max, 2018 57.73 8, 331 9097 04, 2018 57.73 8331 9097 04, 092018 27.90 4gam621099, 621099, 2018 27.90, 621099, 2018 27.90, 621099, 644100, 2018 31.55 791 644100, 118, 46.63, 8383.
DolphinDB programming language is born for data analysis, is born with the ability to deal with large amounts of data, powerful, easy to use.
The above is the editor for you to share how to analyze the language DolphinDB scripting language in big data, if you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.