How to understand the Map-Reduce of Clojure 04/19 Update SLTechnology News&Howtos

How to understand the Map-Reduce of Clojure

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "how to understand the Map-Reduce of Clojure". In the daily operation, I believe many people have doubts about how to understand the Map-Reduce of Clojure. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts of "how to understand the Map-Reduce of Clojure"! Next, please follow the editor to study!

What is PigPen?

A map-reduce language that looks and uses a lot like clojure.core

You can write map-reduce queries as programs, not as scripts

Provide strong support for unit testing and iterative deployment

Note: if you are not familiar with Clojure, we highly recommend that you try the tutorials here, here or here to learn some basics.

Is it really another map-reduce language?

If you can Clojure, you already know PigPen.

The main goal of PigPen is to take language out of the ranks of equations. PigPen's operators are designed to be as similar as possible in Clojure, with no special user-defined functions (UDFs). You just need to define functions (anonymous or named) and you can use them as you would in Clojure programs.

Here is a common example of word count:

(require'[pigpen.core: as pig]) (defn word-count [lines] (- > > lines (pig/mapcat # (- >% first (clojure.string/lower-case) (clojure.string/replace # "[^\ w\ s]") (clojure.string/split # "\ s +") (pig/group-by identity) (pig/map (fn [[word occurrences]] [word (count occurrences)])

This code defines a function that returns a query expression for PigPen. This query takes a series of rows as input and returns the number of occurrences of each word. You can see that this is just a word count logic, not designed for something external, such as where the data comes from and what output it produces.

Can you combine it?

That's for sure. PigPen queries are written as a combination of functions-data input and output. You only need to write once, and you don't need to copy and paste everywhere.

Now let's use the word-count function defined above, plus the load and store commands, to form a PigPen query:

(defn word-count-query [input output] (- > > (pig/load-tsv input) (word-count) (pig/store-tsv output))

This function returns the PigPen representation of the query, saying that he won't do anything himself, and we need to execute it locally or generate a script (we'll talk about it later).

You like unit testing? We can do it.

With PigPen, you can mock input data to write unit tests for your query. You no longer need to cross your fingers to imagine what happens when you submit it to cluster, nor do you need to truncate some files to test input and output.

Mock data is really easy, with pig/return and pig/constantly, you can inject arbitrary data into your script as a starting point.

A common pattern is to use pig/take to sample several rows from the actual data source, package the results with pig/return, and get the mock data.

(use 'clojure.test) (deftest test-word-count (let [data (pig/return [["The fox jumped over the dog."] ["The cow jumped over the moon."]])] (is (= (pig/dump (word-count data)) [["moon" 1] ["jumped" 2] ["dog" 1] ["over" 2] ["cow" 1] ["fox" 1] ["the" 4]])

The pig/dump operator executes the query locally.

Closure

It is troublesome to pass parameters to your query. All variables within the scope of the function or the binding of let are available in the function.

(defn reusable-fn [lower-bound data] (let [upper-bound (+ lower-bound 10)] (pig/filter (fn [x])

< lower-bound x upper-bound)) data))) 注意 lower-bound 和 upper-bound 在生成脚本的时候就有了，在 cluster 上执行函数的时候也能使用。那么我怎么用它呢？只要告诉 PigPen 哪里会把一个查询写成一个 Pig 脚本： (pig/write-script "word-count.pig" (word-count-query "input.tsv" "output.tsv")) 这样你就能得到一个可以提交到 cluster 上运行的 Pig 脚本。这个脚本会用到 pigpen.jar，这是一个加入所有依赖的 uberjar，所以要保证这个 jar 也一起被提交了。还可以把你的整个 project 打包成一个 uberjar 然后提交，提交之前记得先重命名。怎么打包成 uberjar 请参照教程。之前看到，我们可以用 pig/dump 来本地运行查询，返回 Clojure 数据： =>

(def data (pig/return [["The fox jumped over the dog."] ["The cow jumped over the moon."]) # 'pigpen-demo/data= > (pig/dump (word-count data)) ["moon" 1] ["jumped" 2] ["dog" 1] ["over" 2] ["cow" 1] ["fox" 1] ["the" 4]]

If you want to start now, please refer to getting started & tutorials.

Why do I need Map-Reduce?

Map-Reduce is very useful for dealing with data that can't be handled by a single machine. With PigPen, you can deal with large amounts of data as if you were dealing with data locally. Map-Reduce does this by dispersing data into possibly thousands of cluster nodes, each of which processes a small amount of data, all in parallel, so that a task is much faster than a single machine. Operations such as join and group require the coordination of multiple node datasets, in which the data is divided into the same partition through the common join key, and the same value of join key is sent to the same specified machine. Once you have all the possible values on the machine, you can do join operations or do other interesting things.

If you want to see how PigPen does join, check out pig/cogroup. Cogroup accepts any number of data sets and then groups them according to a common key. Suppose we have data like this:

Foo: {: id 1,: a "abc"} {: id 1,: a "def"} {: id 2,: a "abc"} bar: [1 42] [2 37] [23 14] baz: {: my_id "1",: C [1 2 3]}

If you want to group according to id, you can do this:

(pig/cogroup (foo by: id) (bar by first) (baz by # (- >%: my_id Long/valueOf)) (fn [id foos bars bazs]...))

The first three parameters are the dataset to be join, each specifying a function to select the key from the data source. The last argument is a function that combines the grouping results. In our example, this function is called twice:

[1 ({: id 1,: a "abc"}, {: id 1,: a "def"}) ([1 42]) ({: my_id "1",: C [1 2 3]]})] [2 ({: id 2,: a "abc"}) ([2 37] [23 14]) ()]

This combines all values with an id of 1 and an id of 2. Different key values are assigned to different machines independently. By default, key may not appear in the data source, but there is an option to specify that it must be present.

Hadoop provides the underlying interface to do map-reduce job, but even so there are limitations, that is, only one round of map-reduce will be run at a time, without the concept of data flow and complex queries. Pig abstracts a layer on Hadoop, but so far, it's still just a scripting language, and you still need to use UDF to do something interesting about the data. PigPen takes the abstraction a step further, turning map-reduce into a language.

If you are new to map-reduce, we recommend you take a look here.

The motivation to be PigPen

* * Code reuse. * * We want to be able to define a piece of logic and then apply it to different job by passing through parameters.

* * Code integration. * * We don't want to write UDF in scripts and different languages. You don't want to consider the correspondence of different data types in different languages.

* * organize the code. * * We want to write the code in multiple files, how to organize it, and not to be constrained by the job to which the file belongs.

* * Unit testing. * * We want our sampled data to be associated with our unit tests, and we want our unit tests to test business logic without accessing the data.

* * Fast iteration. * * We want to be able to inject mock data at any time, and we want to test a query without waiting for JVM to start.

* * name only the things you want to name. * * most map-reduce languages require naming and specifying data structures for intermediate results, which makes it difficult to test individual job with mock data. We want to organize business logic and name it where we see fit, rather than being dictated by language.

We're tired of writing scripts. We want to write programs.

Note: PigPen is not a Clojure encapsulation of Pig scripts, and it is likely that the resulting scripts are incomprehensible to people.

Design and function

The design of PigPen is as consistent as possible with Clojure. Map-Reduce is functional programming, so why not take advantage of an existing powerful functional programming language? In this way, not only the learning curve is low, but also most concepts can be more easily applied to big data.

In PigPen, queries are treated as expression tree, and each operator is represented by a map that represents the desired behavior information. These map can be nested together to form a tree expression of a complex query. Each command contains a reference to the ancestral command. When executed, the query tree is transformed into a directed acyclic query graph. This makes it easy to merge duplicate commands, optimize the order of related commands, and use debug information to debug queries.

Optimize

When we represent the query as an operation diagram, it is very troublesome to do so. Clojure provides the operation of value equality, that is, if the contents of even objects are the same, they are equal. If the two operations have the same representation, they are exactly the same, so you don't have to worry about repetitive commands when writing the query, they are optimized before execution.

For example, suppose we have two queries:

(let [even-squares (- > > (pig/load-clj "input.clj") (pig/map (fn [x] (* x) (pig/filter even?) (pig/store-clj "even-squares.clj") odd-squares (- > > (pig/load-clj "input.clj") (pig/map (fn [x] (* x) (pig/filter odd?) (pig/store-clj "odd-squares.clj"))] (pig/script even-squares odd-squares))

In this query, we load data from a file, calculate the square of each number, and divide it into even and odd numbers. the operation diagram looks like this: enter the picture description here

This is consistent with our query, but a lot of extra work has been done. We loaded input.clj twice and calculated the square of all numbers twice. It may not seem like a lot of work, but when you do this with a lot of data, simple operations add up. To optimize this query, we can find the same operation. At first glance, we can see that our square operation may be a candidate, but they have different parent nodes, so they cannot be merged together. But we can merge the load functions because they don't have a parent node and they load the same file.

Now our picture looks like this:

Now we are worth loading the data once, which will save some time, but we still have to calculate the square twice. Because we now have only one command to load, our map operation is now the same and can be merged:

In this way, we get an optimized query, each operation is unique. Because we will only merge one command at a time, we will not modify the logic of the query. You can easily generate queries without having to worry about repeated execution. PigPen will only execute the repeated parts once.

Serialization when we have finished processing the data with Clojure, the data must be serialized into binary bytes before Pig can pass data between machines in the cluster. This is an expensive but necessary process for PigPen. Fortunately, there are often many consecutive operations in a script that can synthesize an operation, which saves a lot of time for unnecessary serialization and deserialization. For example, any consecutive map,filter and mapcat operation can be rewritten as a separate mapcat operation.

Let's use some examples to illustrate:

In this example, we start with a serialized value (blue) 4, deserialize it (orange), execute our map function, and then serialize it.

Now let's try a slightly more complex (more realistic) example. In this example, we execute a map, a mapcat, and a filter function.

If you haven't used mapcat before, I can tell you that this is an operation that runs a function on a value and returns a string of values. That sequence will be flatten, and each value will be passed to the next step. In Clojure, that's the result of the combination of map and concat. In Scala, it's called flatMap, and in C # it's called selectMany.

In the following figure, the flow on the left is the query before we optimized, and the one on the right is after optimization. As in the first example, we also start at 4, calculate the square, then subtract one from this value, return itself and add one. Pig gets the collection of values and then flatten them so that each value becomes the next step of input. Note that we need to serialize and deserialize when interacting with Pig. The third and final step is to filter the data, and in this case we keep only odd values. As shown in the figure, we serialize and deserialize the data between any two steps.

The figure on the right shows the optimized results. Each operation returns a sequence of elements. The map operation returns a sequence with only a single element 16, as is the case with mapcat, and the filter operation returns a sequence of 0 elements or single elements. By keeping these commands consistent, we can easily merge them together. We flattrn more sequences of values in a set of commands, but there is no consumption of serialization between steps. Although the jam is more complex, this optimization is performed faster at each step.

Testing, local execution, and debugging

Interactive development, testing, and debuggability are key functions of PigPen. If you have a job that runs for several days at a time, the last thing you want is a bug that pops up after 11 hours of running. PigPen has a local running mode based on rx. This allows us to write unit tests on the query. In this way, we can be more sure that the runtime will not hang up and can return the expected value. What's more, this feature allows us to develop interactively.

Usually, we start by selecting some records from the data source for unit testing. Because PigPen returns data in REPL, we do not need to construct additional test data. In this way, through REPL, we can do map,filter,join and reduce operations on mock data as needed. Each step can verify whether the result is what we want. This method can produce more reliable data than writing a long list of scripts and then imagining it. Another useful place is that complex queries can be written into several smaller functional units. Map-reduce queries can increase or decrease dramatically depending on the magnitude of the data source. When you test the script as a whole, you may have to read a lot of data and end up with a handful of data. By refining the query into smaller units, you can test one unit by reading 100 rows and producing two rows, and then use these two rows as templates to generate more than 100 data when testing the second unit.

Debug mode is useful for resolving exceptions, and when enabled, the results of each operation in the script are written to disk while normal output. This is useful in an environment like Hadoop, where you can't step through the code, and each step can take hours. Debug mode can also visualize the flowchart. This allows you to visually associate the output of the execution plan with the actual operation.

To enable debug mode, refer to the options for pig/write-script and pig/generate-script, which will write additional debug output in the specified directory.

Example of enabling debug mode:

(pig/write-script {: debug "/ debug-output/"} "my-script.pig" my-pigpen-query)

To enable visualization mode, take a look at pig/show and pig/dump&show.

Examples of visualization:

(pig/show my-pigpen-query);; Shows a graph of the query (pig/dump&show my-pigpen-query);; Shows a graph and runs it locally extension PigPen

A useful feature of PigPen is that it is easy to create your own operators. For example, we can define sets and multi-set operators like subtraction and intersection, which are just variants of operators like co-group, but it would be even better if we could define them, test them, and never think about how the logic was implemented.

This is also useful for more complex operations. For aggregate data we have reusable statistical operators such as sum,avg,min,max,sd and quantiles, and operators such as pivot that group multidimensional data and count each group.

These operations themselves are simple operations, but when you abstract them from your query, your query will become much simpler. At this point, you can spend more time thinking about how to solve the problem, instead of repeating the basic statistical methods every time.

Why use Pig?

We chose Pig because we didn't want to rewrite the existing optimization logic of Pig. Regardless of the language level, Pig did a good job in moving big data. Our strategy is to use Pig's DataByteArray binary format to move serialized Clojure data. In most cases, Pig does not need to know the underlying presentation of the data. Byte array can make a quick comparison so that for join and group operations, Pig only needs to simply compare serialized binaries, and if the serialized output is consistent, the values are equal in Clojure. However, this does not apply to data sorting. Binary sorting is really useless, and the result is not the same as that of the original data. To sort, you have to convert the data back, and you can only sort simple types. This is also one of the few defects that Pig has imposed on PigPen.

We also evaluated other languages before we decided to do PigPen. The first requirement is that it must be a programming language, not a scripting language plus a bunch of UDF. We took a brief look at Scalding, and it looks promising, but our team mainly uses Clojure. It can be said that PigPen is to Clojure what Scalding is to Scala. Cascalog is a language usually used to write map-reduce in Clojure, but from past experience, Cascalog is of no use for daily work, you need to learn a complex set of new syntax and many concepts, and implicit join through variable name alignment is not an ideal solution, if the wrong order of operators will cause great performance problems, Cascalog will flatten data results (which may be wasteful), and combined queries feel awkward.

We have also considered using a host language for PigPen. This makes it possible to build a similar abstraction on top of Hive, but defining schema for each intermediate is not consistent with the idea of Clojure. And Hive is similar to SQL, making it more difficult to translate from functional languages. There is a huge difference between relational model languages like SQL and Hive and functional languages like Clojure and Pig. Finally, the most direct solution is to do a layer of abstraction on top of Pig.

At this point, the study on "how to understand the Map-Reduce of Clojure" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.