What are the basic concepts of Flume, Kafka and Spark 07/11 Update SLTechnology News&Howtos

What are the basic concepts of Flume, Kafka and Spark

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "what are the basic concepts of Flume, Kafka and Spark". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Flume

Flume is a highly available, highly reliable, distributed massive log collection, aggregation and transmission system provided by Cloudera. Flume supports customizing all kinds of data sender in the log system to collect data; at the same time, Flume provides the ability to simply process the data and write to various data receivers (customizable).

Flume provides the ability to simply process data and write to various data recipients (customizable). Flume provides the ability to collect data from data sources such as console (console), RPC (Thrift-RPC), text (file), tail (UNIX tail), syslog (syslog log system, supporting TCP and UDP modes) and exec (command execution).

Flume-og adopts a multi-Master approach. In order to ensure the consistency of the configuration data, Flume introduces ZooKeeper to save the configuration data. ZooKeeper itself can ensure the consistency and high availability of the configuration data. In addition, ZooKeeper can notify the Flume Master node when the configuration data changes. Gossip protocol is used to synchronize data between Flume Master.

The most obvious change in Flume-ng is that it removes the centralized management configuration of Master and Zookeeper and becomes a pure transport tool. Another major difference in Flume-ng is that read-in and write-out data are now handled by different worker threads (called Runner). In Flume-og, the read-in thread also does the write job (except for a failure retry). If the write is slow (not a complete failure), it will block Flume's ability to receive data. This asynchronous design allows the read thread to work smoothly without paying attention to any downstream problems.

Kafka

Kafka is a high-throughput distributed publish and subscribe messaging system with the following features:

The persistence of messages is provided through the disk data structure of O (1), which can maintain stable performance for a long time even for message storage with TB.

High throughput: even very common hardware Kafka can support millions of messages per second.

Support for partitioning messages through Kafka servers and consumer machine clusters.

Supports Hadoop parallel data loading.

Storm

Twitter officially opened up Storm, a distributed, fault-tolerant real-time computing system hosted on GitHub and compliant with Eclipse Public License 1.0. Storm is a real-time processing system. Storm is basically written in Clojure (and some in java).

Storm provides a set of common primitives for distributed real-time computing, which can be used in "stream processing" to process messages and update databases in real time. This is another way to manage queues and worker clusters. Storm can also be used for "continuous computing" (continuous computation) to continuously query the data stream and output the results to the user in the form of a stream during the calculation. It can also be used in "distributed RPC" to run expensive operations in parallel. Nathan Marz, lead engineer at Storm, said:

Storm can easily write and extend complex real-time computing in a computer cluster. Storm is used for real-time processing, just as Hadoop is used for batch processing. Storm ensures that every message is processed, and it is fast-- in a small cluster, millions of messages can be processed per second. What's even better is that you can do development in any programming language.

The main features of Storm are as follows:

A simple programming model. Similar to MapReduce, which reduces the complexity of parallel batch processing, Storm reduces the complexity of real-time processing.

A variety of programming languages are available. You can use various programming languages on top of Storm. Clojure, Java, Ruby, and Python are supported by default. To increase support for other languages, you only need to implement a simple Storm communication protocol.

Fault tolerance. Storm manages worker processes and node failures.

Scale horizontally. Computing is performed in parallel among multiple threads, processes, and servers.

Reliable message processing. Storm ensures that each message is fully processed at least once. When a task fails, it is responsible for retrying the message from the message source.

fast. The design of the system ensures that the message can be processed quickly, and uses the dead MQ as its underlying message queue.

Local mode. Storm has a "local mode" that completely simulates the Storm cluster during processing. This allows you to do development and unit testing quickly.

The Storm cluster consists of a master node and multiple working nodes. The primary node runs a daemon called "Nimbus" for assigning code, scheduling tasks, and fault detection. Each worker node runs a daemon called "Supervisor" to listen for work, start and terminate the worker process. Both Nimbus and Supervisor can fail quickly and are stateless, so they become very robust, and the coordination between the two is done by Apache ZooKeeper.

The terms Storm include Stream, Spout, Bolt, Task, Worker, Stream Grouping and Topology. Stream is the data being processed. Spout is the data source. Bolt processes data. Task is a thread running in Spout or Bolt. Worker is the process that runs these threads. Stream Grouping specifies what Bolt receives as input data. Data can be randomly assigned (the term is Shuffle), or distributed according to the field value (the term is Fields), or broadcast (the term is All), or always sent to a Task (the term is Global), the data can be ignored (the term is None), or it can be determined by custom logic (the term is Direct). Topology is a network of Spout and Bolt nodes connected by Stream Grouping. These terms are described in more detail on the Storm Concepts page.

Scala

Scala is a multi-paradigm programming language, a programming language similar to java, which is designed to implement a scalable language and integrate the features of object-oriented programming and functional programming.

Scala has several key features that demonstrate its object-oriented nature. For example, every value in Scala is an object, including basic data types (that is, Boolean values, numbers, and so on), and even functions are objects. In addition, classes can be subclassed, and Scala provides mixin-based composition (mixin-based composition).

Scala has a broader sense of class reuse than languages that only support single inheritance. Scala allows you to reuse "a new member definition in a class (that is, the difference from its parent class) when defining a new class." Scala calls it a combination of mixin classes.

Scala also contains several key concepts of functional languages, including higher-order functions (Higher-Order Function), local application (Currying), nested functions (Nested Function), sequence interpretation (Sequence Comprehensions), and so on.

Scala is statically typed, which allows it to provide generic classes, inner classes, and even polymorphic methods (Polymorphic Method). It is also worth mentioning that Scala is specially designed to interoperate with Java and .NET. The current version of Scala does not run on .NET (although the previous version can be-_-b), but it is scheduled to run on .NET in the future.

Scala is interoperable with Java. It uses the scalac compiler to compile the source file into Java's class file (that is, bytecode running on JVM). You can call all Java libraries from Scala, as well as Scala code from Java applications. In David Rupp's words,

It also has access to an endless number of existing Java libraries, which makes it (potentially) easier to migrate to Scala.

Impala

This is a massively parallel processing (MPP) SQL big data analysis engine (Note:

Impala, like Dremel, draws lessons from the idea of MPP (Massively Parallel Processing) parallel database, abandons MapReduce, which is not suitable for SQL query, and allows Hadoop to support interactive workloads.

Impala claims to be 3-30 times higher in performance than Hive, and even predicts that it may one day surpass the utilization rate of Hive and become the most popular real-time computing platform on Hadoop.

The purpose of Impala is not to replace existing MapReduce tools such as Hive, but to provide a unified platform for real-time queries. In fact, the operation of Impala also depends on the metadata of Hive.

Like Hive, Impala can interact directly with HDFS and HBase libraries. It's just that Hive and other frameworks built on MapReduce are suitable for long-running batch tasks. For example, those who batch extract, convert, load (ETL) type of Job. Impala is mainly used for real-time query.

Hive

Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table, provide simple sql query function, and transform sql statements into MapReduce tasks to run. Its advantage is that the learning cost is low, simple MapReduce statistics can be quickly realized through SQL-like statements, and there is no need to develop special MapReduce applications, so it is very suitable for statistical analysis of data warehouse.

Hive is a data warehouse infrastructure built on Hadoop. It provides a series of tools that can be used for data extraction, transformation loading (ETL), a mechanism that can store, query, and analyze large-scale data stored in Hadoop. Hive defines a simple SQL-like query language called HQL, which allows users who are familiar with SQL to query data. At the same time, the language also allows familiar with MapReduce developers to develop custom mapper and reducer to handle complex analytical work that cannot be done by built-in mapper and reducer.

Hive has no special data format. Hive works well on top of Thrift, controls delimiters, and allows users to specify data formats.

Spark

Spark is a general parallel framework like Hadoop MapReduce opened by UC Berkeley AMP lab. Spark has the advantages of Hadoop MapReduce, but unlike MapReduce, the intermediate output of Job can be saved in memory, so there is no need to read and write HDFS, so Spark can be better applied to MapReduce algorithms that need iteration, such as data mining and machine learning.

Spark is an open source cluster computing environment similar to Hadoop, but there are some differences between the two. These useful differences make Spark superior in some workloads. In other words, Spark enables in-memory distributed datasets to optimize iterative workloads in addition to interactive queries.

Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop, Spark and Scala can be tightly integrated, where Scala can manipulate distributed datasets as easily as local collection objects.

Although Spark was created to support iterative jobs on distributed datasets, it is actually a complement to Hadoop and can be run in parallel in the Hadoop file system. This behavior can be supported through a third-party cluster framework called Mesos. Developed by AMP Lab (Algorithms, Machines, and People Lab) at the University of California, Berkeley, Spark can be used to build large, low-latency data analysis applications.

This is the end of the introduction of "what are the basic concepts of Flume, Kafka and Spark". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.