What are the common Hadoop and Spark projects? 04/24 Update SLTechnology News&Howtos

What are the common Hadoop and Spark projects?

2025-04-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the common Hadoop and Spark projects, the article is very detailed, has a certain reference value, interested friends must read it!

Project 1: data Integration

Call it an "enterprise data center" or "data lake". The idea is that you have different data sources and you want to analyze them. Such projects include getting data sources (real-time or batch) from all sources and storing them in hadoop. Sometimes this is the first step to becoming a "data-driven company"; sometimes you just need a nice report. "enterprise data centers" usually consist of HDFS file systems and tables in HIVE or IMPALA. In the future, HBase and Phoenix will make great efforts in the integration of big data, opening up a new situation and creating a new and beautiful world of data.

Salespeople like to say "read mode", but in fact, to be successful, you must have a clear understanding of what your use case will be (the Hive pattern will not look different from what you do in an enterprise data warehouse). The real reason is that a data lake has a higher level of scalability and much lower cost than Teradata and Netezza. Many people use Tabelu and Excel when doing front-end analysis. Many complex companies use "data scientists" with Zeppelin or IPython notebooks as the front end.

Project 2: professional analysis

Many data consolidation projects actually start with your specific needs and the analysis of a data set system. These are often incredibly specific areas, such as liquidity risk / Monte Carlo simulation analysis in the banking sector. In the past, this professional analysis relied on outdated, proprietary software packages, and the inability to scale up the data was often subject to a limited feature set (mostly because software vendors could not know as much as professional institutions).

In the world of Hadoop and Spark, look at these systems with roughly the same data integration systems, but tend to have more HBase, custom non-SQL code, and fewer (if not unique) data sources. They are increasingly based on Spark.

Item 3: Hadoop as a service

Any large organization in the "professional analysis" project (ironically, one or two "data collation" projects) will inevitably begin to feel "happy" (that is, pain) managing several different configurations of Hadoop clusters, sometimes from different vendors. Next, they will say, "maybe we should integrate these resource pools," instead of leaving most of the nodes idle most of the time. They should make up cloud computing, but many companies often can't or won't for security reasons (internal politics and job protection). This usually means a lot of Docker container packages.

I don't use it, but recently Bluedata seems to have a solution, which will also attract small businesses to lack enough money to deploy Hadoop as a service.

Item 4: flow analysis

Many people will take this "flow", but the flow analysis is different from the device flow. Typically, flow analysis is a real-time version of an organization in batch processing. Anti-money laundering and fraud detection: why not seize it on the basis of the transaction instead of ending it at the end of a cycle? The same inventory management or anything else.

In some cases, this is a new type of trading system that analyzes the bits of data bits because you connect it in parallel to an analysis system. These systems prove themselves as common data stores such as Spark or Storm and Hbase. Please note that flow analysis is not a substitute for all forms of analysis, and you still want to analyze historical trends or look at past data for something you've never thought about.

Item 5: complex event handling

Here, we are talking about subsecond real-time event processing. Although there are not fast enough ultra-low latency (picosecond or nanosecond) applications, such as high-end trading systems, you can expect millisecond response time. Examples include real-time evaluation of call data records handled by Internet telecom operators of things or events. Sometimes, you will see such systems using Spark and HBase--, but they generally fall on their faces and have to be converted to Storm, which is based on the interference mode developed by the LMAX Exchange.

In the past, such systems have been based on customized messages or high performance, from the shelf, client-server message products-but today there is too much data. I haven't used it yet, but the Apex project looks promising and claims to be faster than Storm.

Item 6: ETL stream

Sometimes you want to capture stream data and store it. These projects usually coincide with No. 1 or No. 2, but add their respective scope and characteristics. Some people think they are No. 4 or No. 5, but they are actually dumping and analyzing data on disk. Almost all of these are Kafka and Storm projects Spark is also used, but for no reason, because you don't need to analyze in memory.

Item 7: replace or add SAS

SAS is fine and good, but SAS is also expensive, and we don't need to buy storage for your data scientists and analysts to "play" with your data. In addition, in addition to SAS can do or produce beautiful graphical analysis, you can also do some different things. This is your "data lake". Here are the IPython notebooks (now) and Zeppelin (later). We use SAS to store the results.

When I see other different types of Hadoop,Spark, or Storm projects, every day, these are normal. If you use Hadoop, you may know them. I implemented some of these projects a few years ago, using other technologies.

If you are a veteran who is too afraid of "big" or "do" big data Hadoop, don't worry. Things change more and more, but the essence remains the same. You'll find a lot of similarities that the deployment and trendy technologies you use revolve around Hadooposphere.

Original author: Andrew C. Oliver,Andrew C. Oliver is a professional cat herder who works part-time as a software consultant. He is the president and founder of MammothData (formerly open software integrator), a big data consulting firm based in Durham, North Carolina.

These are all the contents of the article "what are the common Hadoop and Spark projects?" Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.