Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the five open source processing technologies in big data?

2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)05/31 Report--

Big data in what are the five open source processing technologies, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

Did you know that more than 250000 open source technologies have emerged in the market today? All around us, these increasingly complex systems, as we have seen, look at the following chart:

We still have a lot of choices with the least choice. Which are your goals? What is the next wealth of 2000 companies? Which projects can be used as reliable candidates in the real production phase? Which ones deserve special attention? We have done detailed research and testing, let's take a look at five new technologies that shake big data. These are several new sets of tools. Let's take a look at them together.

Storm and Kafka are the main ways of data stream processing in the future, and they have been widely used in some large companies, including Groupon, Alibaba and The Weather Channel. Storm, born in Twitter, is a distributed real-time computing system. Storm is designed to handle real-time computing, while hadoop is mainly used to handle batch operations.

Kafka is a message system developed by LinkedIn that exists in the system as a basic part of the pipeline for data processing. When you use them together, you can get data in real time and linearly incrementally.

Why do you need to care?

Storm and Kafka are used to make the data flow processing linear, ensuring that each message acquisition is real-time and reliable. The front and rear Storm and Kafka can process 10000 pieces of data smoothly per second.

Data flow processing solutions such as Storm and Kafka make many enterprises pay attention to and want to achieve excellent ETL (decimation transformation loading) data integration solutions. Storm and Kafka are also good at memory analysis and real-time decision support. It is not surprising that enterprises use Hadoop solutions for batch processing to meet the real-time business requirements. Real-time data stream processing is a necessary module in the enterprise big data solution because it beautifully handles "3v"-volume,velocity and variety (capacity, speed and diversity). Storm and Kafka are our (infochimps) most recommended technologies, and they will also exist in our platform as a formal component. Drill and Dremel achieve fast, low-load, large-scale, ad hoc query data search. They provide the possibility of searching P-level data in seconds to deal with ad hoc queries and forecasts, and provide strong virtualization support.

Drill and Dremel provide powerful business processing capabilities, not just for data engineers. Everyone on the business side will like that Drill and Dremel.Drill are open source versions of Google's Dremel. Dremel is a technology provided by Google to support big data query. The company will use it to develop its own tools, which is why everyone is paying close attention to Drill. Although this is not a start, the strong interest of the open source community makes it more mature.

Why should you care?

Drill and Dremel are better at analyzing ad hoc queries than Hadoop. Hadoop only provides bulk data processing workflow, which is also a drawback.

The Hadoop ecosystem makes MapReduce a very friendly and beneficial tool for advertising analysis. From Sawzall to Pig to Hive, the establishment of many interface layer applications makes Hadoop more friendly and closer to business, but, like the SQL system, these abstraction layers ignore an important fact-MapReduce (or Hadoop) exists to systematize data processing processes. If you are not worried about what tasks to run? If you don't care about the questions that arise and look for answers, keep silent and keep your insight. "impromptu exploration"-if you have undertaken data processing, how fast do you optimize it? You shouldn't run a new task or wait, and sometimes it's better to think about it than to ask a new question.

In the basic methodology of workflow for heap comparison, many business-driven BI and analysis queries are very basic and temporary interaction, low-latency analysis. Writing Map/Reduce workflows is prohibited in many business analytics. Wait a few minutes for Jobs to start, wait a few hours for execution to complete these data-free interactive experiences, and these comparisons and zooming comparisons eventually produce a basic new vision. Some data scientists have long speculated that Drill and Dremel will be better than Hadoop, and agree that some are still under consideration, and a small number of enthusiasts immediately embrace change, but these are the main advantages in more query-oriented and low-latency situations. In Infochimps we like to use Elasticsearch full-text indexing engine to achieve database data search, but really in big data processing we think Drill will become the mainstream.

R is a powerful open source statistical programming language. Since 1997, more than 2 million of statistical analysts have used R. This is a modern version of S language in the field of statistical computing that was born in Bell Labs and quickly became a new standard statistical language. R makes complex data science cheaper. R is an important leader in SAS and SPASS, and an important tool for statisticians on the show.

Why should you care?

Because it is supported by an extraordinarily powerful community, you can find all R class libraries and create virtual scientific data of all types without writing new code. R is exciting because of the people who maintain him and the new daily creation. R community is one of the exciting places in big data field. R is a great technology that will not go out of date in big data's field. In recent months, thousands of new features have been introduced by analysts with an increasingly open knowledge base. Moreover, R and Hadoop work well together, as a part of big data's processing has been proved. Keep an eye on: Julia is an interesting replacement for R because it doesn't like R's slow interpreter. The Julia community is not very strong right now, but it can wait if you don't use it right away. Gremlin and Giraph help enhance graphical analysis and are used in graphical databases such as Neo4j and InfiniteGraph, and in Giraph that works with Hadoop. Golden Orb is another example of a high-level flow processing diagram-based project. You can take a look. The graph database is a charming marginalized database. There are many interesting differences between them and relational databases, which is that you always want to use graph theory rather than relational theory at the beginning.

Another similar graph-based theory is Google's Pregel, which is an open source alternative to Gremlin and Giraph. In fact, these are examples of copycat implementations of Google technology. Diagrams play an important role in computing network modeling and social networking, and can connect arbitrary data. Another frequent application is mapping and geographic information computing. From A to B, calculate the shortest distance. Graphs are also widely used in the field of biological and physical computing, for example, they can draw unusual molecular structures. Massive graphs, graph databases and analysis languages and frameworks are all part of a real-world implementation of big data. The basic theory of graph is a killer application. Why do you say that? Any one to solve the problem of large network nodes is dealt with through the path between nodes and nodes. Many creative scientists and engineers clearly use the right tools to solve the corresponding problems. Make sure they all run well and spread widely.

SAP Hana is a full-memory analysis platform, which includes an in-memory database and some related tools to create analysis flow and standardize the correct format for data input and output.

Why should I care?

SAP began to oppose the development of powerful products for solidified enterprise users for free use. It's not just SAP starting to think about startups and getting them to use Hana. They authorize the development of community solutions, and these unusual practices revolve around the results of Hana.

Hana assumes that other programs are not fast enough to solve problems encountered, such as financial modeling and decision support, website personalization and fraud detection, and so on. The disadvantage of Hana*** is "full memory", which means accessing soft-state memory, which is a bit clear, but it is also an expensive part compared to disk storage. According to organizers, don't worry about operating costs, Hana is a fast and delayed big data processing tool.

D3 is not on the list, but its intimacy makes us think it is worth mentioning. D3 is a javascript document-oriented visual class library. It is powerful and innovative so that we can see the information directly and let us interact normally. It was written by Michael Bostock, a graphical interface designer for the New York Times. For example, you can use D3 to create an H ™l table from any number of arrays. You can use any data to create interactive progress bars, etc. Here is a practical example of D3 that creates Obama's public opinion in 2013. With D3, programmers can create interfaces between them and organize all kinds of data.

Although this article is not long, it also took me a while to translate. I hope you can correct the inadequacies in translation. In fact, when I read this article, I would like to share it with people who like it, thanks to an open environment, so the United States in the IT field is always so surprising, of course, we have to keep up.

It has been nearly a year since the formal use of Hadoop. During this period, from Baidu to the present BitWare, different companies use different technologies to solve problems. But in essence, there are always a few problems, and of course now many companies are beginning to try Hadoop. It is understandable that this is the general environment.

After reading the above, have you mastered what are the five open source processing technologies in big data? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report