In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Hadoop Common: includes utility classes commonly used by Hadoop, renamed from the original Hadoop core section. It mainly includes system configuration tool Configuration, remote procedure call RPC, serialization mechanism and Hadoop abstract file system FileSystem and so on. They provide basic services for building a cloud computing environment on general hardware and provide the required API for software development running on the platform.
Hadoop Distributed File System (HDFS ™): distributed file system, which provides high throughput, high scalability and high fault-tolerant access to application data. It is the foundation of data storage management in Hadoop system. It is a highly fault-tolerant system that can detect and respond to hardware failures and is used to run on low-cost general-purpose hardware. HDFS simplifies the file consistency model, provides high-throughput application data access through streaming data access, and is suitable for applications with large datasets.
Hadoop YARN: task scheduling and cluster resource management.
Hadoop MapReduce: a parallel processing system for large datasets based on YARN. It is a kind of calculation model, which is used to calculate a large amount of data. The MapReduce implementation of Hadoop, together with Common and HDFS, constitutes the three components in the early stage of Hadoop development. MapReduce divides the application into two steps: Map and Reduce, in which Map performs specified operations on the independent elements on the dataset to generate intermediate results in the form of key-value pairs. Reduce specifies all values of the same "key" in the intermediate result to get the final result. The function partition such as MapReduce is very suitable for data processing in a distributed parallel environment composed of a large number of computers.
Other modules:
Ambari: a Web-based tool that supports provisioning, management, and monitoring of Apache Hadoop clusters. Ambari currently supports most Hadoop components, including HDFS, MapReduce, Hive, Pig, Hbase, Zookeper, Sqoop, and Hcatalog. Ambari supports centralized management of HDFS, MapReduce, Hive, Pig, Hbase, Zookeper, Sqoop, and Hcatalog. Ambari also provides a dashboard for viewing the health of the cluster, such as heat sinks, and the ability to visually view MapReduce,Pig and Hive applications and diagnose their performance characteristics in a user-friendly manner. It is also one of the top five hadoop management tools.
Avro: a data serialization system, led by Doug Cutting, is a data serialization system. Similar to other serialization mechanisms, Avro can convert data structures or objects into a format that is easy to store and transmit. It is designed to support data-intensive applications and is suitable for large-scale data storage and exchange. Avro provides rich types of data structures, fast and compressible binary data formats, filesets for storing persistent data, remote calls to RPC, and simple dynamic language integration.
Cassandra: scalable multi-master database with no single point of failure. Is an open source distributed NoSQL database system. Originally developed by Facebook, it is used to store data in simple formats such as inboxes, and integrates the data model of GoogleBigTable and the fully distributed architecture of Amazon Dynamo. Facebook opened up Cassandra in 2008. Since then, because of the good scalability of Cassandra, it has been adopted by Digg, Twitter and other well-known Web 2.0 websites, and has become a popular distributed structured data storage scheme.
Cassandra is a hybrid, non-relational database, similar to Google's BigTable. Its main functions are richer than Dynamo (distributed Key-Value storage system), but its support is not as good as document storage MongoDB (an open source product between relational database and non-relational database, which is the most functional and most like relational database in non-relational database. The supported data structure is very loose and is in an bjson format similar to json, so you can store more complex data types. Cassandra was originally developed by Facebook and later transformed into an open source project. It is an ideal database for online social cloud computing. Based on Amazon's proprietary fully distributed Dynamo, it combines Google BigTable's Column Family-based data model. P2P decentralized storage. In many ways, it can be called Dynamo 2.0.
Chukwa: a data collection system used to manage large distributed systems (with more than 2000 + nodes, the amount of monitoring data generated by the system every day is at T level). It is built on the basis of HDFS and MapReduce of Hadoop and inherits the scalability and robustness of Hadoop. Chukwa contains a powerful and flexible tool set, which provides a series of functions such as data generation, collection, sorting, deduplication, analysis and display. It is a necessary tool for Hadoop users, cluster operators and managers.
Hbase: a distributed, column-oriented open source database, the technology comes from the Google paper "Bigtable: a distributed Storage system for structured data" written by Fay Chang. Just as Bigtable takes advantage of the distributed data storage provided by the Google File system (File System), HBase provides Bigtable-like capabilities on top of Hadoop. HBase is a subproject of Apache's Hadoop project. Different from the general relational database, HBase is a database suitable for unstructured data storage. Another difference is that HBase is column-based rather than row-based.
HBase is a scalable, highly reliable, high-performance, distributed and column-oriented dynamic schema database for structured data. Unlike traditional relational databases, HBase uses BigTable's data model: enhanced sparse sort Mapping Table (Key/Value), where keys are made up of row keywords, column keywords, and timestamps. HBase provides random, real-time read and write access to large-scale data. At the same time, the data stored in HBase can be processed by MapReduce, which perfectly combines data storage and parallel computing.
Hive: a Hadoop-based data warehouse tool that maps structured data files to a database table, provides simple sql query capabilities, and converts sql statements into MapReduce tasks to run. Its advantage is that the learning cost is low, simple MapReduce statistics can be quickly realized through SQL-like statements, and there is no need to develop special MapReduce applications, so it is very suitable for statistical analysis of data warehouse.
Hive is an important sub-project of Hadoop, which was first designed by Facebook. It is a data warehouse architecture based on Hadoop. It provides many functions for data warehouse management, including data ETL (extraction, transformation and loading) tools, data storage management and query and analysis capabilities of large data sets. Hive provides a mechanism for structured data, which is similar to the SQL-like language in traditional relational databases: Hive QL. Through this query language, data analysts can easily run data analysis business.
Mahout: an open source project under Apache that provides implementations of scalable classic algorithms in the field of machine learning designed to help developers create smart applications more easily and quickly. Mahout includes many implementations, including clustering, classification, recommendation filtering, and frequent sub-item mining. In addition, Mahout can be effectively extended to the cloud by using the Apache Hadoop library.
Mahout originated in 2008 as a sub-project of Apache Lucent. It has made great progress in a very short period of time and is now a top-level project of Apache. The main goal of Mahout is to create scalable implementations of classic algorithms in the field of machine learning, which are designed to help developers create smart applications more easily and quickly. Mahout now includes widely used data mining methods, such as clustering, classification, recommendation engine (collaborative filtering) and frequent set mining. In addition to algorithms, Mahout also includes data input / output tools, data mining support architectures such as integration with other storage systems such as databases, MongoDB, or Cassandra.
Pig: runs on Hadoop and is a platform for analyzing and evaluating large data sets. It simplifies the requirements of using Hadoop for data analysis and provides a high-level, domain-oriented abstract language: Pig Latin. With Pig Latin, data engineers can encode complex and interrelated data analysis tasks into data flow scripts on Pig operations, which can be executed on Hadoop by converting the script into a chain of MapReduce tasks. Like Hive, Pig lowers the bar for analyzing and evaluating large data sets.
Apache Pig is a high-level process language that is suitable for querying large semi-structured datasets using Hadoop and MapReduce platforms. Pig simplifies the use of Hadoop by allowing SQL-like queries to distributed datasets.
MapReduce was used for data analysis. When the business is more complex, using MapReduce will be a very complex thing, for example, you need to do a lot of preprocessing or transformation of data in order to adapt to the processing mode of MapReduce. On the other hand, writing MapReduce programs, publishing and running jobs will be a time-consuming task. The emergence of Pig makes up for this deficiency. Pig allows you to focus on the data and the business itself, rather than focusing on the format conversion of the data and writing MapReduce programs. Essentially, when you use Pig for processing, Pig itself generates a series of MapReduce operations in the background to perform the task, but the process is transparent to the user.
Spark: a fast and universal computing engine for Hadoop data. Spark provides a simple programming model that supports a variety of applications, including ETL, machine learning, streaming and graphical computing.
Apache Spark is a fast and general computing engine specially designed for large-scale data processing.
Spark is a general parallel framework like Hadoop MapReduce opened by UC Berkeley AMP lab (AMP Lab of the University of California, Berkeley). Spark has the advantages of Hadoop MapReduce; but what is different from MapReduce is that the intermediate output of Job can be saved in memory, so it is no longer necessary to read and write HDFS, so Spark can be better applied to iterative MapReduce algorithms such as data mining and machine learning.
Spark is an open source cluster computing environment similar to Hadoop, but there are some differences between the two. These useful differences make Spark superior in some workloads. In other words, Spark enables in-memory distributed datasets to optimize iterative workloads in addition to interactive queries.
Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop, Spark and Scala can be tightly integrated, where Scala can manipulate distributed datasets as easily as local collection objects.
Although Spark was created to support iterative jobs on distributed datasets, it is actually a complement to Hadoop and can be run in parallel in the Hadoop file system. This behavior can be supported through a third-party cluster framework called Mesos. Developed by AMP Lab (Algorithms, Machines, and People Lab) at the University of California, Berkeley, Spark can be used to build large, low-latency data analysis applications.
Tez: an extensible framework for building high-performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by significantly increasing its speed, while maintaining the ability of MapReduce to scale to PB-level data. A computing framework that supports DAG (Database Availability Group Database availability Group) jobs, which originates directly from the MapReduce framework. The core idea is to further split Map and Reduce operations, that is, Map is split into Input, Processor, Sort, Merge and Output, and Reduce is split into Input, Shuffle, Sort, Merge, Processor and Output. In this way, these decomposed meta-operations can be flexibly combined arbitrarily to produce new operations, which are assembled by some control programs. Can form a large DAG job.
ZooKeeper: a distributed, open source distributed application coordination service, an open source implementation of Google's Chubby, and an important component of Hadoop and Hbase. It is a software that provides consistency services for distributed applications, including configuration maintenance, domain name services, distributed synchronization, group services and so on.
The goal of ZooKeeper is to encapsulate complex and error-prone key services and provide users with simple and easy-to-use interfaces and systems with efficient performance and stable functions.
How to agree on a certain value (resolution) in a distributed system is a very important fundamental issue. As a distributed service framework, ZooKeeper solves the problem of consistency in distributed computing. On this basis, ZooKeeper can be used to deal with some data management problems often encountered in distributed applications, such as unified naming service, state synchronization service, cluster management, distributed application configuration item management and so on. ZooKeeper, often as a major component of other Hadoop-related projects, is playing an increasingly important role.
Conclusion
Thank you for watching. If there are any deficiencies, you are welcome to criticize and correct them.
In order to help you make learning easier and efficient, we will share a large number of materials free of charge to help you overcome difficulties on your way to becoming big data engineers and even architects. Here to recommend a big data learning exchange circle: 658558542 welcome everyone to enter × × × stream discussion, learning exchange, common progress.
When you really start learning, it is inevitable that you do not know where to start, resulting in inefficiency that affects your confidence in continuing learning.
But the most important thing is not to know which skills need to be mastered, step on the pit frequently while learning, and eventually waste a lot of time, so it is necessary to have effective resources.
Finally, I wish all the big data programmers who encounter bottle disease and do not know what to do, and wish you all every success in the future work and interview.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.