What are big data and open source queries and frameworks? 04/21 Update SLTechnology News&Howtos

What are big data and open source queries and frameworks?

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "big data and what are the open source queries and frameworks". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "big data and what are the open source queries and frameworks?"

Apache Hive

Apache Hive is the flagship data warehouse tool based on the Hadoop ecosystem. It not only maps structured data files to database tables, but also provides SQL (HQL) query functions (such as SQL statements) and converts SQL statements into MapReduce tasks for execution.

Developed by Facebook, it entered the Apache Incubator in 2008 and became an excellent Apache project in September 2010. The principle behind this is to use the familiar SQL model to process the data on HDFS (Hadoop). With Hive, learning costs are low, and simple MapReduce statistics can be quickly converted through HQL statements without having to worry about developing other special MapReduce applications.

Hive is very convenient to model and build the data warehouse throughout the enterprise, while the Hive SQL model can count and analyze the data in the data warehouse.

However, the underlying Hive is converted to MR, while MR's reorganization relies on system disks, so it can only handle offline analysis, which is relatively inefficient and unacceptable in the developer community. Hive typically uses Hive to build data warehouses throughout the organization.

Apache SparkSQL

Apache SparkSQL is the main Spark component build for processing structured data. Spark SQL was released in 2014 and incorporated the Hive-on-Spark project, which is now the most widely used Spark module. It provides a programmable abstract data model called DataFrames and is regarded as a distributed SQL query engine.

SparkSQL replaces Hive's query engine, but it is also compatible with hive systems. The interfaces provided by Spark RDD API,Spark SQL will provide Spark with more information about structured data and computing. At the bottom of the SparkSQL is Spark Core, which allows you to easily switch between different SQL and API.

Presto

Presto is a distributed data query engine, but never stores the data itself. Presto will provide access to multiple data sources and support cascading queries across different data sources. It is a distributed SQL interactive query engine that was also developed by Facebook in 2012 and open source in 2013.

Presto is an OLAP tool, which is suitable for complex analysis of massive data, but not for OLTP schemes. Presto only provides calculation and analysis functions and cannot be used as a database system.

As far as Hive is concerned, Presto is a memory computing engine with low latency and high concurrency. Its execution efficiency is much higher than Hive. Its MPP (massively parallel processing) model can handle PB level data. The principle of Presto is simply to put some data into memory for calculation, take it out when it is done, and then process other data again, such as loops and similar pipeline processing patterns.

Apache Kylin

Apache Kylin is an open source distributed analysis engine for the Hadoop ecosystem. Its SQL query interface and OLAP function for Hadoop / Spark widely support very large-scale data. It uses precomputation technology based on cube, which can process big data SQL query quickly and efficiently. Kylin was developed by eBay and entered the Apache incubator in November 2014.

The emergence of Kylin is to solve the TB data needed for analysis. It can pre-compute data in the honeycomb and implement it using Hadoop's MapReduce framework. It can query a large number of Hive tables in seconds.

The two most critical processes of Kylin are

The precomputation process of a cube and converts an SQL query into a cube.

In addition, the summary results should be calculated in advance and the query results should be obtained at the same time, so as to avoid direct scanning.

Apache Impala

Compared with other frameworks, it is a real-time interactive SQL query engine for big data. Apache Impala is a SQL MPP query tool developed by Cloudera. Inspired by Google Dremel, it was open source in October 2012 and became an excellent project on November 28th, 2017.

Impala is integrated into the Hadoop ecosystem in a completely open and different form, allowing its users to use SQL to process large amounts of data in the Hadoop ecosystem.

Currently, it supports many types of storage options, such as:

Apache Kudu

Amazon S3

Microsoft ADLS

Local Stora

It was born to support only the interactive analysis of large amounts of HDFS data. Its flexibility and leading analytical database performance facilitate a large number of deployments in global enterprises.

Ir provides efficient BI and interactive SQL analytics for enterprise business and allows third-party ecosystems to grow rapidly.

Apache Druid

Apache Druid is an open source tool for real-time data analysis designed to quickly process large-scale data. Its distributed real-time analysis solves complex tasks through fast query and large-scale data sets.

Entered the Apache incubator on February 28th, 2018. It provides the ability to access data interactively. After entering the Druid system, you can take the data in real time and verify it immediately. The input data an is almost immutable. They are usually factual events based on chronological order.

Elastic Search

It is a distributed, scalable real-time search and analysis engine. It was built by Shay Banon in 2010 and later opened up. It has the function of full-text search engine and the distributed multi-user support of RESTful Web interface.

The working principle of ES is mainly divided into the following steps.

First, the user enters the data into the ES database

Then the word segmentation controller is used to segment the corresponding sentences.

Stores the weight of the subdivision result.

When a user searches for specific data, the results are ranked and scored based on weight, and then returned to the user. ES is developed entirely in Java and is currently a popular enterprise search engine.

It is stable, reliable, fast and easy to install, and is designed for use in cloud computing environments.

Official customers are available in the following languages

Java

.net (C#)

PHP

Python

Apache Groovy

Ruby

Apache HAWQ

Apache HAWQ (Hadoop with queries) is the Hadoop native parallel SQL analysis engine. It is a commercially licensed high-performance SQL engine launched by Pivotal in 2012.

It is the native SQL query engine of Hadoop, which combines the technical advantages of MPP database, huge scalability and the convenience of Hadoop.

It is proved that the OLAP performance of HAWQ is more than 4 times that of Hive and Impala. It is very suitable for quickly building data warehouse system on Hadoop platform.

HAWQ has the following features, such as

Massively parallel processing

Full SQL compatibility

Supports stored procedures and transactions.

It can also be easily integrated with other open source data mining libraries such as MADLib.

Apache Lucene

Apache Lucene is an open source full-text search engine toolkit based on Java. It is a powerful and widely used search engine. Lucene is not a complete search engine, but a full-text search engine architecture, with which you can build other search engine products. It is a full-text search engine architecture that provides complete index creation and query indexing as well as text analysis engines.

The goal of Lucene is to provide an easy-to-use toolkit for software developers to facilitate the implementation of full-text search functions in the required systems, and even to build a complete full-text search engine based on this prototype. It provides a simple but powerful application programming interface (API) for full-text indexing and search.

Apache Solr

Apache Solr is an open source enterprise search platform based on Apache Lucene architecture. It was released in 2004 and became an excellent Apache project on January 17, 2007.

Its high reliability, scalability and fault tolerance provide distributed indexing, replication and load balancing queries, automatic failover, recovery and centralized configuration. It is a stand-alone full-text search server written entirely in the Servlet language and runs in a Java container environment (Apache Tomcat or Jetty).

Solr relies on the Lucene Java search base for full-text indexing and search, and uses HTTP / XML and JSON API to perform REST-like operations. Solr's powerful external configuration capabilities allow it to adapt to many types of software without using the Java language. Solr supports the search and navigation capabilities of many large Internet enterprise sites.

Apache Phoenix

Apache Phoenix is a HBase-based SQL type framework. Apache Phoenix JDBC API replaces the requirements of the traditional HBase client API. It also creates tables, inserts data, and queries HBase data. Basically, this is the Java middle tier, which allows developers to use data tables in HBase, such as relational databases (for example, MySQL through Phoenix).

Phoenix compiles the SQL query statement into a series of Scan operations and generates a JDBC result set that is returned to the service consumer. It enables you to use the base layer, such as HBase coprocessors and filters. Small-scale queries respond in milliseconds regardless of the data response time in seconds.

At this point, I believe you have a deeper understanding of "big data and what are the open source queries and frameworks?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.