How to use the Apache Kylin framework 07/01 Update SLTechnology News&Howtos

How to use the Apache Kylin framework

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article focuses on "how to use the Apache Kylin framework", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn how to use the Apache Kylin framework.

What is Apache Kylin?

Apache Kylin ™is an open source distributed analysis engine that provides SQL query interface and multidimensional analysis (OLAP) capabilities over Hadoop to support very large-scale data, originally developed by eBay Inc. Develop and contribute to the open source community. It can query huge Hive tables in subseconds.

Introduction to Apache Kylin Framework

The secret of Apache kylin to provide low latency (sub-second latency) is precomputation, that is, for a data cube with a star topology, pre-calculate the metrics of the combination of multiple dimensions, and then save the results in hbase and expose the query interfaces of JDBC, ODBC and Rest API to achieve real-time query. Kylin takes the data from Hadoop Hive, then through Cube Build Engine, Build the data in Hive into an OLAP Cube and save it in HBase. When the user executes the SQL query, the SQL statement is parsed into an OLAP Cube query through the Query engine, and then the result is returned to the user.

Core concepts of Apache Kylin

Table (table): This is definition of hive tables as source of cubes, before build cube, must be synchronized in kylin.

Model (model): the model describes the data structure of a star schema that defines the join and filtering relationships between a fact table (Fact Table) and multiple lookup tables (Lookup Table).

Cube description: describes the definition and configuration options for an Cube instance, including which data model is used, which dimensions and metrics are included, how to partition the data, how to handle automatic merging, and so on.

Cube instance: obtained by describing Build in Cube, which contains one or more Cube Segment.

Partition: a user can use a DATA/STRING column as a partition column in a Cube description, thus dividing a Cube into multiple segment by date.

Cube segment (cube segment): it is the data carrier after cube construction (build). A segment maps a table in hbase. After cube instance construction (build), a new segment is generated. Once the original data of a built cube changes, you only need to fresh the segment associated with the changed time period.

Aggregation groups: each aggregation group is a subset of a dimension, and the cuboid is built internally through composition.

Job: when a build request is made to the cube instance, a job is generated. This job records each step of the task information when the cube instance build. The status information of the job reflects the result information of building the cube instance. If the status information of the job execution is RUNNING, the cube instance is being built; if the job status information is FINISHED, the cube instance is built successfully; if the job status information is ERROR, the cube instance construction failed!

Getting started with sample Cube

Execute $KYLIN_HOME/bin/sample.sh, wait for the script to finish, and import the learning project. Kylin provides a script to create a sample Cube; the script creates five sample Hive tables, which mainly record sales information data. We will analyze the sales-related information based on this table.

After importing the learning project, you need to reload the element data or restart the kylin service for the project to take effect. Here we deal with it by reloading.

Visit the host: http://hostname:7070, log in to the Kylin website with the default user name and password ADMIN/KYLIN

Select the learn_kylin project in the project drop-down box (upper left); you can see that the Hive library you just imported exists under this project.

Connect to the Hive library and look at the tables in the Hive default library. There are five tables above.

Enter the kylin Web interface to log in to "System" reload metadata and reload the element data.

After the execution is completed, you can view the imported model. There are two models. This time we mainly learn "kylin_sales_cube".

Select the data model and click "Build" to build the cube.

Select the data partition range and select a date after 2014-01-01 (overwriting all 10000 sample records).

Click monitor to view the jobs that are building cube, refresh to check the construction progress, and wait for the build to complete.

This step is time-consuming because it is pre-calculated, and the default is the MapReduce job.

If it has not been completed for a long time, click Refresh again, and the status information will be updated. The error encountered during this period is as follows:

After checking, it is due to the fact that the history task server of hadoop is not started. Execute the command to open it.

$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver

Jps detects whether there is a JobHistoryServer process, and the existence of a process proves that the startup is successful

After startup, click "ACTION" below there is a "PENDING" option to restore the state of running again.

Build completed

Query the built cube and execute the SQL in the "Insight" tab. First, we execute a SQL to count the number of "KYLIN_SALES" tables.

Select count (*) from KYLIN_SALES

The execution result takes 1.77 seconds.

It takes very fast to execute again, and the fourth execution takes 0.11 seconds.

Next, we perform a SQL for business analysis, which is used to count the total daily sales and the number of buyers. The execution result is displayed quickly.

Select part_dt, sum (price) as total_sold, count (distinct seller_id) as sellers

From kylin_sales group by part_dt order by part_dt

Based on the above SQL to expand again, carry on the multi-table association, carry on the multi-dimensional analysis to the sales data, and check the sales situation of the goods under different categories. It is still very fast to view the query results of multi-table association.

SELECT sum (KYLIN_SALES.PRICE) AS price_sum

KYLIN_CATEGORY_GROUPINGS.META_CATEG_NAME

KYLIN_CATEGORY_GROUPINGS.CATEG_LVL2_NAME

FROM KYLIN_SALES

INNER JOIN KYLIN_CATEGORY_GROUPINGS ON KYLIN_SALES.LEAF_CATEG_ID = KYLIN_CATEGORY_GROUPINGS.LEAF_CATEG_ID

AND KYLIN_SALES.LSTG_SITE_ID = KYLIN_CATEGORY_GROUPINGS.SITE_ID

GROUP BY KYLIN_CATEGORY_GROUPINGS.META_CATEG_NAME

KYLIN_CATEGORY_GROUPINGS.CATEG_LVL2_NAME

ORDER BY KYLIN_CATEGORY_GROUPINGS.META_CATEG_NAME ASC

KYLIN_CATEGORY_GROUPINGS.CATEG_LVL2_NAME DESC

Click on the Visualization visualization display, and the user can select "graphic Type", "Dimension" and "Measurement".

At this point, I believe you have a deeper understanding of "how to use the Apache Kylin framework". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.