Related materials of Hadoop 04/21 Update SLTechnology News&Howtos

Related materials of Hadoop

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

1 HDFS

1.1 concept

Hadoop distributed file system (HDFS) is designed to be suitable for distributed file systems running on general hardware (commodity hardware).

1.2 Features

-High fault tolerance

-low hardware requirements

-provide high-throughput data access

1.3 File system Command Line

1.3.1 get help

Hadoop fs-help

1.3.2 ls command

Hadoop fs-ls / hadoop fs-ls-R / user

1.3.3 getconf command

Hdfs getconf-helphdfs getconf-namenodes

1.3.4 version Information

Hdfs version

Note: as it is similar to the usage of linux system instructions, please refer to the official link at the end of the article for details.

2 MapReduce

2.1 introduction to MapReduce

MapReduce is a programming model for parallel operations on large datasets (larger than 1TB).

2.2 how it works

If there are black beans, soybeans, mung beans and red beans on a plate, you want to pick out the red beans now.

The MapReduce method is:

Step1 found a team to deal with (equivalent to a cluster of servers)

Step2 distributes beans equally to each member of the team (equivalent to allocating data to servers in the cluster)

Step3 asked team members to start picking out red beans (the equivalent of cluster computers processing data in parallel)

Step4 aggregates the beans singled out by team members (equivalent to cluster summary and output results)

3 Hive

3.1 introduction to Hive

3.1.1 concept

Hive is a data warehouse platform based on Hadoop.

3.1.2 the role of Hive

Through hive, we can easily carry out the work of ETL.

Hive defines a query language similar to SQL

HQL can convert QL written by users into corresponding Mapreduce programs and execute them based on Hadoop.

3.1.3 History of Hive projects

Hive is a data warehouse framework opened by Facebook in August 2008. its system goals are similar to those of Pig, but it has some mechanisms that Pig does not support yet.

For example: richer type system, more SQL-like query language, persistence of Table/ metadata, and so on.

4 impala

4.1introduction to Impala

Impala is a real-time interactive SQL big data query tool developed by Cloudera under the inspiration of Google's Dremel. Instead of using slow Hive+MapReduce batch processing, Impala can query data directly from HDFS or HBase with SELECT, JOIN and statistical functions by using a distributed query engine similar to that in commercial parallel relational databases (composed of Query Planner, Query Coordinator and Query Exec Engine), thus greatly reducing latency.

4.2 shell for Impala

4.2.1 start shell

Impala-shell

4.2.2 version query

Select version ()

4.3 Operation of the library

4.3.1 query the database

Show databases

4.3.2 create a database

Create database testdb;create database testdb2

Database storage path:

Hdfs dfs-ls / user/hive/warehouse/

4.3.3 using the database

Use testdb

4.3.4 display the current database

Select current_database ()

4.3.5 Delete the database

Drop database testdb

4.4 Table operation

4.4.1 create a table

Create table T1 (x int); create table T3 (id int, word string); create table city (id int,name string,countrycode string,district string,population int)

4.4.2 display tables in the database

Show tables;show tables in testdb;show tables in testdb like 'tweets'

4.4.3 Table structure description

Describe city

4.4.4 modify the table name

Alter table t3 rename to t2

4.4.5 insert data

Insert into T1 values (1), (3), (2), (4); insert into T2 values (1, "one"), (3, "three"), (5, 'five')

4.4.6 data query

Select min (x), max (x), sum (x), avg (x) from T1 10 select word from T1 join T2 on (t 1.x = t2.id)

5 sentry

5.1 enable permissions

5.1.1 enable permissions

Hive/Impala > Configuration > Service-Wide > Sentry Service > Select "sentry"

5.1.2 specify authentication server

Hive > Configuration > Service-Wide > Advanced > Server Name for Sentry Authorization (hive.sentry.server) > fill in the sentry server name or IP address

5.1.3 set up privileged users

Hive > Configuration > Service-Wide > Security > Bypass Sentry Authorization Users (sentry.metastore.service.users) > enter the bypassed linux user name (hive,impala,hue,hdfs, etc.)

5.1.4 configure the proxy user for Hive

HDFS > Configuration > Service-Wide > Proxy > Hive Proxy User Groups (hadoop.proxyuser.hive.groups) > fill in the linux user name of the agent (hive,impala,hue,hdfs, etc.)

5.1.5 restart the service

Restart the service of Hive/Impala

5.2 authorization

5.2.1 create database users and groups

Groupadd gp1useradd user1-G gp1useradd user2-G gp1

5.2.2 switching execution user

Su-impala

5.2.3 create a database

Switch to hive shell

Hive

New library

Create database testdb

Exit hive shell

Quit

5.2.4 creating roles

Switch to impala shell

Impala-shell

Create a role

Create role ro1

5.2.5 confirm the role created

Show roles

5.2.6 Association of user groups and roles

Grant role ro1 to group gp1

5.2.7 role authorization

Grant all on database testdb to role ro1

Refer to the material:

Docs:

Http://hadoop.apache.org/docs/current/

Hadoop Common Guide:

Http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html

File System Shell Guide:

Http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#Overview

MapReduce Common Guide:

Http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html

Hive Docs

Http://hive.apache.org

LanguageManual:

Https://cwiki.apache.org/confluence/display/Hive/LanguageManual

GettingStarted:

Https://cwiki.apache.org/confluence/display/Hive/GettingStarted

User Documentation:

Https://cwiki.apache.org/confluence/display/Hive/Home#Home-UserDocumentation

Impala Docs

Impala SQL

Http://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_langref_sql.html#langref_sql

Impala Tutorials

Http://www.cloudera.com/documentation/enterprise/latest/topics/impala_tutorial.html

Impala Explore

Http://www.cloudera.com/documentation/enterprise/latest/topics/impala_tutorial.html#tutorial_explore

Sentry Docs

Overview of Impala Security

Http://www.cloudera.com/documentation/enterprise/5-7-x/topics/impala_security.html#security

Enabling Sentry Authorization for Impala

Http://www.cloudera.com/documentation/enterprise/5-7-x/topics/impala_authorization.html#authorization

Impala Grant

Http://www.cloudera.com/documentation/enterprise/5-6-x/topics/impala_grant.html#grant

Hive Grant

Http://www.cloudera.com/documentation/enterprise/5-6-x/topics/sg_hive_sql.html#concept_c2q_4qx_p4__col_level_auth_sentry

Disabling Hive CLI

Http://www.cloudera.com/documentation/enterprise/5-6-x/topics/sg_sentry_overview.html

= =

Other references:

= =

The concept of ETL:

Http://www.cnblogs.com/elaron/archive/2012/04/09/2438372.html

Introduction to Apache Sentry Architecture

Http://blog.javachen.com/2015/04/29/apache-sentry-architecture.html

Enable Kerberos authentication

Http://www.cloudera.com/documentation/enterprise/latest/topics/cm_sg_intro_kerb.html#xd_583c10bfdbd326ba--6eed2fb8-14349d04bee--76dd

Introduction to the architecture of Impala

Http://www.mutouxiaogui.cn/blog/?p=319

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.