How to realize big data Development Environment based on Docker 10/13 Update SLTechnology News&Howtos

How to realize big data Development Environment based on Docker

2025-10-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to realize the big data development environment based on Docker". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to realize the big data development environment based on Docker".

Big data's development relies heavily on the running environment and data. For example, when developing Spark applications, they often rely on Hive, but the local development environment does not have Hive, so it is not efficient to copy the code between the local and the server. I think using Docker to build a stand-alone big data cluster locally, and then copying the code into a container for testing can improve this situation. I have explored this idea myself. Hadoop, Hive, Spark and other components are installed in this image, which can basically meet the requirements, but there are also some problems, such as the need to adjust the configuration to maintain consistency with the production environment, although it can be done, but also a lot of work.

In fact, both CDH and HDP provide similar stand-alone images, and the version of the components in HDP is relatively new and consistent with the company's technology stack, so explore, if the experience is better, you will use it for related development in the future.

Sandbox acquires system requirements

Install Docker 17.09or later

More than 10 gigabytes of memory needs to be configured for Windows and Mac,Docker

Script download and execution

You can visit the https://www.cloudera.com/downloads/hortonworks-sandbox/hdp.html page in the browser to download it, or you can download it directly from the command line with wget:

$wget-- no-check-certificate https://archive.cloudera.com/hwx-sandbox/hdp/hdp-3.0.1/HDP_3.0.1_docker-deploy-scripts_18120587fc7fb.zip

Extract and execute the script:

$unzip HDP_3.0.1_docker-deploy-scripts_18120587fc7fb.zipArchive: HDP_3.0.1_docker-deploy-scripts_18120587fc7fb.zip creating: assets/ inflating: assets/generate-proxy-deploy-script.sh inflating: assets/nginx.conf inflating: docker-deploy-hdp30.sh$ sh docker-deploy-hdp30.sh

After execution, you will start to pull the docker image. You need to download dozens of gigabytes of data and wait patiently.

Sandbox verification

After the script is executed, using docker ps, you can see that two containers are started:

CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMESdaf0f397ff6c hortonworks/sandbox-proxy:1.0 "nginx-g 'daemon of..." About an hour ago Up About an hour 0.0.0.0 b925f92f368d hortonworks/sandbox-hdp:3.0.1 1080-> 1080/tcp,... b925f92f368d hortonworks/sandbox-hdp:3.0.1 "/ usr/sbin/init" About an hour ago Up About an hour 22/tcp, 4200/tcp, 8080/tcpsandbox-hdp

Ignore sandbox-proxy as the container and pay attention to sandbox-hdp, when all the components of HDP have been started.

UI verification

Because port mapping has been done, if you want to access a specific UI, you can directly access the port corresponding to the localhost. You can first visit the Splash page of the localhost:1080:

Here is a wizard. Click the Launch Dashboard on the left to open the Ambari login page and the Tutorial page of HDP. Clicking the Quick Links on the right will open the next wizard, including jump links for Ambari, Zeppelin, Atlas, Ranger and other components:

The login password of Ambari can be obtained by referring to the https://www.cloudera.com/tutorials/learning-the-ropes-of-the-hdp-sandbox.html page, and different users can be selected according to different uses:

User role password adminAmbari Admin initializes the maria_devSpark and SQL Developermaria_devraj_opsHadoop Warehouse Operatorraj_opsholger_govData Stewardholger_govamy_dsData Scientistamy_ds using the ambari-admin-password-reset command

Readers can verify the situation of Web UI one by one. Let's verify the underlying storage and computing.

Functional verification

The command line enters the container:

Docker exec-it sandbox-hdp bashHDFS verification

A simple ls:

[root@sandbox-hdp /] # hdfs dfs-ls / Found 13 itemsdrwxrwxrwt-yarn hadoop 0 2018-11-29 17:56 / app-logsdrwxr-xr-x-hdfs hdfs 0 2018-11-29 19:01 / appsdrwxr-xr-x-yarn hadoop 0 2018-11-29 17:25 / atsdrwxr-xr-x-hdfs hdfs 0 2018-11-29 17:26 / atsv2drwxr-xr-x-hdfs Hdfs 0 2018-11-29 17:26 / hdpdrwx--livy hdfs 0 2018-11-29 17:55 / livy2-recoverydrwxr-xr-x-mapred hdfs 0 2018-11-29 17:26 / mapreddrwxrwxrwx-mapred hadoop 0 2018-11-29 17:26 / mr-historydrwxr-xr-x-hdfs hdfs 0 2018-11-29 18:54 / rangerdrwxrwxrwx- Spark hadoop 0 2021-02-06 07:19 / spark2-historydrwxrwxrwx-hdfs hdfs 0 2018-11-29 19:01 / tmpdrwxr-xr-x-hdfs hdfs 0 2018-11-29 19:21 / userdrwxr-xr-x-hdfs hdfs 0 2018-11-29 17:51 / warehouseHive Verification

Sandbox already has some test data built into it. Just select it.

Start the hive command line first:

[root@sandbox-hdp /] # hive

See which databases are available:

Select foodmart and see which tables are available:

0: jdbc:hive2://sandbox-hdp.hortonworks.com:2 > use foodmart;0: jdbc:hive2://sandbox-hdp.hortonworks.com:2 > show tables +-- + | tab_name | +-- + | account | |. | +- -+

You can see that there are many tables, so we choose the account table:

Very OK.

Spark verification

Query the account table after starting spark-sql:

Spark-sql > select * from foodmart.account limit 1 UnresolvedRelation error in query: Table or view not found: `foodmart`.`roomt`; line 1 pos 14 politic GlobalLimit 1 situation-'LocalLimit 1 challenge -' Project [*] +-'UnresolvedRelation `foodmart`.`roomt`

strange

Spark-sql > show databases;default

Only the default library.

After doing some search, it seems that great changes have taken place in the Hive table accessed by Spark after HDP 3.0.The verification of Spark needs further research.

Sandbox Management stop Sandbox

Use the docker stop command to:

Docker stop sandbox-hdpdocker stop sandbox-proxy restart Sandbox

Use the docker start command to:

Docker start sandbox-hdpdocker start sandbox-proxy cleans up Sandbox

First stop and then remove:

Docker stop sandbox-hdpdocker stop sandbox-proxydocker rm sandbox-hdpdocker rm sandbox-proxy

If you want to delete a mirror:

Docker rmi hortonworks/sandbox-hdp:3.0.1 thank you for reading, the above is the content of "how to achieve the big data development environment based on Docker". After the study of this article, I believe you have a deeper understanding of how to achieve this problem in the big data development environment based on Docker, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.