What are the skills that can handle big data 07/13 Update SLTechnology News&Howtos

What are the skills that can handle big data

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

The editor would like to share with you the techniques that can deal with big data. I believe most people don't know much about it, so share this article for your reference. I hope you will learn a lot after reading this article. Let's learn about it!

Be able to handle big data's skills:

Hadoop offline Computing spark Real-time Computing strom streaming Computing

1. Hadoop background

Apache Hadoop is a reliable and extensible distributed computing development software.

Apache Hadoop can be understood as a framework. It allows you to use a simple programming model to calculate large distributed data sets (massive data).

Which modules are included:

Hadoop Common: some module tools of Hadoop

Hadoop distributed File system (HDFS): a distributed file system that provides high-throughput access to application data

Hadoop YARN: a Framework for Job scheduling and Cluster Resource Management

Hadoop MapReduce: a YARN-based system for processing large datasets (distributed computing framework)

Ps:MapReduce theory that computing power is very general and data is slow.

Each of the above modules has its own independent function, and there is a certain relationship between the module and the module.

II.

The position and relationship of Hadoop in big data, cloud computing

Cloud computing is a product of the integration of distributed computing, parallel technology, network computing, multi-core computing, network storage, virtualization, load balancing and other traditional computer technology and Internet technology.

At present, the two underlying technologies of cloud computing support "virtualization" and "big data technology".

And Hadoop is a ready-to-serve solution for cloud computing platform.

Ps:laaS (Infrastructure as a Service) PaaS (platform as a Service) S aaS (Software as a Service)

3. Hadoop case:

1. Log analysis of the web server of a large website: the web server of a large website has as many clicks as 800GB every 5 minutes, and the peak click can reach 9 million times per second to load the data into memory every 5 minutes, calculate the hot url of the website at a high speed, and feed these information back to the front-end cache server to improve the cache hit rate.

two。 Operator traffic analysis: daily traffic data between 2TB-5TB, copied to HDFS, through interactive analysis engine template, can carry out hundreds of complex data cleaning and reporting tasks, the total time of similar hardware configuration of small clusters (DB2) is 2-3 times faster

3. Real-time analysis of video surveillance information of urban traffic bayonets: real-time analysis, warning and statistics are carried out by using video surveillance information of provincial-wide traffic bayonets based on streaming. The conclusion and real-time warning can be obtained in about 300 milliseconds for non-annual inspection and license plate vehicles in the whole province.

4. Hadoop biosphere

Important component

1. HDFS: distributed file system

2. MapReduce: distributed computing framework

3. Hive: SQL data warehouse tool based on big data technology (file system + computing framework)

4. HBase: distributed massive data database based on Hadoop (NOSQL non-relational database, column storage)

5. Zookeeper: the basic component of distributed coordination service

6. Oozie: workflow scheduling framework

7. Sqoop: data Import and Export tool

8. Flume: log data acquisition framework

9. Mahout: machine learning algorithm library based on Mapreduce/Spark/flink and other distributed frameworks

Distributed system

I. distributed software system

A distributed software system is a system composed of a group of computer nodes that communicate through the network and coordinate to accomplish common tasks. The emergence of distributed systems is to use cheap, ordinary machines to complete computing / storage tasks completed by a single computer, and its purpose is to make full use of computers to handle more tasks.

Second, cases of commonly used distributed software systems:

Web server cluster gives priority to the performance and resources of a single server, and there is also an upper limit on the number of concurrent connections supported, so it is necessary to use polymorphic server cluster to provide concurrent data and computer computing speed.

Each web server will assign a domain name, and the entry of the same domain name must be the same entry.

Baidu has thousands (or even more) web servers. At this time, we use an entrance of www.baidu.com to access. As for which server provides our services, we need to implement a technology, load balancing, at the underlying level.

Analysis process of offline data

Web log data mining

Case study:

Website Click flow Log data Mining system

Demand:

The web clickstream log contains important information about the operation of the website. Through the analysis of the log, we can know the number of visits to the website, which page has the largest number of visitors, which page is the most valuable, the advertising conversion rate, the source information of visitors, and the terminal information of visitors.

Data sources:

To get the method, preprocess a js program on the page. In order for the page to have a firm tag binding time, as long as the user clicks or triggers, the user's information can be obtained and a log file can be generated.

Data processing flow:

1. Data collection: customize the development program or use Flume

two。 Data preprocessing: custom development MapReduce programs run on Hadoop computing

3. Data Warehouse Computing: using hive technology based on Hadoop IQ to complete data cleaning in data warehouse (ETL)

4. Data export: need to use sqoop to export data

5. Data visualization: ps is done by web personnel: Oozie can be used to assist development

HDFS distributed file system

HDFS, a technical paper from Google, GFS,HDFS, a clone of GFS, the full name of HDFS is Hadoop Distributed / d "str" bj distributed file system / (distributed) distributed file system, which runs on a large number of ordinary and cheap machines, provides content error mechanism, and provides file access services with good performance for a large number of users.

Advantages and disadvantages of HDFS

Advantages:

1. High reliability: Hadoop has strong ability to store and process data bit by bit

two。 High scalability: Hadoop distributes data among available computer clusters to complete computing tasks

3. Efficiency: Hadoop can move data dynamically between nodes and ensure the dynamic balance of each node.

4. High fault tolerance: Hadoop can automatically save multiple copies of data and automatically reassign failed tasks

Disadvantages:

1. Not suitable for low latency acc

two。 Unable to store a large number of small files efficiently

3. Does not support multi-user writing, that is, arbitrarily modifying files

Important features of HDFS

The file in 1.HDSF is physically block, and the size of the block can be set by parameter (dfs.blocksize). The default size is 128m in the Hadoop2.x version, and even 64m in the hadoop1.x version.

The 2.HDFS file system provides a unified abstract directory tree for the client, and the client accesses the files under the corresponding path.

3. The directory structure and file block information (metadata) are borne by the NameNode node. NameNode is the master node in the HDFS collection and is responsible for maintaining the directory tree of the entire HDFS file system and the Block block information corresponding to each path (block is id and the DataNode server).

4. The management DataNode of each block block storage of the file is managed by the node DataNode is the slave node of the HDFS cluster, and each Block can store multiple copies on multiple DataNode (the number of copies can be set dfs.replication)

Storage Model of HDSF in Hadoop

HDSF is file-oriented, and the file is linearly Block. Each block has an offset offset (byte), which describes which part of the file the block belongs to. A large text is divided into many blocks, and each block is faced with the location of the file, that is to say, the first byte of each block corresponds to a byte somewhere in the large file. This byte is the offset Block. The size of a single file block stored in the cluster node is the same, that is, a large file, the size of each block defined is fixed, and the size of all cut files is also fixed. However, if the final remaining size of the file is inconsistent with the size of the block, it will occupy space according to the size of the block, and the actual size of the remaining file will be stored. That is to say, the space opened in memory is the number of copies that Block can set for the actual file size. Copies are scattered in different nodes, and the number of copies should not exceed the number of nodes. The number of copies is equal to one backup (copy). The default number of copies for HDFS is 3. The function of the copy is to ensure that the same information can be obtained in other nodes when the file is lost, so the size and number of Block blocks can be set when the copy and block appear in the same node when the file is uploaded. The number of block copies that have been uploaded can be adjusted, but the block can only be written but can be read multiple times. If you want to append data, you can only add it in the last node.

HDFS read and write process

Hdfs read process:

1.client links namenode to view the metadata and find the location where the data is stored.

2.client reads data concurrently through the api of hdfs.

3. Close the connection.

Hdfs writing process:

1.client link namenode to store data

2.namenode records a piece of data location information (metadata), telling client where to store it.

3.client uses hdfs's api to store blocks (64m by default) on datanode.

4.datanode backs up data horizontally. And feedback will be given to client after backup.

5.client notifies namenode that the storage block is over.

6.namenode synchronizes metadata into memory.

7. The process above the other loop.

Read and write permissions exist in HDFS file system

R-- > read w-- > writer x-- > execute

-can be regarded as an octal 1 indicating permission 0 means no permission

Rxw | Rafe-w-> number form 111100 | 010-- > 742

The Shell command for HDFS:

Ps: whether you see it in hdfs dfs form or hadoop fs form, you can complete the operation on HDFS

1. Upload files to HDSF

Put: assigns single or multiple original path target files to the HDFS file system from the local file system

Hdfs dfs-put local file path HDFS file system path

two。 Download the files from the HDFS file system back

Get: copying files from the HDFS file system to the local file system

Hdfs dfs-get HDFS file system path local file system path

Ps:HDFS has a similar method to put and get, while the method copyFromlocal is equivalent to put and copyTolocal to get.

3. View the contents of files in the HDFS file system

Cat: viewing the contents of files in the HDFS file system

Path to files in the hdfs dfs-cat HDFS file system

Ps: do not view non-files. Append files to the local path while viewing files.

Copy operations in the 4.HDFS file system

Cp: copy files from the HDFS file system to the HDFS system

Hdfs dfs-cp the file path in the source HDFS file system the path in the destination HDFS file system

Move files in the 5.HDFS file system

Mv: move source files to the destination path. This command allows multiple source paths, which must be a folder (directory) that does not allow different file systems to move files to each other.

Hdfs dfs-mv the file path in the source HDFS file system the path in the destination HDFS file system

Equivalent to shearing

6. View the size of files in the HDFS file system

A file in the path in the hdfs dfs-du HDFS file system

7. Create a folder in the HDSF system

Mkdir create folder

Paths in the hdfs dfs-mkdir HDFS file system

8. View all files under the HDFS file system

Hdfs dfs-ls HDFS file system path

9. Delete directories or files from the HDFS file system

Ps: can only be a single file or an empty directory

Hdfs dfs-rm HDFS file system path

If there are multiple files in the parameter folder plus-r hdfs dfs-rm-r HDFS file system path

10. Change the permissions of a file

R readable w writable x executable

-all three digits can be treated as an octal.

For example, rwx | rwx | rwx

111 | 111 | 111

7 7 7

Hdfs dfs-chmod permission value HDFS file system path

If you need to change the permissions under the directory, you need to modify them.

Hdfs dfs-chmod-R permission value folder under the path of the HDFS file system

Ps: and all subfiles and folders under the folder will be modified

11. Recycling bin

Hadoop Recycle Bin trash is disabled by default. It is recommended to enable it.

[]:

Ps: there is no permission to operate the Recycle Bin by default, so enable permission before operation.

Path to the Recycle Bin in the hdfs dfs-chmod-R 777 HDFS file system

Example: hdfs dfs-chmod-R 777 / user after deleting the file, found that it was deleted by mistake, and recovered the file.

File HDFS file system path under the path of the Recycle Bin in the hdfs dfs-mv HDFS file system

Empty the Recycle Bin

Path to the Recycle Bin in the hdfs dfs-rm-r HDFS file system

For example: hdfs dfs-rm-r / user/root/.Trash

The above is all the contents of this article entitled "what are big data's skills?" Thank you for your reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.