How to quickly import log files and binaries into HDFS 07/06 Update SLTechnology News&Howtos

How to quickly import log files and binaries into HDFS

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "how to quickly import log files and binaries into HDFS". The explanation in the article is simple and clear and easy to learn and understand. Please follow the editor's ideas slowly and deeply, to study and learn "how to quickly import log files and binaries into HDFS".

Preferred data movement method

If you are running in an older Hadoop environment, you may need tools to move data, all of which are described in this chapter. Using Kafka as the data transfer mechanism allows the separation of producers and consumers while enabling multiple consumers to manipulate the data in different ways. In this case, we can use Kafka to store data on Hadoop, provide data for real-time data flow systems such as Storm or Spark Streaming, and then use it to perform near-real-time calculations. For example, the Lambda architecture allows real-time calculation of aggregated data in small increments, and uses the batch layer to perform error correction and add new data points, thus taking advantage of real-time and batch systems.

Practice: using Flume to push Syslog messages to HDFS

Faced with a pile of log files generated by multiple applications and systems across multiple servers, we may be in a hurry. There is no doubt that valuable information can be mined from these logs, but the first challenge is to move these logs to the Hadoop cluster so that some analysis can be performed.

Considerations for version

The Flume here uses version 1.4. As with all software, there is no guarantee that the technology, code, and configuration described here can be used out of the box using different versions of Flume. In addition, Flume 1.4 requires some updates to work with Hadoop 2.

problem

You want to push Syslog files from all production servers to HDFS.

Solution

Use Flume, a data collection system, to push Linux log files to HDFS.

Discuss

The core of Flume is the collection and distribution of log files, the collection of system logs and transfer to HDFS. The first step in this technique involves capturing all data attached to / var/log/messages and transferring it to HDFS. We will run a Flume agent (described in more detail later), which will do all the work.

Flume agent needs a configuration file to indicate what to do, and the following code defines one for the use case:

For the example to work, you need to make sure that you are using a host that can access the Hadoop cluster and that the HADOOP_HOME is configured correctly, download and install Flume, and set FLUME_HOME to point to the installation directory.

Copy the previous file to the Flume conf directory using the file name tail-hdfspart1.conf. When you are finished, you can start the Flume agent instance:

This should produce a lot of output, but eventually you should see something similar to the following, indicating that everything is all right:

At this point, you should see some data that appears in HDFS:

The .tmp suffix indicates that Flume opens the file and continues to write. Once done, this renames the file and removes the suffix:

You can capture this file to check its contents, which should be aligned with tail/var/log/messages.

So far, we have completed the first data movement with Flume!

Parsing Flume agent

Let's go back and check what we did. There are two main parts: defining the Flume configuration file and running Flume agent. The Flume configuration file contains detailed information about the source, channel, and receiver, which are concepts that affect different parts of the Flume data flow. Figure 5.4 shows these concepts in Flume agent.

Let's introduce these concepts step by step, including the purpose and how they work.

Sources

Flume sources is responsible for reading data from external clients or other Flume receivers. The data unit in Flume is defined as an event, which is essentially a payload and an optional set of metadata. The Flume source sends these events to one or more Flume channels, which handle storage and buffering.

Figure 5.4 description of Flume components in the context of agent

Flume has a wide set of built-in sources, including HTTP,JMS and RPC. Let's take a look at the source-specific configuration properties you set:

Exec source allows Unix commands to be executed, and every line emitted in standard output is captured as an event (common errors are ignored by default). In the previous example, the tail-F command is used to capture system messages when they are generated. If you have more control over the files (for example, if you can move them to a directory after all writes are complete), consider using Flume's spooled directory source (called spooldir) because it provides reliability semantics that exec source cannot obtain.

Testing with tail only

The use of tail for anything other than testing is not encouraged.

Another feature highlighted in this configuration is the interceptor, which allows you to add metadata to events. Recall that the data in HDFS is organized by timestamp: the first part is the date, and the second part is the time:

This is possible because each event is modified using a timestamp interceptor, which inserts the time (in milliseconds) that the source processes the event into the event header. The Flume HDFS receiver then uses this timestamp to determine where the event is written.

To summarize Flume sources, let's take a look at the features it provides:

Transaction semantics, which allows data to be moved reliably at least once semantically, is not supported by all data sources.

Interceptor, which provides the ability to modify or delete events. It is useful for annotating events using host, time, and unique identifiers, which is very useful for deduplication.

Selectors, which allow events to be fanned out or multiplexed in various ways, can be fanned out by copying events to multiple channels, or events can be routed to different channels according to the event header.

Aisle

The Flume channel provides data storage facilities within the agent. The source adds the event to the channel and removes the event from the channel. Channels within Flume provide high availability and can be selected based on the capacity and throughput required by the application.

Flume bundles three channels:

The memory channel stores events in memory queues. This is useful for high-throughput data streams, but there is no guarantee of persistence, which means that if agent fails, the user will lose data.

The file channel persists events to disk. This implementation uses efficient logging and has strong persistence.

The JDBC channel stores events in the database. This provides the strongest availability and recoverability, but at the cost of performance.

In the previous example, we used a memory channel and limited the number of events it stores to 100000. Once the memory channel reaches the maximum number of events, it starts to reject other requests from the source to add more events. Depending on the type of source, this means that the source will retry or delete the event (the exec source will discard the event):

Sinks

Flume receivers receive events from one or more Flume channels and forward them to another Flume source (in a multi-hop process) or handle events in a receiver-specific manner. Flume has many built-in receivers, including HDFS,HBase,Solr and Elasticsearch.

In the previous example, we configured the stream to use a HDFS receiver:

We configure the receiver to write to the file based on the timestamp (note% y and other timestamp aliases). We can do this because events are marked with a timestamp interceptor in the exec source. In fact, you can use any header value to determine the location of the event output (for example, you can add a host interceptor and then write to the file based on the host that generated the event).

You can configure the HDFS receiver in a variety of ways to determine how the file is scrolled. When the receiver reads the first event, it opens a new file (if it is not already open) and writes to it. By default, the receiver will continue to keep the file open and write events to it, which will take about 30 seconds, after which the file will be closed, and the scrolling behavior can be changed using the properties in Table 5.5.

Table 5.5 rollover properties for Flume HDFS receivers

The default HDFS receiver settings should not be used in production because they result in a large number of files that may be very small. It is recommended that you upgrade value or use a downstream compression job to merge these small files.

The HDFS sink allows you to specify how events are serialized when writing to a file. By default, they are serialized in text format, with no interceptor adding any header. For example, if you want to write data (including event headers) in Avro, you can use a serializer configuration to do so. When you do this, you can also specify the Hadoop compression codec used internally by Avro to compress data:

Summary

Reliability in Flume depends on the type of channel used, whether the data source has the ability to retransmit events, and whether events are multiplexed to multiple sources to mitigate unrecoverable node failures. In this technology, a memory channel and an executor source are used, but no reliability is provided in the face of failure. One way to add reliability is to replace the exec source with the spooled directory source and the memory channel with the disk channel.

We can use Flume on a single computer running a single agent with a single source, channel, and receiver, but Flume can support fully distributed settings, run agent on multiple hosts, and have multiple agent hop between the source and the final destination. Figure 5. 5 shows how Flume runs in a distributed environment.

The goal of this technique is to move data into HDFS. However, Flume can support a variety of data receivers, including HBase, file roll,Elasticsearch and Solr. Using Flume to write to Elasticsearch or Solr enables powerful near-real-time indexing.

Therefore, Flume is a very powerful data movement tool that can easily support moving data to HDFS and many other locations. It can continuously move data and support various levels of flexibility to solve system failures, which is a system that can be run with simple configuration.

Figure 5.5 using load balancing and fan-in to move log4j logs to the Flume settings of HDFS

What Flume doesn't really optimize is the use of binary data. It supports moving binary data, but loads the entire binary event into memory, so moving files of GB or larger will not work properly.

Practice: an automatic mechanism for copying files to HDFS

You may have learned how to use log collection tools like Flume to automatically move data to HDFS. However, these tools do not support the use of semi-structured or binary data output. In this practice, we will learn how to automatically move these files to HDFS.

The actual production environment of an enterprise usually has a network island, and the Hadoop cluster can be subdivided away from other production applications. In this case, the Hadoop cluster may not be able to extract data from other data sources, so there is no need to push the data to Hadoop.

A mechanism is needed to automate the process of copying files in any format to HDFS, similar to the Linux tool rsync. This mechanism should be able to compress files written in HDFS and provide a way to dynamically determine HDFS destinations for data partitioning.

Existing file transfer mechanisms, such as Flume,Scribe and Chukwa, are designed to support log files. What if the file format is different, such as semistructured or binary? If the file is isolated in a way that is not directly accessible by Hadoop slave nodes, then Oozie cannot be used to assist in file input.

problem

You need to automate the process of copying files from a remote server to HDFS.

Solution

Open source HDFS File Slurper projects can copy files in any format to or from HDFS. This technique covers how to configure and use it to copy data to HDFS.

Discuss

You can use HDFS File Slurper to help automate (https://github.com/alexholmes/hdfs-file-slurper)). HDFS File Slurper is a simple utility that supports copying files from a local directory to HDFS and vice versa.

Figure 5.6 provides a high-level overview of Slurper and an example of how to use it to copy files. Slurper reads all files that exist in the source directory and can select a query script to determine the location of the files in the target directory. It then writes the file to the target, followed by an optional validation step. After successfully completing all the steps, Slurper moves the source files to the appropriate folder.

Figure 5.6 HDFS File Slurper data flow for copying files

Using this technology, you need to ensure that the following challenges are addressed:

How can I effectively partition writes to HDFS so that everything is not consolidated into one directory?

How do I determine if the data in HDFS is ready for processing (to avoid reading intermediate replicated files)?

How do I execute the utility automatically and periodically?

The first step is to download the latest HDFS File Slurper tarball from https://github.com/alexholmes/hdfs-file-slurper/releases and install it on a host that can access the Hadoop cluster and the local Hadoop installation:

module

Before running the code, you need to edit / usr/local/hdfs-slurper/conf/slurper-env.sh and set the location of the hadoop script. The following code is an example of a slurper-eng.sh file, if you follow the Hadoop installation instructions:

Slurper is bundled with the / usr/local/hdfs-slurper/conf/slurper.conf file, which contains details of the source and destination directories, as well as other options. This file contains the following default settings, which you can change:

Let's take a closer look at these settings:

DATASOURCE_NAME- specifies the name of the data to be transferred. This name is used for the log file name when the management system is booted through the Linux init daemon.

SRC_DIR- specifies the source directory. Any files moved here are automatically copied to the destination directory (using intermediate hop to the destination directory).

WORK_DIR-, this is the working directory. Files in the source directory are moved here before being copied to the destination.

COMPLETE_DIR- specifies the full directory. When the copy is complete, the files will be moved from the working directory to this directory. Alternatively, you can use the-- remove-after-copy option to delete the source file, in which case the-- complete-dir option should not be provided.

ERROR_DIR-, this is the wrong directory. Any errors encountered during replication will cause the source files to be moved to this directory.

DEST_DIR- sets the final destination directory of the source file.

DEST_STAGING_DIR- specifies the destination directory. First copy the file to this directory, and once the copy is successful, Slurper moves the copy to the target location to prevent the target directory from containing partially written files (if a failure occurs).

You will notice that all directory names are HDFS URI. HDFS distinguishes different file systems in this way. The path on the file:/URI local file system, and hdfs:/URI represents the path in HDFS. In fact, any Hadoop file system is supported as long as Hadoop,Slurper is configured correctly.

Running

Create a local directory called / tmp/slurper/in, write an empty file to it, and then run Slurper:

A key feature of Slurper design is that it cannot be used with partially written files. Files must be moved atomically to the source directory (file movements in both Linux and HDFS file systems are atomic). Alternatively, you can write to a period (.) The file name at the beginning is ignored by Slurper, and after the file is written, you can rename the file to a name without a period prefix.

Note that copying multiple files with the same file name will cause the target to be overwritten, and it is the user's responsibility to ensure that the file is unique to prevent this from happening.

Dynamic destination routing

The previous method is effective if you move a small number of files to HDFS every day. However, if you are working on a large number of files, you will think of dividing them into different directories. The advantage of this is that you can have finer-grained control over the input data for MapReduce jobs and help organize the data as a whole in the file system (if you don't want all files on your computer to be in a single directory).

How do I have more dynamic control over the target directory and the filename used by Slurper? the Slurper configuration file has the SCRIPT option (mutually exclusive with the DEST_DIR option) in which you can specify a script that provides dynamic mapping of source files to target files.

Suppose the file you are using contains the date in the file name, and you have decided to organize the data in HDFS by date. Then, you can write a script to perform this mapping activity. The following example is a Python script to do this:

You can now update / usr/local/hdfs-slurper/conf/slurper.conf, set SCRIPT, and comment out DEST_DIR, which will generate the following entries in the file:

If you run Slurper again, you will notice that the target path is now partitioned by date by the Python script:

Data compression and verification

What if you want to compress the output file in HDFS and verify that the copy is correct? The COMPRESSION_CODEC option is required, and its value is the class that implements the CompressionCodec interface. If the compression codec is LZO or LZOP, you can also add the CREATE_LZO_INDEX option to create the LZOP index. (please read Chapter 4 for details. For a link, see the end of the article.)

The validation feature re-reads the target file after the replication is complete and ensures that the checksum of the target file matches the source file. This results in longer processing time, but adds an additional guarantee of successful replication.

The following configuration snippet shows the LZOP codec, LZO indexing, and enabled file validation:

Let's run Slurper again:

Continuous operation

Now that you have mastered the basic mechanism, the final step is to run the tool as a daemon to constantly find the files to be transferred. To do this, you can use a script called bin/slurper-inittab.sh, which is designed to be used with inittab respawn.

The fact that this script does not create a PID file or execute nohup- makes no sense in the context of respawn because inittab is managing the process. Use the DATASOURCE_NAME configuration value to create log file names, which means that multiple Slurper instances can be started using different configuration files logged to different log files.

Summary

Slurper is a convenient tool for data entry from the local file system to HDFS, as well as supporting data output by copying from HDFS to the local file system. It is useful when MapReduce does not have access to the file system and the file form being transferred is not suitable for tools such as Flume.

Practice: using Oozie to schedule periodic data extraction

If the data is on a file system, a Web server, or any other system that can be accessed from the Hadoop cluster, we will need a way to extract the data to Hadoop on a regular basis. Currently, there are some tools to choose from for pushing log files and extracting from the database, but if you need to interact with other systems, you may need to handle the data entry process yourself.

This technology uses Oozie version 4.0.0.

This data entry is divided into two parts: importing data from another system into Hadoop and transferring data on a regular basis.

problem

Automate daily tasks to download content from the HTTP server to HDFS.

Solution

Oozie can be used to move data to HDFS and to perform a publication, such as launching a MapReduce job to process the acquired data. Oozie is now the Apache project, the Hadoop workflow engine that manages data processing activities. Oozie also has a coordinator engine that initiates workflows based on data and time triggers.

Discuss

In this practice, we will perform downloads from multiple URL every 24 hours, using Oozie to manage workflows and schedules. The process of this technology is shown in figure 5.7, and we will use the Oozie trigger function to start a MapReduce job every 24 hours.

Figure 5.7 data flow of Oozie technology

The first step is to view the coordinator XML configuration file. Oozie's orchestration engine uses this file to determine when the workflow should be started. Oozie uses the template engine and expression language to perform parameterization, as shown in the following code. Create a file named coordinator.xml with the following:

Code 5.1 uses the template engine to perform parameterization through Oozie

What can be confusing about Oozie scheduling is that the start and end times have nothing to do with the actual time the job is executed. Instead, they refer to the date on which each workflow was created, which is useful when data is generated periodically and you want to be able to return to a point in time and perform some operations on that data. In this example, you want to perform a job every 24 hours. So, you can set the start date to yesterday and the end date to a future date.

Next, we need to define the actual workflow, which will be executed at each fixed interval and continues when the interval is reached. To do this, create a file called workflow.xml that contains what is shown in the next code.

Code 5.2 defines the workflow using the Oozie coordinator

Oozie wants the map and reduce classes to use the "old" MapReduce API. If you want to use the "new" API, you need to specify additional properties:

The final step is to define the properties file, which specifies how to get the HDFS,MapReduce and the location of the two XML files identified previously in HDFS. Create a file named job.properties, as shown in the following code:

JobTracker properties for different Hadoop versions

If you are using Hadoop version 1.X, you should use the JobTracker RPC port in the jobTracker attribute (the default is 8021). Otherwise, the YARN ResourceManager RPC port is used (default is 8032).

In the previous code snippet, the location in HDFS indicates the location of the coordinator.xml and workflow.xml files written earlier in this chapter. Now you need to copy the XML file, the input file, and the JAR file containing the MapReduce code into HDFS:

Finally, run the job in Oozie:

You can use the Job ID to get some information about the job:

This output results in a run of the job, and you can see the run time. The overall status is RUNNING, which means that the job is waiting for the next interval to occur. When the entire job completes (after the end date), the state changes to SUCCEEDED.

You can confirm that the output directory in HDFS corresponds to a specific date:

As long as the job is running, it will continue to execute until the end of the date, which in this example has been set to 2026. If you want to stop the job, use the-suspend option:

Oozie can also use the-resume and-kill options to resume paused jobs and kill workflows, respectively.

Thank you for your reading, the above is the content of "how to quickly import log files and binaries into HDFS". After the study of this article, I believe you have a deeper understanding of how to quickly import log files and binaries into HDFS, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.