In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly introduces "how to use linux data analysis tools". In daily operation, I believe many people have doubts about how to use linux data analysis tools. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "how to use linux data analysis tools". Next, please follow the editor to study!
Linux data analysis tools are: 1, Hadoop, is a distributed processing of a large number of data software framework; 2, Storm, can be very reliable to deal with large data streams, used to deal with Hadoop batch data; 3, RapidMiner, used for data mining and visual modeling; 4, wc and so on.
The operating environment of this tutorial: linux5.9.8 system, Dell G3 computer.
6 linux big data processing analysis tools
1 、 Hadoop
Hadoop is a software framework capable of distributed processing of a large amount of data. But Hadoop is handled in a reliable, efficient, and scalable way.
Hadoop is reliable because it assumes that computing elements and storage will fail, so it maintains multiple copies of working data to ensure that processing can be redistributed against failed nodes.
Hadoop is efficient because it works in parallel and speeds up processing through parallel processing.
Hadoop is also scalable and can handle PB-level data. In addition, Hadoop relies on community servers, so it is cheap and can be used by anyone.
Hadoop is a distributed computing platform that makes it easy for users to structure and use. Users can easily develop and run applications that deal with huge amounts of data on Hadoop. It mainly has the following advantages:
High reliability. The ability of Hadoop to store and process data bit by bit is trustworthy.
High scalability. Hadoop distributes data and completes computing tasks among available computer clusters, which can be easily extended to thousands of nodes.
High efficiency. Hadoop can move data dynamically between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.
High fault tolerance. Hadoop can automatically save multiple copies of data and automatically reassign failed tasks.
Hadoop comes with a framework written in the Java language, so it is ideal to run on the Linux production platform. Applications on Hadoop can also be written in other languages, such as C++.
2 、 HPCC
Abbreviation for HPCC,High Performance Computing and Communications (High performance Computing and Communications). In 1993, the Federal Coordinating Council for Science, Engineering and Technology submitted a report on "Major Challenge projects: high performance Computing and Communications" to Congress, that is, the report known as the HPCC Program, that is, the President's Science Strategy Project, which aims to solve a number of important scientific and technological challenges through strengthening research and development. HPCC is a program implemented in the United States to implement the information superhighway, which will cost 10 billion US dollars. Its main goals are to develop scalable computing systems and related software to support terabit network transmission performance, develop gigabit network technology, and expand the capacity of research and educational institutions and network connections.
The project is mainly composed of five parts:
High performance computer system (HPCS), including the research of future generations of computer systems, system design tools, advanced typical systems and evaluation of existing systems, etc.
Advanced Software Technology and algorithms (ASTA), including software support for great challenges, new algorithm design, software branches and tools, computing and high performance computing research centers, etc.
National Research and Education Grid (NREN), which includes the research and development of central pick-up station and 1 billion-bit transmission.
Basic Research and Human Resources (BRHR), which includes basic research, training, education, and curriculum materials, is designed to be initiated by rewarding investigators, long-term surveys to increase innovative streams of consciousness in scalable high-performance computing, and to increase the pool of skilled and trained personnel through improved education and high-performance computing training and communications. And to provide the necessary infrastructure to support these research and research activities
Information Infrastructure Technology and Application (IITA) aims to ensure the leading position of the United States in the development of advanced information technology.
3 、 Storm
Storm is free open source software, a distributed, fault-tolerant real-time computing system. Storm can reliably handle large data streams, which can be used to deal with batch data of Hadoop. Storm is simple, supports many programming languages, and is fun to use. Storm is open source from Twitter, and other well-known application companies include Groupon, Taobao, Alipay, Alibaba, Le element, Admaster and so on.
Storm has many applications: real-time analysis, online machine learning, non-stop computing, distributed RPC (remote procedure call protocol, a way to request services from remote computer programs over the network), ETL (abbreviation for Extraction-Transformation-Loading, that is, data extraction, transformation and loading), and so on. The processing speed of Storm is amazing: each node can process 1 million data tuples per second. Storm is extensible, fault-tolerant, and easy to set up and operate.
4 、 Apache Drill
In order to help business users find more effective and faster ways to query Hadoop data, the Apache Software Foundation recently launched an open source project called Drill. Apache Drill implements Google's Dremel.
According to Tomer Shiran, product manager of Hadoop manufacturer MapR Technologies, "Drill" has been operated as an Apache incubator project and will be continuously promoted to software engineers around the world.
The project will create an open source version of Google's Dremel Hadoop tool (which Google uses to speed up the Internet application of the Hadoop data analysis tool). And "Drill" will help Hadoop users to query massive data sets faster.
The "Drill" project is also inspired by Google's Dremel project: it helps Google analyze and process massive data sets, including analyzing and crawling Web documents, tracking application data installed on Android Market, analyzing spam, analyzing test results on Google's distributed build system, and so on.
By developing the "Drill" Apache open source project, organizations are expected to establish the API interface to which Drill belongs and a flexible and powerful architecture to help support a wide range of data sources, data formats, and query languages.
5 、 RapidMiner
RapidMiner is the world's leading data mining solution, with advanced technology to a great extent. It covers a wide range of data mining tasks, including a variety of data arts, which can simplify the design and evaluation of the data mining process.
Functions and features
Provide data mining technology and library free of charge
100% with Java code (can run on the operating system)
The process of data mining is simple, powerful and intuitive
Internal XML ensures a standardized format to represent the exchange data mining process
Large-scale processes can be automated with a simple scripting language
A multi-level view of data to ensure valid and transparent data
Interactive prototype of graphical user interface
Automatic large-scale application of command line (batch mode)
Java API (Application programming Interface)
Simple plug-ins and promotion mechanisms
Powerful visualization engine, visual modeling of many cutting-edge high-dimensional data
Supported by more than 400 data mining operators
Yale University has been successfully applied in many different application fields, including text mining, multimedia mining, functional design, data flow mining, integrated development methods and distributed data mining.
6 、 Pentaho BI
Pentaho BI platform is different from traditional BI products, it is a process-centric, solution-oriented (Solution) framework. Its purpose is to integrate a series of enterprise BI products, open source software, API and other components to facilitate the development of business intelligence applications. With the emergence of it, a series of independent products for business intelligence, such as Jfree, Quartz and so on, can be integrated to form a complex and complete business intelligence solution.
The Pentaho BI platform, the core architecture and foundation of the Pentaho Open BI suite, is process-centric because its central controller is a workflow engine. The workflow engine uses process definitions to define business intelligence processes that are executed on the BI platform. Processes can be easily customized or new processes can be added. The BI platform contains components and reports to analyze the performance of these processes. At present, the main elements of Pentaho include report generation, analysis, data mining, workflow management and so on. These components are integrated into the Pentaho platform through J2EE, WebService, SOAP, HTTP, Java, JavaScript, Portals and other technologies. The release of Pentaho is mainly in the form of Pentaho SDK.
Pentaho SDK consists of five parts: Pentaho platform, Pentaho sample database, Pentaho platform that can run independently, Pentaho solution example and a preconfigured Pentaho network server. Among them, Pentaho platform is the most important part of Pentaho platform, including the main body of Pentaho platform source code. Pentaho database provides data services for the normal operation of Pentaho platform, including configuration information, Solution-related information, etc., which is not necessary for Pentaho platform, and can be replaced by other database services through configuration. The stand-alone Pentaho platform is an example of the stand-alone operation mode of the Pentaho platform, which demonstrates how to make the Pentaho platform run independently without application server support; the Pentaho solution example is an Eclipse project that demonstrates how to develop relevant business intelligence solutions for the Pentaho platform.
The Pentaho BI platform is built on servers, engines and components. These provide the system's J2EE server, security, portal, workflow, rule engine, chart, collaboration, content management, data integration, analysis and modeling functions. Most of these components are standards-based and can be replaced with other products.
9 command line tools for linux data analysis
1. Head and tail
First of all, let's start with file processing. What is in the document? What is its format? You can use the cat command to display files in the terminal, but it is obviously not suitable for dealing with files with longer content.
Enter head and tail, both of which can fully display the specified number of lines in the file. If you do not specify the number of rows, 10 of them are displayed by default.
$tail-n 3 jan2017articles.csv02 Jan 2017 Nesbitt,3 tips for effectively using wikis for documentation,1,/article/17/1/tips-using-wiki-documentation Nesbitt,3 tips for effectively using wikis for documentation,1,/article/17/1/tips-using-wiki-documentation, "Documentation, Wiki", 71002 Jan 2017 Baker,What is your open source New Year's resolution?,1,/poll/17/1/what-your-open-source-new-years-resolution,186 Baker,What is your open source New Year's resolution?,1,/poll/17/1/what-your-open-source-new-years-resolution,186 Jen Wike Huger,The Opensource.com preview for January,0,/article/17/1/editorial-preview-january,35802 Jan 2017
In the last three lines, I can find the date, author's name, title, and other information. However, due to the lack of column headers, I do not know the specific meaning of each column. Let's look at the specific headings of each column:
$head-n 1 jan2017articles.csvPost date,Content type,Author,Title,Comment count,Path,Tags,Word count
Now that everything is clear, we can see the release date, content type, author, title, number of submissions, related URL, tags for each article, and word count.
2 、 wc
But what if hundreds or even thousands of articles need to be analyzed? Here you will use the wc command, which is an acronym for the word "word count". Wc can count the bytes, characters, words, or lines of a file. In this example, we want to know the number of lines in the article.
$wc-l jan2017articles.csv 93 jan2017articles.csv
There are 93 lines in this file, and considering that the first line contains the file title, it can be inferred that this file is a list of 92 articles.
3 、 grep
Here's a new question: how many of these articles are related to security topics? To achieve this goal, we assume that the article we need will mention the word security in the title, tag, or other location. At this point, the grep tool can be used to search for files by specific characters or to implement other search patterns. This is an extremely powerful tool because we can even use regular expressions to build extremely accurate matching patterns. Here, however, we just need to look for a simple string.
$grep-I "security" jan2017articles.csv30 Jan 2017ArticleJournal Tiberius Hefflin,4 ways to improve your security online right now,3,/article/17/1/4-ways-improve-your-online-security,Security and encryption,124228 Jan 2017ArticleJournal subhashish Panigrahi,How communities in India support privacy and software freedom,0,/article/17/1/how-communities-india-support-privacy-software-freedom,Security and encryption,45327 Jan 2017ArticleJournal Alan Smithee,Data Privacy Day 2017: Solutions for everyday privacy,5 / article/17/1/every-day-privacy, "Big data, Security and encryption", 142404 Jan 2017 Big data, Security and encryption, 2017Daniel J Walsh,50 ways to avoid getting hacked in, 2017Magi 14 GetWord ("17"); 17lash; 1Accordant yearbookwash50, Wayhouse, avoiding, buying, hacked, "Yearbook, 2016 Open Source Yearbook, Security and encryption, Containers, Docker, Linux", 2143.
We use the format grep plus the-I flag (telling grep that it is case-insensitive), plus the pattern we want to search for, and finally the location of the target file we are searching for. Finally, we found four safety-related articles. If the scope of the search is more specific, we can use pipe--, which combines grep with the wc command to see how many lines mention security content.
$grep-I "security" jan2017articles.csv | wc-l 4
In this way, wc extracts the output of the grep command and takes it as input. Obviously, with this combination and a little bit of shell script, the terminal will immediately become a powerful data analysis tool.
4 、 tr
In most analysis scenarios, we are faced with CSV files-but how can we convert them to other formats to achieve different applications? Here, we convert it to HTML for data use through a table. The tr command helps you achieve this goal by converting one type of character to another. Similarly, you can also cooperate with the pipe command to achieve output / input docking.
Next, let's try another multipart example of creating a TSV (that is, tab-separated values) file that contains only articles published on January 20th.
$grep "20 Jan 2017" jan2017articles.csv | tr','/ t'> jan20only.tsv
First, we use grep to query the date. We pipe this result to the tr command and use the latter to replace all commas with tab (represented as'/ t'). But where did it go? Here we use the > character to output the result as a new file instead of the screen result. In this way, we can definitely include the expected data in the dqywjan20only.tsv file.
$cat jan20only.tsv 20 Jan 2017 Article Kushal Das 5 ways to expand your project's contributor base 2 / article/17/1/expand-project-contributor-base Getting started 690 20 Jan 2017 Article D Ruth Bavousett How to write web apps in R with Shiny 2 / article/17/1/writing-new-web-apps-shiny Web development 21820 Jan 2017 Article Jason Baker "Top 5: Shell scripting the Cinnamon Linux desktop environment and more" 0 / article/17/1/top-5-january-20 Top 5 214 20 Jan 2017 Article Tracy Miranda How Is your community promoting diversity? 1/ article/17/1/take-action-diversity-tech Diversity and inclusion 1007
5 、 sort
What if we first want to find the specific column that contains the most information? Assuming we need to know which article contains the longest list of new articles, we can use the sort command to sort the number of column words in the face of the previous list of articles on January 20. In this case, we do not need to use intermediate files, but can continue to use pipe. However, splitting the long command chain into shorter parts can often simplify the whole operation.
$sort-nr-tasking impulse t'- K8 jan20only.tsv | head-n 120 Jan 2017 Article Tracy Miranda How is your community promoting diversity? 1 / article/17/1/take-action-diversity-tech Diversity and inclusion 1007
This is a long command, and we try to split it. First, we use the sort command to sort the number of words. The-nr option tells sort to sort by number and to sort the results in reverse order (from largest to smallest). The subsequent-t _ delimit _ tab t' tells the sort that the delimiter is'/ t'. The $requires that the shell be a string that needs to be processed and returns / n as tab. The-K8 section tells the sort command to use the eighth column, which is the target column for the word count in this example.
Finally, the output is pipe to head, and the title of the article with the largest number of words in the file is displayed in the result.
6 、 sed
You may also need to select a specific line in the file. You can use sed here. If you want to merge multiple files that all contain titles and display only one set of titles for the overall file, you need to clear the extra content, or if you want to extract only a specific line range, you can also use sed. In addition, sed can also do a good job of batch search and replacement.
The following is based on the list of previous articles to create a new file without title for merging with other files (for example, we generate a file regularly every month, and now we need to merge the contents of each month).
$sed '1d' jan2017articles.csv > jan17no_headers.csv
The "1D" option requires sed to delete the first line.
7 、 cut
Now that we know how to delete rows, how do we delete columns? Or how to select only one column? Let's try to create a new author list for the previously generated list.
$cut-dumbbell'- f3 jan17no_headers.csv > authors.txt
Here, matching cut with-d means that we need the third column (- f3) and send the results to a new file named authors.txt.
8 、 uniq
The author list has been completed, but how do we know how many different authors are included in it? How many articles have each author written? Unip is used here. Let's sort the files in sort, find a unique value, then calculate the number of articles for each author, and replace the original content with the result.
Sort authors.txt | uniq-c > authors.txt
Now that you can see the number of articles for each author, check the last three lines to make sure the results are correct.
$tail-n3 authors-sorted.txt1 Tracy Miranda1 Veer Muchandi3 VM (Vicky) Brasseur
9 、 awk
Finally, let's take a look at the last tool, awk. Awk is an excellent replacement tool, but of course it's much more than that. Let's go back to the January 12 article list TSV file and use awk to create a new list to indicate the authors of each article and the specific number of words written by each author.
$awk-F "/ t"'{print $3 "" $NF} 'jan20only.tsvKushal Das 690D Ruth Bavousett 218Jason Baker 214Tracy Miranda 1007
The-F "/ t" is used to tell awk that it is currently working with data separated by tab. Within the curly braces, we provide the execution code for awk. $3 means it is required to output the third line, while $NF represents the last line of output (that is, the abbreviation for 'number of fields'), and adds two spaces between the two results to make a clear division.
Although the example listed here is small and does not seem to need to be solved using the above tools, it is obviously difficult to process it with a spreadsheet program if you expand the scope to a file with 93000 lines.
With these simple tools and small scripts, you can avoid using database tools and easily do a lot of data statistics. Whether you are a professional or an amateur, its role can not be ignored.
At this point, the study on "how to use linux data analysis tools" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 302
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.