Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

25 big data terms

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Big data

1. Algorithm. How does "algorithm" relate to big data? Even though the algorithm is a general term, big data's analysis makes it more popular and popular in contemporary times.

two。 Analysis. At the end of the year, you may receive a year-end report from the credit card company containing all transactions for the year. What if you are interested in further analyzing your specific spending on food, clothing, entertainment, etc.? Then you are doing "analysis". You are learning from a pile of raw data to help you make decisions about spending in the coming year. What if you are doing the same exercise on Twitter or Facebook posts for the whole city? Then we are discussing big data's analysis. The essence of big data's analysis is to use a large amount of data to infer and tell stories. Big data analyzed that there were three different types, and then continued the discussion on this topic in turn.

Here I still want to recommend the big data Learning Exchange Group I built myself: 529867072, all of them are developed by big data. If you are studying big data, the editor welcomes you to join us. Everyone is a software development party. Irregularly share practical information (only related to big data software development), including the latest big data advanced materials and advanced development tutorials sorted out by myself. Welcome to join us if you want to go deep into big data.

3. Descriptive analysis. If you just told me that last year you spent 25% on food, 35% on clothing, 20% on entertainment and the rest on miscellaneous items, this is descriptive analysis. Of course, you can refer to more details.

4. Predictive analysis. If you analyze based on the credit card history of the past five years, and the division has a certain degree of continuity, you can highly predict that next year will be about the same as in the past few years. The detail to note here is that this is not a "prediction of the future", but a "probability" that may happen in the future. In big data's prediction analysis, data scientists may use advanced techniques such as machine learning and advanced statistical processes (these terms will be introduced later) to predict weather, economic changes, and so on.

5. Normative analysis. In the case of credit card transactions, you may want to find out which areas of spending (grade food, clothing, entertainment, etc.) have a huge impact on your overall spending. The normative analysis is based on predictive analysis and includes "action" records (such as reducing spending on food, clothing, and entertainment), and analyzes the results to "define" the best category to reduce overall expenditure. You can try to spread it to big data and imagine how executives make data-driven decisions by looking at the impact of various actions.

6. Batch processing. Although batch data processing has long appeared in the mainframe era, big data gave it more big data set processing, thus giving batch processing more meaning. For a set of transactions collected over a period of time, batch data processing provides an effective way to deal with large amounts of data. Hadoop, which will be introduced later, focuses on batch data processing. Beyond the world of batch processing: stream computing uses Spark SQL to build batch programs.

Cassandra is a popular open source database management system managed by Apache Software Foundation. A lot of big data's technology is attributed to Apache, in which Cassandra is designed to handle large amounts of data across distributed servers.

Cloud computing. It is obvious that cloud computing has become ubiquitous, so there may be no need to repeat it in this article, but it is introduced for the sake of the completeness of the article. The essence of cloud computing is the hosting of software and / or data running on remote servers and allowing access from anywhere on the Internet.

Cluster computing. It is a strange way of computing using a "cluster" of resources gathered by multiple servers. After learning more about the technology, we may also discuss nodes, cluster management, load balancing, and parallel processing.

Dark data. In my opinion, the word applies to senior management who are scared out of their minds. Fundamentally, dark data refers to data that is collected and processed by enterprises but not used for any meaningful purpose, so it is described as "dark" and may be buried forever. They could be social network traffic, call center logs, meeting notes, and so on. Many estimates have been made that 60-90% of all corporate data may be "dark data", but no one really knows.

Data lake. When I first heard the word, I really thought someone was making fun of April Fool's Day. But it's really a term! The data Lake is a large repository of enterprise data in the original format. Although we are talking about the data lake here, it is necessary to talk about the data warehouse together, because the data lake and the data warehouse are conceptually very similar and are repositories of enterprise data. however, there are differences in structured formats after cleaning and integration with other data sources. Data warehouses are often used for general data (but not completely). It is said that the data lake gives users easy access to enterprise-class data, and users really know what they are looking for, how to deal with it and use it intelligently. The premise of embracing open source technology-- know the data lake. Do you know the data lake (DATA LAKE)?

data mining. Data mining refers to the use of complex pattern recognition technology to find meaningful patterns and extract opinions from a large number of data. This is closely related to the term "analysis" that we discussed earlier using personal data for analysis. To extract meaningful patterns, data miners use statistics (yes, good old math), machine learning algorithms and artificial intelligence.

13. Data scientist. We are talking about such a hot career! Data scientists can process the data and come up with new ideas by extracting the original data (is it from the data lake mentioned earlier?). Data scientists need to have some of the same skills as Superman: analysis, statistics, computer science, creativity, storytelling and understanding the business environment. No wonder they get such a high salary.

14. Distributed file system. Because big data is too large to store on a single system, the distributed file system provides a data storage system that facilitates the storage of large amounts of data across multiple storage devices, and helps to reduce the cost and complexity of large amounts of data storage.

ETL . ETL is the acronym of extract,transform,load, which represents the process of extraction, transformation, and loading. It specifically refers to the whole process of "extracting" the original data, transforming it into "suitable" data by means of data cleaning / modification, and then "loading" it into an appropriate repository for the system to use. Although the concept of ETL originated in data warehouses, it now applies to processes in other scenarios, such as fetching / absorbing data from external data sources in big data's system. What kind of ETL do we need?

Do engineers want to write a summary of experience that ETL?-- teaches you to build an efficient algorithm / data science department ETL?

Hadoop . When people think of big data, they immediately think of Hadoop. Hadoop (with lovely elephant LOGO) is an open source software framework, the main component is the Hadoop distributed file system (HDFS), Hadoop deploys distributed hardware to support the storage, retrieval and analysis of large data sets. If you really want to impress others, you can also talk about YARN (Yet Another Resource Schedule, another resource scheduler), which, as the name suggests, is also a resource scheduler. I sincerely admire the people who name the program. The Apache Foundation, which named Hadoop, also came up with Pig,Hive and Spark (yes, they are the names of all kinds of software). Aren't you impressed by these names?

Memory calculation. In general, any calculation that can be done without access to Iripple O is expected to be faster than if you need to access it. In-memory computing is a technique that can completely transfer the working dataset to the collective memory of the cluster and avoid writing intermediate computing to disk. Apache Spark is an in-memory computing system that has a huge advantage over I / O in binding on systems like Hadoop MapReduce.

IOT . The latest buzzword is the Internet of things (Internet of things, or IOT). IOT connects computing devices in embedded objects (sensors, wearable devices, cars, refrigerators, etc.) through the Internet, and can send / receive data. IOT produces a large amount of data, which provides more opportunities to present big data's analysis.

19. Machine learning. The purpose of machine learning is to design a system design method that can be continuously learned, adjusted and improved based on the provided data. The machine uses predictive and statistical algorithms for learning and focuses on achieving "correct" behavior patterns and insights, and it continues to optimize and improve as more and more data injection systems are available. Typical applications include fraud detection, online personalized recommendation and so on.

20.MapReduce . The concept of MapReduce may be a little confusing, but let me give it a try. MapReduce is a programming model, and the best way to understand it is to treat Map and Reduce as two separate units. In this case, the programming model first divides big data's data set into several parts (technically called "tuples", but this article does not want to be too technical), so it can be deployed to different computers in different locations (that is, cluster computing described earlier), which are essentially part of Map. The model then collects all the results and "reduces" them to the same report. MapReduce's data processing model and hadoop's distributed file system complement each other.

21.NoSQL . At first glance, it sounds like a protest against the object-oriented SQL (Structured Query Language, structured query language) of traditional relational database management systems (RDBMS), but NoSQL stands for NOT ONLY SQL, which means "not just SQL". NoSQL actually refers to a database management system that is used to handle large amounts of data such as unstructured or technically referred to as "charts" (such as tables in relational databases). NoSQL databases are generally very suitable for large data systems because of their flexibility and the necessary distributed structure of large unstructured databases.

22.R language. Can anyone think of a worse name than this programming language? Yes,'R'is a programming language that performs very well in statistical calculations. If you don't even know'R', you're not a data scientist. If you don't know 'Renewal', please don't send me the bad code. This is the R language, one of the most popular languages in data science.

Spark (Apache Spark). Apache Spark is a fast in-memory data processing engine that efficiently executes streams, machine learning, or SQL workloads that require fast iterative access to datasets. Spark is usually much faster than the MapReduce we discussed earlier.

24. Stream processing. Streaming is designed to manipulate real-time and streaming data through "continuous" queries. Combined with flow analysis (that is, the ability to perform continuous computational mathematical or statistical analysis within a stream at the same time), flow processing solutions can be used to process very large data in real time.

Structured and unstructured data. This is the diversity of "Variety" in big data 5V. Structured data is the most basic data type that can be put into a relational database, and any other data can be linked through the organization of tables. Unstructured data is all the data that cannot be directly stored in a relational database, such as e-mail, social media posts, human recordings, and so on.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report