Have a brief understanding of some basic concepts of big data 04/21 Update SLTechnology News&Howtos

Have a brief understanding of some basic concepts of big data

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. Big data

1. What is big data?

Big data (Big data or Megadata): big data, or huge data, massive data, refers to the information that the amount of data involved is too large to be intercepted, managed, processed, and sorted into a form that can be interpreted by human beings at a reasonable time.

2. The characteristics of big data

① Volume: a large amount of data, including collection, storage and calculation are very large. Big data's starting unit of measurement is at least P (1000 T), E (1 million T) or Z (1 billion T).

② Variety: variety of species and sources. Including structured, semi-structured and unstructured data, such as web logs, audio, video, pictures, geographic location information and so on. Many types of data put forward higher requirements for data processing ability.

Here I still want to recommend the big data Learning Exchange Group I built myself: 529867072, all of them are developed by big data. If you are studying big data, the editor welcomes you to join us. Everyone is a software development party. Irregularly share practical information (only related to big data software development), including the latest big data advanced materials and advanced development tutorials sorted out by myself. Welcome to join us if you want to go deep into big data.

③ Value: the data value density is relatively low, or the sand scavenging in the waves is very precious. With the wide application of the Internet and the Internet of things, information perception is everywhere, there is a large amount of information, but the value density is low. How to combine business logic and through powerful machine algorithms to mine data value? it is the most important problem to be solved in big data era.

④ Velocity: fast data growth, fast processing speed, and high timeliness requirements. For example, the search engine requires that the news of a few minutes ago can be queried by users, and the personalized recommendation algorithm requires real-time recommendation as far as possible. This is a remarkable feature that distinguishes big data from traditional data mining.

⑤ Veracity: the accuracy and reliability of the data, that is, the quality of the data.

2. Data warehouse

1. What is a data warehouse?

In computing, data warehouse (DW or DWH), also known as enterprise data warehouse (EDW), is a system for reporting and data analysis, and is regarded as the core component of business intelligence. DWs A central repository of integrated data from one or more different sources. They store current and historical data in one place to create analytical reports for staff across the enterprise.

2. The characteristics of two operation modes of data warehouse.

① online Analytical processing (OLAP) is characterized by relatively low transaction volume. Queries are often very complex and involve aggregation. For OLAP systems, response time is a measure of effectiveness. OLAP applications are widely used in data mining technology. The OLAP database stores aggregated historical data in a multidimensional schema, usually a star schema. OLAP systems typically have several hours of data latency compared to data marts, while data marts expect latency to be close to one day. The OLAP method is used to analyze multidimensional data from multiple sources and perspectives. The three basic operations in OLAP are summing up (merging), drilling and slicing and slicing.

② online transaction processing (OLTP) is characterized by a large number of short-lived online transactions (INSERT,UPDATE,DELETE). OLTP systems emphasize very fast query processing and maintain data integrity in a multi-access environment. For OLTP systems, effectiveness is measured by transactions per second. The OLTP database contains detailed and current data. The schema used to store the transactional database is the entity model (usually 3NF). Normalization is the specification of data modeling technology in this system.

III. The difference between ETL and DM

ETL/Extraction-Transformation-Loading-- is used to transfer data from DB to DW. It "extracts" the state at a certain point in time in DB, and "transforms" the data format according to the requirements of the storage model of DW, and then "loads" it into DW. It needs to be emphasized here that the model of DB is an ER model, which follows the principle of stylized design, while the data model of DW is a snowflake structure or a star structure. It uses a topic-oriented and problem-oriented design idea, so the model structure of DB and DW is different and needs to be transformed.

DM/Data Mining/ data mining-this mining is not simple statistics, it is based on probability theory or other statistical principles to analyze the large amount of data in DW to find out the rules that we can not find intuitively.

IV. Hadoop

1. What is Hadoop?

On Wikipedia, Hadoop is defined as a software framework written in Java language that facilitates distributed storage and computing of large data sets. To put it simply, this is open source software in the computer field, and any program developer can see its source code and compile it. The emergence of it makes big data's storage and processing much faster and cheaper.

2. What are the characteristics of Hadoop?

① efficiency (Efficient): distributed cloud computing, implemented in a large-scale cluster of servers based on standard x86 architecture, each module is a discrete processing unit, using parallel computing technology and load balancing of computing nodes in the cluster. When the load of one node is too high, it can intelligently transfer the load to other nodes and support node linear smooth expansion. Distributed cloud storage, implemented by the local hard disk of x86 server, uses distributed file system, and each data is stored in at least 3 nodes to ensure the performance and reliability of the storage design.

② reliability (Reliable): can handle multiple costs of maintaining your own data, and automatically redeploy computing tasks in the event of task failure

③ scalability (Scalable): reliable storage and processing of PB-level data

④ low cost (Economical): data can be distributed and processed through a server farm of ordinary machines. In total, these server farms can reach thousands of nodes.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.