What are the common problems in big data's data modeling? 07/02 Update SLTechnology News&Howtos

What are the common problems in big data's data modeling?

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Big data data modeling common problems, many novices are not very clear, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

1. In big data environment, can any modeling techniques be used to improve query performance?

To improve query performance, it depends on the tool you use. The following guidelines can help you:

1) ensure that the best storage is selected for the end user's query. For example, if you are running many short queries, you should consider using HBase. For long-running analysis queries, you may find that Kudu is better. Ideally, check the queries to run and determine the appropriate file format for these use cases.

2) use the correct query engine for the workload. For example, Hive on LLAP is great for scenarios where long-running queries, provisioning dashboards, or standard reports traditionally occur in enterprise data warehouses. On the other hand, Impala is very suitable for temporarily querying data above 100TB. When configuring the query engine, you should also ensure that partitions are set, statistics are collected, connections are properly designed, and query performance reports are reviewed and optimized accordingly.

3) make sure that you choose a tool for retrieving data for each use case. Tools such as Phoenix or HBase with API to run the query, and then use Impala or Hive on LLAP to query the data.

two。 Our data scientists like non-normalized tables or "functional files". Can we retain this concept when modeling big data's system?

Absolutely. This is the core function of modern data warehouse, which is called Analysis basic Table (ABT). Imagine that we are a major telecommunications company with tables for service usage, incoming calls, network elements, and so on. To build a customer churn model in all these tables, we create an ABT for customer data and build a data science model based on ABT. We can subdivide by customer, by cellular tower, by revenue model, etc. ABT is like a data Mart, built on top of a data warehouse, whether it is a star schema or not, so tools such as SAS, R, or other structures that need to be flattened can run without reorganizing the data and have a more traditional fact and dimension type data model without abandoning other use cases.

3. Are there any industry data models in the Internet of things and big data warehouse?

Pre-established, predefined industry-specific data models used to be so important that many major data warehouse vendors provided them as part of their data warehouse solutions. Although we can still see some of these models today, the world has changed a lot since the days when such models were created in the 1990s and 2000s. The changing nature of the data we use today forces us to question structured specifications. A good example can be found in this blog, which introduces the consequences of the U.S. Supreme Court's data model on marriage decisions (https://qntm.org/support). The data model built using heterosexual norms decades ago has been changed to not only accommodate same-sex marriage, but also solve the larger problem of divorce, remarriage and even gender change after marriage for one or two couples. Using traditional structures can be a challenge. Therefore, the answer to modeling industry standards in big data's world is that we do not model the entire industry, but for end-user requirements, so multiple models that change anytime and anywhere can be easily obtained from the data. And allow multiple structures on the same data to accommodate each use case, rather than sticking to a size that fits all methods.

For example, in a telecommunications company, call data is stored in three or four different formats. The first step is to let the monitoring agency see who is calling whom, which can be stored as a graph. The second is that you can query the HBase or Kudu storage based on your mobile phone number to retrieve the last 10 to 30 calls-a very discrete query. HDFS can also be used for long-term analysis, such as the total daily traffic for a given city or region. In the final analysis, this is all the same data, stored in three ways for three use cases to ensure the best results. The industrial data model itself is not outdated, but it needs to be complemented by more flexible data modeling methods at the use case level. Remember, in big data, we can define the structure after data intake and define the structure as needed, thus allowing us to benefit from more modern methods.

4. When modeling the relational structure, we usually rely on indexes to speed up the search. In big data modeling, do we still need to worry about the indexing mechanism?

No, I didn't. It all depends on the file format and data. For example, when using Hadoop HDFS, storage technology makes searches faster through massive parallelism, so you don't have or need traditional indexes. ORC does have the concept of an index, but it also uses Bloom filters. For example, in the telecom data model, we have a primary key defined as the subscriber's mobile number, and in ORC there are columns such as customer type, customer city, customer address, and so on. We can create a bloom filter on all of these columns, and when you select a record from the table, the filter is started and only ORC files with some search criteria data are read (for example, the city is Los Angeles). Remember, in big data's system, we distribute data among hundreds of partitioned files.

5. What kind of partition or storage partition is required to join facts and dimension tables for reporting?

Partitioning can be useful, depending on the storage used. In a big data environment, partitions are very helpful in reducing the number of files that need to be checked to return search results (for more information, see the response on Bloom Filters above). For example, we usually partition fact tables by date or by very large dataset (or even by hour). For dimensions, we can divide them by use case, for example, if our users regularly look for results in their area, they can be divided by geographical location. However, you are not limited to one partitioning method, because you can also do logical partitioning, which is very helpful, because the same data will be used by different users with different motivations, so, each of us can have multiple partitions to serve different business needs.

6. When modeling big data, does the surrogate key contribute to better connection performance than the natural key?

Yes, the surrogate key can definitely help. In general, we find that the connection of proxy keys is basically faster, especially when the natural keys are string columns. Integers are easier to compare join performance. However, there are other advantages. Surrogate keys ensure that you have nothing to do with source system changes. For example, if you move from an internal salesperson management tool to a cloud-based tool, you do not have to map the old natural key to the new natural key, and the alternatives can remain the same and help ensure the consistency of the data feed. The warehouse does not have to change the final report.

7. Can we combine a large fact table with nearly a billion records with a multidimensional table, some of which have more than 1 million records each?

Yes, this is where modern data warehouses really work, especially when using the latest version of the Cloudera solution, where these types of connections can even be done very quickly. Overall performance depends on data and configuration, so we recommend using tools such as Cloudera Workload XM to help, or consult experts to design data warehouses for such large workloads.

8. The data model changes over time. I know how we manage schema versioning in a relational database in a production system. Is version control different when dealing with big data modeling?

Data modeling version control is no different from version control in a traditional environment. For example, in Parquet and ORC, it's easy to add just one new column, but it's not easy to delete it. Changing the data type may require a function to convert stored data (such as strings to integers). In general, if you want to make major changes, you may have to recreate the dimension or fact table. However, just like a relational system, you can use some techniques to make it easier: just like you don't have to change the column data type, you just need to add a new column with a new data type. Keep in mind that in big data's world, adding columns only adds column definitions to metadata, and we add any data to store only when values are set in the row.

9. Is the warehouse based on big data basically the same as the concept of Data Vault 2.0?

Data Vault 2.0 is not based on big data's data warehouse, nor is it a substitute for standardization and dimensional modeling. Data Vault 2.0 is a new way to define transition areas, but you still need to make a traditional model for the data warehouse itself. This is because you can't use your favorite SQL-based BI and analysis tools to report on the data warehouse-you need a data model to understand the data.

10. Is the traditional data warehouse dying?

The traditional data warehouse has not died out, what is happening is that the data warehouse as a discipline is developing effectively. It is adapting to change. If you recall, the establishment of data warehouses from top to bottom in the past often led to a high failure rate, which, according to statistics, once reached 70-80%. Imagine that it took two to three years to develop a traditional data warehouse with all the R & D capabilities and found it failed. This means that we need to develop data warehouses in a more agile way to keep pace with the changing needs of business users, become smoother, faster, and ready to adapt. According to the requirements of the project, from the bottom up, rapid development, deployment, flushing and repetition, we make the data warehouse agile, adaptable, and ready in days or weeks, compared with months or years in the past.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.