Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to build a dataset

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "how to build a dataset". In daily operation, I believe many people have doubts about how to build a dataset. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubts about "how to build a dataset"! Next, please follow the editor to study!

01 where to find it

For frequently asked questions, there are many places to start your search.

Just as Google Scholar is used for research papers, Google Dataset Search is used for data sets. Google search apps are ubiquitous. This is an excellent starting point for learning about a particular topic. Google also manages its own general public data repository, called Google Public Data, and Amazon has its own AWS data registry.

Kaggle.com is an online community dedicated to data science. It has a large dataset repository contributed by the community and organizations, which contains a large number of topics to choose from. The site is also an important resource for learning details of data analysis in the form of competitions or discussions.

Research institutions usually release scientific data for public use. This is especially useful if you need sensitive human data (if you can be sure that it is properly anonymous). In Australia, we have institutions such as the Australian Bureau of Statistics, the Federal Scientific and Industrial Research Organization (CSIRO), and even an online portal called data.gov.au to access all government data.

In other parts of the world, famous institutions include NASA, NOAA NIST, CDC, WHO, UNICEF, CERN, Max Planck Institute, CNR, EPA and so on.

Similarly, many countries have central government data repositories, such as data.gov (United States), open.canada.ca, data.govt.nz, data.europa.eu, and data.gov.uk.

Some companies with non-scientific purposes may even release data repositories if they reach the scale on which they can conduct internal research or are required to conduct internal research. A good example is the World Bank and the International Monetary Fund (IMF), which have become the main sources of open financial and public data.

Where allowed, purchasing data from reputable organizations is an excellent way to ensure accuracy, coverage, and applicable value types and formats.

News sites such as FiveThirtyEight and BuzzFeed provide data from public surveys as well as data collected by key articles, from important social and political data that may be relevant to public well-being (online censorship, government surveillance, guns, health care, etc.), to scores or opinion polls on everything from sports to sports.

Reddit's / r/datasets is a good place to share information. You can browse through interesting things that people post, or ask for help on specific issues. There is even some good meta-information, such as someone publishing a detailed list of each open data portal. / r/MachineLearning is also a good choice when you browse Reddit.

Sometimes, random enthusiasts will really serve you. The author's personal favorite website is Jonathan's Space Home Page, an astrophysicist from the Smithsonian Astrophysics Center at Harvard University, which keeps an extensive list of all objects launched into space. It's just a side project. It's amazing.

Another important source of slightly unusual data is the online Encyclopedia of Integer sequences (OEIS), which is a large collection of numeric sequences and their additional information, such as diagrams or formulas used to generate them. So if you're curious about the Catalan number or want to know about the busy beaver problem, OEIS sorts it for you.

There are also numerous websites dedicated to becoming central registries for data sets in areas such as open government and academic data used in important research publications.

This may illustrate the point: data is everywhere. We are creating more things all the time, and many people and organizations are committed to making it useful to all of us. Personal preferences for data sources are established by time and experience, so it is necessary to explore and experiment extensively.

02 what are you looking for

Before you start the search, have a clear plan to know what you need to model the problem you want to solve. In the potential data to be included, consider the following factors:

The values displayed in the data and their types.

A person or organization that collects data.

The method used to collect data (if known).

The time frame within which data is collected.

Whether this collection alone is enough to solve your problem. If not, is it easy to merge other sources?

Prepared datasets often need to be modified to suit other uses. In this way, even if you can assume that the data is already clean (which should be validated just in case), some data conversion may still be required. To ensure the quality of the output, you should observe the usual data preparation steps from here.

Keep in mind that, to some extent, some additional or different formats of information may be required to produce the desired results. A prebuilt dataset is a good starting point, but it should never be exempted from scrutiny: even if a lot of work needs to be done in the short term, modify or replace inappropriate datasets.

03 build a dataset

To create a dataset from scratch, you have to get the raw data from somewhere. These efforts are usually divided into three main dimensions: recording data, collating data, and collecting data.

Disclaimer

Each country has its own laws and regulations on the collection, storage and maintenance of datasets. Some of the methods described in this section may be legal in one area, but illegal in the next. You must never take any action to get the data set without first checking the legality of the data set.

Observing online content that you don't own through data grabbing or tracking can lead to severe punishment in some parts of the world, whether you don't know it or what your purpose is. It's not worth doing.

Other methods may not be clear in the law, such as collecting photos or videos from public places, or providing ownership of data for other purposes.

Even if the dataset has a license to use the data you need, after you have the data, carefully consider the methods and responsibilities of collecting it. The law of your area always takes precedence over permission to grant you access to data.

As a rule of thumb, if you don't create the data yourself, you don't own it (even if you do create it, you may still not own it). Therefore, you cannot collect or use it unless you have explicit permission.

1. Data recording

Data recording is a first-class data collection: you are making observations of phenomena and attributes, recording your own unique data. This can be done through a physical device (such as a sensor or camera) or a digital observation device (such as a network tracker or crawler).

You can collect data about actions or environmental conditions that occur at a particular location, record images of different objects you want to identify, or record traffic from Web services to predict user behavior.

You can use these methods to create highly targeted datasets for topics that may not have been observed before, but this is the most time-consuming approach. The quality of the data collected also depends on the equipment or method used to collect the data, so it is recommended that you have some professional knowledge.

two。 Data collation

Data collation is the practical process of combining multiple information sources to create new data to be analyzed. It can be built by extracting data from reports, merging data from different online sources, or querying API. It integrates data that exists in many places in a useful way.

In some cases, collating data is almost as time-consuming as recording or generating your own data, but it is more likely to create a set of data about phenomena that occur in hard-to-reach places, such as overseas or within private organizations.

Companies that do not share the initial dataset of a problem may publish multiple papers containing all the data. Or, if a site does not allow you to download the records of every user who has done Y, it may allow you to inquire countless times whether user X has done Y?

The quality of the collated data depends on how much attention you pay when merging sources. Some data collation errors may endanger the entire project, such as merging sources using different units of measurement or simple transcriptional errors.

3. Data capture

Data crawling is a way to collect a large amount of information that already exists, but may not be observed, and can generate structured data that is suitable for use. This was the main form of social media analysis in the past (especially by third parties), but many platforms limit people's ability to access data or use data obtained from their services.

Crawling is performed with software that loads, views, and downloads large amounts of content, which is usually downloaded indiscriminately from Web targets, and can then be adjusted for use. Data capture should be purposeful.

At this point, the study on "how to build a dataset" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report