What are the problems of data scientists in machine learning? 07/12 Update SLTechnology News&Howtos

What are the problems of data scientists in machine learning?

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what are the problems of machine learning for data scientists?". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Introduction

Machine learning is one of the most popular skills these days. We have organized various skills tests so that data scientists can examine their own key skills. These tests include machine learning, deep learning, time series problems and probability.

Total score

Here are the distribution scores, which will help you evaluate your grades.

More than 210 people took the skills test, with the highest score of 36. Here are some statistics about the scores.

Average score: 19.36

Median score: 21

Model score: 27

Problem and solution problem background characteristics F1 represents the grade of college students, and specific values can be taken: a _ () _ (), B _ (), C, C, D, E and F. 1) which of the following is true under the following circumstances?

A) feature F1 is an example of a classed variable. B) feature F1 is an example of a sequence variable. C) it does not belong to any of the above categories. D) both of them

Solution: (B)

Ordered variables are variables that have some order in their categories. For example, level A should be regarded as a higher level than level B.

2) which of the following is an example of a deterministic algorithm?

A) PCA

B) K-Means

C) none of the above

Solution: (a)

Deterministic algorithm is an algorithm whose output will not change in different operations. If we run it again, PCA will give the same result, but K-Means will not.

3) the Pearson correlation between two variables is zero, but their values can still be related to each other.

A) Yes

B) wrong

Solution: (a)

Y = X2. Note that not only are they related, but one variable is a function of another, and the Pearson correlation between them is zero.

4) which of the following statements is true for gradient descent (GD) or random gradient descent (SGD)?

In GD and SGD, you update a set of parameters iteratively to minimize the error function.

In SGD, you have to traverse all the samples in the training set to update the parameters once in each iteration.

In GD, you can use the entire data or a subset of training data to update parameters in each iteration.

A) 1 only

B) only 2

C) only 3

D) 1 and 2

E) 2 and 3

F) 1, 2 and 3

Solution: (a)

In the SGD of each iteration, the batch containing random data samples is usually selected, but for GD, each iteration contains all the training observations.

5) which of the following super-parameters may lead to over-fitting of random forest data?

Number of trees

Tree depth

Learning rate

A) 1 only

B) only 2

C) only 3

D) 1 and 2

E) 2 and 3

F) 1, 2 and 3

Solution: (B)

Usually, if we increase the depth of the tree, it will lead to overfitting. Learning rate is not a super parameter in random forest. The increase in the number of trees will lead to underfitting.

6) imagine that you are using "Analytics Vidhya" and you want to develop a machine learning algorithm that can predict the number of times an article is viewed.

Your analysis is based on features such as the author's name, the number of articles the same author has written on Analytics Vidhya in the past, and other features. In this case, which of the following evaluation indicators would you choose?

Mean square error

Accuracy.

F1 score

A) 1 only

B) only 2

C) only 3

D) 1 and 3

E) 2 and 3

F) 1 and 2

Solution: (a)

It can be considered that the viewing times of the article belongs to the continuous target variable of the regression problem. Therefore, the mean square error will be used as an evaluation index.

7) three images are given below (1, 2, 3). Which of the following options is true for these images?

A) 1 is tanh,2 is ReLU,3 is the SIGMOID activation function.

B) 1 is SIGMOID,2 is ReLU,3 is the tanh activation function.

C) 1 is ReLU,2 is tanh,3 is the SIGMOID activation function.

D) 1 is tanh,2 is SIGMOID,3 is the ReLU activation function.

Solution: (d)

The scope of the SIGMOID function is [0BI 1].

The scope of the tanh function is [- 1pm 1].

The scope of the RELU function is [0dint infinity].

Therefore, option D is the correct answer.

8) the following are the eight actual values of the target variable in the training file. What is the entropy of the target variable?

A)-(5 log 8 log (5 hand 8) + 3 hand 8 log (3 hand 8))

B) 5 log 8 (5 log 8) + 3 log 8 (3 C8)

C) 3 log 8 log (5 log 8) + 5 log 8)

D) 5x8 log (3gam8)-3gamma 8 log (5gam8)

Solution: (a)

The formula of entropy is

So the answer is A.

9) suppose you are using classification features, but have not yet looked at the distribution of classification variables in the test data. You should apply unique thermal coding (OHE) to the classification features. What challenges might you face if you apply OHE to the classification variables of a training dataset?

A) all categories of classification variables are not in the test dataset.

B) compared with the test data set, the frequency distribution in the category is different in the training set.

C) the training set and test set always have the same distribution.

D) An and B

E) none of these are

Solution: (d)

Both are correct, and OHE will not be able to encode categories that exist in the test set but are not in the training set, so this may be one of the main challenges when applying OHE. If the frequency distribution in training and testing is not the same, then the challenge in option B does exist, and you need to be more careful when using OHE.

10) Skip gram model is one of the best models for word embedding in Word2vec algorithm. Which model describes the Skip gram model?

A) A

B) B

C) An and B

D) none of these are

Solution: (B)

Two models (model1 and model2) are used in the Word2vec algorithm. Model1 represents the CBOW model, while Model2 represents the Skip gram model.

11) suppose you are using the activation function X in the hidden layer of the neural network. For any given input, at a particular neuron, you get an output of "- 0.0001". Which of the following activation functions can X represent?

A) ReLU

B) tanh

C) SIGMOID

D) none of these are

Solution: (B)

This function is tanh because the output range of this function is between (- 1).

12) logarithmic loss assessment indicators can have negative values.

A) True B) false

Solution: (B)

Logarithmic loss cannot be negative.

13) which of the following statements is true about "Type1" and "Type2" errors?

Type1 is called false positives and Type2 is called false positives.

Type1 errors occur when we reject the assumption that the original assumption is empty.

A) 1 only

B) only 2

C) only 3

D) 1 and 2

E) 1 and 3

F) 2 and 3

Solution: (e)

In the statistical hypothesis test, the type I error is the false rejection of the true invalid hypothesis ("false alarm"), while the type II error is the false hypothesis ("underreporting").

14) which of the following is one of the important steps in preprocessing text in NLP-based projects?

Stem extraction

Delete pause words

Object standardization

A) 1 and 2

B) 1 and 3

C) 2 and 3

D) 1, 2 and 3

Solution: (d)

Stem extraction is a basic rule-based process of removing suffixes ("ing", "ly", "es", "s", etc.) from words.

Pause words are words that have nothing to do with the data context, such as is / am / are.

Object standardization is also one of the good ways to preprocess text.

15) suppose you want to project high-dimensional data to low-dimensional data. The two most famous dimensionality reduction algorithms used here are PCA and t-SNE. Suppose you apply these two algorithms to the data "X" respectively, and get the dataset "X_projected_PCA" and "X_projected_tSNE". Which of the following statements is true for "X_projected_PCA" or "X_projected_tSNE"?

A) X_projected_PCA will explain it in the nearest neighbor space.

B) X_projected_tSNE will explain it in the nearest neighbor space.

C) both will be explained in the nearest neighbor space.

D) none of them will explain in the nearest neighbor space.

Solution: (B)

T-SNE algorithm considers the nearest neighbor points to reduce the dimension of the data. Therefore, after using t-SNE, we can think that the reduced dimension will also be explained in the nearest neighbor space. But this is not the case for PCA.

Question: 16-17

Below are three scatter plots of two features.

16) in the above figure, which of the following is an example of multiple collinear features?

A) functions in Picture 1

B) functions in Picture 2

C) functions in Picture 3

D) functions in pictures 1 and 2

E) functions in pictures 2 and 3

F) functions in pictures 3 and 1

Solution: (d)

In image 1, features have high positive correlation, while in image 2, features have high negative correlation, so in both images, feature pairs are examples of multiple collinear features.

17) in the previous question, assume that you have identified multiple collinear features. Which of the following actions are you going to do next?

Delete two collinear variables.

Delete one of the two collinear variables.

Deleting related variables may result in loss of information. In order to retain these variables, we can use penalty regression models, such as ridge regression or lasso regression.

A) 1 only

B) only 2

C) only 3

D) 1 or 3

E) 2 or 3

Solution: (e)

You cannot delete both features at the same time, because after deleting both features, you will lose all information, so you should delete only one feature, or you can use regularization algorithms such as L1 and L2.

18) adding unimportant features to a linear regression model may result in _.

R squared increase

R square reduction

A) only 1 is correct

B) only 2 are correct

C) 1 or 2

D) none of these are

Solution: (a)

After adding a feature to the feature space, whether the feature is important or unimportant, R square will always increase.

19) suppose that three variables X _ (1) Y and Z are given. The Pearson correlation coefficients of (Xrecine Y), (Ymenz) and (XMagazi) were C1 ~ C2 and C3, respectively.

Now, you add 2 to all the values of X (that is, the new value becomes X + 2), subtract 2 from all the values of Y (that is, the new value is Ymur2), and Z remains the same. The new coefficients of (Xpenery Y), (Ymenz) and (XMagazi) are given by D1Magol D2 and D3, respectively. What is the relationship between the values of D1Magne D2 and D3 and C1MagneC2 and C3?

A) D1 = C1 ~ D2

< C2，D3 >

B) D1 = C1 D2 > C2 D3 > C3

C) D1 = C1 D2 > C2 D3

< C3 D）D1 = C1，D2 < C2，D3 < C3 E）D1 = C1，D2 = C2，D3 = C3 F）无法确定解决方案：（E）如果你在特征中添加或减去一个值，则特征之间的相关性不会改变。 20）想象一下，你正在解决类别高度不平衡的分类问题。在训练数据中，大多数类别有99％的时间被观察到。对测试数据进行预测后，你的模型具有99％的准确性。在这种情况下，以下哪一项是正确的？对于类别不平衡问题，准确性度量不是一个好主意。精度度量是解决类别不平衡问题的一个好主意。准确性和召回率指标对于解决类别不平衡问题很有用。精度和召回率指标不适用于类别不平衡问题。 A）1和3 B）1和4 C）2和3 D）2和4 解决方案：（A）参考本文中的问题4。 https://www.analyticsvidhya.com/blog/2016/09/40-interview-questions-asked-at-startups-in-machine-learning-data-science/ 21）在集成学习中，你汇总了弱学习模型的预测，因此与单个模型的预测相比，这些模型的集成将提供更好的预测。对于集成模型中使用的弱学习模型，以下哪个陈述是正确的？他们通常不会过拟合。他们有很高的偏差，所以不能解决复杂的学习问题他们通常过拟合。 A）1和2 B）1和3 C）2和3 D）仅1 E）只有2 F）以上都不是解决方案：（A）弱学习模型会确定问题的特定部分。因此，他们通常不会过拟合，这意味着学习能力弱的学习模型具有较低的方差和较高的偏差。 22）对于 K-fold 交叉验证，以下哪个选项是正确的？ K的增加将导致交叉验证结果所需的时间更长。与较低的K值相比，较高的K值将导致交叉验证结果的置信度较高。如果K = N，则称为"留一法(交叉验证法）"，其中N是观察数。 A）1和2 B）2和3 C）1和3 D）1,2和3 解决方案：（D） k值越大，意味着对高估真实预期误差的偏差就越小（因为训练倍数将更接近于总数据集），而运行时间则更长（随着你越来越接近极限情况：留一法交叉验证）。选择k时，我们还需要考虑k倍精度之间的方差。问题上下文23-24 交叉验证是机器学习中超参数调整的重要步骤。假设你正在通过使用5折交叉验证从基于树的模型的10个不同深度值（值大于2）中选择GBM来调整GBM的超参数"max_depth"。一个算法（在最大深度为2的模型上）4折的训练时间是10秒，剩下1折的预测时间是2秒。注意：公式中忽略硬件依赖性。 23）对于具有10个不同"max_depth"值的5折交叉验证的总体执行时间，以下哪个选项是正确的？ A）少于100秒 B）100 - 300秒 C）300 - 600秒 D）大于或等于600秒 E）以上都不是 F）无法估算解决方案：（D） 5折交叉验证中深度"2"的每次迭代将花费10秒进行训练，而测试则需要2秒。因此，5折将花费12 * 5 = 60秒。由于我们正在搜索10个深度值，因此该算法将花费60 * 10 = 600秒。但是，在深度大于2的情况下训练和测试模型所花费的时间将比深度为"2"花费更多的时间，因此总体计时将大于600秒。 24）在上一个问题中，如果你训练相同的算法来调整2个超参数，比如"最大深度"和"学习率"。你想针对最大深度（从给定的10个深度值）和学习率（从给定的5个不同的学习率）中选择正确的值。在这种情况下，以下哪项将代表总时间？ A）1000-1500秒 B）1500-3000秒 C）大于或等于3000秒 D）这些都不是解决方案：（D）与问题23相同。 25）下面给出了针对机器学习算法M1的训练误差TE和验证误差VE的方案。你要基于TE和VE选择一个超参数（H）。 H TE VE 1个 105 90 2 200 85 3 250 96 4 105 85 5 300 100 你将根据上表选择哪个H值？解决方案：（D）根据表格，选择D是最好的 26）你将在PCA中做什么以得到与SVD相同的预测？ A）将数据转换为均值零 B）将数据转换为中位数零 C）不可能 D）这些都不是解决方案：（A）当数据的平均值为零时，向量PCA的预测将与SVD相同，否则，在获取SVD之前必须先将数据居中。问题27-28假设有一个黑盒算法，该算法使用具有多个观测值（t1，t2，t3，……..tn）和一个新观测值（q1）的训练数据。黑盒输出q1的最近邻（例如ti）及其对应的类别标签ci。你还可以认为该黑盒算法与1-NN（1-最近邻）相同。27）可以仅基于此黑盒算法来构造k-NN分类算法。注意：与k相比，n（训练观测值的数量）非常大。 A）真 B）假解决方案：（A）第一步，你在黑盒算法中传递了一个观察值（q1），因此该算法将返回最近邻的观察值及其类标签。在第二步中，你将其从训练数据中选出最接近的观测值，然后再次输入观测值（q1）。黑盒算法将再次返回最近邻观测值及其类标签。你需要重复此过程k次 28）我们不想使用1-NN黑盒，而是要使用j-NN（j>

1) the algorithm is used as a black box. Which of the following options is true for finding k-NN using j-NN?

J must be the appropriate factor of k

J > k

Impossible

A) 1

B) 2

C) 3

Solution: (a)

Same as question 27

29) suppose you get seven scatter graphs 1-7 (from left to right), and you want to compare the Pearson correlation coefficients between each scatter chart variable.

Which of the following is the correct order?

one

< 2 < 3 2 >

3 > 4

seven

< 6 < 5 6 >

5 > 4

A) 1 and 3

B) 2 and 3

C) 1 and 4

D) 2 and 4

Solution: (B)

The correlation from image 1 to 4 is decreasing (absolute value). However, from images 4 to 7, the correlation is increasing, but the correlation value is negative (for example, 0memme0.3mai 0.7mai 0.99).

30) you can use different indicators (such as accuracy, logarithmic loss, F score) to evaluate the performance of binary classification problems. Suppose you are using the logarithmic loss function as an evaluation indicator. Which of the following is true for interpreting logarithmic loss as an evaluation indicator?

If the classifier is confident in misclassification, the logarithmic loss will be severely punished.

For specific observations, the classifier assigns a very small probability to the correct category, so the corresponding contribution of logarithmic loss will be very large.

The lower the logarithmic loss, the better the model.

A) 1 and 3

B) 2 and 3

C) 1 and 2

D) 1, 2 and 3

Solution: (d)

Question 31-32

Here are five samples given in the dataset.

Note: the visual distance between the points in the image represents the actual distance.

31) which of the following is the accuracy of the left-one method cross-validation of 3-NN (3 nearest neighbors)?

A) 0

D) 0.4

C) 0.8

D) 1

Solution: (C)

In the "leave one method" cross-validation, we will select (nMel 1) one observation for training and one verification observation. Treat each point as a cross-verification point, and then find the nearest 3 points of that point.

Therefore, if you repeat this process for all the points, you will get the correct classification, all the positive classes are given in the figure above, but the negative classes will be misclassified. So you will get 80% accuracy.

32) which of the following K values has the smallest cross-validation accuracy of the left-one method?

A) 1NN

B) 3NN

C) 4NN

D) all have the same error of leaving one.

Solution: (a)

Each point will always be misclassified in 1-NN, which means you will get a precision of 0%.

33) suppose you get the following data, and you want to apply a logical regression model to classify it into two given classes.

You are using logical regression with L1 regularization.

Where C is the regularization parameter, and W1 and W2 are the coefficients of x1 and x2.

Which of the following is true when you increase the value of C from zero to a very large value?

A) first w2 becomes zero, then w1 becomes zero

B) first w1 becomes zero, then w2 becomes zero

C) both become zero

D) even if the C value is large, both cannot be zero

Solution: (B)

By looking at the image, we find that we can effectively perform classification even if we only use x2. So, first of all, W1 will become 0. With the increase of regularization parameters, w2 will be closer and closer to 0.

34) suppose we have a dataset that can be trained with the help of a decision tree with a depth of 6 with 100% accuracy. Now consider the following points and select options based on them. Note: all other super parameters are the same, other factors are not affected. 1. Depth 4 will have high deviation and low variance 2. Depth 4 will have low deviation and low variance

A) 1 only

B) only 2

C) 1 and 2

D) none of the above

Solution: (a)

If this kind of data is suitable for a decision tree with a depth of 4, it may result in underfitting of the data. Therefore, in the case of insufficient fitting, it will have higher deviation and lower variance.

35) which of the following options can be used to obtain the global minimum of the k-Means algorithm? 1. Try to run the algorithm for initializing different centroids 2. Adjust the number of iterations 3. Find out the best number of clusters

A) 2 and 3

B) 1 and 3

C) 1 and 2

D) above

Solution: (d)

You can adjust all options to find the global minimum.

36) suppose you are developing a project that is a binary classification problem. You trained the model on the training data set and obtained the following confusion matrix on the validation data set.

According to the above confusion matrix, which of the following options can provide you with a correct prediction? 1. The accuracy is about 0.912. The rate of misclassification is about 0.913. The false alarm rate is about 0.954. The true positive rate is ~ 0.95.

A) 1 and 3

B) 2 and 4

C) 1 and 4

D) 2 and 3

Solution: (C)

The accuracy (correct classification) is (50 + 100) / 165, almost equal to 0.91.

The true positive rate is the number of times you correctly predict the positive classification, so the true positive rate will be 100 × 105 = 0.95, also known as "sensitivity" or "recall rate".

37) for which of the following superparameters, the higher the value of the decision tree algorithm, the better? 1. The number of samples used for splitting is 2. The depth of the tree 3. Number of leaf node samples

A) 1 and 2

B) 2 and 3

C) 1 and 3

D) 1, 2 and 3

E) unable to judge

Solution: (e)

There is no need to increase the value of the parameters to improve performance for all three options A _ B and C. For example, if we have a very high tree depth value, the resulting tree may overfit the data and fail to generalize. On the other hand, if our value is low, the tree may not be enough to hold the data. Therefore, we cannot say with certainty that "the higher the better".

Question 38-39

Imagine that you have a 28 * 28 image and run a 3 * 3 convolution neural network on it, with an input depth of 3 and an output depth of 8.

Note: the stride is 1, and you are using the same fill.

38) what is the size of the output feature map when using the given parameters?

A) width 28, height 28 and depth 8

B) width 13, height 13 and depth 8

C) width 28, height 13 and depth 8

D) width 13, height 28 and depth 8

Solution: (a)

The formula for calculating the output size is

Output size = (N-F) / S + 1

Where N is the input size, F is the filter size, and S is the span.

39) what is the size of the output feature map when using the following parameters?

A) width 28, height 28 and depth 8

B) width 13, height 13 and depth 8

C) width 28, height 13 and depth 8

D) width 13, height 28 and depth 8

Solution: (B)

Same as the above question.

40) suppose that we are drawing a visualization graph of different C values (penalty parameters) in the SVM algorithm. For some reason, we forgot to visually mark the C value. In this case, for the radial basis function kernel, which of the following options best describes the C value of the following images?

(from left to right, the C value is C1 for image1, C2 for image2, C3 for image3).

A) C1 = C2 = C3

B) C1 > C2 > C3

C) C1 < C2 < C3

D) none of these are

Solution: (C)

Penalty parameter C of error term. It also controls the tradeoff between smooth decision boundaries and correct classification training points. For a larger C value, a hyperplane with a smaller margin will be selected for optimization.

This is the end of the content of "what are the problems of data scientists in machine learning". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.