10 Questions You Should to Know about Split Set Mining Systems

Author: wenzhang1

Jul. 02, 2024

Top 50 Data Mining Interview Questions & Answers

Data Mining is a process of extracting useful information from data warehouses or from bulk data. This article contains the Most Popular and Frequently Asked Interview Questions of Data Mining along with their detailed answers. These will help you to crack any interview for a data scientist job. So let&#;s get started.

Please visit our website for more information on this topic.

1. What is Data Mining?

Data mining refers to extracting or mining knowledge from large amounts of data. In other words, Data mining is the science, art, and technology of discovering large and complex bodies of data in order to discover useful patterns.

2. What are the different tasks of Data Mining?

The following activities are carried out during data mining:

  • Classification

  • Clustering

  • Association Rule Discovery

  • Sequential Pattern Discovery

  • Regression

  • Deviation Detection

3. Discuss the Life cycle of Data Mining projects?

The life cycle of Data mining projects:

  • Business understanding:

    Understanding projects objectives from a business perspective, data mining problem definition.

  • Data understanding:

    Initial data collection and understand it. 

  • Data preparation:

    Constructing the final data set from raw data.

  • Modeling:

    Select and apply data modeling techniques.

  • Evaluation:

    Evaluate model, decide on further deployment.

  • Deployment

    : Create a report, carry out actions based on new insights.

4. Explain the process of KDD?

Data mining treat as a synonym for another popularly used term, Knowledge Discovery from Data, or KDD. In others view data mining as simply an essential step in the process of knowledge discovery, in which intelligent methods are applied in order to extract data patterns.

Knowledge discovery from data consists of the following steps:

  • Data cleaning (to remove noise or irrelevant data).

  • Data integration (where multiple data sources may be combined).

  • Data selection (where data relevant to the analysis task are retrieved from the database).

  • Data transformation (where data are transmuted or consolidated into forms appropriate for mining by performing summary or aggregation functions, for sample).

  • Data mining (an important process where intelligent methods are applied in order to extract data patterns).

  • Pattern evaluation (to identify the fascinating patterns representing knowledge based on some interestingness measures).

  • Knowledge presentation (where knowledge representation and visualization techniques are used to present the mined knowledge to the user).

5. What is Classification?

Classification is the processing of finding a set of models (or functions) that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. Classification can be used for predicting the class label of data items. However, in many applications, one may like to calculate some missing or unavailable data values rather than class labels.

6. Explain Evolution and deviation analysis?

Data evolution analysis describes and models regularities or trends for objects whose behavior variations over time. Although this may involve discrimination, association, classification, characterization, or clustering of time-related data, distinct features of such an analysis involve time-series data analysis, periodicity pattern matching, and similarity-based data analysis.

In the analysis of time-related data, it is often required not only to model the general evolutionary trend of the data but also to identify data deviations that occur over time. Deviations are differences between measured values and corresponding references such as previous values or normative values. A data mining system performing deviation analysis, upon the detection of a set of deviations, may do the following: describe the characteristics of the deviations, try to describe the reason behindhand them, and suggest actions to bring the deviated values back to their expected values.

7. What is Prediction?

Prediction can be viewed as the construction and use of a model to assess the class of an unlabeled object, or to measure the value or value ranges of an attribute that a given object is likely to have. In this interpretation, classification and regression are the two major types of prediction problems where classification is used to predict discrete or nominal values, while regression is used to predict incessant or ordered values.

8. Explain the Decision Tree Classifier?

A Decision tree is a flow chart-like tree structure, where each internal node (non-leaf node) denotes a test on an attribute, each branch represents an outcome of the test and each leaf node (or terminal node) holds a class label. The topmost node of a tree is the root node.

A Decision tree is a classification scheme that generates a tree and a set of rules, representing the model of different classes, from a given data set. The set of records available for developing classification methods is generally divided into two disjoint subsets namely a training set and a test set. The former is used for originating the classifier while the latter is used to measure the accuracy of the classifier. The accuracy of the classifier is determined by the percentage of the test examples that are correctly classified.

In the decision tree classifier, we categorize the attributes of the records into two different types. Attributes whose domain is numerical are called the numerical attributes and the attributes whose domain is not numerical are called categorical attributes. There is one distinguished attribute called a class label. The goal of classification is to build a concise model that can be used to predict the class of the records whose class label is unknown. Decision trees can simply be converted to classification rules.

 9. What are the advantages of a decision tree classifier?

  • Decision trees are able to produce understandable rules.

  • They are able to handle both numerical and categorical attributes.

  • They are easy to understand.

  • Once a decision tree model has been built, classifying a test record is extremely fast.

  • Decision tree depiction is rich enough to represent any discrete value classifier.

  • Decision trees can handle datasets that may have errors.

  • Decision trees can deal with handle datasets that may have missing values.

  • They do not require any prior assumptions.  Decision trees are self-explanatory and when compacted they are also easy to follow. That is to say, if the decision tree has a reasonable number of leaves it can be grasped by non-professional users. Furthermore, since decision trees can be converted to a set of rules, this sort of representation is considered comprehensible.

10. Explain Bayesian classification in Data Mining?

A Bayesian classifier is a statistical classifier. They can predict class membership probabilities, for instance, the probability that a given sample belongs to a particular class. Bayesian classification is created on the Bayes theorem. A simple Bayesian classifier is known as the naive Bayesian classifier to be comparable in performance with decision trees and neural network classifiers. Bayesian classifiers have also displayed high accuracy and speed when applied to large databases.

11. Why Fuzzy logic is an important area for Data Mining?

Rule-based systems for classification have the disadvantage that they involve exact values for continuous attributes. Fuzzy logic is useful for data mining systems performing classification. It provides the benefit of working at a high level of abstraction. In general, the usage of fuzzy logic in rule-based systems involves the following:

  • Attribute values are changed to fuzzy values.

  • For a given new sample, more than one fuzzy rule may apply. Every applicable rule contributes a vote for membership in the categories. Typically, the truth values for each projected category are summed.

  • The sums obtained above are combined into a value that is returned by the system. This process may be done by weighting each category by its truth sum and multiplying by the mean truth value of each category. The calculations involved may be more complex, depending on the difficulty of the fuzzy membership graphs.

 12. What are Neural networks?

A neural network is a set of connected input/output units where each connection has a weight associated with it. During the knowledge phase, the network acquires by adjusting the weights to be able to predict the correct class label of the input samples. Neural network learning is also denoted as connectionist learning due to the connections between units. Neural networks involve long training times and are therefore more appropriate for applications where this is feasible. They require a number of parameters that are typically best determined empirically, such as the network topology or &#;structure&#;. Neural networks have been criticized for their poor interpretability since it is difficult for humans to take the symbolic meaning behind the learned weights. These features firstly made neural networks less desirable for data mining.

The advantages of neural networks, however, contain their high tolerance to noisy data as well as their ability to classify patterns on which they have not been trained. In addition, several algorithms have newly been developed for the extraction of rules from trained neural networks. These issues contribute to the usefulness of neural networks for classification in data mining. The most popular neural network algorithm is the backpropagation algorithm, proposed in the s

13. How Backpropagation Network Works?

A Backpropagation learns by iteratively processing a set of training samples, comparing the network&#;s estimate for each sample with the actual known class label. For each training sample, weights are modified to minimize the mean squared error between the network&#;s prediction and the actual class. These changes are made in the &#;backward&#; direction, i.e., from the output layer, through each concealed layer down to the first hidden layer (hence the name backpropagation). Although it is not guaranteed, in general, the weights will finally converge, and the knowledge process stops.

14. What is a Genetic Algorithm?

Genetic algorithm is a part of evolutionary computing which is a rapidly growing area of artificial intelligence. The genetic algorithm is inspired by Darwin&#;s theory about evolution. Here the solution to a problem solved by the genetic algorithm is evolved. In a genetic algorithm, a population of strings (called chromosomes or the genotype of the gen me), which encode candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem, is evolved toward better solutions. Traditionally, solutions are represented in the form of binary strings, composed of 0s and 1s, the same way other encoding schemes can also be applied. 

15. What is Classification Accuracy?

Classification accuracy or accuracy of the classifier is determined by the percentage of the test data set examples that are correctly classified. The classification accuracy of a classification tree = (1 &#; Generalization error).

16. Define Clustering in Data Mining?

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.

17. Write a difference between classification and clustering?[IMP]

Parameters

CLASSIFICATION

CLUSTERING

Type

Used for supervised need learning

Used for unsupervised learning

Basic

Process of classifying the input instances based on their corresponding class labels

Grouping the instances based on their similarity without the help of class labels

Need

It has labels so there is a need for training and testing data set for verifying the model created

There is no need for training and testing dataset

Complexity

More complex as compared to clustering

Less complex as compared to classification

Example Algorithms

Logistic regression, Naive Bayes classifier, Support vector machines, etc.

k-means clustering algorithm, Fuzzy c-means clustering algorithm, Gaussian (EM) clustering algorithm etc.

18. What is Supervised and Unsupervised Learning?[TCS interview question]

Supervised learning, as the name indicates, has the presence of a supervisor as a teacher. Basically supervised learning is when we teach or train the machine using data that is well labeled. Which means some data is already tagged with the correct answer. After that, the machine is provided with a new set of examples(data) so that the supervised learning algorithm analyses the training data(set of training examples) and produces a correct outcome from labeled data. 

Unsupervised learning is the training of a machine using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance. Here the task of the machine is to group unsorted information according to similarities, patterns, and differences without any prior training of data. 

Unlike supervised learning, no teacher is provided that means no training will be given to the machine. Therefore, the machine is restricted to find the hidden structure in unlabeled data by itself. 

19. Name areas of applications of data mining?

  • Data Mining Applications for Finance

  • Healthcare

  • Intelligence

  • Telecommunication

  • Energy

  • Retail

  • E-commerce

  • Supermarkets

  • Crime Agencies

  • Businesses Benefit from data mining

20. What are the issues in data mining?

A number of issues that need to be addressed by any serious data mining package

  • Uncertainty Handling

  • Dealing with Missing Values

  • Dealing with Noisy data

  • Efficiency of algorithms

  • Constraining Knowledge Discovered to only Useful

  • Incorporating Domain Knowledge

  • Size and Complexity of Data

  • Data Selection

  • Understandably of Discovered Knowledge: Consistency between Data and Discovered Knowledge.

21. Give an introduction to data mining query language?

DBQL or Data Mining Query Language proposed by Han, Fu, Wang, et.al. This language works on the DBMiner data mining system. DBQL  queries were based on SQL(Structured Query language). We can this language for databases and data warehouses as well. This query language support ad hoc and interactive data mining.

22. Differentiate Between Data Mining And Data Warehousing?

Data Mining: It is the process of finding patterns and correlations within large data sets to identify relationships between data. Data mining tools allow a business organization to predict customer behavior. Data mining tools are used to build risk models and detect fraud. Data mining is used in market analysis and management, fraud detection, corporate analysis, and risk management.
It is a technology that aggregates structured data from one or more sources so that it can be compared and analyzed rather than transaction processing. 

Data Warehouse: A data warehouse is designed to support the management decision-making process by providing a platform for data cleaning, data integration, and data consolidation. A data warehouse contains subject-oriented, integrated, time-variant, and non-volatile data.

Data warehouse consolidates data from many sources while ensuring data quality, consistency, and accuracy. Data warehouse improves system performance by separating analytics processing from transnational databases. Data flows into a data warehouse from the various databases. A data warehouse works by organizing data into a schema that describes the layout and type of data. Query tools analyze the data tables using schema.

23.What is Data Purging?

The term purging can be defined as Erase or Remove. In the context of data mining, data purging is the process of remove, unnecessary data from the database permanently and clean data to maintain its integrity.

24. What Are Cubes?

A data cube stores data in a summarized version which helps in a faster analysis of data. The data is stored in such a way that it allows reporting easily. E.g. using a data cube A user may want to analyze the weekly, monthly performance of an employee. Here, month and week could be considered as the dimensions of the cube.

25.What are the differences between OLAP And OLTP?[IMP]

OLAP (Online Analytical Processing)

OLTP (Online Transaction Processing)

Consists of historical data from various Databases.

Consists only of application-oriented day-to-day operational current data.

Application-oriented day-to-dayIt is subject-oriented. Used for Data Mining, Analytics, Decision making, etc.

It is application-oriented. Used for business tasks.

The data is used in planning, problem-solving, and decision-making.

The data is used to perform day-to-day fundamental operations.

It reveals a snapshot of present business tasks.

It provides a multi-dimensional view of different business tasks.

A large forex amount of data is stored typically in TB, PB

The size of the data is relatively small as the historical data is archived. For example, MB, GB

Relatively slow as the amount of data involved is large. Queries may take hours.

Very Fast as the queries operate on 5% of the data.

It only needs backup from time to time as compared to OLTP.

The backup and recovery process is maintained religiously

This data is generally managed by the CEO, MD, GM.

This data is managed by clerks, managers.

Only read and rarely write operation.

Both read and write operations.

26. Explain Association Algorithm In Data Mining?

Association analysis is the finding of association rules showing attribute-value conditions that occur frequently together in a given set of data. Association analysis is widely used for a market basket or transaction data analysis. Association rule mining is a significant and exceptionally dynamic area of data mining research. One method of association-based classification, called associative classification, consists of two steps. In the main step, association instructions are generated using a modified version of the standard association rule mining algorithm known as Apriori. The second step constructs a classifier based on the association rules discovered.

27. Explain how to work with data mining algorithms included in SQL server data mining?

SQL Server data mining offers Data Mining Add-ins for Office that permits finding the patterns and relationships of the information. This helps in an improved analysis. The Add-in called a Data Mining Client for Excel is utilized to initially prepare information, create models, manage, analyze, results.

28. Explain Over-fitting?

The concept of over-fitting is very important in data mining. It refers to the situation in which the induction algorithm generates a classifier that perfectly fits the training data but has lost the capability of generalizing to instances not presented during training. In other words, instead of learning, the classifier just memorizes the training instances. In the decision trees over fitting usually occurs when the tree has too many nodes relative to the amount of training data available. By increasing the number of nodes, the training error usually decreases while at some point the generalization error becomes worse. The  Over-fitting can lead to difficulties when there is noise in the training data or when the number of the training datasets, the error of the fully built tree is zero, while the true error is likely to be bigger.

There are many disadvantages of an over-fitted decision tree:

  • Over-fitted models are incorrect.

  • Over-fitted decision trees require more space and more computational resources.

  • They require the collection of unnecessary features.

29. Define Tree Pruning?

When a decision tree is built, many of the branches will reflect anomalies in the training data due to noise or outliers. Tree pruning methods address this problem of over-fitting the data.  So the tree pruning is a technique that removes the overfitting problem. Such methods typically use statistical measures to remove the least reliable branches, generally resulting in faster classification and an improvement in the ability of the tree to correctly classify independent test data. The pruning phase eliminates some of the lower branches and nodes to improve their performance. Processing the pruned tree to improve understandability.

30. What is a Sting?

Statistical Information Grid is called STING; it is a grid-based multi-resolution clustering strategy. In the STING strategy, every one of the items is contained into rectangular cells, these cells are kept into different degrees of resolutions and these levels are organized in a hierarchical structure.  

31. Define Chameleon Method?

Chameleon is another hierarchical clustering technique that utilization dynamic modeling. Chameleon is acquainted with recover the disadvantages of the CURE clustering technique. In this technique, two groups are combined, if the interconnectivity between two clusters is greater than the inter-connectivity between the object inside a cluster/ group.

 32. Explain the Issues regarding Classification And Prediction?

Preparing the data for classification and prediction:

  • Data cleaning

  • Relevance analysis

  • Data transformation

  • Comparing classification methods

  • Predictive accuracy

  • Speed

  • Robustness

  • Scalability

  • Interpretability

33.Explain the use of data mining queries or why data mining queries are more helpful?

The data mining queries are primarily applied to the model of new data to make single or multiple different outcomes. It also permits us to give input values. The query can retrieve information effectively if a particular pattern is defined correctly. It gets the training data statistical memory and gets the specific design and rule of the common case addressing a pattern in the model. It helps in extracting the regression formulas and other computations. It additionally recovers the insights concerning the individual cases utilized in the model. It incorporates the information which isn&#;t utilized in the analysis, it holds the model with the assistance of adding new data and perform the task and cross-verified. 

34. What is a machine learning-based approach to data mining?

This question is the high-level Data Mining Interview Questions asked in an Interview. Machine learning is basically utilized in data mining since it covers automatic programmed processing systems, and it depended on logical or binary tasks. . Machine learning for the most part follows the rule that would permit us to manage more general information types, incorporating cases and in these sorts and number of attributes may differ. Machine learning is one of the famous procedures utilized for data mining and in Artificial intelligence too.

35.What is the K-means algorithm?

K-means clustering algorithm &#; It is the simplest unsupervised learning algorithm that solves clustering problems. K-means algorithm partition n observations into k clusters where each observation belongs to the cluster with the nearest mean serving as a prototype of the cluster.

 

 

36. What are precision and recall?[IMP]

Precision is the most commonly used error metric in the n classification mechanism. Its range is from 0 to 1, where 1 represents 100%. 

Recall can be defined as the number of the Actual Positives in our model which has a class label as Positive (True Positive)&#;. Recall and the true positive rate is totally identical. Here&#;s the formula for it:

Recall = (True positive)/(True positive + False negative)

37. What are the ideal situations in which t-test or z-test can be used?

It is a standard practice that a t-test is utilized when there is an example size under 30 attributes and the z-test is viewed as when the example size exceeds 30 by and large.

38. What is the simple difference between standardized and unstandardized coefficients?

In the case of normalized coefficients, they are interpreted dependent on their standard deviation values. While the unstandardized coefficient is estimated depending on the real value present in the dataset.

39. How are outliers detected?

Numerous approaches can be utilized for distinguishing outliers anomalies, but the two most generally utilized techniques are as per the following:  

  • Standard deviation strategy: Here, the value is considered as an outlier if the value is lower or higher than three standard deviations from the mean value.

  • Box plot technique: Here, a value is viewed as an outlier if it is lesser or higher than 1.5 times the interquartile range (IQR)

40. Why is KNN preferred when determining missing numbers in data?

K-Nearest Neighbour (KNN) is preferred here because of the fact that KNN can easily approximate the value to be determined based on the values closest to it.

The k-nearest neighbor (K-NN) classifier is taken into account as an example-based classifier, which means that the training documents are used for comparison instead of an exact class illustration, like the class profiles utilized by other classifiers. As such, there&#;s no real training section. once a new document has to be classified, the k most similar documents (neighbors) are found and if a large enough proportion of them are allotted to a precise class, the new document is also appointed to the present class, otherwise not. Additionally, finding the closest neighbors is quickened using traditional classification strategies.

41. Explain Prepruning and Post pruning approach in Classification?

Prepruning: In the prepruning approach, a tree is &#;pruned&#; by halting its construction early (e.g., by deciding not to further split or partition the subset of training samples at a given node). Upon halting, the node becomes a leaf. The leaf may hold the most frequent class among the subset samples, or the probability distribution of those samples. When constructing a tree, measures such as statistical significance, information gain, etc., can be used to assess the goodness of a split. If partitioning the samples at a node would result in a split that falls below a pre-specified threshold, then further partitioning of the given subset is halted. There are problems, however, in choosing a proper threshold. High thresholds could result in oversimplified trees, while low thresholds could result in very little simplification.

Postpruning: The postpruning approach removes branches from a &#;fully grown&#; tree. A tree node is pruned by removing its branches. The cost complexity pruning algorithm is an example of the post pruning approach. The pruned node becomes a leaf and is labeled by the most frequent class among its former branches. For every non-leaf node in the tree, the algorithm calculates the expected error rate that would occur if the subtree at that node were pruned. Next, the predictable error rate occurring if the node were not pruned is calculated using the error rates for each branch, collective by weighting according to the proportion of observations along each branch. If pruning the node leads to a greater probable error rate, then the subtree is reserved. Otherwise, it is pruned. After generating a set of progressively pruned trees, an independent test set is used to estimate the accuracy of each tree. The decision tree that minimizes the expected error rate is preferred.                     

42. How can one handle suspicious or missing data in a dataset while performing the analysis?

If there are any inconsistencies or uncertainty in the data set, a user can proceed to utilize any of the accompanying techniques: Creation of a validation report with insights regarding the data in conversation Escalating something very similar to an experienced Data Analyst to take a look at it and accept a call Replacing the invalid information with a comparing substantial and latest data information Using numerous methodologies together to discover missing values and utilizing approximation estimate if necessary.

43.What is the simple difference between Principal Component Analysis (PCA) and Factor Analysis (FA)?

Among numerous differences, the significant difference between PCA and FA is that factor analysis is utilized to determine and work with the variance between variables, but the point of PCA is to explain the covariance between the current segments or variables.

44. What is the difference between Data Mining and Data Analysis?

Data Mining Data Analysis

Used to perceive designs in data stored.  

Used to arrange and put together raw information in a significant manner.  

Mining is performed on clean and well-documented.  

The analysis of information includes Data Cleaning.  So, information is not available in a well-documented format.  

Results extracted from data mining are difficult to interpret.  

Results extracted from information analysis are not difficult to interpret.

45. What is the difference between Data Mining and Data Profiling?

  • Data Mining:

    Data Mining refers to the analysis of information regarding the discovery of relations that have not been found before. It mainly focuses on the recognition of strange records, conditions, and cluster examination.  

  • Data Profiling

    : Data Profiling can be described as a process of analyzing single attributes of data. It mostly focuses on giving significant data on information attributes, for example, information type, recurrence, and so on.

46. What are the important steps in the data validation process?

As the name proposes Data Validation is the process of approving information. This progression principally has two methods associated with it. These are Data Screening and Data Verification.  

  • Data Screening

    : Different kinds of calculations are utilized in this progression to screen the whole information to discover any inaccurate qualities.  

  • Data Verification:

    Each and every presumed value is assessed on different use-cases, and afterward a final conclusion is taken on whether the value must be remembered for the information or not.

47. What is the difference between univariate, bivariate, and multivariate analysis?

The main difference between univariate, bivariate, and multivariate investigation are as per the following:  

  • Univariate

    : A statistical procedure that can be separated depending on the check of factors required at a given instance of time.  

  • Bivariate

    : This analysis is utilized to discover the distinction between two variables at a time.  

  • Multivariate

    : The analysis of multiple variables is known as multivariate. This analysis is utilized to comprehend the impact of factors on the responses.

48. What is the difference between variance and covariance?

Variance and Covariance are two mathematical terms that are frequently in the Statistics field. Variance fundamentally processes how separated numbers are according to the mean. Covariance refers to how two random/irregular factors will change together. This is essentially used to compute the correlation between variables.

49. What are different types of Hypothesis Testing?

The various kinds of hypothesis testing are as per the following:  

  • T-test

    : A T-test is utilized when the standard deviation is unknown and the sample size is nearly small.  

  • Chi-Square Test for Independence:

    These tests are utilized to discover the significance of the association between all categorical variables in the population sample.  

  • Analysis of Variance (ANOVA):

    This type of hypothesis testing is utilized to examine contrasts between the methods in different clusters. This test is utilized comparatively to a T-test but, is utilized for multiple groups.

Welch&#;s T-test: This test is utilized to discover the test for equality of means between two testing sample tests.

50. Why should we use data warehousing and how can you extract data for analysis?

Data warehousing is a key technology on the way to establishing business intelligence. A data warehouse is a collection of data extracted from the operational or transactional systems in a business, transformed to clean up any inconsistencies in identification coding and definition, and then arranged to support rapid reporting and analysis.

Here are some of the benefits of a data warehouse:

  • It is separate from the operational database.

  • Integrates data from heterogeneous systems.

  • Storage a huge amount of data, more historical than current data.

  • Does not require data to be highly accurate.

Bonus Interview Questions & Answers

1. What is Visualization?

Visualization is for the depiction of data and to gain intuition about the data being observed. It assists the analysts in selecting display formats, viewer perspectives, and data representation schema.

2. Give some data mining tools?

  • DBMiner

    Link to TRM

  • GeoMiner

  • Multimedia miner

  • WeblogMiner

3. What are the most significant advantages of Data Mining?

There are many advantages of Data Mining. Some of them are listed below:

  • Data Mining is used to polish the raw data and make us able to explore, identify, and understand the patterns hidden within the data.

  • It automates finding predictive information in large databases, thereby helping to identify the previously hidden patterns promptly.

  • It assists faster and better decision-making, which later helps businesses take necessary actions to increase revenue and lower operational costs.

  • It is also used to help data screening and validating to understand where it is coming from.

  • Using the Data Mining techniques, the experts can manage applications in various areas such as Market Analysis, Production Control, Sports, Fraud Detection, Astrology, etc.

  • The shopping websites use Data Mining to define a shopping pattern and design or select the products for better revenue generation.

  • Data Mining also helps in data optimization.

  • Data Mining can also be used to determine hidden profitability.

4. What are &#;Training set&#; and &#;Test set&#;?

In various areas of information science like machine learning, a set of data is used to discover the potentially predictive relationship known as &#;Training Set&#;. The training set is an example given to the learner, while the Test set is used to test the accuracy of the hypotheses generated by the learner, and it is the set of examples held back from the learner. The training set is distinct from the Test set.

5. Explain what is the function of &#;Unsupervised Learning?

  • Find clusters of the data

  • Find low-dimensional representations of the data

  • Find interesting directions in data

  • Interesting coordinates and correlations

  • Find novel observations/ database cleaning

6. In what areas Pattern Recognition is used?

Pattern Recognition can be used in

  • Computer Vision

  • Speech Recognition

  • Data Mining

  • Statistics

  • Informal Retrieval

  • Bio-Informatics

7. What is ensemble learning?

To solve a particular computational program, multiple models such as classifiers or experts are strategically generated and combined to solve a particular computational program Multiple. This process is known as ensemble learning. Ensemble learning is used when we build component classifiers that are more accurate and independent of each other. This learning is used to improve classification, prediction of data, and function approximation.

8. What is the general principle of an ensemble method and what is bagging and boosting in the ensemble method?

The general principle of an ensemble method is to combine the predictions of several models built with a given learning algorithm to improve robustness over a single model. Bagging is a method in an ensemble for improving unstable estimation or classification schemes. While boosting methods are used sequentially to reduce the bias of the combined model. Boosting and Bagging both can reduce errors by reducing the variance term.

9. What are the components of relational evaluation techniques?

The important components of relational evaluation techniques are

  • Data Acquisition

  • Ground Truth Acquisition

  • Cross-Validation Technique

  • Query Type

  • Scoring Metric

  • Significance Test

10. What are the different methods for Sequential Supervised Learning?

 The different methods to solve Sequential Supervised Learning problems are

  • Sliding-window methods

  • Recurrent sliding windows

  • Hidden Markov models

  • Maximum entropy Markov models

  • Conditional random fields

  • Graph transformer networks

11. What is a Random Forest?

Random forest is a machine learning method that helps you to perform all types of regression and classification tasks. It is also used for treating missing values and outlier values.

12. What is reinforcement learning?

Reinforcement Learning is a learning mechanism about how to map situations to actions. The end result should help you to increase the binary reward signal. In this method, a learner is not told which action to take but instead must discover which action offers a maximum reward. This method is based on the reward/penalty mechanism.

13. Is it possible to capture the correlation between continuous and categorical variables?

Yes, we can use the analysis of the covariance technique to capture the association between continuous and categorical variables.

14. What is Visualization?

Visualization is for the depiction of information and to acquire knowledge about the information being observed. It helps the experts in choosing format designs, viewer perspectives, and information representation patterns.

15. Name some best tools which can be used for data analysis.

The most common useful tools for data analysis are:

  • Google Search Operators

  • KNIME

  • Tableau

  • Solver

  • RapidMiner

  • Io

  • NodeXL

16. Describe the structure of Artificial Neural Networks?

An artificial neural network (ANN) also referred to as simply a &#;Neural Network&#; (NN), could be a process model supported by biological neural networks. Its structure consists of an interconnected collection of artificial neurons. An artificial neural network is an adjective system that changes its structure-supported information that flows through the artificial network during a learning section. The ANN relies on the principle of learning by example. There are, however, 2 classical types of neural networks, perceptron and also multilayer perceptron. Here we are going to target the perceptron algorithmic rule.

17. Do you think 50 small decision trees are better than a large one? Why?

Yes,50 small decision trees are better than a large one because 50 trees make a more robust model (less subject to over-fitting) and simpler to interpret.



V

varshachoudhary

Improve

Please to comment...

40 Questions to Test Data Scientists on Clustering ...

40 Questions & Answers on Clustering Techniques for Data Science Professionals (Updated )

Introduction

The idea of creating machines that learn by themselves (i.e., artificial intelligence) has been driving humans for decades now. Unsupervised learning and clustering are the keys to fulfilling that dream. Unsupervised learning provides more flexibility but is more challenging as well. This skill test will focus on clustering techniques.

Clustering plays an important role in drawing insights from unlabeled data. Clustering machine learning algorithms classify large datasets in similar groups, which improves various business decisions by providing a meta-understanding. Recently deep learning models with neural networks are also used in clustering.

About the Skill Test

In this skill test, we tested our community on clustering techniques. A total of people registered for this skill test. If you missed taking the test, we have provided questions and answers. Here is your opportunity for you to find out how many questions you could have answered correctly. These can also be useful as a part of data science interview questions.

Below is the distribution of scores to help you evaluate your performance:

You can view the leaderboard here. More than 390 people participated in the skill test; the highest score was 33. Here are a few statistics about the distribution.

Overall distribution

Mean Score: 15.11 | Median Score: 15 | Mode Score: 16

Helpful Resources

Many people wish to be data scientists and data analysts these days and wonder if they can achieve it without a background in computer science. Be rest assured, that is totally possible! There are plenty of resources, courses, and tutorials available online that cover various data science topics, such as data analysis, data mining, big data, data analytics, data modeling, data visualization, and more. Here are some of our best recommended online resources on clustering techniques.

If you are just getting started with Unsupervised Learning, here are some comprehensive resources to assist you in your journey:

Clustering Techniques Skill Test Questions & Answers

Q1. Movie recommendation systems are an example of:

  1. Classification
  2. Clustering
  3. Reinforcement Learning
  4. Regression

Options:
A. 2 Only
B. 1 and 2
C. 1 and 3
D. 2 and 3
E. 1, 2, and 3
F. 1, 2, 3, and 4

Solution: (E)

Generally, movie recommendation systems cluster the users in a finite number of similar groups based on their previous activities and profile. Then, at a fundamental level, people in the same cluster are made similar recommendations.

In some scenarios, this can also be approached as a classification problem for assigning the most appropriate movie class to the user of a specific group of users. Also, a movie recommendation system can be viewed as a reinforcement learning problem where it learns from its previous recommendations and improves future recommendations.

Q2. Sentiment Analysis is an example of:

  1. Regression
  2. Classification
  3. Clustering
  4. Reinforcement Learning

Options:
A. 1 Only
B. 1 and 2
C. 1 and 3
D. 1, 2 and 3
E. 1, 2 and 4
F. 1, 2, 3 and 4

Solution: (E)

At the fundamental level, Sentiment analysis classifies the sentiments represented in an image, text, or speech into a set of defined sentiment classes like happy, sad, excited, positive, negative, etc. It can also be viewed as a regression problem for assigning a sentiment score of, say, 1 to 10 for a corresponding image, text, or speech.

Another way of looking at sentiment analysis is to consider it using a reinforcement learning perspective where the algorithm constantly learns from the accuracy of past sentiment analysis performed to improve future performance.

Q3. Can decision trees be used for performing clustering?

A. True
B. False

Solution:  (A)

Decision trees (and also random forests)can also be used for clusters in the data, but clustering often generates natural clusters and is not dependent on any objective function.

Q4. Which of the following is the most appropriate strategy for data cleaning before performing clustering analysis, given less than the desirable number of data points?

  1. Capping and flouring of variables
  2. Removal of outliers

Options:
A. 1 only
B. 2 only
C. 1 and 2
D. None of the above

Solution: (A)

Removal of outliers is not recommended if the data points are few in number. In this scenario, the capping and flouring of variables is the most appropriate strategy.

Q5. What is the minimum no. of variables/ features required to perform clustering?

Options:
A. 0
B. 1
C. 2
D. 3

Solution: (B)

At least a single variable is required to perform clustering analysis. Clustering analysis with a single variable can be visualized with the help of a histogram.

Q6. For two runs of K-Mean clustering, is it expected to get the same clustering results?

A. Yes
B. No

Solution: (B)

K-Means clustering algorithm instead converses on local minima, which might also correspond to the global minima in some cases but not always. Therefore, it&#;s advised to run the K-Means algorithm multiple times before drawing inferences about the clusters.

However, note that it&#;s possible to receive the same clustering results from K-means by setting the same seed value for each run. But that is done by simply making the algorithm choose the set of the same random no. for each run.

Q7. Is it possible that the assignment of observations to clusters does not change between successive iterations in K-Means?

Options:
A. Yes
B. No
C. Can&#;t say
D. None of these

Solution: (A)

When the K-Means machine learning model has reached the local or global minima, it will not alter the assignment of data points to clusters for two successive iterations.

Q8. Which of the following can act as possible termination conditions in K-Means?

  1. For a fixed number of iterations.
  2. The assignment of observations to clusters does not change between iterations, except for cases with a bad local minimum.
  3. Centroids do not change between successive iterations.
  4. Terminate when RSS falls below a threshold.

Options:
A. 1, 3 and 4
B. 1, 2 and 3
C. 1, 2 and 4
D. All of the above

Solution: (D)

All four conditions can be used as possible termination conditions in K-Means clustering:

  1. This condition limits the runtime of the clustering algorithm, but in some cases, the quality of the clustering will be poor because of an insufficient number of iterations.
  2. Except for cases with a bad local minimum, this produces a good clustering, but runtimes may be unacceptably long.
  3. This also ensures that the algorithm has converged at the minima.
  4. Terminate when RSS falls below a threshold. This criterion ensures that the clustering is of the desired quality after termination. Practically, combining it with a bound on the number of iterations to guarantee termination is a good practice.

Q9. Which of the following clustering algorithms suffers from the problem of convergence at local optima?

  1. K- Means clustering algorithm
  2. Agglomerative clustering algorithm
  3. Expectation-Maximization clustering algorithm
  4. Diverse clustering algorithm

Options:
A. 1 only
B. 2 and 3
C. 2 and 4
D. 1 and 3
E. 1,2 and 4
F. All of the above

Solution: (D) 
Out of the options given, only the K-Means clustering algorithm and EM clustering algorithm have the drawback of converging at local minima.

Q10. Which of the following algorithm is most sensitive to outliers?

Options:
A. K-means clustering algorithm
B. K-medians clustering algorithm
C. K-modes clustering algorithm
D. K-medoids clustering algorithm

Solution: (A)

Out of all the options, the K-Means clustering algorithm is most sensitive to outliers as it uses the mean of cluster data points to find the cluster center.

Q11. After performing K-Means Clustering analysis on a dataset, you observed the following dendrogram. Which of the following conclusion can be drawn from the dendrogram?

Options:
A. There were 28 data points in the clustering analysis
B. The best no. of clusters for the analyzed data points is 4
C. The proximity function used is Average-link clustering
D. The above dendrogram interpretation is not possible for K-Means clustering analysis

Solution: (D)

A dendrogram is not possible for K-Means clustering analysis. However, one can create a cluster gram based on K-Means clustering analysis.

Q12. How can Clustering (Unsupervised Learning) be used to improve the accuracy of the Linear Regression model (Supervised Learning)?

  1. Creating different models for different cluster groups.
  2. Creating an input feature for cluster ids as an ordinal variable.
  3. Creating an input feature for cluster centroids as a continuous variable.
  4. Creating an input feature for cluster size as a continuous variable.

Options:
A. 1 only
B. 1 and 2
C. 1 and 4
D. 3 only
E. 2 and 4
F. All of the above

Solution: (F)

Creating an input feature for cluster ids as ordinal variables or creating an input feature for cluster centroids as a continuous variable might not convey any relevant information to the regression model for multidimensional data. But for clustering in a single dimension, all of the given methods are expected to convey meaningful information to the regression model. For example, to cluster people in two groups based on their hair length, storing clustering IDs as ordinal variables and cluster centroids as continuous variables will convey meaningful information.

Q13. What could be the possible reason(s) for producing two different dendrograms using an agglomerative clustering algorithm for the same dataset?

A. Proximity function used
B. of data points used
C. of variables used
D. B and c only
E. All of the above

Solution: (E)

Change in either of the proximity function, no. of data points, or no. of variables will lead to different clustering results and hence different dendrograms.

Q14. In the figure below, if you draw a horizontal line on the y-axis for y=2. What will be the number of clusters formed?

Options:
A. 1
B. 2
C. 3
D. 4

Solution: (B)

Since the number of vertical lines intersecting the red horizontal line at y=2 in the dendrogram is 2, therefore, two clusters will be formed.

Q15. What is the most appropriate no. of clusters for the data points represented by the following dendrogram?

Options:
A. 2
B. 4
C. 6
D. 8

Solution: (B)

The decision of the no. of clusters that can best depict different groups can be chosen by observing the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the dendrogram cut by a horizontal line that can transverse the maximum distance vertically without intersecting a cluster.

In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the dendrogram below covers the maximum vertical distance AB.

Q16. In which of the following cases will K-Means clustering fail to give good results?

  1. Data points with outliers
  2. Data points with different densities
  3. Data points with round shapes
  4. Data points with non-convex shapes

Options:
A. 1 and 2
B. 2 and 3
C. 2 and 4
D. 1, 2 and 4
E. 1, 2, 3 and 4

Solution: (D)

The K-Means clustering algorithm fails to give good results when the data contains outliers, the density spread of data points across the data space is different, and the data points follow non-convex shapes.

Q17. Which of the following metrics do we have for finding dissimilarity between two clusters in hierarchical clustering?

  1. Single-link
  2. Complete-link
  3. Average-link

Options:
A. 1 and 2
B. 1 and 3
C. 2 and 3
D. 1, 2 and 3

Solution: (D)

All three methods, i.e., single link, complete link, and average link, can be used for finding dissimilarity between two clusters in hierarchical clustering( can be found in the Python library scikit-learn).

Q18. Which of the following is/are true?

  1. Clustering analysis is negatively affected by the multicollinearity of features
  2. Clustering analysis is negatively affected by heteroscedasticity

Options:
A. 1 only
B. 2 only
C. 1 and 2
D. None of them

Solution: (A)

Clustering analysis is not negatively affected by heteroscedasticity, but the results are negatively impacted by the multicollinearity of features/ variables used in clustering as the correlated feature/ variable will carry extra weight on the distance calculation than desired.

Context for Question 19: Given are six points with the following attributes

Q19. Which of the following clustering representations and dendrogram depicts the use of MIN or single link proximity function in hierarchical clustering?

A. 

B. 

C. 

D. 

Solution: (A)

For the single link or MIN version of hierarchical clustering, the proximity of two clusters is defined to be the minimum distance between any two points in the different clusters. For instance, from the table, we see that the distance between points 3 and 6 is 0.11, and that is the height at which they are joined into one cluster in the dendrogram. As another example, the distance between clusters {3, 6} and {2, 5} is given by dist({3, 6}, {2, 5}) = min(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5)) = min(0., 0., 0., 0.) = 0..

Context for Question 20: Given are six points with the following attributes
 

Q20. Which of the following clustering representations and dendrogram depicts the use of MAX or complete link proximity function in hierarchical clustering?

A.   

B.  

C.  

D.  

Solution: (B)

For the single link or MAX version of hierarchical clustering, the proximity of two clusters is defined as the maximum distance between any two points in the different clusters. Similarly, here points 3 and 6 are merged first. However, {3, 6} is merged with {4}, instead of {2, 5}. This is because the dist({3, 6}, {4}) = max(dist(3, 4), dist(6, 4)) = max(0., 0.) = 0., which is smaller than dist({3, 6}, {2, 5}) = max(dist(3, 2), dist(6, 2), dist(3, 5), dist(6, 5)) = max(0., 0., 0., 0.) = 0. and dist({3, 6}, {1}) = max(dist(3, 1), dist(6, 1)) = max(0., 0.) = 0..

Context for Question 21: Given are six points with the following attributes
 

Q21. Which of the following clustering representations and dendrogram depicts the use of the group average proximity function in hierarchical clustering?

A.  

B.  
C. 

D. 

Solution: (C)

For the group average version of hierarchical clustering, the proximity of two clusters is defined to be the average of the pairwise proximities between all pairs of points in the different clusters. This is an intermediate approach between MIN and MAX. This is expressed by the following equation:

Here, the distance between some clusters. dist({3, 6, 4}, {1}) = (0. + 0. + 0.)/(3 &#; 1) = 0.. dist({2, 5}, {1}) = (0. + 0.)/(2 &#; 1) = 0.. dist({3, 6, 4}, {2, 5}) = (0. + 0. + 0. + 0. + 0. + 0.)/(6&#;1) = 0.. Because dist({3, 6, 4}, {2, 5}) is smaller than dist({3, 6, 4}, {1}) and dist({2, 5}, {1}), these two clusters are merged at the fourth stage.

Context for Question 22: Given are six points with the following attributes

Q22. Which of the following clustering representations and dendrogram depicts the use of Ward&#;s method proximity function in hierarchical clustering?

A. 

B. 

C.  

D. 

Solution: (D)

Ward method is a centroid method. The centroid method calculates the proximity between two clusters by calculating the distance between the centroids of clusters. For Ward&#;s method, the proximity between two clusters is defined as the increase in the squared error that results when two clusters are merged. The results of applying Ward&#;s method to the sample data set of six points. The resulting clustering is somewhat different from those produced by MIN, MAX, and group average.

Q23. What should be the best choice of no. of clusters based on the following results?

Options:
A. 1
B. 2
C. 3
D. 4

Solution: (C)

The silhouette coefficient is a measure of how similar an object is to its own cluster compared to other clusters. The number of clusters for which the silhouette coefficient is highest represents the best choice of the number of clusters.

Q24. Which of the following is/are valid iterative strategies for treating missing values before clustering analysis?

Options:
A. Imputation with mean
B. Nearest Neighbor assignment
C. Imputation with Expectation Maximization algorithm
D. All of the above

Solution: (C)

All of the mentioned techniques are valid for treating missing values before clustering analysis, but only imputation with the EM algorithm is iterative in its functioning.

Q25. The K-Means algorithm has some limitations. For example, it makes hard assignments (A point either completely belongs to a cluster or not belongs at all) of points to clusters.

Note: Soft assignment can be considered as the probability of being assigned to each cluster: say K = 3 and for some point xn, p1 = 0.7, p2 = 0.2, p3 = 0.1)
Which of the following algorithm(s) allows soft assignments?

  1. Gaussian mixture models
  2. Fuzzy K-means

Options:
A. 1 only
B. 2 only
C. 1 and 2
D. None of these

Solution: (C)

Both, Gaussian mixture models and Fuzzy K-means allow soft assignments.

Q26. Assume you want to cluster 7 observations into 3 clusters using the K-Means clustering algorithm. After the first iteration, clusters C1, C2, C3 have following observations:

C1: {(2,2), (4,4), (6,6)}
C2: {(0,4), (4,0)}
C3: {(5,5), (9,9)}
What will be the cluster centroids if you want to proceed with the second iteration?

Options:
A. C1: (4,4), C2: (2,2), C3: (7,7)
B. C1: (6,6), C2: (4,4), C3: (9,9)
C. C1: (2,2), C2: (0,0), C3: (5,5)
D. None of these

Solution: (A)

Finding centroid for data points in cluster C1 = ((2+4+6)/3, (2+4+6)/3) = (4, 4)
Finding centroid for data points in cluster C2 = ((0+4)/2, (4+0)/2) = (2, 2)
Finding centroid for data points in cluster C3 = ((5+9)/2, (5+9)/2) = (7, 7)
Hence, C1: (4,4),  C2: (2,2), C3: (7,7)

Q27. Assume you want to cluster 7 observations into 3 clusters using the K-Means clustering algorithm. After first iteration, clusters C1, C2, C3 have following observations:

C1: {(2,2), (4,4), (6,6)}
C2: {(0,4), (4,0)}
C3: {(5,5), (9,9)}
What will be the Manhattan distance for observation (9, 9) from cluster centroid C1 in the second iteration?

Options:
A. 10
B. 5*sqrt(2)
C. 13*sqrt(2)
D. None of these

Solution: (A)

Manhattan distance between centroid C1, i.e., (4, 4) and (9, 9) = (9-4) + (9-4) = 10

Q28. If two variables, V1 and V2, are used for clustering. Which of the following are true for K means clustering with k =3?

  1. If V1 and V2 have a correlation of 1, the cluster centroids will be in a straight line
  2. If V1 and V2 have a correlation of 0, the cluster centroids will be in a straight line

Options:
A. 1 only
B. 2 only
C. 1 and 2
D. None of the above

Solution: (A)

If the correlation between the variables V1 and V2 is 1, then all the data points will be in a straight line. Hence, all three cluster centroids will form a straight line as well.

Q29. Feature scaling is an important step before applying the K-Mean algorithm. What is the reason behind this?

Options:
A. In distance calculation, it will give the same weights for all features
B. You always get the same clusters. If you use or don&#;t use feature scaling
C. In Manhattan distance, it is an important step, but in Euclidean distance, it is not
D. None of these

Solution: (A)

Feature scaling ensures that all the features get the same weight in the clustering analysis. Consider a scenario of clustering people based on their weights (in KG) with a range of 55-110 and height (in inches) with a range of 5.6 to 6.4. In this case, the clusters produced without scaling can be very misleading as the range of weight is much higher than that of height. Therefore, it is necessary to bring them to the same scale so that they have equal weightage on the clustering result.

Q30. Which of the following method is used for finding the optimal of a cluster in the K-Mean algorithm?

Options:
A. Elbow method
B. Manhattan method
C. Ecludian method
D. All of the above
E. None of these

Solution: (A)

Out of the given options, only the elbow method is used for finding the optimal number of clusters. The elbow method looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn&#;t give a much better modeling of the data.

Q31. What is true about K-Mean Clustering?

  1. K-means is extremely sensitive to cluster center initializations
  2. Bad initialization can lead to Poor convergence speed
  3. Bad initialization can lead to bad overall clustering

Options:
A. 1 and 3
B. 1 and 2
C. 2 and 3
D. 1, 2 and 3

Solution: (D)

All three of the given statements are true. K-means is extremely sensitive to cluster center initialization. Also, bad initialization can lead to Poor convergence speed as well as bad overall clustering.

Q32. Which of the following can be applied to get good results for the K-means algorithm corresponding to global minima?

  1. Try to run the algorithm for different centroid initialization
  2. Adjust the number of iterations
  3. Find out the optimal number of clusters

Options:
A. 2 and 3
B. 1 and 3
C. 1 and 2
D. All of above

Solution: (D)

All of these are standard practices that are used in order to obtain good clustering results.

Q33. What should be the best choice for the number of clusters based on the following results?

Options:
A. 5
B. 6
C. 14
D. Greater than 14

Solution: (B)

Based on the above results, the best choice of the number of clusters using the elbow method is 6.

Q34. What should be the best choice for the number of clusters based on the following results?

Options:
A. 2
B. 4
C. 6
D. 8

Solution: (C)

Generally, a higher average silhouette coefficient indicates better clustering quality. In this plot, the optimal clustering number of grid cells in the study area should be 2, at which the value of the average silhouette coefficient is the highest. However, the SSE of this clustering solution (k = 2) is too large. At k = 6, the SSE is much lower. In addition, the value of the average silhouette coefficient at k = 6 is also very high, which is just lower than k = 2. Thus, the best choice is k = 6.

Q35. Which of the following sequences is correct for a K-Means algorithm using the Forgy method of initialization?

  1. Specify the number of clusters
  2. Assign cluster centroids randomly
  3. Assign each data point to the nearest cluster centroid
  4. Re-assign each point to the nearest cluster centroid
  5. Re-compute cluster centroids

Options:
A. 1, 2, 3, 5, 4
B. 1, 3, 2, 4, 5
C. 2, 1, 3, 4, 5
D. None of these

Solution: (A)

The methods used for initialization in K means are Forgy and Random Partition. The Forgy method randomly chooses k observations from the data set and uses these as the initial means. The Random Partition method randomly assigns a cluster to each observation and then proceeds to the update step, thus computing the initial mean as the centroid of the cluster&#;s randomly assigned points.

Q36. If you are using Multinomial mixture models with the expectation-maximization algorithm for clustering a set of data points into two clusters, which of the assumptions are important?

Options:
A. All the data points follow two Gaussian distribution
B. All the data points follow n Gaussian distribution (n >2)
C. All the data points follow two multinomial distribution
D. All the data points follow n multinomial distribution (n >2)

Solution: (C)

In the EM algorithm for clustering it&#;s essential to choose the same no. of clusters to classify the data points into the no. of different distributions they are expected to be generated from and also, the distributions must be of the same type.

Q37. Which of the following is/are not true about Centroid based K-Means clustering algorithm and Distribution based expectation-maximization clustering algorithm?

  1. Both start with random initializations
  2. Both are iterative algorithms
  3. Both have strong assumptions that the data points must fulfill
  4. Both are sensitive to outliers
  5. The expectation-maximization algorithm is a special case of K-Means
  6. Both require prior knowledge of the no. of desired clusters
  7. The results produced by both are non-reproducible

Options:
A. 1 only
B. 5 only
C. 1 and 3
D. 6 and 7
E. 4, 6 and 7
F. None of the above

Solution: (B)

All of the above statements are true except the 5th as instead K-Means is a special case of EM algorithm in which only the centroids of the cluster distributions are calculated at each iteration.

Q38. Which of the following is/are not true about the DBSCAN clustering algorithm?

  1. For data points to be in a cluster, they must be in a distance threshold to a core point
  2. It has strong assumptions for the distribution of data points in the dataspace
  3. It has substantially high time complexity of order O(n3)
  4. It does not require prior knowledge of the no. of desired clusters
  5. It is robust to outliers

Options:
A. 1 only
B. 2 only
C. 4 only
D. 2 and 3
E. 1 and 5
F. 1, 3 and 5

Solution: (D)

DBSCAN can form a cluster of any arbitrary shape and does not have strong assumptions for the distribution of data points in the data space. DBSCAN has a low time complexity of order O(n log n) only.

Q39. Which of the following are the high and low bounds for the existence of F-Score?

Options:
A. [0,1]
B. (0,1)
C. [-1,1]
D. None of the above

Solution: (A)

The lowest and highest possible values of the F score are 0 and 1, where 1 means that every data point is assigned to the correct cluster, and 0 means that the precession and/ or recall of the clustering analysis are both 0. In clustering analysis, a high value of F score is desired.

Q40. Following are the results observed for clustering data points into 3 clusters: A, B, and C:


What is the F1-Score with respect to cluster B?

Options:
A. 3
B. 4
C. 5
D. 6

Solution: (D)

Here,
True Positive, TP =
True Negative, TN = 600 + =
False Positive, FP = + 200 =
False Negative, FN = 400 + 400 = 800
Therefore,
Precision = TP / (TP + FP) = 0.5
Recall = TP / (TP + FN) = 0.6
Hence,
F1  = 2 (Precision Recall)/ (Precision + recall) = 0.54 ~ 0.5

Conclusion

You have successfully completed our skill test focused on conceptual and practical knowledge of clustering fundamentals and its various techniques. I hope taking this test and finding the solutions has helped you gain knowledge and boost your confidence in the topic.

If you are preparing for a data science job interview, I suggest you also check out our guides of important interview questions on logistic regression, SQL, tensor flow, k-nearest neighbor, and naive bayes.

Sauravkaushik8 Kaushik

18 May,

Saurav is a Data Science enthusiast, currently in the final year of his graduation at MAIT, New Delhi. He loves to use machine learning and analytics to solve complex data problems.

Want more information on Split Set Mining Systems? Feel free to contact us.

57

0

Comments

Please Join Us to post.

0/2000

All Comments ( 0 )

Guest Posts

If you are interested in sending in a Guest Blogger Submission,welcome to write for us!

Your Name: (required)

Your Email: (required)

Subject:

Your Message: (required)