Best Data Science Interview Questions
Below I am sharing top data science interview questions and this time I am not providing the answers. Now it is your turn to answer. Try to answer them and then share your answer through comments. Trust me this is the best practice for any interview preparations. So, here are the questions
Q.1 Tell us about your favorite machine learning algorithm and why you like this?
Q.2 If you are a data scientist, how will you collect the data. What will be your data acquisition and retention strategy?
Q.3 Which uncommon skills you can add to your data science team?
Q.4 How did you upgrade your analytical skills? Tell us your practices
Q.5 If I will give you a data set, what will you do with it to know whether it suits your business needs or not?
Q.6 Tell us how to effectively represent data using 5 dimensions.
Q.7 What do you know about an exact test?
Q.8 What makes a good data scientist?
Q.9 Which tools will help you to succeed in your role as a data scientist?
Q.10 How would you resolve a dispute with a colleague?
Q.11 Have you ever changed someones opinion at work?
Q.12 According to you, what makes data science so popular?
These were some of the most asked data science interview questions. I hope you will try to frame the answers on your own, post them through comments. Lets check how much you know about Data Science, Machine Learning, and R.
Stay updated with latest technology trendsJoin DataFlair on Telegram!!
What Is The Difference Between Machine Learning And Deep Learning
Machine and deep learning are important subsets of artificial intelligence. They both involve the study of computer algorithms. Data scientists should understand and know how to use both.
Example:”Machine learning uses algorithms to allow computers to learn without having to program them. The three types of machine learning are supervised, unsupervised and reinforcement. Deep learning is a type of machine learning that involves algorithms influenced by the artificial neural networks in the brain. Machine learning makes decisions based on what it has learned through algorithms while deep learning layers algorithms in such a way that it makes decisions on its own.”
Whats The Difference Between Gaussian Mixture Model And K
Let’s says we are aiming to break them into three clusters. K-means will start with the assumption that a given data point belongs to one cluster.
Choose a data point. At a given point in the algorithm, we are certain that a point belongs to a red cluster. In the next iteration, we might revise that belief, and be certain that it belongs to the green cluster. However, remember, in each iteration, we are absolutely certain as to which cluster the point belongs to. This is the “hard assignment”.
What if we are uncertain? What if we think, well, I can’t be sure, but there is 70% chance it belongs to the red cluster, but also 10% chance its in green, 20% chance it might be blue. That’s a soft assignment. The Mixture of Gaussian model helps us to express this uncertainty. It starts with some prior belief about how certain we are about each point’s cluster assignments. As it goes on, it revises those beliefs. But it incorporates the degree of uncertainty we have about our assignment.
Kmeans: find kk to minimize ^2
Gaussian Mixture : find kk to minimize ^2/^2
The difference is the denominator ^2, which means GM takes variance into consideration when it calculates the measurement.Kmeans only calculates conventional Euclidean distance.In other words, Kmeans calculate distance, while GM calculates weighted distance.
- Hard assign a data point to one particular cluster on convergence.
- It makes use of the L2 norm when optimizing .
Don’t Miss: Machine Learning Engineer Interview Preparation
What Is The Sliding Window Method For Time Series Forecasting
The sliding window method is also called the lag method, where previous time steps are used as inputs, and the next time step is used as an output. The previous steps depend on the window width, which is the number of previous steps. The sliding window method is quite famous for univariate forecasting. It converts a time series dataset into a supervised learning problem.
For example, if the sequence is and the window width is three. The output will look like:
Problem #: Second Highest Salary
Write a SQL query to get the second highest salary from the Employee table. For example, given the Employee table below, the query should return 200 as the second highest salary. If there is no second highest salary, then the query should return null.
Note:If the number of students is odd, there is no need to change the last ones seat.
Also Check: What Are Good Answers To Interview Questions
Question #: Which Data Modeling Techniques Do You Prefer And Why
How to answer: Turning data into understandable and actionable information is a critical part of the data scientist’s job. This question allows employers to understand your data modeling skills and background. List and discuss your preferred data modeling techniques, including benefits such as ease of use, flexibility, etc.
Why Do You Use Feature Selection
Feature selection is the process of selecting a subset of relevant features for use in model construction. Feature selection is itself useful, but it mostly acts as a filter, muting out features that arent useful in addition to your existing features.Feature selection methods aid you in your mission to create an accurate predictive model. They help you by choosing features that will give you as good or better accuracy whilst requiring less data.Feature selection methods can be used to identify and remove unneeded, irrelevant and redundant attributes from data that do not contribute to the accuracy of a predictive model or may in fact decrease the accuracy of the model.Fewer attributes is desirable because it reduces the complexity of the model, and a simpler model is simpler to understand and explain.
Filter feature selection methods apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset. The methods are often univariate and consider the feature independently, or with regard to the dependent variable.Some examples of some filter methods include the Chi squared test, information gain and correlation coefficient scores.
Why Is R Used In Data Visualization
R is widely used in Data Visualizations for the following reasons-
- We can create almost any type of graph using R.
- R has multiple libraries like lattice, ggplot2, leaflet, etc., and so many inbuilt functions as well.
- It is easier to customize graphics in R compared to Python.
- R is used in feature engineering and in exploratory data analysis as well.
Additional Technical Data Scientist Technical Interview Questions
|Have you worked on a data science project that required a substantial programming component? What did you take away from the experience?|
|Describe how to effectively represent data with five dimensions.|
|You need to generate a predictive model using multiple regression. Whats your process for validating this model?|
|How do you ensure that the changes youre making to an algorithm are an improvement?|
|Please provide your method for handling an imbalanced data set thats being used for prediction .|
|Whats your approach to validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression?|
|You have two different models of comparable computational performance and accuracy. Please explain how you decide which to choose for production and why.|
|You are given a data set consisting of variables with a substantial portion missing values. Whats your approach?|
Read Also: Stages Of The Interview Process
How To Prepare For The Data Scientist Interview
Regardless of the company and business field, you cant possibly answer data scientist interview questions without the knowledge and technical skills, such as:
- Relational databases and SQL
- Machine Learning
- Deep Learning frameworks
- NLP algorithms.
Speaking of preparation, if youre ready to start your career in data science, but you need to improve your skillset, you can register for the complete Data Science Program today. Start with the fundamentals with our Statistics, Maths, and Excel courses, and build up step-by-step experience with SQL, Python, R, Power BI, Tableau, and more.
Q64 Explain Svm Algorithm In Detail
SVM stands for support vector machine, it is a supervised machine learning algorithm which can be used for both Regression and Classification. If you have n features in your training data set, SVM tries to plot it in n-dimensional space with the value of each feature being the value of a particular coordinate. SVM uses hyperplanes to separate out different classes based on the provided kernel function.
Recommended Reading: What Questions Should I Ask During An Interview
How Do You Deal With Outliers In Your Data
For the most part, if your data is affected by these extreme cases, you can bound the input to a historical representative of your data that excludes outliers. Sothat could be a number of items or a lower or upper bounds on your order value.
If the outliers are from a data set that is relatively unique then analyze them for your specific situation. Analyze both with and without them, and perhaps with a replacement alternative, if you have a reason for one, and report your results of this assessment.One option is to try a transformation. Square root and log transformations both pull in high numbers. This can make assumptions work better if the outlier is a dependent.
How Can We Handle Missing Data
To be able to handle missing data, we first need to know the percentage of data missing in a particular column so that we can choose an appropriate strategy to handle the situation.
For example, if in a column the majority of the data is missing, then dropping the column is the best option, unless we have some means to make educated guesses about the missing values. However, if the amount of missing data is low, then we have several strategies to fill them up.
One way would be to fill them all up with a default value or a value that has the highest frequency in that column, such as 0 or 1, etc. This may be useful if the majority of the data in that column contains these values.
Another way is to fill up the missing values in the column with the mean of all the values in that column. This technique is usually preferred as the missing values have a higher chance of being closer to the mean than to the mode.
Finally, if we have a huge dataset and a few rows have values missing in some columns, then the easiest and fastest way is to drop those columns. Since the dataset is large, dropping a few columns should not be a problem anyway.
Out Of Collaborative Filtering And Content
Content-based filtering is considered to be better than collaborative filtering for generating recommendations. It does not mean that collaborative filtering generates bad recommendations.
However, as collaborative filtering is based on the likes and dislikes of other users we cannot rely on it much. Also, users likes and dislikes may change in the future.
For example, there may be a movie that a user likes right now but did not like 10 years ago. Moreover, users who are similar in some features may not have the same taste in the kind of content that the platform provides.
In the case of content-based filtering, we make use of users own likes and dislikes that are much more reliable and yield more positive results. This is why platforms such as Netflix, Amazon Prime, Spotify, etc. make use of content-based filtering for generating recommendations for their users.
Q: Walk Me Through The Probability Fundamentals
Eight rules of probability
- Rule #1: For any event A, 0 P 1 in other words, the probability of an event can range from 0 to 1.
- Rule #2: The sum of the probabilities of all possible outcomes always equals 1.
- Rule #3: P = 1 P This rule explains the relationship between the probability of an event and its complement event. A complement event is one that includes all possible outcomes that arent in A.
- Rule #4: If A and B are disjoint events , then P = P + P this is called the addition rule for disjoint events
- Rule #5: P = P + P P this is called the general addition rule.
- Rule #6: If A and B are two independent events, then P = P * P this is called the multiplication rule for independent events.
- Rule #7: The conditional probability of event B given event A is P = P / P
- Rule #8: For any two events A and B, P = P * P this is called the general multiplication rule
Factorial Formula: n! = n x x x x 2 x 1Use when the number of items is equal to the number of places available.Eg. Find the total number of ways 5 people can sit in 5 empty seats.= 5 x 4 x 3 x 2 x 1 = 120
Fundamental Counting Principle This method should be used when repetitions are allowed and the number of ways to fill an open place is not affected by previous fills.Eg. There are 3 types of breakfasts, 4 types of lunches, and 5 types of desserts. The total number of combinations is = 5 x 4 x 3 = 60
Also Check: What Questions To Ask In Exit Interview
Q: Give Me 3 Types Of Statistical Biases And Explain Each Of Them With An Example
- Sampling bias refers to a biased sample caused by non-random sampling.To give an example, imagine that there are 10 people in a room and you ask if they prefer grapes or bananas. If you only surveyed the three females and concluded that the majority of people like grapes, youd have demonstrated sampling bias.
- Confirmation bias: the tendency to favour information that confirms ones beliefs.
- Survivorship bias: the phenomenon where only those that survived a long process are included or excluded in an analysis, thus creating a biased sample.
Since You Have Experience In The Deep Learning Field Can You Tell Us Why Tensorflow Is The Most Preferred Library In Deep Learning
Tensorflow is a very famous library in deep learning. The reason is pretty simple actually. It provides C++ as well as Python APIs which makes it very easier to work on. Also, TensorFlow has a fast compilation speed as compared to Keras and Torch . Apart from that, Tenserflow supports both GPU and CPU computing devices. Hence, it is a major success and a very popular library for deep learning.
Read Also: How To Properly Answer Interview Questions
What Will You Do Have You Experienced Such An Issue Before
In such types of questions, we first need to ask what ML model we have to train. After that, it depends on whether we have to train a model based on Neural Networks or SVM.
The steps for Neural Networks are given below:
- The Numpy array can be used to load the entire data. It will never store the entire data, rather just create a mapping of the data.
- Now, in order to get some desired data, pass the index into the NumPy Array.
- This data can be used to pass as an input to the neural network maintaining a small batch size.
The steps for SVM are given below:
- For SVM, small data sets can be obtained. This can be done by dividing the big data set.
- The subset of the data set can be obtained as an input if using the partial fit function.
- Repeat the step of using the partial fit method for other subsets as well.
Now, you may describe the situation if you have faced such an issue in your projects or working in machine learning/ data science.
Describe An Experience In Which You Had Moving Deadlines For Several Projects But Felt That One Key Project Element Needed Extra Attention How Did You Balance Your Daily Responsibilities With This New Element
Interviewers want to know how you adhere to deadlines and seek approval for prioritization of your tasks. First, you might talk about how you worked with team members and your manager to prioritize tasks. Second, youd want to discuss how you balance competing priorities.
Your answer might look something like: I create a schedule for the week of all the tasks I have to do, based on highest to lowest priority. I also schedule a list for each day, while providing space for ad hoc projects or to work ahead.
One time, I had to deliver code for an assignment by the end of the week, but I felt that the code needed an additional five hours to be fully optimized. Each day, I left one hour free to perform code optimization, which helped to improve our final output.
Also Check: How To Interview A Prospective Employee
Q: Explain What A Long
A long-tailed distribution is a type of heavy-tailed distribution that has a tail that drop off gradually and asymptotically.
3 practical examples include the power law, the Pareto principle , and product sales .
Its important to be mindful of long-tailed distributions in classification and regression problems because the least frequently occurring values make up the majority of the population. This can ultimately change the way that you deal with outliers, and it also conflicts with some machine learning techniques with the assumption that the data is normally distributed.