Data Science Coding Interview Questions

What Is A Confusion Matrix

Coding Interview for Data Scientists | Python Questions | Data Science Interview

A confusion matrix is used to determine the efficacy of a classification algorithm. It is used because a classification algorithm isnt accurate when there are more than two classes of data, or when there isnt an even number of classes.

The process for creating a confusion matrix is as follows:

Create a validation dataset for which you have certain expected values as outcomes.

Predict the result for each row that is present in the dataset.

Now count the number of correct and incorrect predictions for each class.

Organize that data into a matrix so that each row represents a predicted class and each column an actual class.

Fill the counts obtained from the third step into the table.

The matrix that results from this process is known as a confusion matrix.

Python Coding Interview Question #: Apartments In New York City And Harlem

Try and solve the question by Airbnb:

Find the search details of 50 apartment searches the Harlem neighborhood of New York City.

Link to the question:

Here are some hints. You need to set three conditions that will get you only apartment category, only those in Harlem, and the city has to be NYC. All three conditions will be set using the == operator. You dont need to show all apartments, so use the head function to limit the number of rows in the output.

In Your Choice Of Language Write A Program That Prints The Numbers Ranging From One To 50

But for multiples of three, print “Fizz” instead of the number, and for the multiples of five, print “Buzz.” For numbers which are multiples of both three and five, print “FizzBuzz”

The code is shown below:

Note that the range mentioned is 51, which means zero to 50. However, the range asked in the question is one to 50. Therefore, in the above code, you can include the range as .

The output of the above code is as shown:

What Are The Differences Between Correlation And Covariance

Although these two terms are used for establishing a relationship and dependency between any two random variables, the following are the differences between them:

Correlation: This technique is used to measure and estimate the quantitative relationship between two variables and is measured in terms of how strong are the variables related.
Covariance: It represents the extent to which the variables change together in a cycle. This explains the systematic relationship between pair of variables where changes in one affect changes in another variable.

Mathematically, consider 2 random variables, X and Y where the means are represented as

Technical Data Science Interview Questions

Data Science in R Interview Questions and Answers Book

24. Explain overfitting and underfitting.

In order to make reliable predictions on untrained data in machine learning and statistics, it is required to fit a model to a set of training data. Overfitting and underfitting are two of the most common modeling errors that occur while doing so.

A statistical model suffering from overfitting relates to some random error or noise in place of the underlying relationship. When a statistical model or machine learning algorithm is excessively complex, it can result in overfitting. An example of a complex model is one having too many parameters when compared to the total number of observations.

When underfitting occurs, a statistical model or machine learning algorithm fails in capturing the underlying trend of the data. Underfitting occurs when trying to fit a linear model to non-linear data.

Although both overfitting and underfitting yield poor predictive performance, the way in which each one of them does so is different. While the overfitted model overreacts to minor fluctuations in the training data, the underfit model under-reacts to even bigger fluctuations.

25. What is batch normalization?

Batch normalization is a technique through which attempts could be made to improve the performance and stability of the neural network. This can be done by normalizing the inputs in each layer so that the mean output activation remains 0 with the standard deviation at 1.

26. What do you mean by cluster sampling and systematic sampling?

How To Avoid Overfitting Your Model

An overfitting model is only for a very small amount of data. It ignores big data. There are 3 ways to avoid overfitting :

Keep a simple model, take a few variables, and remove noise in training data.
Use K folds for cross-validation.
Penalize models by regulation techniques like LASSO if they can cause overfitting.

Python Coding Interview Question #1: Positions Of Letter ‘a’

This question by Amazon asks you to:

Find the position of the letter ‘a’ in the first name of the worker ‘Amitah’. Use 1-based indexing, e.g. position of the second letter is 2.

Link to the question:

There are two main concepts in the solution. The first is filtering the worker Amitah using the == operator. The second one is using the find function on a string to get the position of the letter a.

As a data scientist, youll be working with dates a lot. Depending on the data available, you could be asked to convert data to datetime, extract a certain period of time , or manipulate datetime in any other way thats suitable.

Don’t Miss: Nist Cybersecurity Framework Interview Questions

Gear Up For Your Next Data Science Interview

If you need help with your prep, join Interview Kickstartâs Data Science Interview Course â the first-of-its-kind, domain-specific tech interview prep program designed and taught by FAANG+ instructors. to learn more about the program.

IK is the gold standard in tech interview prep. Our programs include a comprehensive curriculum, unmatched teaching methods, FAANG+ instructors, and career coaching to help you nail your next tech interview.

Q: What Are The Differences Between Lists And Tuples

SQL Coding Interview Question Using A Window Function (PARTITION BY) | Data Science Interviews

Ans: Lists and tuples both are the values of any data type but there are some differences between them.

The basic difference between lists and tuples is that lists are mutable whereas tuples are immutable.
Lists are slower than tuples.
Lists are built with square brackets while tuples are enclosed in parentheses.

with Python Code Example)

Dont Miss: How To Write An After Interview Thank You Email

Also Check: Data Scientist Interview Questions And Answers

What Is Root Cause Analysis

Root cause analysis was initially developed to analyze industrial accidents but is now widely used in other areas. It is a problem-solving technique used for isolating the root causes of faults or problems. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from recurring.

Conduct A Practice Interview

Consider asking a friend or another trusted peer to help you conduct a practice interview. Begin by providing them with a list of potential questions. Then proceed as though you were actually sitting for an interview. Afterward, ask your partner for feedback about your performance.

Research Common Interview Questions

When preparing for your data scientist interview, look up common interview questions in addition to technical questions. Research questions related to your strengths, weaknesses, behaviors and habits. Thinking in advance about how you might address such topics can help you devise stronger responses during the interview.

What Is Generative Adversarial Network

Top 100 Frequently Asked Data Science Interview Questions and Answers ...

This approach can be understood with the famous example of the wine seller. Let us say that there is a wine seller who has his own shop. This wine seller purchases wine from the dealers who sell him the wine at a low cost so that he can sell the wine at a high cost to the customers. Now, let us say that the dealers whom he is purchasing the wine from, are selling him fake wine. They do this as the fake wine costs way less than the original wine and the fake and the real wine are indistinguishable to a normal consumer . The shop owner has some friends who are wine experts and he sends his wine to them every time before keeping the stock for sale in his shop. So, his friends, the wine experts, give him feedback that the wine is probably fake. Since the wine seller has been purchasing the wine for a long time from the same dealers, he wants to make sure that their feedback is right before he complains to the dealers about it. Now, let us say that the dealers also have got a tip from somewhere that the wine seller is suspicious of them.

So, in this situation, the dealers will try their best to sell the fake wine whereas the wine seller will try his best to identify the fake wine. Let us see this with the help of a diagram shown below:

From the image above, it is clear that a noise vector is entering the generator and he generates the fake wine and the discriminator has to distinguish between the fake wine and real wine. This is a Generative Adversarial Network .

What Is The Difference Between Data Analytics And Data Science

Data science involves the task of transforming data by using various technical analysis methods to extract meaningful insights using which a data analyst can apply to their business scenarios.
Data analytics deals with checking the existing hypothesis and information and answers questions for a better and effective business-related decision-making process.
Data Science drives innovation by answering questions that build connections and answers for futuristic problems. Data analytics focuses on getting present meaning from existing historical context whereas data science focuses on predictive modeling.
Data Science can be considered as a broad subject that makes use of various mathematical and scientific tools and algorithms for solving complex problems whereas data analytics can be considered as a specific field dealing with specific concentrated problems using fewer tools of statistics and visualization.

= 0.8 * 0.8 * 0.8 * 0.8 = 0.40

So, the probability that we will see one shooting star in the time interval of an hour is = 1-0.4 = 0.6

So, there are approximately 60% chances that we may see a shooting star in the time span of an hour.

Explain The Roc Curve

ROC curves are graphs that depict how a classification model performs at different classification thresholds. The graph is plotted with the True Positive Rate on the y axis and the False Positive Rate on the x-axis.

The TPR is expressed as the ratio between the number of true positives and the sum of the number of true positives and false negatives. The FPR is the ratio between the number of false positives in a dataset and the sum of the number of false positives and true negatives.

Mention The Steps In Creating A Decision Tree

Consider the whole data as input.
Calculate target variable entropy and the predictor attributes.
Calculate the information gained on all attributes.
You have to take the attribute with the highest information gained as the root node.
Repeat the above process on every division. So the decision node for every branch is finalized.

What Are The Major Industries In Chicago

Solving Real-World Data Science Interview Questions! (with Python Pandas)

Chicago has created a brand for itself in a multitude of sectors. Manufacturing, transportation, information technology, and health services & technologies are all booming industries in Chicago these days. IT is quickly becoming the fastest expanding sector, and there is a high need for people with Data Science certification in Chicago.

Are You Familiar With Date Manipulations

Date manipulation is one of the most important technical concepts that data scientists use. The interviewer may ask about date manipulations to ensure you have the necessary expertise to gather, sort and interpret data. In your response, discuss a specific example of when you used date manipulation and the outcome you achieved.

Example: “I often work with date manipulations to aggregate data based on what my clients want to know. For example, I recently worked with a pizza shop that wanted to determine what hour of the day they receive the most orders so they could adjust employee hours accordingly. With the monthly and daily data the shop owners provided me, I used data manipulation to determine which hours had the highest averages for order volume.”

What Is The Admission Process For This Data Science Certification Program

There are three manageable phases to the Data Science Certification Program admission:

All interested applicants must apply online using the application form.

Candidates will be shortlisted by an admissions panel based on their application.

The shortlisted candidates will get an offer of admission, which they must accept by paying the fees.

You May Like: What Questions To Ask During An Interview

Q78 During Analysis How Do You Treat Missing Values

The extent of the missing values is identified after identifying the variables with missing values. If any patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful business insights.

If there are no patterns identified, then the missing values can be substituted with mean or median values or they can simply be ignored. Assigning a default value which can be mean, minimum or maximum value. Getting into the data is important.

If it is a categorical variable, the default value is assigned. The missing value is assigned a default value. If you have a distribution of data coming, for normal distribution give the mean value.

If 80% of the values for a variable are missing then you can answer that you would be dropping the variable instead of treating the missing values.

Python Coding Interview Question #1: Business Name Lengths

25 Data Scientist Interview Questions & Answers

The next question is by the City of San Francisco:

Find the number of words in each business name. Avoid counting special symbols as words . Output the business name and its count of words.

Link to the question:

When answering the question, you should first find only distinct businesses using the drop_duplicates function. Then use the replace function to replace all the special symbols with blank, so you dont count them later. Use the split function to split the text into a list, and then use the len function to count the number of words.

Also Check: What To Ask Babysitter Interview

Question : Calculate The Churn Rate Percentage

Calculate the churn rate, the percentage of the customer who churned, round to the nearest integer, and visualize through plotting.

From the plot above, 70% decided to churn and 29% decided to stay. Another simpler way of doing this is to display the calculated percentages and plot in a simple barplot.

How Do You Build A Random Forest Model

A random forest is built up of a number of decision trees. If you split the data into different packages and make a decision tree in each of the different groups of data, the random forest brings all those trees together.

Steps to build a random forest model:

Randomly select ‘k’ features from a total of ‘m’ features where k < < m

Among the ‘k’ features, calculate the node D using the best split point

Split the node into daughter nodes using the best split

Repeat steps two and three until leaf nodes are finalized

Build forest by repeating steps one to four for ‘n’ times to create ‘n’ number of trees

Recommended Reading: How To Record An Interview On Your Phone

Q25 What Is Correlation And Covariance In Statistics

Covariance and Correlation are two mathematical concepts these two approaches are widely used in statistics. Both Correlation and Covariance establish the relationship and also measure the dependency between two random variables. Though the work is similar between these two in mathematical terms, they are different from each other.

Correlation: Correlation is considered or described as the best technique for measuring and also for estimating the quantitative relationship between two variables. Correlation measures how strongly two variables are related.

Covariance: In covariance two items vary together and its a measure that indicates the extent to which two random variables change in cycle. It is a statistical term it explains the systematic relation between a pair of random variables, wherein changes in one variable reciprocal by a corresponding change in another variable.

Technical Concepts Tested In Salesforce Data Scientist Coding Interview Questions

Data Science Mock Interview | Salesforce SQL Coding Interview Question

This article discusses some of the most recent Salesforce Data Scientist interview questions that candidates have been asked in SQL. Traditionally, the SQL interview questions at Salesforce were rather diverse and covered topics such as comparing individual data points with averages , counting occurences per timeframe , finding data based on criteria , sorting , finding duplicates or connecting tables .

However, most recently the majority of candidates face a variety of interview questions based on the same simple table. The Salesforce data scientist interview questions relate to analysing changes of the user engagement in time. In this article, we will analyse this recent dataset and discuss the solutions to the questions you may expect to get asked.

Recommended Reading: What Questions Are Asked In An Exit Interview

Q100 What Are The Different Layers On Cnn

There are four layers inCNN:

Convolutional Layer the layer that performs a convolutional operation, creating several smaller picture windows to go over the data.

ReLU Layer it brings non-linearity to the network and converts all the negative pixels to zero. The output is a rectified feature map.

Pooling Layer pooling is a down-sampling operation that reduces the dimensionality of the feature map.

Fully Connected Layer this layer recognizes and classifies the objects in the image.

How Do You Work Towards A Random Forest

The underlying principle of this technique is that several weak learners combine to provide a strong learner. The steps involved are:

Build several decision trees on bootstrapped training samples of data

On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates out of all pp predictors

Rule of thumb: At each split m=pm=p

Predictions: At the majority rule

This exhaustive list is sure to strengthen your preparation for data science interview questions.

You May Like: How To Ace My Interview

Data Science Coding Interview Questions

Don't Miss

What Is A Confusion Matrix

Python Coding Interview Question #: Apartments In New York City And Harlem

In Your Choice Of Language Write A Program That Prints The Numbers Ranging From One To 50

What Are The Differences Between Correlation And Covariance

Technical Data Science Interview Questions

24. Explain overfitting and underfitting.

25. What is batch normalization?

26. What do you mean by cluster sampling and systematic sampling?

How To Avoid Overfitting Your Model

Python Coding Interview Question #1: Positions Of Letter ‘a’

Gear Up For Your Next Data Science Interview

Q: What Are The Differences Between Lists And Tuples

What Is Root Cause Analysis

Conduct A Practice Interview

Research Common Interview Questions

What Is Generative Adversarial Network

What Is The Difference Between Data Analytics And Data Science

Explain The Roc Curve

Mention The Steps In Creating A Decision Tree

What Are The Major Industries In Chicago

Are You Familiar With Date Manipulations

What Is The Admission Process For This Data Science Certification Program

Q78 During Analysis How Do You Treat Missing Values

Python Coding Interview Question #1: Business Name Lengths

Question : Calculate The Churn Rate Percentage

How Do You Build A Random Forest Model

Steps to build a random forest model:

Q25 What Is Correlation And Covariance In Statistics

Technical Concepts Tested In Salesforce Data Scientist Coding Interview Questions

Q100 What Are The Different Layers On Cnn

How Do You Work Towards A Random Forest

More articles

Popular Articles

About Us

Popular Category

Editor Picks