Saturday, June 22, 2024

# How To Crack Amazon Data Scientist Interview

## Q80 What Is An Autoencoder

How to Crack Data Science Interviews- Motivations

Ans. These are feedforward learning networks where the input is the same as the output. Autoencoders reduce the number of dimensions in the data to encode it while ensuring minimal error and then reconstruct the output from this representation.

Also Explore Deep Learning Online Courses & Certifications

## What Are Dimensionality Reduction And Its Benefits

The Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions to convey similar information concisely.

This reduction helps in compressing data and reducing storage space. It also reduces computation time as fewer dimensions lead to less computing. It removes redundant features for example, there’s no point in storing a value in two different units .

## What Is A Kernel Function In Svm

In the SVM algorithm, a kernel function is a special mathematical function. In simple terms, a kernel function takes data as input and converts it into a required form. This transformation of the data is based on something called a kernel trick, which is what gives the kernel function its name. Using the kernel function, we can transform the data that is not linearly separable into one that is linearly separable.

Recommended Reading: Where To Watch Harry And Meghan Interview

## Two Candidates Aman And Mohan Appear For A Data Science Job Interview The Probability Of Aman Cracking The Interview Is 1/8 And That Of Mohan Is 5/12 What Is The Probability That At Least Of Them Will Crack The Interview

The probability of Aman getting selected for the interview is 1/8P = 1/8The probability of Mohan getting selected for the interview is 5/12P=5/12

Now, the probability of at least one of them getting selected can be denoted at the Union of A and B, which means

P =P+ P )

Where P stands for the probability of both Aman and Mohan getting selected for the job.To calculate the final answer, we first have to find out the value of PSo, P = P * P

1/8 * 5/12

Now, put the value of P into equation 1

P =P+ P )

1/8 + 5/12 -5/96

So, the answer will be 47/96.

## Q59 What Is Supervised Learning

Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples.

Algorithms: Support Vector Machines, Regression, Naive Bayes, Decision Trees, K-nearest Neighbor Algorithm and Neural Networks

E.g. If you built a fruit classifier, the labels will be this is an orange, this is an apple and this is a banana, based on showing the classifier examples of apples, oranges and bananas.

You May Like: How To Interview An Attorney

## Q67 Explain Decision Tree Algorithm In Detail

A is a supervised machine learning algorithm mainly used for Regression and Classification. It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision tree can handle both categorical and numerical data.

## Why Are Coding Questions Asked In Ds Interviews

What exactly is a coding interview? We use the phrase coding interview to refer to any technical session that involves coding in any programming language other than a query language like SQL. In todays market, you can expect a coding interview with just about any data science job.

Why? Coding is an essential part of your DS careers. Here are three reasons:

• DS is a technical subject. The bulk of a data science job involves collecting, cleaning, and processing data into usable formats. Therefore, to get work done, basic programming proficiency is a must.
• Lots of real-world data science projects are highly collaborative, involving multiple stakeholders. Data scientists who are equipped with stronger fundamental CS skills will find it easier to work closely with engineers and other partners.
• In many companies, data scientists are responsible for shipping production code, such as data pipelines and machine learning models. Strong programming skills are essential for projects of this type.

To sum up, strong coding skills are necessary to perform well in many data science positions. If you cannot show that you possess those skills in the coding interview, you will not get the job.

Also Check: How To Prepare For An Interview At Amazon

## Other Helpful Resources For All Aspiring Data Scientists

So far we have covered the question and answer part of the interview process. But even having that knowledge might not be enough if you dont follow the tips and behavioural guidelines covered in this section! Things like body language, the way you structure your thoughts, your awareness of the industry, domain knowledge and how caught up you are with all the latest developments in machine learning these all matter a great deal.

8.1 Beware Interviewer for the Analytics Job is Observing you Closely!

As an analyst, getting into details and studying them carefully, almost becomes second nature to you. In an interview, you will likely be interviewed by someone who has been an analyst for a longer duration that you have been. Hence, you should expect a thorough and close examination of minute details. The tips mentioned here will prove to be very handy.

#### 8.2 Definitive Guide to prepare for an analytics interview

This article lays down the general structure of an analytics interview. It covers aspects like the different points the employer judges you on, the different stages of an interview, how a technical interview is conducted, etc. This guide is meant to help you ace the next analytics interview you sit for!

## Make A Scatter Plot Between Price And Carat Using Ggplot Price Should Be On Y

Advanced SQL Questions From Amazon (Handling complex logic in data science interviews)

We will implement the scatter plot using ggplot.

The ggplot is based on the grammar of data visualization, and it helps us stack multiple layers on top of each other.

So, we will start with the data layer, and on top of the data layer we will stack the aesthetic layer. Finally, on top of the aesthetic layer we will stack the geometry layer.

Code:

`> ggplot)+geom_point`

## Introduce 25 Percent Missing Values In This Iris Datset And Impute The Sepallength Column With Mean And The Petallength Column With Median

To introduce missing values, we will be using the missForest package:

`library`

Using the prodNA function, we will be introducing 25 percent of missing values:

`Iris.mis< -prodNA`

For imputing the Sepal.Length column with mean and the Petal.Length column with median, we will be using the Hmisc package and the impute function:

`libraryiris.mis\$Sepal.Length< -with)iris.mis\$Petal.Length< -with)`

## How To Calculate The Accuracy Of A Binary Classification Algorithm Using Its Confusion Matrix

In a binary classification algorithm, we have only two labels, which are True and False. Before we can calculate the accuracy, we need to understand a few key terms:

• True positives: Number of observations correctly classified as True
• True negatives: Number of observations correctly classified as False
• False positives: Number of observations incorrectly classified as True
• False negatives: Number of observations incorrectly classified as False

To calculate the accuracy, we need to divide the sum of the correctly classified observations by the number of total observations. This can be expressed as follows:

You May Like: What Questions To Expect In A Second Interview

## Cracking The Data Scientist Interview

After interviewing with over 50 companies for Data Scientist/Machine Learning Engineer, I am going to frame my experiences in the Q& A format and try to debunk any myths that beginners may have in their quest for becoming a Data Scientist.

By Ajit Samudrala, Data Scientist at Symantec

After completing my Data Science internship at Sirius in August 2018, I have started searching for a full-time position in Data Science. My initial search was haphazard with mediocre resume and Linkedin profile. Unsurprisingly, it took me a month to start the ball rolling. After 40 days into my search, I received my first response from Google for Data Scientist position in one of their Engineering teams. I was simmering with excitement, as I didnt expect a call from Google even in my wildest dreams. I couldnt make it to onsite but it was a great learning experience. Thereafter, I interviewed with Apple, SAP, Visa, Walmart, Nielsen, Symantec, Swiss Re, AppNexus, Catalina, Cerego, and 40 other companies for Data Scientist/Machine Learning Engineer. Finally, I have joined Symantec in their Mountain View campus. I am going to frame my experiences in the Q& A format and try to debunk any myths that beginners may have in their quest for becoming a Data Scientist.

1. What was the toughest part of the whole job search process?

2. What was the best part of the whole job search process?

3. What are the primary skills required to ace the interviews?

12. Random Tips

## Q25 What Is Correlation And Covariance In Statistics

Covariance and Correlation are two mathematical concepts these two approaches are widely used in statistics. Both Correlation and Covariance establish the relationship and also measure the dependency between two random variables. Though the work is similar between these two in mathematical terms, they are different from each other.

Correlation: Correlation is considered or described as the best technique for measuring and also for estimating the quantitative relationship between two variables. Correlation measures how strongly two variables are related.

Covariance: In covariance two items vary together and its a measure that indicates the extent to which two random variables change in cycle. It is a statistical term it explains the systematic relation between a pair of random variables, wherein changes in one variable reciprocal by a corresponding change in another variable.

Read Also: How To Interview A Manager

## Q115 What Is A Boltzmann Machine

Boltzmann machines have a simple learning algorithm that allows them to discover interesting features that represent complex regularities in the training data. The Boltzmann machine is basically used to optimise the weights and the quantity for the given problem. The learning algorithm is very slow in networks with many layers of feature detectors. Restricted Boltzmann Machines algorithm has a single layer of feature detectors which makes it faster than the rest.

Q116. What Is Dropout and Batch Normalization?

Dropout is a technique of dropping out hidden and visible units of a network randomly to prevent overfitting of data . It doubles the number of iterations needed to converge the network.

Batch normalization is the technique to improve the performance and stability of neural networks by normalizing the inputs in every layer so that they have mean output activation of zero and standard deviation of one.

## Mtech In Data Science And Machine Learning

The M.Tech in Data Science and Machine Learning is a 21 months program offered in Full-Time / Weekend-Classroom formats, which enables participants to gain an in-depth understanding of data science and analytics techniques and tools that are widely used by companies. Upon successful completion of all requirements, the participants in this program will earn an M.Tech Degree from PES University. The classes will be held at PES University Electronics City Campus, and Great Learning online platform.

Who Can Participate In This Program?

This course is for candidates with 0 to 5 years of experience.

Job Roles One Can Break Into:

Data Scientist, Data Analyst, Machine Learning Engineer, Data Science Generalist, etc.

Hands-on Learning:

Through the duration of this course, the candidates will be trained on Python, SQL, Tableau, Data Science and Machine Learning.

Also Check: How To Do Podcast Interviews Remotely

## How Can You Select K For K

We use the elbow method to select k for k-means clustering. The idea of the elbow method is to run k-means clustering on the data set where ‘k’ is the number of clusters.

Within the sum of squares , it is defined as the sum of the squared distance between each member of the cluster and its centroid.

## What Is A Confusion Matrix

Advanced Data Science SQL Interview Question [Amazon] (window functions & aliasing)

The confusion matrix is a table that is used to estimate the performance of a model. It tabulates the actual values and the predicted values in a 2×2 matrix. True Positive : This denotes all of those records where the actual values are true and the predicted values are also true. So, these denote all of the true positives. False Negative : This denotes all of those records where the actual values are true, but the predicted values are false. False Positive : In this, the actual values are false, but the predicted values are true. True Negative : Here, the actual values are false and the predicted values are also false. So, if you want to get the correct values, then correct values would basically represent all of the true positives and the true negatives. This is how the confusion matrix works.

Read Also: How To Do Hirevue Interview

## What Is Dimensionality Reduction

Dimensionality reduction is the process of converting a dataset with a high number of dimensions to a dataset with a lower number of dimensions. This is done by dropping some fields or columns from the dataset. However, this is not done haphazardly. In this process, the dimensions or fields are dropped only after making sure that the remaining information will still be enough to succinctly describe similar information.

A gradient measures how much the output of a function changes if you change the inputs a little bit. It simply measures the change in all weights with regard to the change in error. You can also think of a gradient as the slope of a function.

Gradient Descent can be thought of climbing down to the bottom of a valley, instead of climbing up a hill. This is because it is a minimization algorithm that minimizes a given function .

## What Are The Differences Between Supervised And Unsupervised Learning

Supervised Learning

• Uses known and labeled data as input
• Supervised learning has a feedback mechanism
• The most commonly used supervised learning algorithms are decision trees, logistic regression, and support vector machine
• Uses unlabeled data as input
• Unsupervised learning has no feedback mechanism
• The most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, and apriori algorithm

## How To Crack A Data Science Interview

The data science and analytics sector in India has witnessed a sharp increase in demand for highly-skilled professionals who understand both the business world as well as the tech world. Data Science is considered one of the most lucrative jobs in the industry right now.

However, the industry is still riddled with a lot of challenges in terms of talent and which is why organisations have started pouring a substantial amount of money in building their data science and analytics team. Organisations regardless of their positions have been using data science and analytics to garner insights from data.

#### Need For Young Professionals:

The Data Scientist career is titled as the Hottest Career of 21st Century by Harvard Business Review and this position has proved to be one of the most appealing as well as wanted by the job seekers. However, there has been a potential shortage of Data Scientists in the field. The reason behind this is the technological challenges that are limiting the skills of the employees. Most of the senior-level management has started off from software or coding designations since the sector wasnt evolved enough to encompass the designation of a data scientist.

An entry-level data scientist is someone who has less than four years of experience working as a business analyst with knowledge in Python. The entry-level role also applies to senior software engineers looking for opportunities to work in analytics and machine learning projects.

You May Like: What To Ask A Cfo In An Interview

## Q121 What Is A Generative Adversarial Network

Suppose there is a wine shop purchasing wine from dealers, which they resell later. But some dealers sell fake wine. In this case, the shop owner should be able to distinguish between fake and authentic wine.

The forger will try different techniques to sell fake wine and make sure specific techniques go past the shop owners check. The shop owner would probably get some feedback from wine experts that some of the wine is not original. The owner would have to improve how he determines whether a wine is fake or authentic.

The forgers goal is to create wines that are indistinguishable from the authentic ones while the shop owner intends to tell if the wine is real or not accurately

Let us understand this example with the help of an image.

There is a noise vector coming into the forger who is generating fake wine.

Here the forger acts as a Generator.

The shop owner acts as a Discriminator.

The Discriminator gets two inputs one is the fake wine, while the other is the real authentic wine. The shop owner has to figure out whether it is real or fake.

So, there are two primary components of Generative Adversarial Network named:

• Generator

• Discriminator

• The generator is a CNN that keeps keys producing images and is closer in appearance to the real images while the discriminator tries to determine the difference between real and fake images The ultimate aim is to make the discriminator learn to identify real and fake images.

## Q95 What Is Natural Language Processing

Ans. Natural language processing is a branch of artificial intelligence that helps computers understand, interpret, and manipulate human language. It focuses on the processing of human communications, dividing them into parts, and identifying the most relevant elements of the message. With the Comprehension and Generation of Natural Language, it ensures that machines can understand, interpret and manipulate human language.

Recommended Reading: How To Interview A Real Estate Agent