Data Pipeline Design Interview Questions

How Do You Handle Duplicate Data In Sql

An Azure Data Factory Pipeline can be executed using three methods

You might want to clarify a question like and ask some follow-up questions of your own. Specifically, you might be interested in A. what kind of data is being processed, B. and what types of values are most likely to be duplicated.

With some clarity, youll be able to suggest more relevant strategies. For example, you might propose using the DISTINCT or UNIQUE key to reduce duplicate data. Or you could walk the interviewer through how the GROUP BY key could be used.

Mostly Asked Azure Data Factory Interview Questions And Answers

18 18. What is the breakpoint in the adf pipeline ?

1. What is azure data factory used for ?

Azure Data factory is the data orchestration service provided by the Microsoft Azure cloud. ADF is used for following use cases mainly :

Data migration from one data source to other

On Premise to cloud data migration

Automated the data flow.

There is huge data laid out there and when you want to move the data from one location to another in automated way within the cloud or from on-premises to the azure cloud azure data factory is the best service we have available.

2. What are the main components of the azure data factory ?

These are the main components of the the azure data factory:

3. What is the pipeline in the adf ?

Pipeline is the set of the activities specified to run in defined sequence. For achieving any task in the azure data factory we create a pipeline which contains the various types of activity as required for fulfilling the business purpose. Every pipeline must have a valid name and optional list of parameters.

4. What is the data source in the azure data factory ?

5. What is the integration runtime in Azure data factory :

6. What are the different types of integration runtime ?

There are 3 types of the integration runtime available in the Azure data factory. We can choose based upon our requirement the specific integration runtime best fitted in specific scenario. The three types are :

Youre Given Two Tables A Users Table And A Neighborhoods Table Write A Query That Returns All Of The Neighborhoods That Have 0 Users

Whenever the question asks about finding 0 values, e.g. users or neighborhoods, start thinking LEFT JOIN! An inner join finds any values that are in both tables a LEFT JOIN keeps only the values in the left table.

With this question, our predicament is to find all the neighborhoods without users. To do this we must do a left join from the neighborhoods table to the users table. Heres an example solution:

SELECTn.nameFROMneighborhoodsASnLEFTJOINusersASuONn.id=u.neighborhood_idWHEREu.idISNULL

30. Write a query to get the current salary data for each employee.

31. What is the difference between DELETE and TRUNCATE?

32. Write a query to find the nominee who has won the most awards.

33. What are aggregate functions in SQL?

34. What SQL commands can be used in ETL?

35. How do you change a column name by writing a query in SQL?

36. How would you design the database for a recommendation engine?

37. Whats the difference between WHERE and HAVING?

38. What is an index in SQL? When would you use an index?

Recommended Reading: Entry Level Front-end Developer Interview Questions

Can You Tell Me About Namenode What Happens If Namenode Crashes Or Comes To An End

It is the centre-piece or central node of the Hadoop Distributed File System, and it does not store actual data. It stores metadata. For example, the data being stored in DataNodes on which rack and which DataNode the information is stored. It tracks the different files present in clusters. Generally, there is one NameNode, so when it crashes, the system may not be available.

Using The Following Sql Table Definitions And Data How Would You Construct A Query That Shows

A data engineer needs to be able to construct and execute queries in order to understand the existing data, and to verify data transformations that are part of the data pipeline.

You can ask a few questions covering SQL to ensure the data engineer candidate has a good handle on the query language.

Here are some examples:

With a product table defined with a name, SKU, and price, how would you construct a query that shows the lowest priced item?
With an order table defined with a date, a product SKU, price, quantity, tax rate, and shipping rate, how would you construct a query that shows the average order cost?

You can use the SQL below to setup the examples above:

CREATE TABLE products  NOT NULL,  price DECIMAL NOT NULL,  PRIMARY KEY) CREATE TABLE orders  NOT NULL,  quantity INT NOT NULL,  tax_rate DECIMAL NOT NULL,  shipping_rate DECIMAL NOT NULL,  FOREIGN KEY REFERENCES products) INSERT INTO products VALUES  INSERT INTO products VALUES  INSERT INTO orders VALUES  INSERT INTO orders VALUES

Also Check: Sql Developer Interview Questions For 10 Years Experience

Compare Azure Data Lake Gen1 Vs Azure Data Lake Gen2

Azure Data Lake Gen1	Azure Data Lake Gen2
Azure Data Lake Gen 1 is file system storage In a hierarchical file system that distributes data in blocks.	Azure Data Lake Gen 2 includes a file system for efficiency and reliability and flexible object storage.
The hot/cold storage tier isn’t available.	The hot/cold storage tier is available.
It doesnt approve storage redundancy.	It supports Storage Redundancy.

What Is A Trigger In Sql

In SQL, a trigger refers to a set of statements in a system catalog that runs whenever DML commands run on a system. It is a special stored procedure that gets called automatically in response to an event. Triggers allow the execution of a batch of code whenever an insert, update or delete command is executed for a specific table. You can create a trigger by using the CREATE TRIGGER statement. The syntax is:

CREATE TRIGGER trigger_name

ON table_name FOR EACH ROW

BEGIN

END

Recommended Reading: What To Bring To A Job Interview Teenager

What Do You Mean By The Slice Action And How Many Slice

A slice operation is the filtration process in a data warehouse. It selects a specific dimension from a given cube and provides a new sub-cube in the slice operation. Only a single dimension is used, so, out of a multi-dimensional data warehouse, if it needs a particular dimension that needs further analytics or processing, it will use the slice operation in that data warehouse.

What Tools Did You Use In A Recent Project

Building a Data Pipeline

Interviewers want to assess your decision-making skills and knowledge about different tools. Therefore, use this question to explain your rationale for choosing specific tools over others.

Walk the hiring managers through your thought process, explaining your reasons for considering the particular tool, its benefits, and the drawbacks of other technologies.
If you find that the company works on the techniques you have previously worked on, then weave your experience with the similarities.

Don’t Miss: How To Interview A Programmer

What Is A Memorable Data Pipeline Performance Issue That You Solved

This question will give you insight into the candidates past experiences with data pipeline implementation and how they were able to improve performance. Performance issues in a data pipeline can not only slow down the gathering of data, but can disrupt and slow down data analysis. This can have a direct impact on business decisions.

Here are some examples of experiences candidates could discuss:

how they improved the performance of a specific SQL query
how they upgraded a database from one type to another
how they reduced the time it took to run a set of queries
how they improved performance of importing or exporting of data
how they improved retrieval of data from a backup system

If you want to know if the candidate has ideas on how to improve the performance of your data pipeline, also ask this as a question!

You can also ask a candidate how they have solved issues with malformed data and incorrect taxonomies.

Check out our entire set of software development interview questions to choose and practice those which fit your job search situation best:

Differentiate Between Oltp And Olap

OLTP stands for Online Transaction Process System
OLTP is known for maintaining transactional level data of the organization and generally, they are highly normalized. If it is an OLTP route then it is going to be a star schema design.
OLAP stands for Online Analytical process system.
OLAP is known for a lot of analysis and fulfills reporting purposes. It is a de-normalized form.

If it is an OLAP route then it is going to be a snowflake schema design.

Don’t Miss: Erp Interview Questions And Answers

Can You Elaborate On Reducer In Hadoop Mapreduce Explain The Core Methods Of Reducer

Reducer is the second stage of data processing in the Hadoop Framework. The Reducer processes the data output of the mapper and produces a final output that is stored in HDFS.

The Reducer has 3 phases:

Shuffle: The output from the mappers is shuffled and acts as the input for Reducer.

Sorting is done simultaneously with shuffling, and the output from different mappers is sorted.

Reduce: in this step, Reduces aggregates the key-value pair and gives the required output, which is stored on HDFS and is not further sorted.

There are three core methods in Reducer:

Setup : it configures various parameters like input data size.

Reduce : It is the main operation of Reducer. In this method, a task is defined for the associated key.

Cleanup: This method cleans temporary files at the end of the task.

What Is Cluster Analysis What Is The Purpose Of Cluster Analysis

Survey ERP Official Site â Best Survey ERP

Cluster analysis is defined as a process where an object is defined without giving any label to it. It uses statistical data analysis techniques and processes the data mining job. Using cluster analysis, an iterative process of knowledge discovery is processed in the form of trails.

The purpose of cluster analysis:

It can deal with a different set of attributes
High dimensionality
Interpretability

Watch this video on Top 10 Highest Paying IT Jobs in 2021 and know how to get into these job roles.

You May Like: How To Conduct An Effective Job Interview

What Is The Meaning Of Skewed Tables In Hive

Skewed tables are the tables in which values appear in a repeated manner. The more they repeat, the more the skewness.

Using Hive, a table can be classified as SKEWED while creating it. By doing this, the values will be written to different files first, and later, the remaining values will go to a separate file.

As A Data Engineer How Have You Handled A Job

Data engineers have a lot of responsibilities, and its a genuine possibility that youll face challenges while on the job, or even emergencies. Just be honest and let them know what you did to solve the problem. If you have yet to encounter an urgent issue while on the job or this is your first data engineering role, tell your interviewer what you would do in a hypothetical situation. For example, you can say that if data were to get lost or corrupted, you would work with IT to make sure data backups were ready to be loaded, and that other team members have access to what they need.

Don’t Miss: How To Overcome Interview Anxiety

Tell Me About Yourself

What theyâre really asking: What makes you a good fit for this job?

This question is asked so often in interviews that it can seem generic and open-ended, but itâs really about your relationship with data engineering. Keep your answer focused on your path to becoming a data engineer. What attracted you to this career or industry? How did you develop your technical skills?

The interviewer might also ask:

Why did you choose to pursue a career in data engineering?
Describe your path to becoming a data engineer.

Does The Job Assistance Program Guarantee Me A Job

System Design Interview – Top K Problem (Heavy Hitters)

Apparently, no. Our job assistance program is aimed at helping you land in your dream job. It offers a potential opportunity for you to explore various competitive openings in the corporate world and find a well-paid job, matching your profile. The final decision on hiring will always be based on your performance in the interview and the requirements of the recruiter.

What Is Meant By Coshh

COSHH is the abbreviation for Classification and Optimization-based Scheduling for Heterogeneous Hadoop systems. As the name suggests, it provides scheduling at both the cluster and the application levels to directly have a positive impact on the completion time for jobs.

Learn about the difference between Data Engineer and Data Scientist in our blog on Data Engineer vs Data Scientist!

What Is Data Modeling

Data modeling is a technique that defines and analyzes the data requirements needed to support business processes. It involves creating a visual representation of an entire system of data or a part of it. The process of data modeling begins with stakeholders providing business requirements to the data engineering team.

Recommended Reading: How To Track Exit Interview Data

Tips To Crack Amazons Data Engineering Interview

Take note of the following tips to nail your next Amazon data engineer interview:

Start your prep at least 10 weeks before your interview
Practice coding on a whiteboard for the onsite interview
Practice mock interviews with professionals from FAANG companies
Think out loud your solution to give the hiring manager a window into your analytical approach
Create a project portfolio and list your projects in the STAR format
Brush up on concepts in your programming language

What Are The Different Power Bi Tools And How Are They Used

Data Application Lab: Cracking design/case questions in data engineer ...

Some of the common tools in Power BI are built-in connectors, Power query, AI-powered Q& A, Machine Learning models, quick insights, and integration of Cortana. The built-in connectors in Power BI allow the user to connect with both On-premises and On-cloud data sources including Salesforce, SQL Servers, Microsoft, and more.

Power Query helps you integrate the reports and share them on the internet. Cortana integration will allow you to run queries by giving voice commands. In addition, Power BI offers advanced analysis, Machine Learning, and other AI tools to create live dashboards and check your performance in real-time.

You May Like: How To Have An Excellent Interview

What Is The Level Of Granularity Of A Fact Table

A fact table is usually designed at a low level of granularity. This means we must find the lowest amount of information stored in a fact table. For example, employee performance is a very high level of granularity. In contrast, employee performance daily and employee performance weekly can be considered low levels of granularity because they are much more frequently recorded data. The granularity is the lowest level of information stored in the fact table the depth of the data level is known as granularity in the date dimension.

The level could be a year, month, quarter, period, week, and day of granularity, so the day is the lowest, and the year is the highest. The process consists of the following two steps determining the dimensions to be included and the location to find the hierarchy of each dimension of that information. The above factors of determination will be resent as per the requirements.

Subquery And Derived Tables

Calculating the number of clicks, logins, and purchases per user session for active users using subquery and derived tables is shown below.

select userSessionMetrics.userId,    userSessionMetrics.sessionId,    userSessionMetrics.numclicks,    userSessionMetrics.numlogins,    userSessionMetrics.numPurchasesfrom  as numPurchasesfrom clickstreamgroupby userIdhaving numPurchases > =1    ) purchasingUsersjoin  as numclicks,sum as numlogins,sum as numPurchasesfrom clickstreamgroupby userId,            sessionId    ) userSessionMetrics on purchasingUsers.userId = userSessionMetrics.userIdwhere purchasingUsers.userId in  > 1    )

We can see the query plan by running explain + the above query in your sql terminal.

From the query plan, we can see that the query planner decided to

Calculate movingUsers and userSessionMetrics in parallel and join them.

While simultaneously filtering clickstream data to generate purchasingUsers.

Finally, join the datasets from the above 2 points.

You can see that the query plan is very similar to the CTE approach.

Read Also: How To Have A Good Job Interview

Well Why Should Data Scientists Worry

This is mainly because being able to apply machine learning becomes highly practical, when some system design fundamentals are applied to it. Therefore it is essential that data scientists understand some of these concepts. For a start, let us ask a few system design questions.

What is horizontal and vertical scaling?
What are various load balancing algorithms?
What are various cache eviction policies?
What is the advantage/disadvantage of adding an index to a database?

We will cover many of these questions here:

In future we will also add more questions on system design on the same website.

Data Engineer Job Growth And Demand In 2022

Datacoral: Using Serverless to Create Data Pipelines

When compared to data science, data engineering does not receive as much media coverage. However, data engineering is a career field that is rapidly expanding and in great demand. It can be a highly exciting career for people who enjoy assembling the “pieces of a puzzle” that build complex data pipelines to ingest raw data, convert it, and then optimize it for various data users. According to a LinkedIn Search as of June 2022, there are over 229,000 jobs for data engineering in the United States, and over 41,000 jobs for the same in India.

Based on Glassdoor, the average salary of a data engineer in the United States is $112,493 per annum. In India, the average data engineer salary is â¹925,000. According to Indeed, Data Engineer is the 5th highest paying job in the United States across all the sectors. These stats clearly state that the demand for the role of a Data Engineer will only increase with lucrative paychecks.

Also Check: How To Crack Amazon Business Intelligence Interview