Sunday, June 16, 2024

Big Data Coding Interview Questions

Don't Miss

Miscellaneous Coding Interview Questions

Big Data Interview Question | Spark Interview Question | Spark with Scala Coding Interview Question

Apart from data structure-based questions, most of the programming job interviews also ask algorithm, design, bit manipulation, and general logic-based questions, which Iâll describe in this section.

Itâs important that you practice these concepts because sometimes they become tricky to solve in the actual interview. Having practiced them before not only makes you familiar with them but also gives you more confidence in explaining the solution to the interviewer.

  • How is a bubble sort algorithm implemented?
  • How is an iterative quicksort algorithm implemented?
  • How do you implement an insertion sort algorithm?
  • How is a merge sort algorithm implemented?
  • How do you implement a bucket sort algorithm?
  • How do you implement a counting sort algorithm?
  • How is a radix sort algorithm implemented?
  • How do you swap two numbers without using the third variable?
  • How do you check if two rectangles overlap with each other?
  • How do you design a vending machine?
  • If you need more such coding questions you can take help from books like Cracking The Code Interview, by Gayle Laakmann McDowellwhich presents 189+ Programming questions and solution. A good book to prepare for programming job interviews in a short time.

    By the way, the more questions you solve in practice, the better your preparation will be. So, if you think 50 is not enough and you need more, then check out these additional 50 programming questionsfor telephone interviews and these books and courses for a more thorough preparation.

    What Is The Difference Between A Box Plot And A Histogram

    The frequency of a certain features values is denoted visually by both box plots

    and histograms.

    Boxplots are more often used in comparing several datasets and compared to histograms, take less space and contain fewer details. Histograms are used to know and understand the probability distribution underlying a dataset.

    The diagram above denotes a boxplot of a dataset.

    What Can I Do To Find Out Whether This Data Science Program Is Good For Me

    It’s always beneficial to learn new talents and broaden your knowledge. This Data Science program was created in collaboration with Purdue University and is an excellent combination of a world-renowned curriculum and industry-aligned training, making this postgraduate in data science a superb choice.

    Don’t Miss: When To Send Thank You Email After Interview

    Using The Sample Superstore Dataset Display The Top 5 And Bottom 5 Customers Based On Their Profit

    • Drag Customer Name field on to Rows, and Profit on to Columns.
    • Right-click on the Customer Name column to create a set
    • Give a name to the set and select the top tab to choose the top 5 customers by sum
    • Similarly, create a set for the bottom five customers by sum
    • Select both the sets, right-click to create a combined set. Give a name to the set and choose All members in both sets.
    • Drag top and bottom customers set on to Filters, and Profit field on to Colour to get the desired result.

    How Would You Implement The Insertion Sort Algorithm

    Data Science Interview Questions Part
    • We assume the first element in the array to be sorted. The second element is stored separately in the key. This sorts the first two elements. You can then take the third element and do a comparison with the ones on the left of it. This process will go on until a point where we sort the array.

    int a =

    for {

    int n = m

    while {

    int k = a

    a = a

    a = k

    Don’t Miss: How To Prepare For Software Developer Interview

    Have Your Questions Ready

    While itâs important to be thinking about the questions youâll have to answer, itâs also essential to have some questions ready that you will ask at the end of the interview.

    Many overlook this, but it is an excellent way for you to find out more about the role and decide whether it is definitely for you and show your interest in the position and company. Some examples of questions include:

    ⢠What is the metric on which my performance will be evaluated?

    ⢠How will the projects I work on align with key business goals?

    ⢠What are the top three reasons you like working here?

    ⢠What are the most immediate projects that need to be addressed?

    Read more:Questions to Ask at the End of an Interview

    Given A List Of Timestamps In Sequential Order Return A List Of Lists Grouped By Week Using The First Timestamp As The Starting Point

    This question sounds like it should be a SQL question, doesnt it? Weekly aggregation implies a form of GROUP BY in a regular SQL or Pandas question. In either case, aggregation on a dataset of this form by week would be pretty trivial.

    But as a scripting question, this task is trying to pry out if the candidate is comfortable dealing with unstructured data, as data scientists may be forced to deal with a lot of unstructured data depending on their specific role or company.

    In this function, we have to do a few things:

  • Loop through all of the datetimes.
  • Set a beginning timestamp as our reference point.
  • Check if the next timestamp in the array is more than seven days ahead.a. If so, set the new timestamp as the reference point.b. If not, continue to loop through and append the last value.
  • This Python question explores the concept of stemming, which is the heuristic of chopping off the end of a word to clean and bucket it into an easier feature set.


    roots=sentence="the cattle was rattled by the battery"


    "the cat was rat by the bat"

    Read Also: Do I Need Another Interview To Renew Global Entry

    Python Coding Interview Question #1: Business Name Lengths

    The next question is by the City of San Francisco:

    Find the number of words in each business name. Avoid counting special symbols as words . Output the business name and its count of words.

    Link to the question:

    When answering the question, you should first find only distinct businesses using the drop_duplicates function. Then use the replace function to replace all the special symbols with blank, so you dont count them later. Use the split function to split the text into a list, and then use the len function to count the number of words.

    Questions On Product Sense And Business Applications

    Hadoop Interview Questions and Answers | Big Data Interview Questions | Hadoop Tutorial | Edureka

    These questions are specific to the business and how you would use data science. Answering these questions well can demonstrate your ability to apply your data science knowledge to a business capacity, rather than just understanding theory. Questions will likely be particular to the role, but use the following as a guide:

    • “We are looking to improve a new feature for our product. What metrics would you track to make sure itâs a good idea?”

    • “If we were looking to grow X metric on X feature, how might we achieve that?”

    • “Tell me about a time you set about aligning data projects with company goals.”

    • “When measuring the impact of a search toolbar change, which metric would you use?”

    Read Also: How To Interview With Google

    Difference Between An Error And A Residual Error

    The difference between a residual error and error are defined below –


    The difference between the actual value and the predicted value is called an error.

    Some of the popular means of calculating data science errors are –

    • Root Mean Squared Error
    • Mean Absolute Error
    • Mean Squared Error

    The difference between the arithmetic mean of a group of values and the observed group of values is called a residual error.

    An error is generally unobservable.

    A residual error can be represented using a graph.

    A residual error is used to show how the sample population data and the observed data differ from each other.

    An error is how actual population data and observed data differ from each other.

    Write A Function To Return A 5

    More context. Lets say we have a five-by-five matrix num_employees where each row is a company and each column represents a department. Each cell of the matrix displays the number of employees working in that particular department at each company.

    To reconstruct the new array, loop through every cell in a department and divide by the total number of employees of the whole company, which is the sum of the whole row.



    closest_key-> 'm'

    With this question, ask: Is your computed distance always positive? Negative values for distance will interfere with getting an accurate result.



    The idea is that we need to try every matching substring of string1 and string2.So, for example, if we have string1 = abbc, string2 = acc, we can take the first letter of string1, a, and look for a match in string2. Once we find one, we are left with the same problem with a smaller portion of the two strings. The remaining part of string1 will be bbc and string2 cc, and we repeat the process.

    • In the second iteration, we dont find a match _b_bc with cc.
    • In the third iteration, we dont find a match b_b_c with cc.
    • Finally, we have a match bb_c_ with _c_c.
    • We finished string1, and the result is ac.

    Also Check: How To Best Prepare For A Phone Interview

    Explore Our Popular Software Engineering Courses

    SL. No
    View all Software Engineering Courses

    In order to write a code that will return the first non-repeated letters, we can use LinkedHashMap to store the character count. This HashMap follows the order of the insertion and characters are initialised in the same position as in the string. The scanned string must be iterated using LinkedHashMap to choose the required entry with the value of 1.

    Another way to approach this problem is by using firstNonRepeatingChar. This allows the non-repeated character which appears first to be identified in a single pass. This approach used two storage to replace an interaction. This method stores non-repeated and repeated characters separately and when the iteration ends, the required character is the first element in the list.

    2. How can you remove duplicates from arrays?

    First, you must use the LinkedHashSet to retain the original insertion order of the elements into the set. You must use loops or recursion functions to solve these kinds of coding interview questions.

    The main factor that we must keep in mind when dealing with arrays is not the elements that have duplicates. The main problem here is removing the duplicates instead. Arrays are static data structures that are of fixed length, thus not possible to be altered. So, to delete elements from arrays, you need to create new arrays and duplicate the content into these new arrays.

    Check out Cybersecurity course to upskill yourself and gain an edge.

    • replace
    • replace
    • replaceFirst

    + Top Mcqs On Big Data And Answers

    Top 30 JMeter Interview Questions and Answers for 2021

    Multiple Choice Questions on Big-Data.

    1. As companies move past the experimental phase with Hadoop, many cite the need for additional capabilities, including _______________a) Improved data storage and information retrievalb) Improved extract, transform and load features for data integrationc) Improved data warehousing functionalityd) Improved security, workload management, and SQL support

    Answer: dClarification: Adding security to Hadoop is challenging because all the interactions do not follow the classic client-server pattern.

    2. Point out the correct statement.a) Hadoop do need specialized hardware to process the datab) Hadoop 2.0 allows live stream processing of real-time datac) In Hadoop programming framework output files are divided into lines or recordsd) None of the mentioned

    Answer: bClarification: Hadoop batch processes data distributed over a number of computers ranging in 100s and 1000s.

    3. According to analysts, for what can traditional IT systems provide a foundation when theyre integrated with big data technologies like Hadoop?a) Big data management and data miningb) Data warehousing and business intelligencec) Management of Hadoop clustersd) Collecting and storing unstructured data

    Answer: aClarification: Data warehousing integrated with Hadoop would give a better understanding of data.

    Answer: aClarification: To use Hive with HBase youll typically want to launch two clusters, one to run HBase and the other to run Hive.

    Read Also: How To Ace Coding Interview

    Q110 What Are The Variants Of Back Propagation

    • Stochastic Gradient Descent: We use only a single training example for calculation of gradient and update parameters.

    • Batch Gradient Descent: We calculate the gradient for the whole dataset and perform the update at each iteration.

    • Mini-batch Gradient Descent: Its one of the most popular optimization algorithms. Its a variant of Stochastic Gradient Descent and here instead of single training example, mini-batch of samples is used.

    Tensorflow, Pytorch

    Binary Tree Coding Interview Questions

    So far, we have looked at only the linear data structure, but all information in the real world cannot be represented in linear fashion, and thatâs where tree data structure helps.

    Tree data structure is a data structure that allows you to store your data in a hierarchical fashion. Depending on how you store data, there are different types of trees, such as a binary tree, where each node has, at most, two child nodes.a

    Along with its close cousin binary search tree, itâs also one of the most popular tree data structures. Therefore, you will find a lot of questions based on them, such as how to traverse them, count nodes, find depth, and check if they are balanced or not.

    A key point to solving binary tree questions is a strong knowledge of theory, e.g. what is the size or depth of the binary tree, what is a leaf, and what is a node, as well as an understanding of the popular traversing algorithms, e.g. pre-, post-, and in-order traversal.

    Here is a list of popular binary tree-based coding questions from software engineer or developer job interviews:

  • How is a binary search tree implemented?
  • How do you perform preorder traversal in a given binary tree?
  • How do you traverse a given binary tree in preorder without recursion?
  • How do you perform an inorder traversal in a given binary tree?
  • How do you print all nodes of a given binary tree using inorder traversal without recursion?
  • How do you implement a postorder traversal algorithm?
  • You May Like: What Are The Most Common Job Interview Questions

    Q26 What Is The Difference Between Point Estimates And Confidence Interval

    Point Estimation gives us a particular value as an estimate of a population parameter. Method of Moments and Maximum Likelihood estimator methods are used to derive Point Estimators for population parameters.

    A confidence interval gives us a range of values which is likely to contain the population parameter. The confidence interval is generally preferred, as it tells us how likely this interval is to contain the population parameter. This likeliness or probability is called Confidence Level or Confidence coefficient and represented by 1 alpha, where alpha is the level of significance.

    Q27. What is the goal of A/B Testing?

    It is a hypothesis testing for a randomized experiment with two variables A and B.

    The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of interest. A/B testing is a fantastic method for figuring out the best online promotional and marketing strategies for your business. It can be used to test everything from website copy to sales emails to search ads

    An example of this could be identifying the click-through rate for a banner ad.

    Q28. What is p-value?

    When you perform a hypothesis test in statistics, a p-value can help you determine the strength of your results. p-value is a number between 0 and 1. Based on the value it will denote the strength of the results. The claim which is on trial is called the Null Hypothesis.

    Probability of not seeing any shooting star in 15 minutes is

    = 1 P= 1 0.2 = 0.8

    Explain The Difference Between Structured Data And Unstructured Data

    Live Data Engineering Interview | Big Data Coding Interview | Apache Spark Interview

    Data engineers must turn unstructured data into structured data for data analysis using different methods for transformation. First, you can explain the difference between the two.

    Structured data is made up of well-defined data types with patterns that make them easily searchable, whereas unstructured data is a bundle of files in various formats, such as videos, photos, texts, audio, and more.

    Unstructured data exists in unmanaged file structures, so engineers collect, manage, and store it in database management systems turning it into structured data that is searchable. Unstructured data might be inputted through manual entry or batch processing with coding, so ELT is the tool used to transform and integrate data into a cloud-based data warehouse.

    Second, you can share a situation in which you transformed data into a structured format, drawing from learning projects if youâre lacking professional experience.

    Recommended Reading: How Do I Ace An Interview

    Difference Between Normalisation And Standardization



    • The technique of converting data in such a way that it is normally distributed and has a standard deviation of 1 and a mean of 0.
    • The technique of converting all data values to lie between 1 and 0 is known as Normalization. This is also known as min-max scaling.
    • Standardization takes care that the standard normal distribution is followed by the data.
    • The data returning into the 0 to 1 range is taken care of by Normalization.
    • Normalization formula –

    X = /


    Xmin – features minimum value,

    Xmax – features maximum value.

    • Standardization formula –

    X = /

    What Tools Did You Use On The Project

    What theyâre really asking: How did you arrive at your decision to use certain tools?

    Data engineers must manage huge swaths of data, so they need to use the right tools and technologies to gather and prepare it all. If you have experience using different tools such as Hadoop, MongoDB, and Kafka, youâll want to explain which one you used for that particular project.

    You can go into detail about the ETL systems you used to move data from databases into a data warehouse, such as Stitch, Alooma, Xplenty, and Talend. Some tools work better for back-end, so if you can communicate strong decision-making abilities, then youâll shine as a candidate whoâs confident in their skills.

    The interviewer might also ask:

    • What are your favorite tools to use, and why?

    • Compare and contrast two or three tools that you used on a recent project.

    Don’t Miss: How To Reject A Good Candidate After Interview

    More articles

    Popular Articles