One Way To Improve Software Designing Skills Is To Understand The
2 One way to learn system design is to read the technical papers of famoussystems. Unfortunately, reading a paper is generally believed hard, andkeeping the system design interview in mind, we are not interested in a lot ofdetails mentioned in the papers. This course extracts out the most relevantdetails about the system architecture that the creators had in mind whiledesigning the system. Keeping system design interviews in mind, we willfocus on the various tradeoffs that the original developers had consideredand what prompted them to choose a certain design given their constraints.
Furthermore, systems grow over time thus, original designs are revised,adapted, and enhanced to cater to emerging requirements. This meansreading original papers is not enough. This course will cover criticism onthe original design and the architectural changes that followed to overcomedesign limitations and satisfy growing needs.
What to expect #The course has two parts: System Design Case Studies and System DesignPatterns.
Snitch Keeps Track Of The Network Topology Of Cassandra Nodes It Determines
Snitch determines the proximity of nodes within the ring and also monitors the read latencies to avoid reading from nodes that have slowed down. Each node in Cassandra uses this information to route requests efficiently. Cassandras replication strategy uses the information provided by the Snitch to spread the replicas across the cluster intelligently. Cassandra will do its best by not having more than one replica on the same rack.
The Pacelc Theorem States That In A System That Replicates Data:
389 if there is a partition , a distributed system can tradeoff between availability and consistency else , when the system is running normally in the absence of partitions, the system can tradeoff between latency and consistency .
The first part of the theorem is the same as the CAP theorem, and theELC is the extension. The whole thesis is assuming we maintain highavailability by replication. So, when there is a failure, CAP theorem prevails.But if not, we still have to consider the tradeoff between consistency andlatency of a replicated system.
Examples # Dynamo and Cassandra are PA/EL systems: They choose availability over consistency when a partition occurs otherwise, they choose lower latency. BigTable and HBase are PC/EC systems: They will always choose consistency, giving up availability and lower latency. MongoDB is PA/EC: In case of a network partition, it chooses availability, but otherwise guarantees consistency.
Recommended Reading: How To Analyse An Interview
Who Are The Authors Of This Course
This course was developed by Design Gurus, a group of senior engineers and hiring managers whove been working at Facebook, Google, Amazon, and Microsoft. The authors of this course have a lot of experience in conducting SDIs and know exactly what is being asked at these interviews. Other courses developed by the same team can be found on their website.
Lets evaluate different aspects of this course.
To Boost Read Performance Cassandra Provides Three Optional Forms Of
1. Row cache: The row cache, caches frequently read rows. It stores a complete data row, which can be returned directly to the client if requested by a read operation. This can significantly speed up read access for frequently accessed rows, at the cost of more memory usage. 2. Key cache: Key cache stores a map of recently read partition keys to their SSTable offsets. This facilitates faster read access into SSTables
84 stored on disk and improves the read performance but could slow down the writes, as we have to update the Key cache for every write. 3. Chunk cache: Chunk cache is used to store uncompressed chunks of data read from SSTable files that are accessed frequently.
Don’t Miss: How To Not Be Nervous For An Interview
Jeopardy: If A Clients Local Lease Timeout Expires It Becomes Unsure Whether
Grace period: When a session is in jeopardy, the client waits for an extratime called the grace period – 45s by default. If the client and master manageto exchange a successful KeepAlive before the end of clients grace period,
178 the client enables its cache once more. Otherwise, the client assumes that thesession has expired.
A Chunk Is Prioritized Based On How Far It Is From Its Replication Goal
217 collected after a few days. Replicas of deleted files can exist for a few days as well.
Master rebalances replicas regularly to achieve load balancing and betterdisk space usage. It may move replicas from one ChunkServer to another tobring disk usage in a server closer to the average. Any new ChunkServeradded to the cluster is filled up gradually by the master rather than floodingit with a heavy traffic of write operations.
Read Also: Best Wireless Lavalier Mic For Interviews
The Following List Contains Criticism On Dynamos Design:
Each Dynamo node contains the entire Dynamo routing table. This is likely to affect the scalability of the system as this routing table will grow larger and larger as nodes are added to the system. Dynamo seems to imply that it strives for symmetry, where every node in the system has the same set of roles and responsibilities, but later, it specifies some nodes as seeds. Seeds are special nodes that are externally discoverable. These are used to help prevent logical partitions in the Dynamo ring. This seems like it may violate Dynamos symmetry principle. Although security was not a concern as Dynamo was built for internal use only, DHTs can be susceptible to several different types of attacks. While Amazon can assume a trusted environment, sometimes a buggy software can act in a manner quite similar to a malicious actor. Dynamos design can be described as a leaky abstraction, where client applications are often asked to manage inconsistency, and the user experience is not 100% seamless. For example, inconsistencies in the shopping cart items may lead users to think that the website is buggy or unreliable.
What Happens When The Namenode
Metadata backup #On a NameNode failure, the metadata would be unavailable, and a diskfailure on the NameNode would be catastrophic because the file metadatawould be lost since there would be no way of knowing how to reconstruct 269 the files from the blocks on the DataNodes. For this reason, it is crucial tomake the NameNode resilient to failure, and HDFS provides two mechanismsfor this:
You May Like: What To Take To A Job Interview
In Terms Of The Cap Theorem Bigtable Is A Cp System Ie It Has Strictly
BigTable was developed at Google and has been in use since 2005 in dozensof Google services. Because of the large scale of its services, Google could notuse commercial databases. Also, the cost of using an external solution wouldhave been too high. That is why Google chose to build an in-house solution.BigTable is a highly available and high-performing database that powersmultiple applications across Google where each application has differentneeds in terms of the size of data to be stored and latency with which resultsare expected.
Though BigTable is not open-source itself, its paper was crucial in inspiringpowerful open-source databases like Cassandra ,HBase andHypertable .
Google built BigTable to store large amounts of data and perform thousandsof queries per second on that data. Examples of BigTable data are billions ofURLs with many versions per page, petabytes of Google Earth data, andbillions of users search data.
BigTable is suitable to store large datasets that are greater than one TBwhere each row is less than 10MB. Since BigTable does not provide ACID properties or transactionsupport, Online Transaction Processing ) applications
with transaction processes should not use BigTable. For BigTable, datashould be structured in the form of key-value pairs or rows-columns. Non-structured data like images or movies should not be stored in BigTable.
Lazy Space Allocation #
One disadvantage of having a large chunk size is the handling of small files.Since a small file will have one or a few chunks, the ChunkServers storingthose chunks can become hotspots if a lot of clients access the same file. Tohandle this scenario, GFS stores such files with a higher replication factorand also adds a random delay in the start times of the applications accessingthese files.
Let’s explore how GFS manages the lesystem metadata.
You May Like: Amazon Problem Solving Interview Questions
Making Replica Placement Decisions
The master acquires locks over a namespace region to ensure properserialization and to allow multiple operations at the master. GFS does nothave an i-node like tree structure for directories and files. Instead, it has ahash-map that maps a filename to its metadata, and reader-writer locks areapplied on each node of the hash table for synchronization.
Why Does Chubby Export A Unix
Recall that Chubby exports a file system interface similar to Unix butsimpler. It consists of a strict tree of files and directories in the usual way,with name components separated by slashes.
The main reason why Chubbys naming structure resembles a file system tomake it available to applications both with its own specialized API, and viainterfaces used by our other file systems, such as the Google File System. Thissignificantly reduced the effort needed to write basic browsing andnamespace manipulation tools, and reduced the need to educate casualChubby users. However, only a very limited number of operations can beperformed on these files, e.g., Create, Delete, etc.
Cell The second component is the name of the Chubby cell it’s Remainder resolved to one or more Chubby The remainder path Prex servers via DNS lookup. is interpreted The ‘ls’ prex is common to all within the Chubby cell. Chubby paths and stands for’lock service’.
Don’t Miss: How To Write A Thank You For An Interview Email
Unfortunately It Is Not Easy To Model The Data As Crdts In Many Cases It
Instead of vector clocks, Dynamo also offers ways to resolve the conflictsautomatically on the server-side. Dynamo often usesa simple conflict resolution policy: last-write-wins , based on thewall-clock timestamp. LWW can easily end up losing data. For example, iftwo conflicting writes happen simultaneously, it is equivalent to flipping acoin on which write to throw away.
Here Are The Advantages Of Lazy Deletion
Simple and reliable. If the chunk deletion message is lost, the master does not have to retry. The ChunkServer can perform the garbage collection with the subsequent heartbeat messages. GFS merges storage reclamation into regular background activities of the master, such as the regular scans of the filesystem or the exchange of HeartBeat messages. Thus, it is done in batches, and the cost is amortized. Garbage collection takes place when the master is relatively free. Lazy deletion provides safety against accidental, irreversible deletions.
Also Check: How To Be Professional In An Interview
Chubby Is A Highly Available And Persistent Distributed Locking Service That
Chubby usually runs with five active replicas, one of which is elected as the master to serve requests. To remain alive, a majority of Chubby replicas must be running. BigTable depends on Chubby so much that if Chubby is unavailable for an extended period of time, BigTable will also become unavailable. Chubby uses the Paxos algorithm to keep its replicas consistent in the face of failure.
Chubby provides a namespace consisting of files and directories. Eachfile or directory can be used as a lock.Read and write access to a Chubby file is atomic.Each Chubby client maintains a session with a Chubby service. A clientssession expires if it is unable to renew its session lease within the leaseexpiration time. When a clients session expires, it loses any locks andopen handles. Chubby clients can also register callbacks on Chubby filesand directories for notification of changes or session expiration.In BigTable, Chubby is used to: Ensure there is only one active master. The master maintains a session lease with Chubby and periodically renews it to retain the status of the master. Store the bootstrap location of BigTable data Discover new Tablet servers as well as the failure of existing ones Store BigTable schema information Store Access Control Lists
Let’s explore the components that constitute BigTable.
Topic: Campaigns M1 M0 Subscriber 4
Publisher 4 M4 M7 M2 M1 M0 M0 M0 Subscriber 5 Subscribed to ‘Campaigns’
Producers are applications that publish records to Kafka.
Consumers #Consumers are the applications that subscribe to datafrom Kafka topics. Consumers subscribe to one or more topics and consumepublished messages by pulling data from the brokers.
In Kafka, producers and consumers are fully decoupled and agnostic of eachother, which is a key design element to achieve the high scalability thatKafka is known for. For example, producers never need to wait forconsumers.
At a high level, applications send messages to a Kafka broker,and these messages are read by other applications called consumers.Messages get stored in a topic, and consumers subscribe to the topic toreceive new messages.
Kafka is run as a cluster of one or more servers, where each server isresponsible for running one Kafka broker.
ZooKeeper #ZooKeeper is a distributed key-value store and is used for coordination andstoring configurations. It is highly optimized for reads. Kafka uses ZooKeeperto coordinate between Kafka brokers ZooKeeper maintains metadatainformation about the Kafka cluster. We will be looking into this in detaillater.
Don’t Miss: What Are My Weaknesses Job Interview
In Kafka Quotas Are Byte
The broker does not return an error when a client exceeds its quota butinstead attempts to slow the client down. When the broker calculates that aclient has exceeded its quota, it slows the client down by holding the clientsresponse for enough time to keep the client under the quota. This approachkeeps the quota violation transparent to clients. This also prevents clientsfrom having to implement special back-off and retry behavior.
Kafka performance #
What Is Clock Skew #
On a single machine, all we need to know about is the absolute or wall clocktime: suppose we perform a write to key k with timestamp t1 and thenperform another write to k with timestamp t2 . Since t2 > t1 , the secondwrite must have been newer than the first write, and therefore the databasecan safely overwrite the original value.
In a distributed system, this assumption does not hold. The problem is clockskew, i.e., different clocks tend to run at different rates, so we cannotassume that time t on node a happened before time t + 1 on node b . The 27 most practical techniques that help with synchronizing clocks, like NTP, stilldo not guarantee that every clock in a distributed system is synchronized atall times. So, without special hardware like GPS units and atomic clocks, justusing wall clock timestamps is not enough.
You May Like: How To Make A Good Interview
To Learn More About Automatic Failover Take A Look At Apache
This lesson will explore some important aspects of HDFSarchitecture.
HDFS provides a permissions model for files and directories which is similarto POSIX. Each file and directory is associated with an owner and a group.Each file or directory has separate permissions for the owner, other userswho are members of a group, and all other users. There are three types ofpermission:
To Improve Read Performance Tablet Servers Employ Two Levels Of Caching:
Scan Cache It caches pairs returned by the SSTable and is useful for applications that read the same data multiple times.
326 Block Cache It caches SSTable blocks read from GFS and is useful for the applications that tend to read the data which is close to the data they recently read .
Bloom filters #
Any read operation has to read from all SSTables that make up a Tablet. Ifthese SSTables are not in memory, the read operation may end up doingmany disk accesses. To reduce the number of disk accesses BigTable usesBloom Filters.
Bloom Filters are created for SSTables .They help to reduce the number of disk accesses by predicting if an SSTablemay contain data corresponding to a particular pair. Bloomfilters take a small amount of memory but can improve the readperformance drastically.
Instead of maintaining separate commit log files for each Tablet, BigTablemaintains one log file for a Tablet server. This gives better writeperformance. Since each write has to go to the commit log, writing to a largenumber of log files would be slow as it could cause a large number of diskseeks.
One disadvantage of having a single log file is that it complicates the Tabletrecovery process. When a Tablet server dies, the Tablets that it served will bemoved to other Tablet servers. To recover the state for a Tablet, the newTablet server needs to reapply the mutations for that Tablet from the commitlog written by the original Tablet server. However, the mutations for these
Also Check: What Kind Of Questions Do You Ask In An Interview
Client Stores A Hint For Server 4
Server 5 Server 3
Now when the node which was down comes online again, how should wewrite data to it? Cassandra accomplishes this through hinted handoff.
When a node is down or does not respond to a write request, the coordinatornode writes a hint in a text file on the local disk. This hint contains the dataitself along with information about which node the data belongs to. Whenthe coordinator node discovers from the Gossiper that a node for which it holds hints has recovered, it forwards the writerequests for each hint to the target. Furthermore, each node every tenminutes checks to see if the failing node, for which it is holding any hints,has recovered.
With consistency level Any, if all the replica nodes are down, thecoordinator node will write the hints for all the nodes and report success tothe client. However, this data will not reappear in any subsequent reads
69 until one of the replica nodes comes back online, and the coordinator nodesuccessfully forwards the write requests to it. This is assuming that thecoordinator node is up when the replica node comes back. This also meansthat we can lose our data if the coordinator node dies and never comes back.For this reason, we should avoid using the Any consistency level.
One thing to remember: When the cluster cannot meet the consistency levelspecified by the client, Cassandra fails the write request and does not store ahint.