Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

The system design interviews, Summaries of Systems Design

Summary of system design interviews

Typology: Summaries

2023/2024

Available from 04/08/2024

US-Summery
US-Summery 🇮🇹

4.2

(15)

937 documents

1 / 164

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
lOMoARcPSD|39591929
P. Marwedel,
The system design interviews
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download The system design interviews and more Summaries Systems Design in PDF only on Docsity!

lOMoARcPSD|

P. Marwedel,

The system design interviews

System Design Interviews: A step by step guide

A lot of software engineers struggle with system design interviews (SDIs) primarily because of three reasons:

  • The unstructured nature of SDIs, where they are asked to work on an open-ended design problem that doesn’t have a standard answer.
  • Their lack of experience in developing large scale systems.
  • They did not prepare for SDIs. Like coding interviews, candidates who haven’t put a conscious effort to prepare for SDIs, mostly perform poorly especially at top companies like Google, Facebook, Amazon, Microsoft, etc. In these companies, candidates who don’t perform above average, have a limited chance to get an offer. On the other hand, a good performance always results in a better offer (higher position and salary), since it shows the candidate’s ability to handle a complex system. In this course, we’ll follow a step by step approach to solve multiple design problems. First, let’s go through these steps:

Step 1: Requirements clarifications

It is always a good idea to ask questions about the exact scope of the problem we are solving. Design questions are mostly open-ended, and they don’t have ONE correct answer, that’s why clarifying ambiguities early in the interview becomes critical. Candidates who spend enough time to define the end goals of the system always have a better chance to be successful in the interview. Also, since we only have 35-40 minutes to design a (supposedly) large system, we should clarify what parts of the system we will be focusing on. Let’s expand this with an actual example of designing a Twitter-like service. Here are some questions for designing Twitter that should be answered before moving on to the next steps:

  • Will users of our service be able to post tweets and follow other people?
  • Should we also design to create and display the user’s timeline?
  • Will tweets contain photos and videos?
  • Are we focusing on the backend only or are we developing the front-end too?
  • Will users be able to search tweets?
  • Do we need to display hot trending topics?
  • Will there be any push notification for new (or important) tweets? All such question will determine how our end design will look like.

support a huge number of reads. We will also need a distributed file storage system for storing photos and videos.

Step 6: Detailed design

Dig deeper into two or three components; interviewer’s feedback should always guide us what parts of the system need further discussion. We should be able to present different approaches, their pros and cons, and explain why we will prefer one approach on the other. Remember there is no single answer, the only important thing is to consider tradeoffs between different options while keeping system constraints in mind.

  • Since we will be storing a massive amount of data, how should we partition our data to distribute it to multiple databases? Should we try to store all the data of a user on the same database? What issue could it cause?
  • How will we handle hot users who tweet a lot or follow lots of people?
  • Since users’ timeline will contain the most recent (and relevant) tweets, should we try to store our data in such a way that is optimized for scanning the latest tweets?
  • How much and at which layer should we introduce cache to speed things up?
  • What components need better load balancing?

Step 7: Identifying and resolving bottlenecks

Try to discuss as many bottlenecks as possible and different approaches to mitigate them.

  • Is there any single point of failure in our system? What are we doing to mitigate it?
  • Do we have enough replicas of the data so that if we lose a few servers we can still serve our users?
  • Similarly, do we have enough copies of different services running such that a few failures will not cause total system shutdown?
  • How are we monitoring the performance of our service? Do we get alerts whenever critical components fail or their performance degrades?

Summary

In short, preparation and being organized during the interview are the keys to be successful in system design interviews. The above-mentioned steps should guide you to remain on track and cover all the different aspects while designing a system. Let’s apply the above guidelines to design a few systems that are asked in SDIs.

Designing a URL Shortening service like

TinyURL

Let's design a URL shortening service like TinyURL. This service will provide short aliases redirecting to long URLs. Similar services: bit.ly, goo.gl, qlink.me, etc. Difficulty Level: Easy

1. Why do we need URL shortening?

URL shortening is used to create shorter aliases for long URLs. We call these shortened aliases “short links.” Users are redirected to the original URL when they hit these short links. Short links save a lot of space when displayed, printed, messaged, or tweeted. Additionally, users are less likely to mistype shorter URLs. For example, if we shorten this page through TinyURL: https://www.educative.io/collection/page/5668639101419520/5649050225344512/ 916475904/ We would get: http://tinyurl.com/jlg8zpc The shortened URL is nearly one-third the size of the actual URL. URL shortening is used for optimizing links across devices, tracking individual links to analyze audience and campaign performance, and hiding affiliated original URLs.

Storage estimates: Let’s assume we store every URL shortening request (and associated shortened link) for 5 years. Since we expect to have 500M new URLs every month, the total number of objects we expect to store will be 30 billion: 500 million * 5 years * 12 months = 30 billion Let’s assume that each stored object will be approximately 500 bytes (just a ballpark estimate–we will dig into it later). We will need 15TB of total storage: 30 billion * 500 bytes = 15 TB Bandwidth estimates: For write requests, since we expect 200 new URLs every second, total incoming data for our service will be 100KB per second: 200 * 500 bytes = 100 KB/s For read requests, since every second we expect ~20K URLs redirections, total outgoing data for our service would be 10MB per second: 20K * 500 bytes = ~10 MB/s Memory estimates: If we want to cache some of the hot URLs that are frequently accessed, how much memory will we need to store them? If we follow the 80-20 rule, meaning 20% of URLs generate 80% of traffic, we would like to cache these 20% hot URLs. Since we have 20K requests per second, we will be getting 1.7 billion requests per day: 20K * 3600 seconds * 24 hours = ~1.7 billion To cache 20% of these requests, we will need 170GB of memory. 0.2 * 1.7 billion * 500 bytes = ~170GB One thing to note here is that since there will be a lot of duplicate requests (of the same URL), therefore, our actual memory usage will be less than 170GB. High level estimates: Assuming 500 million new URLs per month and 100:1 read:write ratio, following is the summary of the high level estimates for our service: New URLs 200/s URL redirections 20K/s Incoming data 100KB/s Outgoing data 10MB/s Storage for 5 years 15TB Memory for cache 170GB

4. System APIs

Once we've finalized the requirements, it's always a good idea to define the system APIs. This should explicitly state what is expected from the system.

We can have SOAP or REST APIs to expose the functionality of our service. Following could be the definitions of the APIs for creating and deleting URLs: createURL(api_dev_key, original_url, custom_alias=None, user_name=None, expire_date=None) Parameters: api_dev_key (string): The API developer key of a registered account. This will be used to, among other things, throttle users based on their allocated quota. original_url (string): Original URL to be shortened. custom_alias (string): Optional custom key for the URL. user_name (string): Optional user name to be used in encoding. expire_date (string): Optional expiration date for the shortened URL. Returns: (string) A successful insertion returns the shortened URL; otherwise, it returns an error code. deleteURL(api_dev_key, url_key) Where “url_key” is a string representing the shortened URL to be retrieved. A successful deletion returns ‘URL Removed’. How do we detect and prevent abuse? A malicious user can put us out of business by consuming all URL keys in the current design. To prevent abuse, we can limit users via their api_dev_key. Each api_dev_key can be limited to a certain number of URL creations and redirections per some time period (which may be set to a different duration per developer key).

5. Database Design

Defining the DB schema in the early stages of the interview would help to understand the data flow among various components and later would guide towards data partitioning. A few observations about the nature of the data we will store:

  1. We need to store billions of records.
  2. Each object we store is small (less than 1K).
  3. There are no relationships between records—other than storing which user created a URL.
  4. Our service is read-heavy. Database Schema: We would need two tables: one for storing information about the URL mappings, and one for the user’s data who created the short link.
  1. What if parts of the URL are URL-encoded? e.g., http://www.educative.io/distributed.php? id=design, and http://www.educative.io/distributed.php%3Fid%3Ddesign are identical except for the URL encoding. Workaround for the issues: We can append an increasing sequence number to each input URL to make it unique, and then generate a hash of it. We don’t need to store this sequence number in the databases, though. Possible problems with this approach could be an ever-increasing sequence number. Can it overflow? Appending an increasing sequence number will also impact the performance of the service. Another solution could be to append user id (which should be unique) to the input URL. However, if the user has not signed in, we would have to ask the user to choose a uniqueness key. Even after this, if we have a conflict, we have to keep generating a key until we get a unique one. Request flow for shortening of a URL 1 of 9

b. Generating keys offline

We can have a standalone Key Generation Service (KGS) that generates random six letter strings beforehand and stores them in a database (let’s call it key-DB). Whenever we want to shorten a URL, we will just take one of the already-generated keys and use it. This approach will make things quite simple and fast. Not only are we not encoding the URL, but we won’t have to worry about duplications or collisions. KGS will make sure all the keys inserted into key-DB are unique Can concurrency cause problems? As soon as a key is used, it should be marked in the database to ensure it doesn’t get used again. If there are multiple servers reading keys concurrently, we might get a scenario where two or more servers try to read the same key from the database. How can we solve this concurrency problem? Servers can use KGS to read/mark keys in the database. KGS can use two tables to store keys: one for keys that are not used yet, and one for all the used keys. As soon as KGS gives keys to one of the servers, it can move them to the used keys table. KGS can always keep some keys in memory so that it can quickly provide them whenever a server needs them. For simplicity, as soon as KGS loads some keys in memory, it can move them to the used keys table. This ensures each server gets unique keys. If KGS dies before assigning all the loaded keys to some server, we will be wasting those keys–which is acceptable, given the huge number of keys we have.

KGS also has to make sure not to give the same key to multiple servers. For that, it must synchronize (or get a lock on) the data structure holding the keys before removing keys from it and giving them to a server What would be the key-DB size? With base64 encoding, we can generate 68.7B unique six letters keys. If we need one byte to store one alpha-numeric character, we can store all these keys in: 6 (characters per key) * 68.7B (unique keys) = 412 GB. Isn’t KGS a single point of failure? Yes, it is. To solve this, we can have a standby replica of KGS. Whenever the primary server dies, the standby server can take over to generate and provide keys. Can each app server cache some keys from key-DB? Yes, this can surely speed things up. Although in this case, if the application server dies before consuming all the keys, we will end up losing those keys. This can be acceptable since we have 68B unique six letter keys. How would we perform a key lookup? We can look up the key in our database or key-value store to get the full URL. If it’s present, issue an “HTTP 302 Redirect” status back to the browser, passing the stored URL in the “Location” field of the request. If that key is not present in our system, issue an “HTTP 404 Not Found” status or redirect the user back to the homepage. Should we impose size limits on custom aliases? Our service supports custom aliases. Users can pick any ‘key’ they like, but providing a custom alias is not mandatory. However, it is reasonable (and often desirable) to impose a size limit on a custom alias to ensure we have a consistent URL database. Let’s assume users can specify a maximum of 16 characters per customer key (as reflected in the above database schema). High level system design for URL shortening

7. Data Partitioning and Replication

To scale out our DB, we need to partition it so that it can store information about billions of URLs. We need to come up with a partitioning scheme that would divide and store our data to different DB servers.

Request flow for accessing a shortened URL 1 of 11

9. Load Balancer (LB)

We can add a Load balancing layer at three places in our system:

  1. Between Clients and Application servers
  2. Between Application Servers and database servers
  3. Between Application Servers and Cache servers Initially, we could use a simple Round Robin approach that distributes incoming requests equally among backend servers. This LB is simple to implement and does not introduce any overhead. Another benefit of this approach is that if a server is dead, LB will take it out of the rotation and will stop sending any traffic to it. A problem with Round Robin LB is that server load is not taken into consideration. If a server is overloaded or slow, the LB will not stop sending new requests to that server. To handle this, a more intelligent LB solution can be placed that periodically queries the backend server about its load and adjusts traffic based on that.

10. Purging or DB cleanup

Should entries stick around forever or should they be purged? If a user-specified expiration time is reached, what should happen to the link? If we chose to actively search for expired links to remove them, it would put a lot of pressure on our database. Instead, we can slowly remove expired links and do a lazy cleanup. Our service will make sure that only expired links will be deleted, although some expired links can live longer but will never be returned to users.

  • Whenever a user tries to access an expired link, we can delete the link and return an error to the user.
  • A separate Cleanup service can run periodically to remove expired links from our storage and cache. This service should be very lightweight and can be scheduled to run only when the user traffic is expected to be low.
  • We can have a default expiration time for each link (e.g., two years).
  • After removing an expired link, we can put the key back in the key-DB to be reused.
  • Should we remove links that haven’t been visited in some length of time, say six months? This could be tricky. Since storage is getting cheap, we can decide to keep links forever.

Detailed component design for URL shortening

11. Telemetry

How many times a short URL has been used, what were user locations, etc.? How would we store these statistics? If it is part of a DB row that gets updated on each view, what will happen when a popular URL is slammed with a large number of concurrent requests? Some statistics worth tracking: country of the visitor, date and time of access, web page that refers the click, browser, or platform from where the page was accessed.

12. Security and Permissions

Can users create private URLs or allow a particular set of users to access a URL? We can store permission level (public/private) with each URL in the database. We can also create a separate table to store UserIDs that have permission to see a specific URL. If a user does not have permission and tries to access a URL, we can send an error (HTTP 401) back. Given that we are storing our data in a NoSQL wide-column database like Cassandra, the key for the table storing permissions would be the ‘Hash’ (or the KGS generated ‘key’). The columns will store the UserIDs of those users that have permissions to see the URL.

Designing Pastebin

Let's design a Pastebin like web service, where users can store plain text. Users of the service will enter a piece of text and get a randomly generated URL to access it. Similar Services: pastebin.com, pasted.co, chopapp.com Difficulty Level: Easy

4. Capacity Estimation and Constraints

Our services will be read-heavy; there will be more read requests compared to new Pastes creation. We can assume a 5:1 ratio between read and write. Traffic estimates: Pastebin services are not expected to have traffic similar to Twitter or Facebook, let’s assume here that we get one million new pastes added to our system every day. This leaves us with five million reads per day. New Pastes per second: Paste reads per second: 1M / (24 hours * 3600 seconds) ~= 12 pastes/sec 5M / (24 hours * 3600 seconds) ~= 58 reads/sec Storage estimates: Users can upload maximum 10MB of data; commonly Pastebin like services are used to share source code, configs or logs. Such texts are not huge, so let’s assume that each paste on average contains 10KB. At this rate, we will be storing 10GB of data per day. 1M * 10KB => 10 GB/day If we want to store this data for ten years we would need the total storage capacity of 36TB. With 1M pastes every day we will have 3.6 billion Pastes in 10 years. We need to generate and store keys to uniquely identify these pastes. If we use base64 encoding ([A-Z, a-z, 0-9, ., - ]) we would need six letters strings: 64^6 ~= 68.7 billion unique strings If it takes one byte to store one character, total size required to store 3.6B keys would be: 3.6B * 6 => 22 GB 22GB is negligible compared to 36TB. To keep some margin, we will assume a 70% capacity model (meaning we don’t want to use more than 70% of our total storage capacity at any point), which raises our storage needs to 51.4TB. Bandwidth estimates: For write requests, we expect 12 new pastes per second, resulting in 120KB of ingress per second. 12 * 10KB => 120 KB/s As for the read request, we expect 58 requests per second. Therefore, total data egress (sent to users) will be 0.6 MB/s. 58 * 10KB => 0.6 MB/s

Although total ingress and egress are not big, we should keep these numbers in mind while designing our service. Memory estimates: We can cache some of the hot pastes that are frequently accessed. Following the 80 - 20 rule, meaning 20% of hot pastes generate 80% of traffic, we would like to cache these 20% pastes Since we have 5M read requests per day, to cache 20% of these requests, we would need: 0.2 * 5M * 10KB ~= 10 GB

5. System APIs

We can have SOAP or REST APIs to expose the functionality of our service. Following could be the definitions of the APIs to create/retrieve/delete Pastes: addPaste(api_dev_key, paste_data, custom_url=None user_name=None, paste_name=None, expire_date=None) Parameters: api_dev_key (string): The API developer key of a registered account. This will be used to, among other things, throttle users based on their allocated quota. paste_data (string): Textual data of the paste. custom_url (string): Optional custom URL. user_name (string): Optional user name to be used to generate URL. paste_name (string): Optional name of the paste expire_date (string): Optional expiration date for the paste. Returns: (string) A successful insertion returns the URL through which the paste can be accessed, otherwise, it will return an error code. Similarly, we can have retrieve and delete Paste APIs: getPaste(api_dev_key, api_paste_key) Where “api_paste_key” is a string representing the Paste Key of the paste to be retrieved. This API will return the textual data of the paste. deletePaste(api_dev_key, api_paste_key) A successful deletion returns ‘true’, otherwise returns ‘false’.

6. Database Design

A few observations about the nature of the data we are storing:

  1. We need to store billions of records.
  2. Each metadata object we are storing would be small (less than 100 bytes).
  3. Each paste object we are storing can be of medium size (it can be a few MB).
  4. There are no relationships between records, except if we want to store which user created what Paste.

8. Component Design

a. Application layer

Our application layer will process all incoming and outgoing requests. The application servers will be talking to the backend data store components to serve the requests. How to handle a write request? Upon receiving a write request, our application server will generate a six-letter random string, which would serve as the key of the paste (if the user has not provided a custom key). The application server will then store the contents of the paste and the generated key in the database. After the successful insertion, the server can return the key to the user. One possible problem here could be that the insertion fails because of a duplicate key. Since we are generating a random key, there is a possibility that the newly generated key could match an existing one. In that case, we should regenerate a new key and try again. We should keep retrying until we don’t see failure due to the duplicate key. We should return an error to the user if the custom key they have provided is already present in our database. Another solution of the above problem could be to run a standalone Key Generation Service (KGS) that generates random six letters strings beforehand and stores them in a database (let’s call it key-DB). Whenever we want to store a new paste, we will just take one of the already generated keys and use it. This approach will make things quite simple and fast since we will not be worrying about duplications or collisions. KGS will make sure all the keys inserted in key-DB are unique. KGS can use two tables to store keys, one for keys that are not used yet and one for all the used keys. As soon as KGS gives some keys to an application server, it can move these to the used keys table. KGS can always keep some keys in memory so that whenever a server needs them, it can quickly provide them. As soon as KGS loads some keys in memory, it can move them to the used keys table, this way we can make sure each server gets unique keys. If KGS dies before using all the keys loaded in memory, we will be wasting those keys. We can ignore these keys given that we have a huge number of them. Isn’t KGS a single point of failure? Yes, it is. To solve this, we can have a standby replica of KGS and whenever the primary server dies it can take over to generate and provide keys. Can each app server cache some keys from key-DB? Yes, this can surely speed things up. Although in this case, if the application server dies before consuming all the keys, we will end up losing those keys. This could be acceptable since we have 68B unique six letters keys, which are a lot more than we require. How does it handle a paste read request? Upon receiving a read paste request, the application service layer contacts the datastore. The datastore searches for the key, and if it is found, returns the paste’s contents. Otherwise, an error code is returned.

b. Datastore layer

We can divide our datastore layer into two:

  1. Metadata database: We can use a relational database like MySQL or a Distributed Key-Value store like Dynamo or Cassandra.
  2. Object storage: We can store our contents in an Object Storage like Amazon’s S3. Whenever we feel like hitting our full capacity on content storage, we can easily increase it by adding more servers. Detailed component design for Pastebin

9. Purging or DB Cleanup

Please see Designing a URL Shortening service.

10. Data Partitioning and Replication

Please see Designing a URL Shortening service.

11. Cache and Load Balancer

Please see Designing a URL Shortening service.

12. Security and Permissions

Please see Designing a URL Shortening service.