Go back

How to design scalable temporary email service?

0m 0s

How to design scalable temporary email service?

The discussion outlines the design of a disposable email service focused on anonymity, scalability, and ephemerality. Key functional requirements include generating unique email addresses without registration, receiving and storing emails with attachments, providing a web interface for access, and automatically deleting all data after 24 hours. Non-functional requirements target handling 1 million emails daily (50 GB of data) with high availability (99.9%), sub-200ms latency for inbox display, robust spam filtering, and cost-effectiveness. The architecture comprises seven main components: a load balancer, address generator, SMTP servers, NoSQL storage with built-in TTL, a web interface, spam filter, and cleanup service. Critical design differentiators from traditional email include no user authentication (access is solely via the generated address), short data retention, and high churn of ephemeral data. Security emphasizes cryptographically strong address generation using UUIDs and domain rotation to mitigate reputation risks. The system employs a queue-based pipeline where SMTP servers quickly offload emails to a message queue (e.g., Kafka), allowing asynchronous processing by workers for parsing, spam filtering, and storage, ensuring reliability under load.

Transcription

6335 Words, 38677 Characters

English
Unpacking the Core Requirements for a Disposable Email Service Welcome back to the Deep Dive. Today, we're switching gears a bit, stepping away from the news cycle. Speaker 2 Yeah, doing something different. Speaker 1 We're diving into a pure system design challenge. Think of it like a collaborative session, almost like a technical interview setting. Speaker 2 OK, I like it. What's the problem? Speaker 1 We're designing a temporary e-mail service. You know, the kind where you get an address instantly, No sign up, use it for something quick, and then it disappears. Scalable. Anonymous. Speaker 2 Disposable e-mail, Yeah, very useful for privacy sign ups, avoiding spam testing things. I get it. It's an interesting design space because it's sort of the opposite of regular e-mail. Speaker 1 Exactly. Regular e-mail is all about keeping stuff forever. Strong logins. This is fast and fleeting. Speaker 2 Right, the values in the ephemerality and the anonymity. That changes everything design wise. Speaker 1 So let's treat this seriously like we're really building it. Where do we start? What are the absolute must haves the the requirements? Speaker 2 Good place to start. Always requirements first. The big one seems obvious from the name Temporary. How temporary? Speaker 1 Let's say 24 hours emails stick around for exactly one day, then poof. Speaker 2 OK, 24 hour retention. That's a huge constraint. It drives a lot of tech choices downstream, esecially storage. What else? Functionally, what does it need to do? Speaker 1 Well, users need to get an address right? Without registering. Speaker 2 Generate unique addresses, Yeah. Speaker 1 Then obviously it needs to receive e-mail sent to that address. Speaker 2 To even store them got. Speaker 1 It and the user needs a way to see those emails. A simple web page probably. Speaker 2 Yep, a web interface to display the inbox. Speaker 1 Needs to handle attachments. I assume people expect that. Speaker 2 Support attachments OK and critically automatic deletion after that 24 hour window, no leftovers. Speaker 1 Right, self destructing messages. Anything else fundamental maybe? Speaker 2 Basic forwarding could be useful and definitely needs to handle multiple people potentially looking at the same shared address in box, so concurrency matters. Speaker 1 OK, good list. Now the non functionals, the emmalidis. What kind of scale are we talking? If this takes off, usage could explode. Speaker 2 We need to plan for scale from the get go. Let's throw out some numbers based on typical services like this, say 100,000 daily active users. Speaker 1 OK, 100 KDAU, how many emails per user? Speaker 2 Let's be a bit generous. Maybe 10 emails per user per day. Some might get one, some might get 50 if they're signing up for lots of things. Speaker 1 So 100,000 users times 10 emails. That's a million emails a day. Speaker 2 A million emails daily? And what about size? Emails aren't huge, but they add up. Average maybe 50 kilobytes with headers and a bit of content, sometimes more with attachments, sometimes less. Speaker 1 OK, 1,000,000 emails times 50 KB carry the one that's it's 50 gigabytes of new data every single day. Speaker 2 Exactly 50 GB written per day, and importantly, 50 GB deleted per day 24 hours later. That churn is key. High write throughput, fast retrieval and efficient dilution. Speaker 1 That scale definitely screams distributed system. What other non functionals? Availability Latency. Speaker 2 Availability has to be high. If you grab a temp address for a verification link, the service needs to be up. Let's aim for standard high availability, like 99.9%, three ninths. Speaker 1 And latency. How fast should that inbox load? People using temp e-mail are often in a hurry, yeah. Speaker 2 That's crucial for the user experience. Waiting kills it. Let's set an aggressive target. Displaying emails in the web interface should take less than 200 milliseconds. Speaker 1 Sub 200 meters display. Wow. OK, that's demanding. Speaker 2 It is. It dictates choices later. And of course scalability to handle that. Million emails potentially much more. Security and privacy are paramount, especially than a login and spam filtering. We have to filter spam or the service becomes useless fast. Cost effectiveness too, can't forget that. Speaker 1 OK, that's a solid set of requirements. Tracing an Email's Journey Through Our Anonymous System High volume, high churn, low latency, high availability, strong privacy. Now let's sketch the big picture. What are the main building blocks? The core components? Speaker 2 Right, high level I'm thinking we need probably 7 key pieces first handling traffic coming in both web users and incoming e-mail. So a load balancer at the front door. Speaker 1 Makes sense, distributes the load. Speaker 2 And the core services. We need something to actually create the addresses. The e-mail address generator got it. We need the system that actually receives mail from the Internet. The SMTP server component needs to handle volume. Speaker 1 The mail receiver check then. Speaker 2 Obviously somewhere to put the emails, the e-mail storage service, that's where the 50 GB a day goes. Speaker 1 The database essentially. Speaker 2 Yep, and the part the user sees the web interface needs to be fast. Remember sub 200 meter? Speaker 1 User facing part OK. Speaker 2 We need that spam filter. Probably works closely with the SMTP server and storage crucial. And finally something has to manage that 24 hour expiry. Let's call it the cleanup service. So ideally the storage system helps a lot here. Speaker 1 Load balancer, generator, SMTP server, storage, web interface, spam filter, cleanup service that is comprehensive. But you know many of those components sound like what Gmail or any big e-mail provider has. What makes this system fundamentally different architecturally? Speaker 2 That's the core question. Really. The biggest difference? No user authentication. Pick about Gmail, Outlook. They live on logins, user accounts, passwords, identity verification. That's their whole model. Speaker 1 Right here, there's none of that. You just get an address. Speaker 2 Exactly. Access is controlled only by knowing the address itself. If you know [email protected] you can see the inbox period. Speaker 1 So the security burden shifts entirely. It's not about protecting passwords, it's about making those generated addresses incredibly hard to guess. Speaker 2 Precisely that's difference #1 #2 is the retention period. Gmail stores data potentially forever. We store it for 24 hours Max. This leads to drastically different storage strategies. We optimize for fast writes and very efficient deletes, not long term durability. Speaker 1 High churn like you said. Speaker 2 Yep, and #3 is the sheer volume of ephemeral data relative to say, active users, right? Well, Gmail has massive storage. It's tied to persistent user accounts. We have potentially millions of short lived inboxes created and destroyed daily. It's a different traffic pattern. Speaker 1 OK, that clarifies the core distinctions. Let's walk through an email's life. Trace the data flow using those 7 components. Speaker 2 Sure. So step one, user hits our website, the web interface talks to the e-mail address generator. Speaker 1 Generator does its magic. Speaker 2 Right creates a unique random address like super random elid at tempdomain-b.com and gives it back to the user via the web interface. Speaker 1 OK, user has the address. Now someone sends an e-mail to it. OK. Speaker 2 External sender's mail server looks up the MX record for tempdomain-b.com, finds our infrastructure, the e-mail hits our main load balancer. Speaker 1 Which directs it to 1. Speaker 2 Of our available SMTP servers. The SMTP server receives the connection, takes the e-mail data quick, basic validation happens. Speaker 1 Here like checking size limits. Speaker 2 Exactly size, basic format checks. Assuming it passes, the SMTP server doesn't hang on to it for long. It needs to accept mail quickly, so it immediately puts the raw e-mail into a queue like Kafka. Speaker 1 Decoupling. Speaker 2 Smart central for handling bursts, so emails in the queue now separate processing workers pick it up. These workers run it through the spam filter. Speaker 1 The filter does its analysis. Speaker 2 Content center reputation, Maybe some basic virus scanning on attachments if it passes the spam checks. Speaker 1 It gets stored. Speaker 2 It gets processed, parsed into headers, body attachment info, and then written to the e-mail storage service. Critically, when we write it, we attach that 24 hour time to live metadata. Speaker 1 The TTL flag OK. Speaker 2 Meanwhile, the user's looking at the web interface for that address. Maybe have a web socket open the storage service or the processing worker sends a notification new e-mail arrived. Speaker 1 And the web interface displays it, ideally under 200 meters. Speaker 2 That's the goal. User reads it, time passes 24 hours after it was stored. Speaker 1 The cleanup service kicks in or the database itself. Speaker 2 Ideally, the database handles the TTL expiry automatically. The cleanup service might just monitor that things are actually getting deleted correctly. They'd be forced to lead stragglers, but the heavy lifting should be built into the storage layer. The data vanishes. Speaker 1 That flow makes sense, but that point about access being tied only to the address? That feels like the biggest security headache if those addresses aren't truly random and unique. Ensuring Unpredictable Addresses and Managing Domain Health It's game over. Yeah, the address generator has to be cryptographically strong. Guessability is the enemy. This also directly influences another key design choice early on, using multiple domains. Speaker 1 Right, you mentioned that. Why not just one super fast domainthebest-temp-mail.com? Speaker 2 Reputation, reputation, reputation. In the e-mail world, you're sending domains. Reputation is everything, even though we're mostly receiving. If bad actors use our service and those addresses end up on spam lists associated with thedash-dash-mail.com, then legitimate services might start blocking all mail from our servers. Or worse, other providers might blacklist our domain entirely. Speaker 1 So if Gmail sees tons of spam associated with addresses from our domain, they might just block any e-mail trying to reach an address at that domain precisely. Speaker 2 Or they might just junk everything. By having a pool of domains temp-mail-a.com, temp-mail.j.net, At ephemeral-post.org we can distribute the addresses. If one domain starts getting a bad reputation, you just. Speaker 1 Stop generating new addresses for it and switch to another. Speaker 2 Exactly. Rotate them, outlet the bad one pool down or retire. It isolates the reputation risk and keeps the overall service healthy and deliverable. It's a necessary operational complexity. Speaker 1 That makes a lot of sense, and you also hinted that standard relational databases, even sharded ones like MySQL or Postgres, might struggle here. Why lean towards no sequel right away for the e-mail storage service? Speaker 2 Primarily the right load and the TTL requirement. Trying to write potentially thousands of emails per second across many relational shards, manage indexes, and run massive delete operations constantly for the 24 hour expiration. It's just really hard operationally. You spend all your time managing the sharding and the vacuuming deletion process. Speaker 1 So no sequel handles that better. Speaker 2 Certain types of no sequel databases are practically built for this kind of workload. They often have much better horizontal scaling for rights, just more nodes. They handle flexible schemas easily, which is good for e-mail. And crucially, many have built in time to live TTL support at the record level. That simplifies the cleanup immensely. Speaker 1 OK, let's dig into that address generation piece more. You said unpredictability is key. What format are we talking about to achieve that? Speaker 2 We need something with extremely high entropy. The standard approach here is using UE IDs, universally unique identifiers, specifically like version 4 UE IDs which are based on random numbers. Speaker 1 So the address looks like UED string at domain.com. Speaker 2 Prettymuchmaybesomerandomprefixfollowedbyuead@selected-domain.com The UUID part gives you what, $222 possible combinations? It's astronomically unlikely anyone will guess one, or that you'll have a collision. Speaker 1 But astronomically unlikely isn't 0 Do we need collision detection? Speaker 2 Yes, absolutely. Even though it's rare, when the generator creates AUUID string, it should do a quick check against active addresses in say, a Bloom filter or even a direct database check, just to be 100% certain it hasn't generated one that's currently in use. It's a tiny overhead for guaranteed uniqueness. Speaker 1 Makes sense. And the domain management you talked about rotating domains based on reputation. How does that actually work in practice? Is it manual? Speaker 2 Oh no, definitely automated. We need an internal monitoring system. It constantly tracks metrics for each domain in our pool. Things like bounce rates, spam complaint ratios reported via feedback loops from ISPs, maybe checks against public blacklist. Speaker 1 Like a health score for each domain. Speaker 2 Exactly. If a domain's health score drops below a certain threshold, the monitoring system flags it as quarantined. The address generator is configured to only pull from the pool of healthy domains. Speaker 1 And the quarantined ones? Speaker 2 They're taken out of rotation immediately. They might sit in quarantine for a cool down period, weeks, maybe months. Or if they're really burned, we might just retire them permanently and acquire new ones. It has to be dynamic. Speaker 1 OK, that handles domain health, but what about direct abuse of the generator itself? What stops someone writing a script to request like a million addresses in 10 minutes? That could flood our system or exhaust our domain pool quickly. Speaker 2 Right rate limiting is absolutely essential at the generation endpoint. It needs multiple layers. Layer one basic IP base rate limiting. You know, allow maybe 5 address requests per minute from a single IP using something like the token bucket algorithm to allow small bursts but cap the sustained rate. Speaker 1 But bots can cycle through IPS. Speaker 2 True. So layer 2. If an IP starts hitting the limit frequently, or if we detect suspicious patterns, we introduce a challenge. Maybe a lightweight CAP DCHA or a proof of work challenge that's easy for a browser but costly for a script hitting us thousands of times. Speaker 1 Making it computationally expensive for the bots. Speaker 2 Exactly. And maybe layer three. More sophisticated detection using browser fingerprinting or analyzing request patterns to identify automated tools versus genuine human users. We want the barrier for abuse to be high, but friction for legitimate users generating one or two addresses needs to be minimal. Building a Robust Queue-Based Email Processing Pipeline So thinking about the life cycle of an address, it gets generated, enters an active state, people send emails to it, then 24 hours after the first e-mail arrives, or maybe 24 hours after generation. Which one? Speaker 2 Good question. Simpler is usually better. Let's say the address itself is valid for 24 hours from the moment it's generated, any e-mail sent to it within that window is accepted and stored with its own 24 hour TTL from its arrival time. After the addresses 24 hours are up, the SMTP server starts rejecting mail sent to it. Speaker 1 OK, so the address has a lifespan, and each e-mail received during that lifespan has its own 24 hour lifespan from when it landed. Speaker 2 Yeah, that seems manageable. Address state created active for 24 hour expired e-mail state received visible for 24 hour from receipt deleted. Keeps the logic cleaner all. Speaker 1 Right address generated, the user has it. Now the fire hose part handling that million emails a day hitting our SMTP servers. This is the frontline for incoming mail. How do we make it rock solid? Speaker 2 Starts with DNS actually. We need correctly configured MX records, mail exchange records for all our active domains, pointing them to the load balancers in front of our SMTP servers. That's how the Internet knows where to send the mail. Speaker 1 Standard e-mail setup. Speaker 2 Yeah, then behind the load balancer, we need multiple SMTP server instances running, not just one for redundancy. If one server fails or needs maintenance, others take over. And for load balancing, spreading that million emails across several machines. This directly supports the 99.9% availability goal. Speaker 1 When an e-mail first connects to one of those SMTP servers, what's the absolute first thing it does before accepting the data? Speaker 2 Immediate basic checks. Sanity checks. Is the sending server trying to send something ridiculously large? We enforce A strict size limit, maybe 25 minute B, maybe 50 minute B, and reject anything bigger right away. Prevent simple denial of service. Is the e-mail format basically valid? Check basic header structure. Reject obvious junk immediately. Don't waste cycles downstream. Speaker 1 OK, basic filtering at the gate, but the core challenge you mentioned reliability under load. If we get a sudden flood of e-mail, say, a big newsletter blast hits thousands of our temporary addresses at once, the SMTP server can't afford to get bogged down doing complex spam checks or waiting for the database it needs to accept the e-mail fast. Speaker 2 You nailed it. That's where the queue based processing architecture is non negotiable. The SMTP server's main job is incredibly simple. Accept the connection, perform those very basic checks, receive the e-mail data, and then immediately hand it off. Speaker 1 Hand it off where? Speaker 2 To a message queue, a highly reliable high throughput queue like Apache Kafka or maybe Rabbitmq or AWSQS. The SMTP server just puts the raw e-mail message onto the Kafka topic and tells the sending server 250 OK, meaning I've got it, then it's ready for the next connection. Speaker 1 So it accepts responsibility very quickly without doing the heavy lifting itself. Speaker 2 Exactly. It decouples ingestion from processing. The queue acts as a massive buffer. If a huge spike of emails arrives, the SMTP servers just keep putting them into Kafka. Kafka is built to absorb that kind of load. The actual processing can catch up later. This prevents the SMTP servers from getting overwhelmed and dropping connections, ensuring reliability. Speaker 1 Brilliant. OK, so the e-mail is safe in Kafka now who handles it? What does that e-mail processing pipeline look like? Pulling messages off the queue. Speaker 2 That's where dedicated processing workers come in. These are separate services may be running as containerized applications whose only job is to read messages from the Kafka queue. Speaker 1 And what do they do with each message? Speaker 2 A multi step process. First parsing. They take the raw e-mail data and parse it properly. Extract the headers from 2 subject date. Separate the body text to HTML identify and potentially extract any attachments. Speaker 1 Get it into a structured format, right? Speaker 2 Step 2. Spam detection. This is critical. The parse content goes through our spam filter component. This might involve multiple techniques, checking the sender's IP against known spam blacklists, DNS, BLS, analyzing the content for spammy keywords or patterns. Maybe checking URLs in the body against phishing databases. Could even involve a machine learning model trained to score emails for spaminess. Speaker 1 A multi layered defense. Speaker 2 Has to be Step 3. Maybe basic virus scanning, especially on attachments integrate with something like Clamid fee? Speaker 1 OK, parsed scan for spam and viruses. What else? You mentioned e-mail authentication standards earlier. Speaker 2 Yes, that's important for context and potentially for filtering. The workers should check the emails authentication results. Did it pass SPF Sender Policy framework? This checks if the sending IP was authorized by the domain owner. Did it pass D Kim domain keys? Identified mail. This uses cryptographic signatures to verify the message hasn't been tampered with and came from the claimed domain. Speaker 1 And DMR. Speaker 2 And DMRS Domain based message authentication, reporting and conformance tells us what the sender wants us to do. If SPF or D Kim fail, should we reject it, quarantine it spam folder or just let it through? We don't have a traditional spam folder here, so failing DMRS might heavily increase its spam score, potentially leading to rejection. Understanding these helps the system behave like a good e-mail citizen. Speaker 1 Got it. So if it passes all checks. Speaker 2 Yeah, it passes the spam filter and other checks. The worker then prepares the final structured data metadata, body attachment references and writes it to the e-mail storage service, making sure to set that 24 hour TTL. Then the worker acknowledges the message in Kafka, removing it from the queue. Speaker 1 And just like we rate limit address generation, do we need to rate limit incoming emails from specific senders? Can one external server flood us? Speaker 2 Absolutely. The SMTP servers themselves, or perhaps an intelligent load balancer should track incoming connection rates and e-mail volume per sending IP or sending domain. If one source suddenly starts hammering us with thousands of emails per minute, throttle them or even temporarily block them at the edge. It's essential self preservation to prevent one bad actor from overwhelming the entire processing pipeline or storage system. Optimizing Email Storage and Deletion with Cassandra OK, makes sense. Let's pivot to that e-mail storage service. This is where the rubber meets the road for handling 50 jiggly BS a day. And bleeding it efficiently, you lean towards no sequel specifically mentioning Cassandra. Why Cassandra over say Mongo DB which is also popular? Speaker 2 It's a good comparison. Mongo DB is a fantastic document database. Very flexible, often easier to get started with. It's great for many use cases, but when you look at extremely high volume continuous rights distributed across many nodes and needing efficient built in TTL based deletion, Cassandra's architecture often shines brighter. Speaker 1 What's different about Cassandra's architecture? Speaker 2 The core difference is its log structured merge LSM tree storage engine, unlike traditional databases like manga with wire tiger or relational databases using B trees that often need to update data in place on disk which involves read, modify, write cycles and locking. Speaker 1 Which can be slow under heavy load. Speaker 2 Exactly. LSM trees like in Cathandra handle writes differently. Writes are typically fast sequential appends to in memory tables, mem tables and then flushed to immutable files on disk assess tables. Weeds might need to check multiple files, but the right path is highly optimized for throughput. Deletes are also just depends writing a tombstone marker. Speaker 1 So for our scenario, constant stream of new emails or rights and constant dilution that LSM approach is a better fit. Speaker 2 It's practically designed for it. Cassandra Scales writes horizontally very well by adding more nodes and the way it handles deletes via tombstones and background compaction, it's perfectly with our TTL requirement. It avoids the massive overhead of actively seeking out and deleting billions of records every day. Speaker 1 OK, convinced on cassandrafornowhowdowemodelthedataifauseraccessesrandom-xyz@domain.com. How do we quickly fetch only the emails for that address? Speaker 2 Data modeling is key in Cassandra. You design your tables around your queries. Our main query is get all emails for address X sorted by time. So we create a table. Let's call emails by address. The partition key would be the e-mail address itself. Speaker 1 What does the partition key do? Speaker 2 It determines which node or set of replica nodes the data lives on. All emails for the same address will be stored together on the same physical nodes. This makes reading all emails for one address extremely fast, usually hitting only one or a few nodes. Speaker 1 OK, so partitioning by address gives us fast look UPS. How do we get them in order? Newest first? Speaker 2 Within each partition, IE for each e-mail address, we use clustering keys. The most natural clustering key here would be a unique identifier for the e-mail that also sorts chronologically a time. UE D is perfect for this. It's unique like AUE, but also encodes the timestamp. Speaker 1 So partition key e-mail address, clustering key e-mail time UED. Speaker 2 Exactly. Cassandra will automatically store the rows within each partition, sorted by that e-mail time UED, typically descending. So newest first is the default and super efficient. The table might also have columns for sender, subject, body, review, attachment, references, et cetera. Speaker 1 Let's talk attachments. Emails can have tiny text files or huge 20 MEU videos. Storing big blobs directly in Cassandra isn't usually recommended, right? Speaker 2 Correct. Cassandra is optimized for lots of smaller records, not huge blobs. Trying to stuff large attachments directly into rows can cause performance issues, network bottlenecks, and memory pressure during compaction. Speaker 1 So what's the strategy? Speaker 2 A hybrid approach. The e-mail metadata, sender, subject, etcetera, and maybe a small preview of the body go into the Cassandra table, but the actual attachment files, especially if they're over certain size threshold, say 500K to. Speaker 1 Be store them elsewhere. Speaker 2 Store them in dedicated object storage like Amazon S3, Google Cloud Storage or CEF. Object storage is cheap, highly scalable, and designed for storing large files. Speaker 1 And how do we link them? Speaker 2 The Cassandra row for the e-mail simply stores a reference to the object in S3, basically the S3 object key or a pre signed URL. When the user clicks to download the attachment in the web interface, the back end fetches it directly from S3 using that reference. Speaker 1 Keeps the database lean and uses the right tool for each job. Can we compress attachments in S3? Speaker 2 Absolutely. Compressing attachments before storing them in S3 is a great way to save on storage costs and potentially speed up downloads. S3 also has its own life cycle policies that we'd rely on our main 24 hour logic. Speaker 1 OK, let's revisit the cleanup. You said Cassandra's TTL and LSM structure simplify this hugely compared to running delete commands. Explain that again. How does TTL really work with compactions? Speaker 2 Right, when you insert a row into Cassandra with a TTL value, say 600 and 6400 seconds for 24 hours, Cassandra stores that expiration time stamp alongside the data. When that time passes, the data isn't immediately wiped from the disk file as stable. It's in because those files are immutable. Instead, when Cassandra reads that data later, it checks the time stamp and sees it's expired. It treats it as if it doesn't exist. It won't return it in queries. Speaker 1 So it's logically gone, but not physically yet. Speaker 2 Exactly. The physical removal happens during a background process called compaction. Cassandra periodically runs Compaction to merge s, s tables, clean up old data, and reorganize things for Reed efficiency. During compaction, when it encounters data whose TTL has expired, it simply doesn't write that data into the new. Merged as a stable poof, it's physically gone. Speaker 1 And the tombstone markers for explicit deletes get cleaned up similarly. Speaker 2 Tombstones also get purged during compaction after a certain grace period. Using TTL is generally more efficient than creating lots of tombstones, especially for our use case where everything expires. Our cleanup service then becomes more of an auditor verifying the TTLS are working, monitoring compaction health, maybe dealing with rare edge cases rather than actively deleting gigabytes of data itself. Speaker 1 That's a huge win. Massive operational simplification and cost saving. Delivering Instant Updates and Powerful Search Capabilities OK, back end storage feels solid. Let's jump to the front. The web interface and that sub 200 meters latency goal for seeing new emails. How do we achieve that responsiveness? The key. Speaker 2 Is avoiding user actions for updates. We can't rely on the user hitting refresh. Even traditional polling, asking the server anything new every few seconds introduces too much delay in overhead. Speaker 1 So push, not pull. Speaker 2 Exactly. We need websockets. When a user opens the inbox page for a specific temporary address, the browser establishes A persistent Websocket connection back to our web server layer. Speaker 1 A dedicated pipe for that user session. Speaker 2 Right now, remember our processing pipeline when a worker successfully processes an e-mail, filters it and stores it in Cassandra. Speaker 1 It needs to notify the web layer. Speaker 2 Precisely that worker, or maybe a notification service it calls, finds the active websocket connection associated with that e-mail address and pushes a small notification message down. The websocket new e-mail arrived ID 123 sender subject. Speaker 1 And the JavaScript running in the user's browser receives that message instantly. Speaker 2 Instantly the front end code listens for messages on the web socket. When it gets a new e-mail notification, it can immediately fetch the full details. Or maybe the notification already contains enough and dynamically insert the new e-mail into the inbox list on the page without a full page reload. That's how you get that near real time feel and hit the sub 200 meters target from arrival to display. Speaker 1 OK, web sockets for real time. What about scale on the front end? If someone uses a temp address to sign up for like 100 newsletters at once, their inbox could get flooded. We can't load 500 emails onto the page instantly. Speaker 2 No, definitely not standard web practice pagination. When the user first loads the inbox, we only fetch and display say the latest 20 or maybe 50 emails from Cassandra. Speaker 1 Based on that timey clustering key we set up. Speaker 2 Exactly. Cassandra makes fetching the top end very efficient. If the user wants to see older emails, they click a load more button or Scroll down and the front end makes another request to fetch the next page the next 20 or 50 emails. This keeps the initial load fast and the payload small. Speaker 1 Makes sense. Now one more key feature for any inbox search, people might want to find that verification e-mail from example Corp or search for a subject containing discount. Cassandra isn't great at that kind of query, right? Speaker 2 Not at all. Cassandra's query capabilities are deliberately limited, focused on primary key lookups for performance. Asking it to do a full text search across subjects or e-mail bodies for millions of emails would be incredibly slow and inefficient, if possible at all. Speaker 1 So we need another tool. Speaker 2 Yes, for flexible, powerful full text search you bring in a specialized search engine. The standard choice here is elastic search or open search. Speaker 1 Another component in the architecture, How does it fit in without slowing down the main e-mail flow? Speaker 2 We use the same decoupling pattern, asynchronous indexing via Kafka. Remember our processing worker after it successfully stores the e-mail in Cassandra? Speaker 1 It does another thing. Speaker 2 It does one more thing. It prepares A simplified version of the e-mail data, maybe sender subject, a sanitized version of the body, the e-mail ID, and puts that onto another Kafka topic. Let's call it the Elasticsearch indexing queue. Speaker 1 So storage and indexing happen in parallel asynchronously. Speaker 2 Exactly separate Elasticsearch indexing. Workers consume messages from this queue. Their only job is to take that data and push it into the Elasticsearch cluster, making it searchable. Speaker 1 So the main e-mail ingestion path SMTP piccup processing Cassandra isn't blocked waiting for Elasticsearch. Speaker 2 Correct indexing happens slightly delayed, maybe by a few seconds, but it doesn't impact the critical path of receiving and storing the e-mail. When a user performs a search in the web interface, Subject contains voucher. Speaker 1 The query goes to Elasticsearch, not Cassandra. Speaker 2 Right, the web back end forwards the search query to Elasticsearch. Elasticsearch returns a list of matching e-mail IDs. The back end might then fetch the full details for those specific e-mail IDs from Cassandra to display the results. It leverages the strengths of both systems. Speaker 1 Cassandra for fast writes, primary key reads and TTL. Elastic search for complex search queries. A nice separation of concerns. Speaker 2 It's essential for meeting all the functional and non functional requirements, especially performance at scale. Navigating the Ethical and Technical Trade-offs in Design OK, we've designed a pretty sophisticated system here, scalable, resilient, fast, ephemeral. Let's take a step back and recap some of the key decisions and trade-offs, especially around security and data integrity. Speaker 2 Sure, security and privacy were drivers from the start. Since anonymity is the goal, data encryption is mandatory, both TLS for data in transit and encryption for data at rest in Cassandra and S3. A critical policy decision is no user tracking or logging beyond what's strictly necessary for operational monitoring like system load Error rates, queue depths. We don't log who generated what address or the content beyond his 24 hour life. Speaker 1 And the short retention helps with privacy compliance. Speaker 2 Massively, The 24 hour TTL is our biggest compliance feature. By design, we minimize the data we hold and how long we hold it. This inherently aligns well with principles like data minimization found in GDPR and similar regulations. The data simply doesn't exist long enough to become a major liability. Speaker 1 We chose Cassandra, which favors availability and partition tolerance. In the CAP theorem sense. That means eventual consistency is a factor. How do we manage that risk? We can't have emails just disappear or take minutes to show up across nodes. Speaker 2 That's the classic trade off we mitigated by tuning consistency levels for the critical path, writing e-mail metadata to Cassandra, and reading it. For the inbox view. We'd likely use quorum reads and writes. This means a write must be acknowledged by majority of replica nodes before it's considered successful, and a read must also query a majority. It significantly increases the likelihood of reading the most recent write, giving a strong consistency for the core user experience. Speaker 1 At the cost of slightly higher latency compared to reading from just one node. Speaker 2 Potentially yes, but it's a worthwhile trade off for correctness here. For less critical things, maybe like updating usage statistics or even propagating attachment references to S3, we might relax the consistency to gain performance, accepting eventual consistency there. Speaker 1 Makes sense. And fault tolerance. We've layered it in multiple places. Speaker 2 Yeah, it's built in throughout. Multiple domains prevent single point of failure for reputation. Multiple SMTP servers behind a load balancer. Kafka acts as a massive buffer tolerant to downstream worker failures. Cassandra itself is designed for fault tolerance through replication. If a node dies, data is still available on others and our processing workers should be designed with retries and idempetency using circuit Breakers so a temporary failure and say the spam checking service doesn't bring down the whole pipeline. Speaker 1 It really is engineered for continuous operation despite failures, high velocity, rapid destruction, built to a stand bumps. Speaker 2 That's the goal and ephemeral system designed for resilience. Speaker 1 OK, this leads to a really fascinating final thought for you, our listener to ponder. We've just spent all this time meticulously designing defenses, strong address generation, IP rate limiting, TPPTCHS, domain rotation, sophisticated multi layered spam filtering, maybe even looking at sender behavior analysis down the line. We built all these walls. Speaker 2 Yeah, a lot of effort goes into defense. Speaker 1 But the fundamental premise of the service is easy, anonymous access. That makes it inherently attractive to bad actors, right? They'll want to use it to test spam campaigns, register fake accounts, receive phishing confirmations, whatever. Despite all our filters and limits, some abuse will get through. Speaker 2 It's inevitable. The classic cat and mouse game. Speaker 1 So here's the real challenge. Beyond the initial design, How does the team running this service decide where to draw the line? Every time you make the spam filter more aggressive, or the rate limits tighter, or the kepi TCH harder, you might block more abuse, but you also risk blocking legitimate users or adding friction for people genuinely seeking privacy. Speaker 2 Yeah, the false positive problem. Speaker 1 Where is that ethical and technical balance point? How much potential abuse is acceptable to preserve frictionless anonymous access, and how does that balance shift over time as attackers get smarter? It seems like a constant ongoing calibration with no easy answer. Speaker 2 You're absolutely right. That's not just a launch decision. It's a continuous operational and ethical dilemma. The architecture we designed gives the team the tools to fight that battle. They can tune the filters, adjust limits, swap components, but how aggressively they use those tools is a constant judgement call. There's no perfect setting. Speaker 1 A fascinating challenge that sits right at the intersection of technology, privacy, and security. Thank you for diving deep into this complex design with us today. My. Speaker 2 Pleasure. It was a great problem to think through.

Podcast Summary

Key Points:

  1. The system is designed as a scalable, anonymous disposable email service with 24-hour email retention and no user authentication.
  2. Core requirements include handling 1 million emails daily (50 GB data), high availability (99.9%), sub-200ms latency for inbox display, strong spam filtering, and automatic deletion.
  3. Key architectural components are a load balancer, email address generator, SMTP servers, a NoSQL storage service with TTL support, a web interface, spam filter, and a cleanup service.
  4. Security relies on cryptographically strong, unpredictable address generation (using UUIDs) and domain rotation to manage sender reputation risks.
  5. A queue-based processing pipeline (e.g., Kafka) decouples email ingestion from processing to handle traffic spikes and ensure reliability.

Summary:

The discussion outlines the design of a disposable email service focused on anonymity, scalability, and ephemerality. Key functional requirements include generating unique email addresses without registration, receiving and storing emails with attachments, providing a web interface for access, and automatically deleting all data after 24 hours. 9%), sub-200ms latency for inbox display, robust spam filtering, and cost-effectiveness.

The architecture comprises seven main components: a load balancer, address generator, SMTP servers, NoSQL storage with built-in TTL, a web interface, spam filter, and cleanup service. Critical design differentiators from traditional email include no user authentication (access is solely via the generated address), short data retention, and high churn of ephemeral data. Security emphasizes cryptographically strong address generation using UUIDs and domain rotation to mitigate reputation risks.

, Kafka), allowing asynchronous processing by workers for parsing, spam filtering, and storage, ensuring reliability under load.

FAQs

The service must generate unique email addresses without registration, receive and store emails sent to those addresses, display them via a web interface, support attachments, and automatically delete all data after 24 hours.

It must handle high scale (e.g., 1 million emails daily), ensure high availability (99.9%), provide low latency (under 200ms for inbox display), maintain strong privacy and security, and be cost-effective.

Addresses are generated using cryptographically strong random UUIDs to prevent guessing, with collision detection to guarantee uniqueness. Access is solely based on knowing the address, as there is no user authentication.

Using a pool of domains helps manage reputation risks. If one domain gets flagged for spam, it can be rotated out of use, isolating damage and ensuring overall service deliverability remains high.

A NoSQL database is preferred due to its ability to handle high write throughput, efficient horizontal scaling, flexible schemas for email data, and built-in TTL support for automatic 24-hour expiration.

SMTP servers perform basic validation and then immediately place raw email data into a high-throughput message queue (like Kafka). This decouples ingestion from processing, preventing overload and ensuring reliability during traffic spikes.

Chat with AI

Loading...

Pro features

Go deeper with this episode

Unlock creator-grade tools that turn any transcript into show notes and subtitle files.