This is Software Engineering Radio, the podcast for professional developers on the web at
se-radio.net.
SE Radio is brought to you by the IEEE Computer Society and by IEEE Software magazine, online
at computer.org/software.
Hello everyone, welcome to this episode of Software and Sharing Radio.
Our guest today is Chun Li.
Chun Li is a co-founder and chief architect at D-BOSS.
Before founding D-BOSS, Chun completed her PhD in computer science at Stanford in 2023.
Her PhD research has focused on abstractions for efficient and reliable cloud computing.
Chun is also the co-organiser of the South Blade Systems Club, which is an independent
talk series focusing on systems programming.
Before we actually talk a little bit more about D-BOSS and its research origins, I'd
like to point listeners to episode 596, which is Max Infective on Durable Execution with
Temporal.
There are also a few other episodes we have done on related topics, and I'll put those
in the show notes.
These are episode 351, 223 and 198.
So happy to have you here, Chun, to talk about D-BOSS and Durable Execution.
Welcome to the show.
Would you like to add something to your bio before we jump right in?
Thanks for inviting me to the show.
Yeah, so originally D-BOSS started as a joint research project.
It's a collaboration between Stanford and MIT.
So we started the project since 2020.
It was also led by the Postgres creator, Professor Mike Stonebreaker, and the Spark creator and
Databricks co-founder, Matei Zaharia.
So during the research project, we really tested the capability of databases and see
how databases can help you create reliable programs and how databases can help you make
your programs more observable and debuggable.
So during the research project, we built several prototypes and we wrote several papers.
And when we presented it, people were really excited about the capability that D-BOSS can
bring.
So when we graduated in 2023, we decided to, based on the research project, co-found D-BOSS.
And D-BOSS as a company right now, we focus on building durable software.
And we believe that all software should be reliable and observable and scalable by default.
So now D-BOSS stands for durable backend, observable, and scalable.
So you mentioned the use of databases for basically running your applications.
But why is that a new concept?
Everybody uses databases for running apps, or most people do.
So what is the difference here?
What is the secret sauce that you were talking about?
Yeah, so it's true that people have been storing their business critical data in databases
for like 30, 40 years.
But the new concept is that we also want to persist programs execution state in the database.
Like here's your program and it has multiple steps.
And we want to persist the steps output and the input into the database so that if you're
program crashes or machine failed, we'll be able to resume from exactly where you left
off.
So the idea is to, in addition to your application data, we also store your program execution
state in the database, especially if you're having long running and dynamic workflows
in your programs.
You really don't want to restart from scratch every time you hit a system error or you have
a machine failure.
So Chan, maybe you can explain to listeners what exactly do you define a workflow as opposed
to a service?
Yeah, so to get started, I think we can talk about what is a workflow in this context.
So traditionally, people think workflows as state machines.
You have to define a DAG, stuff like that.
But actually in Dboss, anything can be a workflow.
So to give a concrete example, a workflow is a sequence of operations or function calls.
So a very typical example in like job execution is, let's say, checkout service.
So if you're implementing a checkout service, usually you have to call, say, reserve inventory.
You have to update a database to make sure that you have enough inventory.
And then after that, if you successfully reserve the inventory, you will cut out to the payment
process.
For example, this could be an external service like a Stripe or PayPal.
You will say, I want to charge a user this much.
And after that, you need to wait for the response from those services.
And then based on the result, you will decide whether to fulfill the order and send a confirmation
email to the user, or you have to undo your reservation for the inventory and then cancel
the order and send a cancellation email.
So this process can be abstracted as a workflow.
And then what guarantees do we want for this workflow?
First, we want to make sure that once I click that checkout button, all steps will eventually
succeed or complete, right?
I don't want to say I pay for my process, but I never receive my item or I reserve the
inventory but never charge a user.
So that's a first guarantee.
And the second guarantee is that I want to guarantee effectively exactly once.
So if I charge a user, I want to charge them once.
And if I reserve the inventory, I also want to only reserve it once.
So you talked about workflow and you mentioned transaction.
What's the relationship?
So in this context, a transaction is essentially interaction to a local database.
And then the workflow can be composed of transactions and also external services, or we just call
it steps in general.
So in DevOps, a workflow is a sequence of steps.
And some steps could be transactions that talk to a database.
And other normal steps, well, normal steps can be any functions, right?
Can be talking to external APIs or can have some any non-deterministic factors, operations
in that step.
But overall, the workflow needs to be deterministic.
Determinism and item potency are two foundations for durable workflows.
And essentially, what we build is a library that you can just easily install it and then
add annotations to your program.
So I described the checkout workflow.
To make it durable and deboss, you essentially put annotation at deboss.workflow in front
of the overall orchestration workflow function.
And then you put at step at the function definition of every individual step.
Then when we call that function, essentially we'll first, we basically put wrappers around
those functions to make them durable.
So to make it more concrete, basically when you start a workflow, when you call the workflow
function, the wrapper would first checkpoint the inputs of the workflow in the database.
And then for each step we first check, have we executed this step before?
If so, we directly read the recorded output from the database and return the output instead
of re-executing it.
So this way we guarantee exactly once.
So the thing is for external systems, the way to guarantee exactly once is to guarantee
at least once plus item potency key.
And deboss basically automatically generates an item potency key per workflow and per step.
So we can guarantee that each step will execute effectively exactly once.
And then after each step finishes, we'll checkpoint the output into the database and
so on and so forth until we reach the end of the workflow.
And say if something crashes in the middle, when we recover, the deboss library will look
into the database and see which workflow is still pending.
And then if the workflow is still pending, we'll just essentially replay the workflow
from the start and walk through all the steps.
And then for the finished steps, we'll already see the database record, and then we'll skip
those steps.
And then we'll eventually lend to the last completed step and then we'll continue resume
the workflow from where it left off.
So that's basically a high level idea of how deboss is.
You mentioned the item potency key, we'll blame that further.
Yeah.
So essentially, item potency key is a way to guarantee that each step is executed once
and only once.
So essentially, when I work with external API, say Stripe payment, right, you want to say
I only want to charge this user once.
And then you can pick it back a item potency key in your request to Stripe.
So if Stripe received the same key again, it will not issue another invoice to the user.
Stripe will be able to look up its own database and say, hey, I already have an outstanding
invoice for this checkout session, so they will only charge the user once.
So I think item potency is a very important concept in terms of durable execution.
And then deboss, we automatically generate an item potency key for workflow if you don't
already provide one, and then for each step, we'll basically append the step sequence number
to it.
So every step in the workflow will also be uniquely identified.
So the reason we need item potency key is that you can't really control external systems,
right?
Say, even if we only call Stripe API once, we don't know whether the network connection
will fail or maybe Stripe will fail or there are some intermediate transient errors that
may happen.
So like, if we don't get the result back, we may think it failed.
But in order to make sure that the checkout goes through, we may want to retry it.
And if we retry it, we can't generate a new session because maybe the Stripe payment
already succeeded, but the network was cut when Stripe tried to send the result back.
So we have to retry it.
And when we retry it, the way to tell Stripe that this is the same checkout request I sent
it before is to append the item potency key.
So item potency key is the key to guarantee correctness and exactly once execution when
you work with external systems.
Could you talk more about the Dbos architecture?
You did talk about the workflow and the step.
Is there any other key elements of the architecture you'd like people to know about?
And especially if they have origins in the research that you had done in the past?
Yeah.
So Dbos workflows and steps and transactions were implemented directly as SQL interactions
in the library.
And then beyond that, we also have several other primitives.
The one is widely used is called Dbos queues.
So we also implement queues on top of Postgres.
With queues, basically you will be able to group.
So instead of directly invoking or executing a workflow, synchronously, you can in queue
a workflow and then other workers will be able to pick it up.
So the queue is also just a database record, right?
Which is in queue by saying, here's the task, here's a queue name, and then for the worker,
they can just pull from each queue.
So the benefit of queue is that it makes it really easy to control concurrency and rate
limiting.
Say, for example, we can have different queues like open AI queue, cloud queue, stuff like
that.
And we want to say, for the open AI queue, we only want to send the request five times
per minute.
So this is a way to group your invocations and to control how many outstanding requests
you send to external systems.
And another thing is based on research is messaging system.
So when workflows and steps, basically they are all within the Dbos territory, but when
you need to interact with external systems, you need a way to communicate with a workflow.
So give an example of this.
So say, actually the checkout workflow, right?
When you call Stripe, you don't block and wait until user to pay out.
So what you actually do is that Stripe will directly return a confirmation saying, okay,
we are processing your payment.
But when your payment is done, we'll send you a web hook invocation to say this payment
is done.
And now how to do it in Dbos is that, so in a workflow, we can have a Dbos.receive on
the specific topic waiting for the specific Stripe payment session, and then we can implement
a web hook that listens Stripe callback.
When a Stripe callback, we can extract, because as we mentioned, we have the idempotency key.
We can uniquely identify which workflow we should call back to.
And then we will be able to send, say, Dbos.send with that idempotency key and then send Stripe
payment results, either paid or payment rejected, stuff like that.
And then the workflow, while using the receive, will be able to get that information.
So this is really useful if you want to, like, I think almost all the payment system, for
example, is using this type of pattern.
And also if you want anything that goes to, like, human in the loop, say, you want to
send a human verification email and that you want to wait for the human confirmation to
come back, you can't just block there and wait.
You have to wait for a callback.
And then Dbos.send and receive primitive allows you to do that.
And this library is what you call as Dbos.transact.
Anybody can use it as long as they annotate their code.
Yes.
So Dbos.transact is an open source library, it's MIT license, so anyone will be able
to use it.
It's currently available in TypeScript and Python, so we are adding more language support.
So for the audience, if you have any feedback on what languages you wish to see, please
contact us and we'll add more.
So for the transact library, you can install it and run it anywhere.
What is the Dbos cloud?
Right.
So Dbos.cloud is a serverless hosting platform for transact applications, so you can run
your transact applications anywhere on your own laptop or on a Kubernetes cluster, but
if you don't have any resources and if you don't want to manage any clusters, you can
deploy your app to Dbos.cloud.
So if I do use Dbos.transact library, I must get a Postgres, if I wanted to use another
database system, could I?
Usually Dbos.transact supports any Postgres-compatible databases, so you can use it with your own
Postgres server.
Many of our users just simply add Dbos.transact because they already have a Postgres server,
so they just use the same server to store your execution state data.
But you can also use other offerings.
Each cloud provider has some Postgres offering, and we're also compatible with new serverless
Postgres like Neon, or Superbase, or CockroachDB, or YuccaBitDB, there are a lot of options
here.
So is that just because that's the database you tested with, or are there any specific
features of Postgres, Postgres-compatible databases that you leverage?
Yeah, I guess it goes back to the question of why do we choose Postgres.
So there are a couple of reasons we chose Postgres.
First is that the ecosystem is huge.
As I said, there are a lot of providers in the cloud, or there are a lot of on-prem solutions
as well.
People know how to operate Postgres.
It's a very mature and battle-tested technology, and people really trust Postgres.
So that's one reason.
And the second reason is that it's a relational database, so it has built-in transactions.
It's really reliable, so we don't need to worry about, like, it has backups, and it has
replication if you want to.
And finally, the extension ecosystem is great.
Some of our users actually use Postgres as a vector store.
So some people also use Postgres to store time-series data.
So really, with a single database, you can achieve a lot of things.
You can store basically all data, from transactional data to analytics data to vector data.
So we really like this versatility of Postgres.
A lot of our listeners, I think, are really interested in the how, like, peering behind
the black box.
So if they will have access to the Postgres database, where the state is getting stored,
if they peek in there, what will they see, and where should they peek?
Yeah, so Debao stores all the information in the separate logical database inside the
Postgres server.
So you can think of a logical database as another namespace within the Postgres server
to isolate it from your main application database, so that we don't interrupt or we don't disturb
your normal tables or your normal queries.
So in that database, under the Debao schema, you will be able to find several tables that
stores the information.
So there are several tables you want to look into.
The first is workflow status table.
That's the core table that stores workflow information.
And then when you first execute a workflow, it will say workflow pending.
And then as your workflow progress and eventually succeed or fail, we'll update the workflow
status to error with the error, the actual error, or success with the actual output.
So that table is the core that drives the Debao that stores the state changes of Debao.
And then the second table you want to look at is operation outputs table.
In that table, that's where we store the serialized output of each step.
And by looking into that table, you will see the workflow ID and then the step ID inside
a workflow and the result of every single step.
So by looking at those two tables, you can piece together what workflow has executed
when and what steps has executed at what time step.
Besides that, we also have the workflow queues table.
The queues table basically groups workflows into different queues.
And then by saying different queues, it's just we assign a different name in the queue
names table, and that will allow us to quickly look up what workflows are assigned to be
executed in that queue.
So I would say the real beauty is in SQL.
You can use SQL to manage your workflows, and you can also use SQL to simply query them
and observe what happened in your system.
Do you provide other box visualizations for what you see in these tables?
Yes.
If you log into dbossconsole, like console.dboss.dev, you will be able to see we have table visualization.
You can basically select or filter based on the workflow names or based on the workflow
status.
And we are also actively developing a graph visualization of the workflow execution graph.
So say what workflow has started and how many steps it has finished and what's a workflow
parent-child relationship between different workflows say, like, one workflow can invoke
sub-workflow that execute other tasks, so you'll be able to connect the dots by looking
at the graph.
Chen, do you have any real-world example where maybe talking to your customers where unreliable
workflow executions cost them really problematic system failures or inefficiencies?
Yeah.
So let's see.
There are several use cases.
My favorite one is that one of our customers have to persist data across multiple systems.
Say Shopify sends some data through Kafka and for each message, each message contains some
customer data.
They want to persist in their local Postgres database.
They want to persist in their CRM, they want to persist in their ERP system.
So they want to make sure that the data is consistent across all systems.
So they chose D-Boss and if you don't have correct customer data, you can probably lose
customers so that's what they really don't want to see and D-Boss guarantees that whenever
you receive a message from Kafka, we guarantee that the message will appear across all systems.
Maybe that answers one of my questions, which was why would I not roll my own or use code
generators to help me write the code for durable execution?
I think one of what you say tells me that there's a lot of guarantees and compliance
that using D-Boss might help me with.
Is that fair?
Are you going through some kind of compliance process that people can leverage?
Yeah.
So I think we can talk a bit more about the observability and goal of D-Boss.
So because everything is stored in Postgres, it's really easy to query and visualize what's
going on.
So my favorite quote from customers is that they said, "D-Boss is great because everything
is in Postgres.
I can just use SQL queries to see what workflows have run, what workflows has failed, and what
happened at each step."
And based on the information we store in Postgres, we're also developing more observability
features like graph visualization of a workflow and its steps, and the workflow may be spawning
multiple workflows there.
So we are visualizing the parent-child relationship there.
And more than that, what we can provide is actually management over your workflows.
So because those are just database records, if you want to, say, resume a workflow, you
can just re-enqueue it, put it back into a queue stable.
If you want to cancel a workflow, then you can just mark the workflow as canceled, and
then the downstream process will just cancel the operation.
And if you want to say, "We can talk more about it later when we talk about operations
with D-Boss."
But if you want to restart a workflow with your new versions of a code, you can just
copy the workflow input information and assign it with a new idempotency key, and then just
start execution on the new version.
And because data is in Postgres, it's a relational database, it's structured data.
And structured data is really easy to analyze and to observe.
A little bit of segue here, and would like to contrast D-Boss with solutions already
in the market.
And I did mention our earlier episode, the latest one being with Maxime Fatih on Temporal.
How does D-Boss contrast with Temporal?
And is your definition of workflow consistent with Temporal's definition of workflow and
durable execution?
Yeah.
So that's a good question, and we got asked about that question a lot.
So I think the workflow definition in D-Boss and Temporal are essentially the same.
But the core difference is how we perceive the implementation of this durable execution.
So in Temporal, we call the pattern "external orchestration pattern."
With external orchestration, which means you have to start a Temporal server, and then
when you run your workflow, instead of directly executing each step, your workflow function
will need to inqueue a step into a Temporal worker.
And then Temporal worker, Temporal server, will instead push the notification or will
instead notify a worker node to process that, they call it activity, we call it step.
And then after the worker node finishes a step or activity, they will send it back to Temporal
server.
Temporal server will persist the result and then send it back to the main workflow function.
So basically for every invocation of your step, there are multiple network hops.
And then by contrast in D-Boss, we essentially embed durable execution in your program.
So when you call the workflow, it's just a function call.
And when a workflow calls a step, it's again, it's a function call, but a function call
is intercepted by the D-Boss library.
So everything will happen just in your program.
So we believe the benefit of D-Boss is that it's really simple.
All you need is your program and your Postgres database.
You don't have to deploy an extra orchestration server, and you don't have to deploy like
distributed workers to process your steps.
I get the part about not having to deploy an orchestration server, but then how do you
achieve orchestration?
So D-Boss implements durable execution as orchestration.
So just to clarify that, we achieve durable execution by intercepting the function calls.
So you can think of it as when you call a D-Boss decorated function, instead of directly
executing a function, we wrap around the workflow function, for example, to say we first persist
the input and then call the function.
When execute the workflow function, the workflow function will call each step.
And again, each step is also wrapped by D-Boss.
So in the wrapper, D-Boss will first check if the step has finished before.
If not executed, persist the result.
If it has been executed before, directly return the result.
So everything happens in the language layer.
So that's why we don't need to send it over to a separate worker, to a separate message
queue to achieve this.
Is there any other areas where it's important to compare and contrast with temporal?
I think the two main areas, one is simplicity.
In order to add D-Boss to your program, it's really simple.
You install it as a library, add to your existing program by decorating your functions, and
you're done.
And then to implement something temporal, you have to basically restructure a program
into, like, you have to think in a distributed system way.
Every time when you cut out a step, it is essentially an RPC call to another worker
and then to execute it and then wait for the result to come back.
So we think simplicity is the number one differentiator, and then it's also very simple to operate.
So say, like, to run D-Boss in production, all you need is a Postgres server.
And then basically people already know how to operate Postgres servers.
So we don't add much of the operational overhead to it.
By contrast, if you use any external orchestration services, you have to host their service or
you have to rely on their cloud providers, cloud offerings.
For example, every time you want to call a step, you have to communicate to their cloud
and say, "I want to execute a step," and their cloud will send a message to the worker
to execute it, so on and so forth.
So that will also give some, like, implication to performance, right?
In D-Boss, every step is, we'll basically add a database write, which is, like, a few
milliseconds.
Well, if you do everything over the network, that will easily go a few hundred milliseconds.
Other than the simplicity of use, do you have any thoughts about when a developer should,
when they need durable execution, they should consider D-Boss versus Temporal?
Yeah.
Temporal is really great technology, like, it's used by a lot of companies.
I think one benefit of the Temporal model is that, if you want to have a workflow that's
consisted of different steps or written in different languages, it could be easier to
use the Temporal model.
Say you have a workflow written in Python, but maybe some steps are written in Go, other
in Java, and Rust, if you have such heterogeneous workflow, it will currently be easier to do
it in Temporal.
Well, on the other hand, if your program is just in Python or TypeScript, it will definitely
be easier to do in D-Boss.
In fact, like, a current trend is, we call full stack applications.
So when users write their front-end code in TypeScript, they actually also want to write
their back-end in TypeScript.
So that's why, like, D-Boss, in this case, makes more sense because it's very lightweight.
You just add it as another TypeScript library and then use it in a program.
How much of this comparison also applies to some of the other technologies like AWS step
functions?
Yeah.
So the external orchestration part is the same.
So we actually had some performance benchmarks against step functions, and we found out for
D-Boss, where each step is a few millisecond, whereas in step functions, every time you
have to schedule a step, it will push to a queue, and then you have to dequeue and wait
for results, so it will come back as, like, 200 millisecond.
So the performance gap is pretty large.
And another difference is that step functions requires you to use a JSON description language
to basically specify the DAG.
And that could be another overhead, I would say.
Well, though, like, step functions also provide a graphical interface.
We can drag and draw your workflows into a DAG.
This is really nice, and this can be used by, like, for example, developers or people without
too much coding experience.
However, the problem is that it's very easy to use when you have simple tasks, but when
you have, like, more complicated tasks, it's hard to manage.
Just share an anecdote from our conversations with users.
Someone was switching from step functions to D-Boss because they said eventually they
gave up on code review because they have to review 3,000 lines of JSON code.
And then in that case, using D-Boss is better because all your workflow logic is essentially
just your code.
So you write workflows as code, and then you can do code review, you can do your debugging
normally as you would when you develop your functions.
A follow-up question there.
In some of the previous episodes on this topic, there was pros and cons of the graphical representation
discussed.
Certainly, there was a sentiment expressed that developers are not very fond of the
graphical representation, but more business users find it useful.
What is your experience there?
And if I did want to communicate that with the business user, and I was as a developer
using D-Boss, what are my options?
Yeah, that's a great question.
I think I have the same experience when talking to developers.
They usually say they want code, they want to see code, but when we talk to more business-focused
people, they want to see what's going on.
They want to have a visual or a graph visualization of what happened, like how many steps happened,
how many workflows happened, and if something failed, which step failed.
So in D-Boss, we basically provide, we're actually actively developing a graphical visualization
based on the information story and database.
And the reason is that, so you can't define workflows based on graph, but we provide observability
into what happened, what have executed before.
I think that is a good balance, I'd say, because developers usually use code to develop their
workflows.
Well, business-focused people usually need to see what happened, like the execution of
those workflows.
I think that's a good combination.
People had talked earlier about scheduling, how one would achieve that.
Can you talk about rollback with D-Boss, because rollback was another big plus of a
workflow system, being able to achieve that.
How would I as a developer achieve that if I was using D-Boss?
Yeah, so if you use D-Boss, you can just handle exceptions normally as you would when you
develop your API or your service.
So in the checkout example.
So if I receive, if the workflow says, sees the payment was failed, then it will need
to call the undo inventory reservation function, and it also need to call some functions to
cancel the order.
So in D-Boss, you basically express them in code as error, like if you see this error,
call those sequences functions.
If you see other types of error, do something else.
So we do ask users to explicitly specify the rollback actions, and the rollback actions
are also steps.
So the benefit of using D-Boss is that we'll guarantee everything, the workflow, that all
the steps will run to completion.
How do I as a developer take an existing body of code and the annotations that D-Boss wants?
What process would I go through?
Yeah, so we think it's pretty easy.
So you first install the library.
If you use Python, it's pip install.
If you use HypeScript, it's npm install.
And then after that, you can just import D-Boss, and then you say you have a function you want
to be a workflow.
You just decorate it as @dboss.workflow.
And then within that workflow function, you will see what function calls in makes.
And then you decorate those functions as @dboss.step.
I think my question was more like, how do I figure out which functions I should now go
and decorate?
All right.
So I think you will essentially decorate any functions that will talk to external APIs or
talk to the database, anything that will generate side effects outside of your program.
Thanks, Qian.
So I'd like to now go into a different section where we talk about maybe the more use cases
that we see for durable execution, especially with AI agents.
Before we go there, is there anything else you'd like to add?
Yeah.
So actually, I like to talk a bit more about workflow recovery and failure recovery when
you use in production.
So as we talked earlier, it's true that developers can write checkpointing code manually, but
you want to use a durable execution system or a library like D-Boss or others.
Because you want to have automatic recovery.
So give you a concrete example.
In production, you usually, you have to deploy your code in multiple, if you use Kubernetes,
you will deploy to multiple containers or pods.
And in a large production development, you could probably develop maybe 100 or 1000 pods
that each will serve, well, you'll load balance between those pods to serve requests.
So the problem is that at any point in time or within any hour, some pods will definitely
fail.
So it's a probability to fail as high when you have a large deployment.
And if you already use Kubernetes, you may think, yes, I can just restart those pods.
But the problem is that sure, when you restart a pod, you still need some application logic
to decide how to resume the work you've done before, before you crash.
And if you use D-Boss, we provide a service called conductor.
So this service basically connects to your running applications.
And then it will detect when some of the workers are failing.
So if it detects some workers failed, it will try to redistribute the pending workflows
running on that worker to other healthy workers.
And this way we automatically recover from failures.
And we guarantee that all workflows will eventually run to completion.
So I think this is one reason that you really want to delicate this kind of recovery to
some library or services like D-Boss.
How does that coexist with the recovery mechanisms in Kubernetes?
Yeah.
So in Kubernetes, basically when you restart, it will restart your application, but it doesn't
have any application semantics.
It doesn't know what workflows failed before.
Essentially, D-Boss as a library will have some background threads to listen for tasks,
say you have to recover this and that.
So D-Boss will dispatch certain recovery commands to your workers.
And this works perfectly with other mechanisms in, like say Kubernetes or other deployments.
So what you mean is that I developed my application with the D-Boss transact library.
And then when I deploy it to production, I can leverage D-Boss conductor whether or
not I'm deploying to Kubernetes.
Yes.
Exactly.
Do you use D-Boss conductor automatically in the D-Boss cloud?
Yes.
We have the same capability in D-Boss cloud.
So if you deploy to D-Boss cloud, you have all the automatic recovery plus auto scaling.
All right.
Let's spend some time on AI agents.
That was one big use case on your website.
Why is the problem surfacing now with AI agents and requiring durable execution specifically
for that?
So I think this is a really new and emergent use case for D-Boss.
So actually, besides AI agents, we also have another AI use case, which is AI data pipeline.
If we have time, we can dive into that as well.
And for AI agents, basically, the program will be driven by AI or LLMs.
So say, instead of developers coding what functions to call, the function calling or
tool calling will be decided based on the LLM responses.
So I actually, I build a very interesting refund agent using LUNGRAPH and D-Boss.
So basically what an agent does is that when you receive a refund request, it will talk
to LLM to decide which tool to use, maybe based on the customer record, based on the
order, it will call a refund workflow.
So very simplistically, am I right in that the agent is essentially an LLM with tools
at the most basic level?
Yeah.
Okay.
Sorry.
Continue to you.
Yeah.
So then the agent can say, basically, oh, if the purchase was above a threshold, I will
need to send an email to an admin to verify whether you want to approve this refund or
not.
The refund workflow will have to wait for a human input and then decide what to do next.
So I would say with AI agents, the most challenging part is, first, dynamic execution of workflows.
You can no longer have a very static workflow, like workflow branches or function calling
will be very dynamic based on the LLM output.
And second, it's really unreliable in many ways.
Like from my point of view, we treat AI as a unreliable service in a stack.
So your AI call may fail.
And then every time you call it, it may give you a different result.
And then third thing is that human in the loop is really tricky.
Like how do we make AI work together with human is a kind of new and challenging topic
these days.
Can you talk more about the agent you developed and how you use deboss there?
Yeah.
And for the agent I developed, basically, I decorate my tool function as a deboss workflow
so that it will guarantee basically if I refunded before, I will only refund it once.
So this guarantees that if the user asks, I want to refund it again, based on the record
information will say you've already been refunded before.
So we'll skip the refund.
And they will guarantee that once we kick off the refund workflow, it will always finish.
So say if something happened to the program, if it crashes in the middle, we will just
resume the refund process from where it left off.
So like say, if already processed the payment or kind of return the money back to users,
we'll also restore the inventory stuff like that.
Basic question though.
So if the agent is the LLM2 and the LLM is the one calling the tool, which in many cases
the tool is an API call, how does now deboss get involved in that part of the workflow?
Yeah, exactly.
So the cool thing is that as I introduced before, deboss is a library and annotations
that you can wrap around your function with.
So when LLM calls the function, it will call the wrapped function.
And that wrap function is a doable workflow.
So that's why it's super easy to add deboss with any AI frameworks.
Like instead of passing your bear function, you pass the annotated function into your
AI tools.
That's interesting.
Are you going to talk about data pipelines as well?
Is there an example there?
Yeah.
So data pipeline is one like huge use case in deboss where basically the typical pipeline
sometimes goes.
The first, it's a workflow of first scrape websites or scrape other data sources and
store the PDFs or images into some S3 bucket.
And then the second step will, in parallel, will use LLMs to analyze those images or PDFs.
And then the third step will be to persist the return value from LLMs into multiple data
stores, either vector database or Postgres or multiple data sources.
And then maybe another step is to do some decision based on or analytics based on the
results from previous steps.
Like a concrete use case is, for example, stock marketing monitoring.
So some of our use cases would be to scrape the website of stock markets and then put
into LLM to do some analysis and then do some business decision based on like, do I want
to invest in the stock, do I want to sell the stock as a final step?
Why durable execution?
Yeah, durable execution is essential because you don't want to lose data.
And once you process, like usually there are two reasons, one, the data volume is huge,
right?
So easily you can process 1,000 documents.
And if something fell in the middle, or not just 1,000, maybe 10,000 documents, if something
fell in the middle, you don't want to restart from the beginning.
Like, you don't want to say, oh, now something happened to my pipeline, either AI failed
or I hit some AI rate limit.
I don't want to say I have to reprocess all 10,000 documents again.
I want to resume from where it left off.
So that's the first thing.
The second thing is, you want everything to complete.
Say if I process 10,000 documents, I want all of them to complete, otherwise my business
decision, for example, which stock to trade may be based on incomplete data.
That can cause like financial loss and other consequences.
So in this specific example, if I did decorate the loading step with debas, and the error
was on, let's say, the 100,000 document in my set of 500,000, what is the behavior?
So basically in this case, you want to use debasqueues to paralyze those tasks.
Like you don't want to process all those documents in one giant step.
You want to basically say each step is a queued task spawned by my workflow.
From documents 1,000 to 10,000, I will in queue a task, and then it will become a fan
out pattern.
And then we will wait until all tasks finish, so we'll fan in by waiting those results.
So if any of the tasks failed, then when we recover the workflow, we'll just say, okay,
now I have to check all those queued tasks.
And if some tasks failed, then we'll restart execution of those tasks.
But for those tasks that have already finished, we'll just get the results from the database
and they return it.
So that's how you can correctly recover from those failures.
So you had earlier mentioned about using conductor for debas in production.
What's that other recommendations about being able to run debas at scale?
So the question is mostly about the best practices for using debas.
So I think to run debas at scale, you first really want to make sure that your workflows
are deterministic.
That's the core to guarantee your workflows are recoverable.
Because if you have nondeterminism in your workflow, when we recover it, we may go through
another execution path that we don't really know how to recover.
And then the second thing is use queues wisely.
So for example, when you deal with LLMs, they typically have rate limiting and they also
have concurrency limits.
So how many outstanding requests you can have to those LLMs?
So in this case, use debas queues.
So with queues, you can add rate limiter, say I want at most five API calls within 30 seconds.
And I want at most 10 outstanding requests at a time.
So with those preventives, you can build your apps with parallel tasks but also within the
limit of those LLMs.
Is there anything else you'd like to add for folks that would like to get started with
debas transfer?
Yeah, so for debas transact, download it, install it, run it locally on your laptop.
The cool thing is that because it's a library, we can also leverage the debugger.
So we actually develop a time travel debugger that you can install from VS Code marketplace
where you will be able to replay traces happened in anything happened in the past.
Say I want to investigate why the workflow executed that way, like executed yesterday.
And I will just pick the workflow and then tell our debugger extension to say I want
to re-execute that.
And when we re-execute, instead of writing to the database, we'll just pull the information
from the database, say this is the workflow, this was the input, and those were the output
of each step.
So we can instead of actually say, if your workflow contains sending email or calling
out to Stripe, we'll not actually call those external APIs, we'll instead return the recorded
information, and then you'll be able to step through your workflow as if it happened in
the past.
That's very cool.
Thanks so much for today, Qian.
I think a lot of the goal of the conversation, at least what I was trying to achieve was
distinguish between workflow as a graphical representation aid for business users versus
grouping together the series of steps or services that need to be durable.
And how does that play into the now where a lot of the code generation is happening
with AI as well and where exactly does D-boss fit into it?
You've explained that you can definitely use a simplicity of D-boss and the performance
of D-boss in grouping together steps that you need as a workflow.
And I presume one would be able to use AI to then introspect that code and generate the
graphical representation as required for business users.
But then the question would be, why would you not use AI to even write all the framework
that comes with D-boss?
Could you just maybe spend a few minutes on your thinking about how AI fits into all this?
Yeah.
It's a really interesting question, and I've been thinking a lot about it recently, especially
with AI, we've seen a lot of AI-generated code.
It's true that someone may say, OK, AI may be able to generate those checkpointing code,
but AI won't be able to generate automatic workflow recovery.
It won't be able to generate the control plane that can automatically recover on failure.
So we think D-boss and other workflow engines still play an essential role here in the AI
era.
Right.
And with AI-generated code, I think we really want to make sure they are reliable.
And by reliable, I mean, yes, those codes may be buggy.
So you want to have a way to say, if I caught a bug, I want to be able to investigate a
bug.
I want to be able to, say, restart my workflow from a specific step and then fix the bug.
So those are the capabilities where those durable execution engines really shine, because
with those AI-generated unreliable code, you want a way to correctly store those information
for debugging.
And because I think the one advantage of D-boss is that because we store those, like, what
step has executed and when at a time stamp of it in a relational database, so we really
have this structured data for AI to analyze.
It's easy for human to analyze, and it will also be easy for AI to analyze what's going
wrong.
So I think it's a really exciting new capability that we are exploring.
And I think simplicity and structured data logging are important.
So simplicity is that we actually try to add a prompt for D-boss, and we give the prompt
that will incorporate our latest version of the code, will basically give AI instructions
on how to add D-boss to your code.
So with that, we can just tell AI, make this code durable, and then AI will correctly do
that in one shot.
And that's really interesting, because if we put your execution as a library, we will
be able to make AI-generated code durable very easily.
And then when you execute that code, we automatically checkpoint data in a database, so it will
be very easy for a human or for AI to investigate what's going on.
And then because everything is a database, it also allows users or AI to automatically
fix those code, because to fix those code, you just need several SQL statements to modify
the table to change the result, to change the output of the recorded output table, and
then to continue from where it left off.
So we are really exploring the synergy between your code, your database, and the way to modify
your code to generate new traces, to modify the database, to observe databases.
So it's a really exciting area.
How can listeners contact you if they have any feedback or questions?
Yeah, so to contact me, you can visit my website, chanlie.dev, or you can follow me on LinkedIn,
follow me on Blue Sky or Twitter.
But also, don't forget to visit deboss.dev, this is our main website, and we're looking
forward to your feedback.
Thank you so much, Chan, for coming on.
Thanks, Kanchan, for inviting me.
Thanks for listening to SC Radio, an educational program brought to you by IEEE Software magazine.
For more about the podcast, including other episodes, visit our website at se-radio.net.
To provide feedback, you can comment on each episode on the website, or reach us on LinkedIn,
Facebook, Twitter, or through our Slack channel at se-radio.slack.com.
You can also email us at
[email protected].
This and all other episodes of SC Radio is licensed under Creative Commons license 2.5.
Thanks for listening.