Modern Distributed Applications with Stephan Ewen

What this episode covers

A major challenge with creating distributed applications is achieving resilience, reliability, and fault tolerance. It can take considerable engineering time to address non-functional concerns like retries, state synchronization, and distributed coordination. Event-driven models aim to simplify these issues, but often introduce new difficulties in debugging and operations. Stephan Ewen is the Founder at Restate which

of MATCHES

TRANSCRIPT · AUTO-GENERATED

A major challenge with creating distributed applications is achieving resilience, reliability, and fault tolerance. It can take considerable engineering time to address non-functional concerns like retries, state synchronization, and distributed coordination. Event-driven models aim to simplify these issues, but often introduce new difficulties in debugging and operations. Stefan Ewen is the founder at Restate which aims to simplify modern distributed applications.

He is also the co-creator of Apache Flink, which is an open-source framework for unified stream processing and batch processing. Stefan joins the show with Sean Falconer to talk about distributed applications and his work with Restate. This episode is hosted by Sean Falconer. Check us out for more information on Sean's work and where to find him.

Stefan, welcome to the show. Thanks for having me. Hi, Sean. Yeah, absolutely.

Thanks for doing this. I'm excited to get into it. I want to start off with a bit of your background, you know, what was sort of your journey and experience from working on Flink to now being the CEO and founder of Restate? Yeah, most of my professional life was Apache Flink so far.

I was part of the team that started it in 2014 and in a way probably unresponsive for a lot of the early architecture of Apache Flink around the way the data play and coordination and snapshots and all of that worked. The journey actually started even earlier in a way it started when I was still at university in grad school and we're working in this sort of intersection between our doable databases and some of the very, very early steps of stream processing had just come up, you know, like storm was a new thing back then. And after that we actually took the project that worked on a university and turned it into an open-source project. We're sort of a mix of pipeline-based processing system and maybe some early steps of a streaming system and like as part of the open-source journey we found our sweet spot with users and stream processing turned it into a stream processor and then you know kept writing that way for Kafka Flink, the event of real-time stream processing, so full stream processing, unified batch and stream processing and so on.

I left that space roughly 2021 to focus on something new and sort of a project that or a set of problems that caught my back then was kind of similar to the problems that we're trying to address with Flink, Flink being sort of an analytical system for robust analytical real-time pipelines and we're more and more being asked for how do you build more like transactional event or application and not the applications that aggregate events and join events and so on and yeah, feed dashboards, feed recommenders and so on. But the type of pipelines that in the end actually process payments invoicing, you know, orchestrate orders and shipments and so on, like these types of applications that folks were so stitching together manually with databases and queues and lots of like custom logic and it felt like they were like folks were looking for a solution that you can turn to systems like Flink to can implement that. It's not a great match. You don't use analytical system for transactional processing as a general rule of thumb but I guess this question came out more and more without okay, like we should probably start looking into that space and building something and that is when we started working on restate.

We've been doing this for around about three years now and there we are. So how do you sort of describe restate? Do you turn it or do you sort of bucket it into this class of like durable execution frameworks? Yeah, it's definitely put durable execution as one of the main ingredients on its list.

So you're right. There's this big bucket of durable execution engines. It almost seems like there's a Cambrian explosion of those right now like there's only one of a few months and reset is definitely a good candidate if you're looking for a durable execution engine. It's a little more than that though.

It's really I would say a more holistic platform for building distributed resilient application. It doesn't just include durable execution as in being able to sort of journal different steps in your process and being able to reliably recover them which is this notion of like workflow style logic but implemented in general code and general purpose code not in ideas also. So recently goes quite a bit beyond that restate sort of tackles the moralistic problem of like what if we try to apply this idea of durable execution not just to a single workflow but what if we sort of incorporate concepts like distributed communication a state that outlives an individual workflow or an individual durable execution how would all those things interact how do you build sort of a more general platform that applies that level of durability and resilience to just a bit of services in general and not just like an individual workflow. Going back to sort of rise of durable execution frameworks and idea that there's sort of like seems like there's a new one every six months or whatever.

What you think that is the case like this is something that people are investing time into and it seems like there's growing interest in it. Yeah I think this is because the state of the art is sort of it's not feasible I think. It's more and more developers and companies reaching their conclusion that the challenges that you're facing today implementing this is just not something that many software development teams can handle and the ones that can handle that they're not really using their time well because they're spending most of their time really on problems that have nothing to do with the business logic they're spending their time on problems like figuring out race conditions and how to avoid split brains and how to avoid lost updates if a zombie process appears and all those things you know they should be focusing on making you know adding features to the application of not creating some work around distributed systems problems. I think this has gotten particularly bad with the rise of microservices and I think it's a big part of why there's a little bit of a backlash even against microservices right now like all the benefit they give you I think many people realize how challenging distributed infrastructures with lots of microservices are and you know some are just saying okay let's go back to the one that was just a bad idea in the first place and then there's a whole group of people that say like no no we actually like a lot of the benefits that microservices give us which is one like a stronger foundation to build them on we want something that sort of freezes from dealing with many of these problems and I think this is where the whole wave of durable execution systems get started kind of that movement.

Now let's say it gets actually more and more necessary to have those systems because applications get increasingly distributed and it's not only this services you build yourself but more and more functionality that you access is hidden behind APIs provided by SAS vendor and so on these are all like services that you interact with they become part of your microservice architecture even you've done with the on them and so on but they add to the complexity of the problem and that's a trend that's only increasing I think that's not going back. Yeah even though he said like I hate microservices and down with microservices and they'll go back to the monolith even if I build that monolith and I deploy it and able to manage that I doubt that application exists in isolation is going to have interdependencies to third party services which then are going to reintroduce essentially all these distributed system problems so if I'm connecting up microservices or even on calling like a third party API then there's all kinds of things in a distributed system that can go wrong there can be certain outages so without using some sort of framework to help me solve those problems like what are teams typically doing to do that are they just sort of making those requests and then you know at best they're doing some sort of retry scheme with exponential back off to see if they can push that through and they're okay with sometimes that not happening or what is it that you know companies are doing to try solve these problems now. Yeah I think there's a lot of different approaches to that problem there's I think the first observation I would throw in is like there's a lot of companies that actually don't really get it right you know just the fact that there's still so many websites that tell you like don't hit a file while you're undergoing a booking on order process just like one indication that they can't really handle these like things well like concurrent requests you still see lots of like a lot of artifacts that if you are a developer you can understand okay something is going wrong there back and back and back. I would say just like first of all a lot of times it's not actually getting solved correctly and actually heard a quote in another podcast from somebody who works at a food delivery startup who said like for many many years the solution was just like you can order a nant about you until it's like it's not just like it's really expensive you can know the problems and send out just I would say if you want to actually solve the problem one of the ways to do it typically is to stitch together different systems the typical ingredients will be use a queue at database with your own retry loops with back off but it's not just as simple as implementing if you retry this with back off right like you have to always kind of worry about okay what actually if that gets the retry actually happening if you're triggering this as an RPC call you know you might be truly trying it the whole process might actually go away then there's an extra people to the queue in front actually say you know if even if the process gets away the event gets rid of it somewhere else now you have to actually worry about the fact you know you might actually have two processes that work on the same event twice like are they all writing each other with their retry so you want to throw in a lock you want to introduce versioning and conditional updates and then you know you might be interacting with APIs you might call them get a result crash afterwards recall them might get a different result the second time you call that so the next reach will actually photos a different control flow than the first one and things go completely haywire it doesn't stop with a retry I would say it's very often you start with a queue and a retry and then you incrementally just add like bits to guard against that bucket that you discovered at that and then incrementally just rose really complex and it makes a hidden assumption on this exactly how that you behave and that's exactly how that API behaves and then somebody changes that and everything breaks again and then you're back to fixing this I do not think they're really good solutions if you go to the extreme end of saying okay here's an extremely sensitive high value process then sometimes folks throw in workflow engines as one solution right let's say if here's an order process that we really don't want to go wrong because that can actually cause a lot of money you might pull in a heavyweight or a clock illustrator but it's really not something you typically pull in for like small microservice logic visible because it's really a complicated component we have in the second second of all it's really foreign citizen it doesn't interact well with a lot of the other logic yeah and it's probably a little heavy weight for the majority of the types of calls that you might be making in between microservices or even two external services yeah and this is actually one of the interesting things that durable execution brings in specifically in like an implementation like we're looking at every state if you can actually make durable cheap cheap S in low latency having adorable step introduces a very moderate latency overhead then what you can actually do is you can start assuming these workflow guarantees for a lot of code in your application it's no longer sort of prohibitive to do that from two sides like it's no longer so slow and expensive that you say oh I really don't want this here this is in the synchronous path of the user interactions it's going to make everything really sluggish but it's going to feel still fast doing that and the second thing is you're still running code it still is so it still fits in with all your tools and with all your deployment and pipelines and all your versioning all your streamer registries you can still can still keep using that so it feels like you can keep doing mostly what you are you're just adding this fine grained reliability to your functions and get a lot of the problems out of the door that's actually the ultimate all of our systems like we say okay going back to 2021 when you started working on this like where did that project start like how did you even begin to try to tackle this problem yeah so we started initially actually trying to solve this from the server patch you think and as I mentioned we were working with a bunch of users on analytical pipelines and then this question came up users building so transactionally met with net pipelines on flink and we really didn't find that a good match and you know very moderate success there were a few that could make it work with like specific approaches for much in really good experience and then we started a sub project in flink it's kind of still around it's called stateful functions the idea was just like that's the thing that folks really like which is just reliable communication transactional state and sort of encapsulate that into an individual piece individual function think of it's like a lambda function but when it's invoked it has contextual state it's invoked in the concept of a key so if it's attached to a key and when it's invoked it's sort of hard-rated with a state of that key can interact with that modify that it can basically produce a set of lot of RPC or messages that go out to other functions and then this is all like transactionally committed like the messages sent to other functions the state of this committed it's almost like a stateful disaggregated serverless actor system i think of it like that so that in principle raised a lot of interest like a lot of folks that did like that as an abstraction could see there's like this is great for building anything from digital twins to well the transactional state machines which represent orders even invoicing payments and so on because they'd actually build payment processors on that we learned that in only years later those were quite crazy this is like one legitimate this hat and this is it was built on flinkers analytical system and flinkers really through but optimized it's like low latency for analytical system but that is mostly if you sort of use the at least ones semantics and the things meaning you sort of like you push events right when they come if you want actually transactional results you're introducing a huge latency namely flink is checkpoint based and you have to wait until the next checkpoint happens that sort of the latency introduced if you want to say that they want to take a second step before the first step is really durable and that is in the seconds right and imagine using this as a foundation for work-flow style logic means like every workflow step is sort of let's say 10 second latency it's like completely impossible to do that and it is also something that the flink architecture could never like fully remedy so we thought if we really want to make that happen if we want to make that vision happen of durable execution being something that's so low latency that you can use it without worrying about introducing latency overhead even ends with latency critical paths that are like synchronous interaction paths so then we'd really have to build a new stack we'd have to start on the bottom building on a low latency architecture that emphasizes faster ability and not analytical throughput and that's how we then that started with restate so what are the core building blocks of restate both for me user perspective like what am I sort of stitching together from the developer experience and then what is sort of the architecture behind the scenes that's helping me essentially support that in a way that is going to shield me against these kind of like outages or other issues that you might run into into distributed systems so yeah there are different levels in which to look at let's look at the first from the infrastructure side which is reset actually like sit in your infrastructure you can think of the detects a similar place as a message cure message broker it's kind of a or work lock is right it's kind of a marriage of let's say the Kafka rescue vendor of an application and sort of their temporalisk doable execution workflow world so it sits where a broker would sit you're writing your logic as a service handler the abstraction is which try to keep microservices really as the abstraction so you're writing services like almost as if it would be like a spring boot application or express to as application is also like a handler's groups into services and then restate is the queue with which those services sort of get triggered right so if you want to actually trigger a handler you put an event in that queue that's supposed to invoke the handler and then restate in both that service so in that sense sort of classical queue in front of the service the abstraction that we exposed really that of an event but it's more that restates when it looks at the services and their handlers re-exports them and sort of becomes a reverse proxy so we're really trying to get away from people thinking in terms of like queues and events and trying to keep thinking in terms of like synchronous and icon and so forth as the person doing the implementation of undoing the if I'm doing it I kind of just focus on the work that I need to do and that's sort of extracted away by restate.

I guess that's one way to think about it I think if we go into details of what the programming model has maybe we'll see this but in general you can think of it as it's like it's a level up from sort of Kafka style event with an application you're not thinking in terms of queues and events you're thinking in terms of doable stateful resilient invocations or functions yeah and that sounds like maybe an academic detail but it actually is a word of a difference because that actually means that restate takes on a lot more responsibility it doesn't just take on the responsibility of say okay I'll deliver the event and make sure the function is triggered and it's you know it's really a way to be triggered on a failure it also understands okay how do I fence retries against you know earlier executions how do I log contextual state attach contextual state how do I track progress if I have like multiple steps that happen as part of a function invocation I want to actually understand that I record the result of the previous step before I start the next one just you know to give ourselves an easier life when it comes to implementing not complex control flow that would be thrown up with it's on different results during different retries all those things if you implement them manually you're typically not just looking at a queue at kaffrey you'll be looking at combining a queue with a locking service with the database with the scheduler and so on it's a restate wraps that all together and says we're going from queue and event to durable state fold resilient function executions and then as I mentioned before the core programming model is services that are meant to mimic RPC style service frameworks and the sort of the simplest reading book you have is really service handlers that get durable execution and then resets of layers a few things on top of that like one concept is virtual objects that are stateful handlers that remember state across individual invocations and shard around keys and then you know more like high level workflow constructs where you can actually add signal handlers and query handlers and so on but all of that is sort of built on top of the general service abstraction so if I'm implementing one of these handlers and it gets executed what is sort of the wife of that process behind the scenes let's assume you're implementing a payment processing handler or something like this and that gets invoked and let's say the logic that you have in there is a first step to check the status like was that already processed was it maybe canceled was it blocked before so let's say the payments identified by an ID I might want to call the front detector I'm allowed to database send them out a message and so on the life cycle of executing this would be the following some external triggers on external appliances I want to execute that function that enters restate the reset server the broker component as an event and restate will understand where does that service live you can think of it that the service has to be the register to restate the service endpoint you know where's that deployed is that like here your lm lambda is that an HTTP two server endpoint this co-renet is deployment and so on you have to register that at the server and then the server connects and pushes the invocation you've worked with for example like Amazon Event Bridge or things like that it's kind of a very similar model so reset will then look up okay that you know that enters on that endpoint and I'm connecting to that let's assume it's an endpoint on Kubernetes or so in this case it would open a streaming connection HTTP two put the invocation and then hold on to the connection and that sort of the lifeline to that single invocation or execution attempt which allows the service to stream back things like general progress state update outgoing messages the function let's say or payment handler would also get when it's invoked you know if it's a stateful handler a virtual object restate would attach all the contextual state of in those for that individual handler to the invocation so that the handler could directly look up things like okay what's the status that the previous status that was committed like okay it's still new so let's start and execute that payment and then let's say we're calling things like the it's they will call an external fraud detector API we get the result and we say okay this is actually a durable step then the handler would put the result of that step into that stream that goes back to reset internally has a consensus log that persists all the things it receives and it has a bunch of sort of logic around this to understand okay this that information that still comes from you know like a valid execution attempt that come from a temp that has been fenced off in the past it has like a sort of it has sort of a elaborate consensus log that supports a conditional append of that operation to the general and it links that that operation or that that entry that's the result of the calling the fraud detector API to the original event and if we'd say okay there's a failure after that point now that failure could be you know like just the connection is ruptured the process goes away or there's a timeout then the resets over would understand okay that event hasn't been completed the execution I didn't actually get an acknowledgement back for that yet and it would send the event to another process I would retry sending it to that endpoint and it would attach everything it has to that event and that's that's a contextual state like last step but now also it would attach things like the journal entries that it already collected like here's the result from the previous step reps that all up and sends it there then let's that service basically say as I'm going through the code again I can skip over steps that have already completed this is what the SDK library basically does for you like understands okay that has been that's already found in the general we can ignore this is a new step we actually add an action or an event for that in the journal and then you know then it goes on that applies to pretty much any operation recording the result of an API call updating state sending out a message all these things basically become events that are streamed to the reset server and the reset server understands how to process these events they all get sort of attached to the original invocation but sometimes they also represent more like they present an outgoing event that is then rooted to another service or they represent a state update which supplied an internal state index and so on so it's generally an extensible event event or event or an architecture on the server side that synchronizes of a streaming protocol with the service. I install you know this SDK I set up this client I'm wrapping my you know call essentially around some of the SDK you know semantics or whatever and that's going to call the restay server restay is going to do its magic to make sure that that call is able to essentially be facilitated in a way that it's reliable durable and so on how do you make sure that the call from essentially the client to the server is done in such a way that is reliable. So if you want to just the initial event the initial call that triggers our durable handler you have a bunch of ways to do this you can do this way through http through a client library or you can actually just connect Kafka and then it which is like polyc. That's from Kafka that represents these implications.

There's a few ingredients in there that help you that make this reliable like number one like the reset server will not acknowledge anything before it has persisted that in its internal consensus logs so even the original event has to go through the consensus log first before even an asynchronous subnet or so is acknowledged so we already have that durable and the second thing is you can attach our input and see keys to the invocation and then the sort of event processor inside a restay server can use that to duplicate invocation events so that you know all the goodness of saying we duplicate steps inside the durable handler doesn't really help you much if you can't duplicate the implications so like the item potency key support is there to do that and then if you integrate this with Kafka it automatically does that so Kafka offset mapping to our importancy mechanisms and basically gives just an end-to-end in exactly one situation. So if I want to you know start using something like restate and have an existing project do I have to kind of you know think about re-architecting everything to start with or can I do it sort of bit by bit based on where maybe my most critical workflows are like a payment system for example. Yeah so we've really built it to avoid having to re-architect everything and that kind of shows in the sort of in many of the core abstractions like we do have to adjust the code to use the SDK to have access to some of the like durable execution mechanisms like execute this code block as a durable step or you know access the access the built-in transactional state or let restate deliver that message to another service like you have to use the SDK the library to do that so there's some adjustment in the code but the way you're deploying this the way you're generally packaging this is meant to be very much in line with what you're doing anyway it's hence the sort of you know idea to abstract it like to give it the shape of microservice service handler is the way you deploy it from the outside you can very often just just say okay this was a non-restate service you know I'm importing the reset SDK I'm starting to use this I'm starting to use using these restate actions inside my code connecting this to the reset server which becomes the reverse proxy and now these services that initially used to call the service directly they call the restate server which becomes the reverse proxy for the service it's really meant to sort of allow you to plug it in incrementally just look at it one service at a time there are a few things that really become very powerful only once you start attaching a few more services like between services that are attached to the same reset server you get kind of and exactly once RPC messaging which is pretty nice but even in the absence of that you're still getting a lot of goodies so yeah it's totally meant for incremental adoption for some like adopting this or for teams that are adopting this approach like does it take some work for them in terms of their thought process and the way they traditionally develop to kind of come around to this mode of like operating calling services yeah I think it does a bit and I would say mostly it almost requires unlearning a few things that they have learned in the past so if you're coming from a traditional workflow system we often have folks asking okay I'm writing this but like where is you know how do I make something a persistent activity now or yeah just looking for the concept of like a work on activity and then the interesting thing is like in reset every durable step is like an activity or if you want to separate it out then make it a separate service that you call and you don't really need work with a special construct anymore it sort of gets similar guarantees just from your regular service abstraction even the same observability and telemetry you just get out of that so you have to kind of maybe take back a step from looking for exactly the concepts you might know and just understand that a lot of the reasons why you were using those concepts the guarantees you were really looking for they're sort of like everywhere now in almost all the code you write with restates you don't have to go to like these special constructs anymore the second thing is understanding that many of the operations you do are now durable across failures and crashes for example if i'm doing something like an RPC call sequential RPC call request response to another system and the caller actually fails and gets recovered into a different node that's not something that you usually assume still works not because you know like the network call might be lost or like even if something gets sent back to the code actually issue the call is not waiting for the response was recovered in a completely different process but that it still works in case of reset because all the all the building blocks are actually durable persistent just with the building blocks the RPC is basically connected to a persistent future that gets recovered in a different process i'm going to complete it there so the entire code that made the call gets recovered restored to the point where it made the call and then completed with the result of that calling just like works even if it moves around so this is something that a lot of people don't expect to work and that's why they're trying to code ways around that and then they come into the discord and say like okay i'm not really connecting the dots here and you basically tell them not just like delete that that just works that's an interesting experience are there certain kinds of projects that this makes more sense for than others like what stage it makes sense to go with and approach like this versus you know something alternatively i think there's a few cases where it does not make sense and then there are a few cases where probably lots of durable execution systems could make sense and then there's some cases where i would say that's a really good use reset use case in particular i mean in general durable execution makes sense for workloads that's for orchestrate many steps that updates stuff like if you have mostly you know read heavy workloads or read only workloads just like doesn't really make sense to plug in a system like this you know then there that there are workflows that with something like durable execution is a nice convenient piece because it helps you you know and keeps it restart retries you don't have to do them yourself it helps you to implement asynchronously primitives down a bit easier and you know but if anything goes wrong and some state gets lost and everything gets retried and recomputed there's really no big deal like a lot of them let's let's say let's say a rack pipeline or so which we log into generation like you know if you lose something you recomputed versus cases you call your LMM a few more times than that so like half a century of bill maybe that's not a big deal and then there's cases that without actually really matters we will absolutely care about transactional correctness where you say like no matter what kind of funky failure happen i can never go back before previous step or cases where you do explicitly need transactional state that oddness if you do workflows that you can rely on that other services can integrate with them this is a very good restate use case because we've kind of architected it with that level of resilience in mind reset is implementing really its own step it doesn't build on a database it it implements its own consensus log it's a process on top of that it's a complete self-contained single binary you just deploy it and run and it has internally and extremely well thought through consensus architecture that allows you to make very strong assumptions on your semantics and i think payment processing is a good example like if you want that there's a good use case for reset i would say specifically also when you want something that works both in the cloud but has also i guess a credible story for self-hosting them the converging of binary architecture is actually feasible to self-host it's not just like theoretically you can so open source and you can host it but that's actually fun to operate you mentioned like rack pipeline there not maybe not being the ideal use case because if it fails you know you can run it again or something like that it's a use case but it's a good use case even it's very convenient to do that on top of reset it's just not a use case where you would rely on like strong transactional correctness that's i guess but what about like a user facing application that leverages you know a foundation model of sub-sport there you know especially if i'm doing something where i'm making multiple inference calls and sort of like a genetic workflow i would think it would make a ton of sense there because you could be calling the tools you know the various data systems multiple you know models and so forth is that a use case that you're seeing yeah i think that's actually very interesting one as soon as you come more into the into the AI agent space i think it becomes a lot more interesting for a couple of reasons like number one i think agents are a good kind of match with durable execution in general because they are a bit like dynamic workflows workflows with the control flow is not known up front it's kind of determined by the responses of the LLM and durable executioners has this flexibility that you don't need to define the sequence of steps in the control flow up once you can kind of create dynamic control flow just like recorded and replay it after a failure so i think durable execution in general matches agents very well and then the second thing is agents usually contextually stateful so they map really well to these like virtual objects kind of concept that we have in restate when we have you have this this exclusively scoped state that you have access to that that you can use to remember basically not just previous steps but also previous context but it's still not something that is just like it's hidden in the workflow but it's still an open state that you can you can probe from other services yeah that you can even interact and put it in all context and from other services if that comes up it's the whole attraction just matches very nicely what are some of the unexpected use cases that you've seen of people applying restate yeah so there's some very expected use cases like you know classical workflows sagas distributed state machines the unexpected ones it seems there are lots of folks that have fairly complicated distributed queuing setups where they're starting with something like Kafka and then they're also pulling in a RevidMQ and they have some I don't know some routers and some actors in between think often this is kind of a workaround to build something like you have maybe a common log and then you find this out in like more fine-grind entities that you interact with and we have a bunch of users that basically could replace a whole zoo of the distributed queue orchestration just like with a single reset service that's something we hadn't quite expected to happen so often the second one that I've found fascinating is that we've seen folks do in fact build a lot of custom workflow engines and custom rule engines apparently that's the thing maybe many companies build for internal processes internal tools and so on so that's a quite common use case that we've seen my favorite one is actually folks building custom workflow and rule engine that they ship into into factories to evaluate like sensor data and trigger actions that like that controls machines that was not on my list for like one of the early use cases so that was quite fun to see what we just say is like the biggest sort of challenges that you face when designing and implementing restate I mean that's technical challenges right like the mission is extremely ambitious to say we're building we're building a full stack that starts on the bottom with a like consensus log that has low latency but it's also you can deploy it and extremely complicated setups cross cross availability zones cross regions it tries to make good use of like modern cloud architecture like objects also at the same time you know bridge the gap to the latency this is that's a technical challenge that we still you know we've worked quite quite some time like actually up to like over two years now I'm making this happen I would say beyond that the biggest challenge really is let's say education of the space durable execution is becoming more and more known but it still is not necessarily a mainstream concept a lot of folks still associate it also primarily with workflows if you're doing like durable execution for webflows maybe you get more and more folks that like not okay yeah I know that but if you're trying to say okay now we're actually sort of talking about durable execution in a more general way it also includes state communication like think of it as a microservice paradigm that's like okay I need to think about that a bit like I think this education is is something that is a big challenge but also I would look at it positively it's also something that is making progress most folks after they've gone through the initial okay I hadn't expected that let me think through it a bit like once they actually crack it they usually get quite excited about it so that's not help spreading the word so that's that's good yeah I mean I think that proportional with any sort of new category creation like this is not the way that people are used to doing things then it's hard for people to even know that they have a problem in the next idea by doing something so they're not necessarily actively searching for until you sort of cross the barrier of this like educational awareness essentially definitely I'm not sure what it goes far is to say like this is a brand new category that we're creating like durable execution as a category existed before we started I think the new we're bringing a bit of a new twist into it definitely like you know treating it as more than a workflow paradigm is probably something new and then adding this like low latency capabilities that that actually allow you to use it in places where you might previously not have thought it being applicable is maybe something new as well that people need to wrap their heads around but yeah we're also sort of working with other folks that have worked on creating this durable execution category are basically like leveraging their work for sure what do you think overall like the impact will be to how we design distributed systems in the future if more and more people adopt this approach of durable execution I would venture guess and say we're gonna this type of solutions are going to be very very widely adopted in a couple of years I think they're going to replace a lot of workflow queuing and another sort of distributed orchestration systems that are out there just because they they're an nicer like more approachable way of solving these problems and yeah they just interact better with with the rest of your application stack and they can actually do things like they can actually support use cases that you might not have been thinking of before and vice versa not using these systems as we said before it's just like getting getting harder and harder this is one of the driver I would actually throw in a second element why I think this is going to be extremely widely adopted in the future and that is if you look at that the whole AI transcended AI code generation you can actually see that these systems are getting increasingly good at doing things like even complicated complicated business logic we're like you know assuming you have all the domain context you really need a bunch of steps a bunch of non trivial steps to happen but those systems are not the ones that solve distributed race conditions for you or like understand okay here is a case where you know that process stalls just here and then reach what happens and then force of a copy here and then there's a going to interfere within a weird manner I don't see that happening even if you think they can conceptually do that that's probably a waste of compute power I think if you just use a foundation like verbal execution and say like it's just an incredibly good target for our foundation for AI generated code because it's a solid semantics a lot of the problems that you really don't want anything unexplainable semi unpredictable to be reasoning about and then put the much simpler generated business logic on top of that it's a nice package yeah that'd be great so what's next for you know restate so at the moment we're working very hard in releasing the next version which is our first distributed release I guess by the time this comes out it's probably going to be released already so we're talking like two to three weeks from now the moment if you use restate you can think of it as it deploys like a single note database like a post-versal give it a persistent volume and good the next version gives you the complete distributed deployment power to the distributed replication scale out and everything so that's actually a big thing that we've seen a lot of excitement building up for and we're pretty anxious to get it out there that's the biggest immediate step and then after that we're at the moment at the face where we're really we're really just excited to be to be working with as many users as possible learn from them what they're using it for what they see as good use cases how they think about the problem how would they explain it to others how would they explain this category how would they explain the abstraction and the mental model you'd have to have and really you know share this with the world and work with whoever's excited to work with us awesome well Stefan thanks so much for being here thank you for having me cheers

Share this episode

Similar Episodes

I'm ok

Mar 26, 2026 ·1m

REMIX: Why we over-shop and compulsively acquire, and how to stop, with Dr Jan Eppingstall

Jan 9, 2026 ·61m

REMIX: OCD and hoarding disorder with Jenna Overbaugh

Jan 2, 2026 ·47m

REMIX: Therapy and hoarding disorder - what are the options? With Dr Jan Eppingstall

Dec 26, 2025 ·78m

REMIX: ADHD and hoarding disorder with Professor Sharon Morein

Dec 21, 2025 ·46m

#207 13 actionable pieces of mental health advice from six former podcast guests

Dec 12, 2025 ·53m

Similar Podcasts

Ask A Spaceman Archives - 365 Days of Astronomy Ask A Spaceman Archives - 365 Days of Astronomy Podcasting Astronomy Every Day of the Year That Hoarder: Overcome Compulsive Hoarding That Hoarder Hoarding disorder is stigmatised and people who hoard feel vast amounts of shame. This podcast began life as an audio diary, an anonymous outlet for somebody with this weird condition. That Hoarder speaks about her experiences living with compulsive hoarding, she interviews therapists, academics, researchers, children of hoarders, professional organisers and influencers, and she shares insight and tips for others with the problem. Listened to by people who hoard as well as those who love them and those who work with them, Overcome Compulsive Hoarding with That Hoarder aims to shatter the stigma, share the truth and speak openly and honestly to improve lives. The Small Business Startup School – Business Notes | Financial Literacy | Retail Psychology – For Professionals & Entrepreneurs The Small Business Startup School Inc. Starting or buying a small business? While personal circumstances may vary, business patterns remain timeless. On The Small Business Startup School, we explore strategies, insights, and practical solutions to help entrepreneurs confidently navigate their journey.Hosted by Ola Williams—a retail entrepreneur, fintech founder, and financial coach with over two decades of experience—this podcast marries financial awareness and retail psychology with optimism to deliver actionable takeaways.Join us to learn, grow, and connect as we uncover the keys to business success.Let’s continue to learn together and be encouraged to keep on connecting! DIOSA. Carolina Sanper This podcast is a sacred space created by Carolina Sanper where you connect with your inner wisdom and embody your magnetic feminine power.It is the realization that the mystical realm is where you plant the seeds of your desired reality.It is a portal to your true essence: awareness, presence, and receiving with ease. Welcome home, DIOSA. 🖤

Frequently Asked Questions

How long is this episode of Podcast Archives - Software Engineering Daily?

This episode is 40 minutes long.

When was this Podcast Archives - Software Engineering Daily episode published?

This episode was published on June 5, 2025.

What is this episode about?

A major challenge with creating distributed applications is achieving resilience, reliability, and fault tolerance. It can take considerable engineering time to address non-functional concerns like retries, state synchronization, and distributed...

Can I download this Podcast Archives - Software Engineering Daily episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.