Prometheus and service monitoring episode artwork

EPISODE · Aug 7, 2015 · 1H 10M

Prometheus and service monitoring

from Changelog Master Feed

Julius Volz from SoundCloud joined the show to talk about Prometheus, an open-source service monitoring system written in Go.

NOW PLAYING

Prometheus and service monitoring

0:00 1:10:01
of MATCHES

TRANSCRIPT · AUTO-GENERATED

Welcome back, everyone. This is The Change Log, and I'm your host, Adam Stokowiak. This is episode 168, and we're joined today by Julius Volz from SoundCloud to talk about Prometheus, an open source service monitoring system written in Go. Super awesome conversation today.

Talk about the data model, the query language, and all the in-betweens. We have three awesome sponsors for the show, Codeship, Toptal, and DigitalOcean. Our first sponsor is Codeship. They're your hosted continuous delivery service focusing on speed, security, and customizability.

And they launched a brand new feature called Organizations. Now you can create teams, set permissions for specific team members, and improve collaboration in your continuous delivery workflows. Maintain centralized control over your organization's projects and teams with Codeship's new Organizations plans. You can save 20% off any premium plan you choose for three months by using this code, thechangelogpodcast.

Again, that code is thechangelogpodcast, and you'll save 20% off any premium plan you choose for three months. Head to codeship.com slash thechangelog to get started. And now on to the show. All right, everybody, we're back.

We've got a great show lined up today. What we actually have been waiting for for a bit. It was recommended by Peter Bergan. We just talked about him and GoKit and Go4Con, all that stuff, but Peter was recommending this.

Jared, our last guest, was saying that this was there, you know, Prometheus was their tech to play with. So we had to get Julius Volz on the line here. So Julius, welcome to the show. Hi, pleasure to be here.

And also we've got Jared hanging out in the wings there. Say what's up, Jared. What's up, Jared. So Jerry, we were at GopherCon not long ago.

So we met Julius and also Bjorn, who couldn't make this call, but we were excited to finally get a chance to get Prometheus and this conversation talking about metric tracking and stuff like that on the show. So what's the what's the best way to open this one up? You want to talk about Julius a bit? You want to go right into the tech?

Well, first, let me say that, uh, you know, we kind of did the hallway track at GopherCon. We're out interviewing people and talking with everybody. And there were two things people were excited about. One was Ben Johnson, who we'll end up to come up here pretty soon and the stuff that he's been up to.

And the other one that everybody was excited about was Prometheus. In fact, I think Julius, you guys even got a shout out during one of the keynotes. Is that correct? Yeah, we got a bunch of shout outs, I think, from Peter's talk, from Tomas' talk, the keynote.

So yeah, really, really exciting. Very cool. So we're excited to hear about it. We want to know all the details.

But I think, Adam, maybe if we start with the history and kind of see, you know, why Prometheus even exists. Do you want to start there? Let's do that. So Julius, you've been, you've been with SoundCloud for a bit before with Google.

What, what was going on to make Prometheus a thing for you? Yeah. So when I was at Google, I was actually doing something completely different. I was in Google's production offline storage system.

So basically we had many tens of data centers with huge tape libraries backing up all production data that Google had. So basically exabyte scale backup system globally. So monitoring wasn't really my specialty there, but I definitely came in contact with it as a site reliability engineer on that service. And when I left Google and joined SoundCloud back in 2012, it went as it often goes when Google is left Google at around that time, especially.

They felt a bit naked in terms of what the open source world has provided them in terms of infrastructure. Because at Google you have like an awesome cluster scheduler. You've got awesome monitoring systems, awesome storage systems and so on. Suddenly you get like thrown out into the wild and you miss all of that stuff and you feel this, this urgent need to be building a lot of that yourself again.

But when I joined SoundCloud a month prior to that, another ex-Googleer was also joining SoundCloud, Matt Proud. And he felt even more strongly about this. And he was particularly unhappy with the state of open source monitoring systems. So he had actually already in his free time started building client libraries for instrumenting services with metrics.

And his grand vision was to build a whole monitoring system. So when I joined a month later, he kind of pulled me on board and we started building something in our free time that eventually became Prometheus. So just in the first months, end of 2012, that was really just our free time. Finally, we got enough of it working in such a way that we could expose data from services, collect it, query it, and maybe, you know, even show it in a graph.

And that was the point when we decided, okay, this is actually going somewhere. Let's give it a name. Let's call it Prometheus. And we, like briefly afterwards, we started formally introducing that at SoundCloud.

And yeah, nowadays it has become SoundCloud's standard monitoring system in time series database. Now, the topic aside, I got to ask the question, which is one of my favorite movies out there by Ridley Scott is a movie called Prometheus. Is there any correlation? I have never watched that movie actually.

Will we see aliens come out of the code at some point? Right. So that, that was actually funny. I think it actually came out around the same time, but it wasn't really on my radar back then.

I think I just briefly had heard about it, but it wasn't really any, it wasn't really connected to this. Okay. All right. Yeah.

Prometheus, the movie came out in 2012 and I remember loving the name and not loving the movie so much, Adam. So maybe that's a separate show, but we could, uh, we could do that. I heard, I heard a lot of bad things about that movie. We can pause for a minute when you rant.

I mean, we could start another show. Maybe I should go a bit more into what we had at SoundCloud back then, because that was kind of the big motivation to, to build Prometheus. Well, you said, you said that you felt naked as a Googleer, you felt naked coming out of like Google and some of the things missing. So this was obviously one of those things missing, right?

Yeah. Yeah. But, but you might ask there, there, there were many open source monitoring systems, right? Why, why were we not happy with those?

That's my question. I like that question. I actually had that question queued up. Yes.

That's the next question. Cool. So, I mean, back then, uh, SoundCloud was doing this migration that a lot of companies do migrating from one monolithic web application to a set of microservices just because, uh, the initial monolithic application has grown too big, too complex. People don't want to maintain it anymore.

You can't have independent groups deploying independent things. So SoundCloud, uh, pretty early on actually started adopting Go and built their own kind of Heroku style in-house cluster scheduler, uh, called Bazooka. And that's kind of that, that was already a container scheduling system, a very early form. And we're still using that actually before, um, before Docker came out and before Kubernetes and so on came out.

And the challenge was now that we had these hundreds of microservices running on these Bazooka clusters, um, with thousands of instances and developers, whenever they built a new revision, uh, maybe every day, even scaled down the old revision and scaled up the new revisions. And all these instances would land on random hosts and on random ports. And somehow we needed to monitor them. So what we did back then was, uh, what SoundCloud did back then was, uh, use StatsD and Graphite as the main time series based um monitoring system.

So uh StatsD and Graphite had several problems. So when I joined, I remember the StatsD server almost falling over because it was a single threaded node application running on a huge beefy machine, but it could only use one core. So it was actually throwing away UDP packets left and right. I don't know if you know how StatsD works.

Uh, the general working model is that, let's say you have a set of web servers, let's say an API server, and you have 100 instances of that. Um, then if you want to count the number of HTTP requests that happen in that entire service, every one of these instances for every request that they handle sent a UDP packet to StatsD and StatsD will count from all these hundred instances, will count up all these counter packets from these different instances over usually a 10 seconds interval and then finally sum them all up and write a single data point out to Graphite at the time. So Graphite is a time series storage system and StatsD is kind of in front of it to aggregate counter data into a final count per 10 seconds. And you can do some stuff there.

Like you can say on the, on the service side, please only send every 10th uh UDP packet or something. So you alleviate the load somewhat, um, but the main pattern is here that you're doing the counting in the StatsD side. And, um, yeah, that StatsD wasn't really scalable. It was throwing away UDP packets, wasn't really working that well anymore.

Um, and the other problem was Graphite's inherent data model. So in Graphite for service monitoring and for host monitoring, we had Ganglia. And Ganglia is pretty much completely, you know, you have the host as a dimensional key there, but not much else. Of course, also the metric name, but there's no query language.

There's no nice graphing interface and so on. You get these pretty static dashboards with host metrics. And yeah, so we use also Nagios, of course, for the host alerting then. This might be a little bit premature, but I just went to the Nagios, and we're going to say it different ways, by the way.

Nagios, Nagios, Nagios. They say they're the industry standard for IT infrastructure monitoring. What is the goal or what was the goal with Prometheus? Was it to, you know, redo what everyone had been doing not quite so well because you have opinions and you know, obviously some skills to do it, but was it the goal to sort of unseat some of these existing players or is it to just sort of like rebuild something new that made sense for SoundCloud?

Yeah, definitely. So for us, it was the goal to replace SD, to replace Graphite, to replace Nagios in the end with a new kind of ecosystem that is more powerful and more integrated and allows you to do more stuff in a more modern way. So yeah, definitely. We hope to make people depend less on those old tools, I would say.

So we kind of sometimes jokingly call it the next generation monitoring system. And it does try to cover all the aspects from instrumenting your services, collecting the data, showing the data in the dashboards, alerting on the data if something is wrong, and then sending those notifications to you. So yeah, it tries to cover basically the whole field. What it does not do is event-based monitoring.

So if you want to do per-account, per-request accounting, let's say, you want to really collect every individual event, you know, a use case like logging or a use case like Elasticsearch where you can really put every individual record of what happened in there, that's not really what we're trying to do. Prometheus is really in the business of collecting purely numeric time series that have a metric name and a set of key value dimensions. And those, the metric name and the key value dimensions, uniquely identify every series and that you can then actually use together with a query language to do really powerful queries to aggregate and slice and dice based on whatever dimension you're currently interested in during the query, actually. And yeah.

So you started building this in your free time or you and your buddies started building it. I'm curious, just kind of the inner workings of SoundCloud, where they're at with open source and how much freedom they give you as an engineer. Was this something that you had to sell to your boss or to the company or was it just like, well, we're doing this now and whatever you guys think is the best solution must be right. Yeah, so this was definitely an interesting history.

I think at the beginning, we just took the liberty ourselves to do that in our free time. There was a lot of resistance at the beginning to introduce that at SoundCloud, which totally makes sense to me, especially in retrospect, because to be honest, at the beginning, nothing was really working. It was, I mean, late 2012, early 2013. The main server was pretty immature.

It wasn't really performing well. There were a lot of ecosystem components were missing and there was no real dashboarding solution yet and so on. But as time went on, we just kind of, you know, I think we took quite some liberties there in just pushing this project on and it became better and better. And I would say probably one and a half years in, we had the main server that collects the time series and makes them available.

We had that pretty mature and stable. We had PromDash, which is the Prometheus dashboard builder. So finally, people were actually able to build dashboards on top of the data that they collected. And we also had one of our really first killer use cases where we got instrumentation about all the containers that were running on Bazooka or in-house Heroku system.

So you could get for every application revision in Proctype keyed by those dimensions and more actually the current CPU usage, the memory usage, the memory limit, and so on and so on. And that really started convincing people that this was really worth it. And then I think that was kind of the tipping point where shortly after the strategic bet was made in SoundCloud to really switch to that. And in terms of open sourcing, that was interesting because when we started this initially, we just put it on GitHub without asking anyone on its own organization.

And so it's kind of a weird status, I guess. It was a private project. It still arguably, I mean, it was definitely started in the free time. Matt even started before he joined SoundCloud.

And we've been trying since then to keep it as independent as possible from any single company. So we really want this to be an open community project without, you know, one company controlling too much of the direction and so on. And before, so we put it on GitHub back then, but we really didn't make any noise about it. So we only told a couple of friends, especially also other ex-Googlers because, so I guess I have to say Prometheus is kind of inspired by a lot of what we learned about monitoring at Google and a lot of people who quit Google then asked, either asked us, hey, do you know anything similar?

Or they just discovered Prometheus and kind of noticed that it was very similar to what they've been used to. So before we even, you know, went more public about Prometheus, we had a kind of an insider circle of people using it, testing it already at one of our ex-colleagues from SoundCloud who then went to Docker. He started using it at Docker. And another colleague used it at Boxever, which is a Dublin-based company.

And so he's in Dublin. And in terms of open sourcing, so it was open source, but only in the beginning of this year, for the record, since it's a podcast, this year is 2015. In January, we decided, okay, it's finally, it's really ready enough to share with a broader audience. So just leading up to that, we had a lot of discussions with, you know, internal departments about how we should communicate this and what's the status around that.

In the end, everything was pretty relaxed. And we had, you know, blog posts on the SoundCloud's backstage blog and on Boxever's blog. And I think on my Docker colleagues' private blog back then. And yeah, and then it really took off.

And that was... So it took some work then. It took some commitment from you and Matt and others that were sort of seeing the way over where this can go and then... I was gonna say, you just run it concurrently alongside your statsd stuff until it showed its value.

And then you were able to eventually cut over or you're still running your statsd stuff as well. So yeah, that's what we did. And statsd is still running because, you know, you never turn off old systems in practice. But practically nobody's using that anymore.

Very few people are using that. So if you're building a new service at SoundCloud, it's going to use Prometheus. There's some legacy stuff on statsd and graphite still. And there's some stuff that was hard to convert.

But yeah, for the most part, it's all Prometheus now. And yeah, it's been really a ride, especially since being more vocal about it at the beginning of the year. We've really, I mean, the community has grown crazily. We have contributors from all kinds of companies.

We get a lot of contributions. Basically, we get contributions almost every day, if not multiple. I think Google is now, Google's Kubernetes is now natively instrumented with Prometheus metrics. So if you want to monitor Kubernetes, you don't even need to have kind of any kind of adapter to get Prometheus metrics out of there.

You have CoreOS adopting it quite a lot for their components. So Etcd is one notable mention there. That is already sprinkled with Prometheus metrics. Then you have DigitalOcean completely adopting it for their internal monitoring right now.

I don't know how much I can say about that, but I think these are the three companies where they're like reasonably public about what they're doing with Prometheus. I know of a bunch more, but I, I'm not sure, you know, how much you can say about those. Sure. Well, there's definitely tons of details that any system that looks to replace a handful of legacy systems will have many moving parts and you have an architecture, you have a data model, there's a query language.

There's lots of details. We want to ask you about all of them. First, we're gonna take a quick sponsor break, hear a word from our awesome sponsor, and then we will be back with all the nitty gritty details of Prometheus. Be right back.

TopTile is by far the best place to work as a freelance software developer. I had a chance to sit down and talk with Brendan Beneschat, the co-founder and COO of TopTile. And I asked Brendan to share some details about the foundation of TopTile, what makes TopTile different and what makes their network of elite engineers so strong. Take a listen.

Hey, I'm one of the co-founders and I'm an engineer. I studied chemical engineering to pay for this super expensive degree. I was freelancing as a software developer. And by the time I finished, realized that being a software developer was pretty awesome.

And so I kept doing that. And my co-founder is in a similar situation as well. And so we wanted to solve a problem as engineers and do it as It is possible to get to read back from those through Prometheus via Prometheus's query language. So if you want to get data out of those again, currently you would still have to then head to those other systems.

But that's on the long-term roadmap. We definitely want to have like a long-term storage that is read that that we can read back from. The local storage is good for, you know, a couple of weeks or maybe even months, maybe longer depending on how much data you have. But it's not really meant as a forever storage.

So, yeah. So that's the simplicity decision just just because you guys want it to be simple? Yeah, on one hand, it's much simpler to implement, of course, than a distributed system. And we also believe that through the simplicity, hopefully, you'll get more reliable, more reliability out of this in the end.

So if let's say you wanted to have HA, high availability, you would simply run two identically configured Prometheus servers scraping exactly the same data. And if one goes down, you still have the other one to go to. But they are not clustered. So they're completely independent of each other.

And if you want to investigate state during an outage, you just need one of them up and you can go to either one and see what's actually happening. Um, okay, so normally instrumented jobs are one of the three types of things that Prometheus can collect data from. But you might also have something like a Linux host machine or HAProxy or NginX, things that you cannot easily at least instrument directly. You probably wouldn't want to go into the Linux kernel and build a module that exports Prometheus metrics over HTTP, right?

So for that, we have a set of export servers. We call them exporters, which are just basically basically little jobs, little binaries that you run close to whatever you're interested in monitoring. And they know how to extract the native metrics from from that system. So for example, in the case of the host exporter, it would go to the proc file system and give you a lot of information about the networking and the disks and so on and so on.

And these little exporters then transform what they collect locally into a set of Prometheus metrics, which they again expose on an HTTP endpoint for Prometheus to scrape. And that's how Prometheus can get information from these kinds of systems. And we have a lot of exporters for all kinds of systems there already. Finally, the third kind of thing you might want to monitor and which can be a challenge is things like batch jobs or things that are just too short-lived to be exposing metrics and to be scraped reliably by Prometheus.

So in that case, let's say you have a daily daily batch job which deletes some users or so on. And you want to track the last time it ran successfully and how many users it deleted. For that, we have something called the push gateway, which is kind of the glue between the push and the pull world, which you're only really supposed to be using when you really have to. And the batch job could then push at the end of its run usually these metrics, the last run and the deleted users to that push gateway.

And the push gateway would simply hold on to those metrics forever. And the Prometheus server can then come by and scrape it from the push gateway. And yeah, so that's kind of the data ingestion side of things. In the architecture further there, so after the data is collected and stored, we have we can do two interesting things with the data.

We can look at it as a human on the dashboard or directly on the Prometheus server. So for dashboarding, we have a couple of solutions. We have PromDash, the Prometheus dashboard builder is really kind of a UI based click based dashboard builder, similar to Grafana. When I started building PromDash, Grafana, to my knowledge, didn't really exist yet or not at all.

But it's roughly comparable to that. But since then, Grafana now also has experimental Prometheus graphing support. And there's a third visualization option where you can serve dynamic HTML templates directly from the Prometheus server. That's kind of a power user use case where you can build any kinds of HTML-based dashboards.

And these templates then have access to the query language of Prometheus. So they allow you to build like even the dynamic layouts depending on the data that you have in your in your Prometheus instance. So that's visualization. And then the last part that we do in Prometheus is alerting.

So you have collected a lot of data now about all your systems, your hosts, and your services. And now you can actually make use of that data to see if something is wrong somewhere. To see if a batch job hasn't run for a while, to see if the request rate of some services are too low or errors are spiking up. And you can actually use the same powerful query language that you can use to display stuff.

You can use the same language to formulate alert conditions under which people should get notified. And since you might have multiple of these Prometheus servers that each compute these alert conditions in the company, you might want to do some correlation between them and alert routing and so on. And that's better done in a central place. So you'll usually have one or a few alert managers in your company.

That's a separate binary again that you run usually once that all the Prometheuses in your organization send currently firing alerts to. And the alert manager then can do things like inhibit one alert if another one is firing. It knows how to route alerts based on the key value dimensions on the alerts to specific notification configurations, to specific teams and so on. And it supports a range of notification mechanisms like PagerDuty, email, Slack, and so on.

So that's kind of the overall overview over Prometheus. Just one question on the visualization side. What's the purpose of having a separate like the PromDash aspect and then also built-in graphing and querying? Is one for a certain use case and one's for a different use case?

Yeah, definitely. So the built-in graphing is really more useful for ad hoc exploration, really of data that is in one Prometheus server. And that's good. You know, even if your PromDash is down and you really just want to see what's happening in one Prometheus server, you can go there.

You can do very rudimentary graphing. So it doesn't have all the bells and whistles that PromDash has, you know, like stacked. It does have stacked graphs, but it doesn't have like multiple axes, multiple expressions in one graph, different color schemes and things like that. So it's quite simple, but it allows you in the worst case to still, you know, explore the data in that Prometheus server.

And PromDash is really a dashboard builder. So that's for when you really want to persist a dashboard forever and for other people to see and to share. And especially it's very useful. Let's say, I think in SoundCloud, we have maybe roughly 50 Prometheus servers.

And we have one central PromDash installation, which just knows about all these Prometheus servers. And in there, you can then have dashboards or even single graphs where you show time series or query expressions from multiple different servers in one graph. So yeah, it's more of this nice wall dashboard use case. So the alert management would be part of the built-in UI.

The configuration of your alerts and stuff would be what you use the built-in UI for or PromDash for. Yeah. So for learning, that's actually part of, that's partially in the Prometheus server and partially in the alert manager. So in the Prometheus server, you can define rules, basically rules that, alerting rules that get executed, let's say every 30 seconds or one minute commonly, depending on what you configure.

And what happens there is that it really just executes a query expression and sees if there are any results from that expression. We maybe should go a bit into how the query language works. And if there are any results from that expression, they get transformed into labeled alerts and get transferred to the alert manager where they can then be deduped, silenced, rooted, and so on. And this is kind of interesting because this whole labeled key value data model goes all the way from the instrumented services to the storage, to the querying, and all the way to the alert manager.

So you really have that chain of dimensional information to work with at every point in the chain. Yeah. It sounds like everything builds off the query language and the query language builds off of the data model. Exactly.

So maybe the data model is probably the next place to dig in and tell us, you know, what it is and how it all works. And maybe if that's unique to Prometheus or something you took from somewhere else, just go into details on how the actual data is modeled. Sure. So Prometheus stores time series and time series have a metric name and they have a set of key value dimensions, which we just call labels.

So you might have something like a metric name, HTTP requests total, which tracks the total number of HTTP requests that have been handled by a certain service instance since it started. And but then you might be interested in drilling down, right? You would want to know which of these are GET requests, which path handlers have been hit and so on. And for that, you can use the label dimensions.

So, for example, you might have method equals get on there and you might have status equals 200 for the outcome and so on. And these dimensions then get stored and they allow you to query time series by these dimensions. So you could say, you know, sum over all the dimensions except the status code dimension. Then you would get the total number of requests over all your service instances, but keyed by the status code.

So that dimension would be preserved. Or you could just select the specific dimension or you can even do. So let's say you have one metric Or the content of MySQL queries, the actual query string, then you're probably better served with something like a log-based system, InfluxDB, or Elasticsearch, and so on, that really can store individual events, individual things with arbitrary metadata. So I can see where the labels might get a little bit where there's better and worse practices with them, whereas with a more just a key value namespacing thing, it's pretty easy to just come up with the next name you drill down one dimension.

But as you add dimensions, I can see where it would get difficult and you're, in fact, warning against things not to do. Is there a place to go where it's like, hey, how do I do this in a typical situation? Because I think across many organizations, the type of metrics are similar. Do you guys have best practices or things you've learned at SoundCloud, best ways to use Prometheus labels?

Oh, yeah, definitely. So we actually have a whole section on best practices at the very bottom of our website about metric and label naming and how to build good consoles, dashboards, and alerting, and so on. I think one thing that really just happened sometimes at SoundCloud is that people mistakenly, either by not yet knowing the Prometheus data model well enough or just by making a simple mistake in the code, have set some of these label dimensions, let's say, to a track ID or a user ID, and that then creates millions and millions of time series. I mean, Prometheus, a single Prometheus server can handle millions of time series, but if you just overdo it a bit and you're not careful about what you stick into label values, then you can really easily blow up a Prometheus server.

So keep those label dimensions to sane bounded things. So you always have Prometheus automatically attaches some of them anyways. So you get the name of the job, which is the name of the service. It's just terminology, I guess.

The name of the service, which we call job. The host and port of the instance by default, and that already gives you some dimensionality even if you don't have any labels on the side of your service, right? So you at least get, if you have 100 instances, you get 100 time series for this one metric, which could be the number of HTTP requests, and then you have to multiply that by all the other dimensions that you add. That could easily end up for a single metric.

You can easily get thousands or even 10,000s of time series. Well, certainly lots of moving parts when we talk about Prometheus. So I'm going to assume that based on this conversation, so many people are like, I want to try it out. I want to get started.

So we're going to take a quick break, and when we come back, we're going to talk about just that. We'll be right back. I have yet to meet a single person who doesn't love DigitalOcean. If you've tried DigitalOcean, you know how awesome it is, and here at The Changelog, everything we have runs on blazing fast SSD file servers from DigitalOcean, and I want you to use the code Changelog when you sign up today to get a free month, run a server with one gig of RAM and 30 gigs of SSD drive space totally for free on DigitalOcean.

Use the code Changelog. Again, that code is Changelog. Use that when you sign up for a new account at digitalocean.com to sign up and tell them The Changelog sent you. All right, we're back with Julius Volz talking about Prometheus.

And while we're on that break, we realized that getting started is a good step to go towards next, but we forgot we want to kind of go back a little bit on this religious piece of push versus pull when it comes to Prometheus. So Julius, why don't you lead us through that piece there? Sure. So this is funny because it's a bit of a religious thing.

And push can be, you know, pull can be sometimes better, sometimes push is better, depending on the type of environment you're using Prometheus in. But one of our team members even wrote a blog post about push versus pull for monitoring. He's Brian from Dublin. And you can find that in our FAQ actually.

But I think some points are interesting. So if you do, so I think first let's start with one advantage of push. Push is really easy to get through firewalls. If your monitoring system is easily reachable from everywhere, you know, you only need to make one point, one network point available on the internet or in your local, in your company's network or whatever.

And then everyone just needs to be able to push somehow to that. With pull, sometimes people run into the problem that let's say, you know, if they have setups where they need to pull from various endpoints on the internet and they should be secured and so on, you know, they have to have a bit more, they need to now secure and make available n endpoints instead of one. So that's often what pains people when they can't use pull. But for us, especially in these kind of modern web company environments where you have your own data centers or your own virtual private clouds and you have internally trusted environments where you can just pull from every target, pull really has a number of advantages.

So one thing that's really, really nice is that you can just manually go per HTTP to a target and get the current state of the target. So by default, if you go to an end, a Prometheus endpoint on a service, you will get a text-based format that will tell you the current state of all the metrics and you don't even need a server for that. So that's one nice thing. You can run a complete copy of production monitoring on your laptop or anywhere.

You can just bring up a second copy of all of it to do experiments, to try out new alerting rules, and so on. And that copy will get the exact same data as your production version of monitoring without you having to configure the actual services to send data somewhere else. And we kind of argue that if you're doing service monitoring and alerting, you kind of need to know, your monitoring system kind of needs to know anyways where your services live and which services should currently be there because otherwise it can't really alert you about a target being down or so on because it doesn't know if it should be gone, if it was deprovisioned, or if it was just crash looping, for example. So with that kind of argument, the monitoring system should be knowing what your targets are anyways.

So the knowledge is already there, so that also makes it easier to pull the data and makes it easier to tell in monitoring and alerting whether a target is currently down. And yeah, so we don't think otherwise that is like a huge issue whether you do push or pull, especially in terms of scalability, it doesn't really matter that much. But yeah, it kind of depends on your environment. I think there would be some scalability aspects of pulling as you add more services, more hosts.

I guess, you know, you had your statsd servers dropping UDP packets. It seems like catching UDP packets is a lot easier than going out and requesting data. Have you found in practice that that's just not a big issue? Yeah, so that's actually an interesting point.

So that's really not an issue at all. So the actual pulling side of things has never been a bottleneck for us. But it's also very important to point out here that the whole fundamental way of how data is transferred is quite different in the statsd model to the Prometheus model. As I said earlier, in the statsd model, you send UDP packets basically proportionally to the amount of user traffic you get, right?

Like for every HTTP request or every tenth or so on, you send a UDP packet, please count this, please count this, please count this. Why don't you just increment a number in memory on your web server and then every 15 seconds or so, transfer the current counter state? So that's Prometheus's philosophy. The nice thing is there, it uses way less traffic, like orders of magnitude less traffic.

It uses less computation in the client, especially if you have services that do many thousands or even more requests per second. You might have some multi-core, high-performance web routers which can do hundreds of thousands or more requests. And there, you know, sending a UDP packet for every request would actually be quite prohibitive. And the other thing is that if these counter UDP packets in the statsd world get lost, you just get a lower total request rate displayed in your monitoring system and you have no clue that these packets were actually lost.

With the Prometheus model, if a scrape fails one time, it doesn't really matter so much because, let's say, the next scrape works, you will still not lose any of these counter increments that have happened because they're tracked on the service side, right? And every instance, these counters are just continuously incrementing from the start of the instance. And every time I come by, I just see what's the current state. And that's also a very good argument for not doing any kind of rate pre-computation on the service side, but doing that on the Prometheus server side.

So in your service, really just count things up. Don't expose rates because, let's say, if you do expose rates, you know, this is kind of a derivative of a counter. Then you might really, if you miss a scrape, then you might really miss a peak in the rate. And if you miss a scrape with a counter, you just get a bit less, a bit worse time resolution of that data, but you would never miss any increments of that counter after a while.

That makes sense. Awesome. That makes sense on the theory of why you go which path is on one side you can lose data and on the other side you're just kind of missing some time. Exactly.

Yep Hopefully, that will be less painful. And let's see. Yeah, but I mean, that's basically as easy as this. You need to download the latest binary, unpack it, drop in a config file, and just start it, and it's running.

And by default, one of the default configuration files here is set up in such a way that Prometheus collects data on its own metrics exposition endpoint. So Prometheus instruments itself via one of the Prometheus client libraries, so you can monitor itself, basically. So that's a nice use case to get started. If you just want to look at some very simple Prometheus metrics without having any services.

Another thing that's really nice to get started with is, because everyone has this, is the Node exporter, which basically, by the way, has nothing to do with Node.js, but a host. So the Node exporter is a host exporter. It exports host metrics. And that's a really nice thing to get started with.

You just start it. You don't, I mean, you can set a lot of command line flags, but if you don't specify anything by default, it will do the right thing. And you configure Prometheus to scrape that either statically or via some kind of service discovery. And yeah, and then you get host metrics about either your local machine or your data center machines and so on.

That's pretty easy, too. While we're talking about getting started, I got to imagine that people are saying, OK, when I get started, I also want to have a community to sort of hang around. So you've got a Twitter handle, of course. You've got a mailing list, and you've got IRC.

So those are three ways that people can hang out and sort of catch up. I was on the mailing list recently and just see that it's pretty lively and active. So when you're getting started, if you have any questions, then there's this mailing list to look at as well, to link up in the show notes, of course. And definitely stop by the IRC channel.

So we are there basically every day, very active. A lot of people are coming there asking questions. And we're always super happy to answer. And yeah.

So that's kind of the fastest channel to reach us. And the mailing list is good for longer questions and more persistent communication. So it's Prometheus on freenode and then Prometheus developers as a Google group, which we'll link out to. So don't worry about trying to say that URL.

That's not readable. That's not pretty. Which URL? Well, the Google group.

It's not quite as easy to... Yeah, you don't want to read that out in the podcast, no. No, that's boring. changelog.com slash 168.

You'll get all the links we even found. That blog post of Push versus Pull that you referenced. So we'll have that in there as well. Yep.

Or just head over to prometheus.io. click on the Community tab and you have all the channels there. That's true. And yeah, we're very, very happy about any contributors.

And I think who we could especially use, because we're all backend people, is someone who really likes doing front-end stuff. That traditionally was always lacking in these kind of infrastructure projects. That's a good segue there, Jared, to the call to arms then. That's right.

Sounds like one. Julius, if you were going to request help or give a call to arms to the open source community, would you say front-end developers is what we're after? What would you say to the open source community, how we can help you out? Yeah, generally it would be great to have more front-end interested people in the infrastructure world, right?

And that goes for Prometheus as well. We've been coding a lot of the, you know, PromDash is very front-end-y and the graphing interface in Prometheus itself. But it would be really great to get people who feel like really strongly about infrastructure and nice front-ends and help us, you know, refactor a lot of things there, improve the UI, make it shiny. That's definitely always very nice thing to have.

But, you know, any other kind of contributions are great too. I think two of the areas that are currently still lacking and that will get the most attention in the future are the alert manager, which we are currently redesigning and re-implementing over the next months to be more production ready and more powerful. But also any kind of long-term storage integration. So we have these ways of writing out data to currently OpenTSDB or InfluxDB, but it would be really great to have a full readback implementation where you can query the long-term storage through the Prometheus server again.

And, you know, if either someone wants to implement that for an existing backend system or wants to even maybe create a completely new Prometheus-specific long-term storage, that would be interesting as well. But there's a lot of stuff to do. Maybe head to the different issue trackers on the various Prometheus GitHub projects, which are all under github.com slash Prometheus and check out if there's anything that looks interesting to you. So yeah, that sounds like lots of different ways to get involved.

And while we're asking our closing questions, Julius, we would be remiss not to ask the one everybody loves, which is, who is your programming hero? Yeah, I hoped you would not ask that one. Okay. No, definitely Bjorn.

There you go. Bjorn is one of my partners in crime on Prometheus. We're quite a bunch now, actually. Also, actually, this is funny because we also hired an intern right now who we are going to transform to be a full-timer at SoundCloud.

And we found him through Prometheus contributions. And he's very young, like 23, and he outcodes me every day. He's very, very, very smart. And I actually, every day is sounded by the...

What's his name? Fabian. And, yeah, every day is sounded by the quality and the quantity of his coding, but also of his communication in the community. Really, really great person.

I guess more in terms of traditional programming heroes. I guess when I was a child, I really was like, had a bit of a coding crush on John Carmack, you know, with the early id games in the 90s, Doom and so on. Definitely in the Go community, Rob Pike. And you probably have heard or even met at GopherCon about Dimitri Vyukov.

He's from Google. He's not really on the Go team, but he's on the dynamic tools team of Google. But he has contributed so many awesome, awesome features to the Go runtime and tooling around that. The Race detector, the new tracing framework, this fuzzing framework that also just now found actually a bug in Prometheus' query language.

Really great. And a lot of these really hardcore tools for getting, you know, dynamic information about your code. And he found hundreds of bugs with that. So I was really impressed when I heard about that.

And he also gave a really great talk about that at GopherCon that I can highly recommend. Yeah, that was another one I didn't mention on top of the show. Ben Johnson's open source database stuff. And then Dimitri, and specifically his talk on race, like you just said, was one that everybody was kind of raving about as it came out of the conference room.

So you're not the only one who's seen it pretty awesome. All right, Julius. Well, it was great having you on the show. Definitely something we've been wanting to get you on the show before to talk about Prometheus and everything it's doing and what y'all are doing at SoundCloud.

So definitely fun having you on the show today. I want to thank our awesome sponsor for the show, CodeShip, TopTile, and DigitalOcean, making this show possible. Also want to thank our awesome listeners and remind everyone that's not a member yet that we are member supported. You can join the community and get access to the member-only Slack channel as well as many other awesome benefits of supporting the ChangeLog.

Go to changelog.com slash membership. And while you're there, you might also sign up for ChangeLog Weekly and ChangeLog Nightly, which is our weekly and nightly emails, both respectively at slash weekly and slash nightly. Jared, what's the next week's show? We do have one show scheduled.

What is that next week's show? Don't put me on the spot, man. I think it might be Ben Johnson on Gopher Databases. I know he's coming up.

And then he's on August. He records on August 14th. So we'll have a show between him and now, but we don't know who it is. We don't know who it is.

We're going to try and tease out what the next show is, but nonetheless, we have lots of awesome shows coming up soon. But until then, let's say goodbye. See you. Thank you.

Thank you.

PodQuesting Dwight J Randolph- WolfShield Media PodQuesting: -By WolfShield Media and Dwight J RandolphJoin us on an exciting journey to master the world of fiction podcasting! At PodQuesting, we document our quest to improve and innovate, sharing valuable insights, strategies, and behind-the-scenes tips along the way. Whether you're an experienced podcaster or just starting your first show, our podcast is your go-to resource for everything podcasting.Discover practical advice, creative techniques, and lessons from our own experiences as we explore the ever-evolving podcasting landscape. Ready to level up your skills and embark on this adventure with us? Tune in and join the quest!Have questions or feedback? Reach out to us at [email protected] and visit our website:WolfShield.Media The PFN Cincinnati Bengals Podcast Pro Football Network The PFN Cincinnati Bengals Podcast is where you can stay up-to-date with the latest news and analysis on the Cincinnati Bengals! Our hosts, industry experts Jay Morrison and Dallas Robinson, provide weekly coverage of all the latest rumors and updates about the Bengals. Don’t forget to follow the show to receive new episodes directly in your podcast feed and leave a rating and review to let us know your thoughts. The 48 Laws of Power by Robert Greene (Full Audiobook) Robert Greene Amoral, cunning, ruthless, and instructive, this multi-million-copy New York Times bestseller is the definitive manual for anyone interested in gaining, observing, or defending against ultimate control – from the author of The Laws of Human Nature.In the book that People magazine proclaimed “beguiling” and “fascinating,” Robert Greene and Joost Elffers have distilled three thousand years of the history of power into 48 essential laws by drawing from the philosophies of Machiavelli, Sun Tzu, and Carl Von Clausewitz and also from the lives of figures ranging from Henry Kissinger to P.T. Barnum.Some laws teach the need for prudence (“Law 1: Never Outshine the Master”), others teach the value of confidence (“Law 28: Enter Action with Boldness”), and many recommend absolute self-preservation (“Law 15: Crush Your Enemy Totally”). Every law, though, has one thing in common: an interest in t Mind Force Radio.com Mind Force Radio.com Natural Strength Night is an informative, humorous, sometimes a little raucous, good-time of myth busting and honest training information from the trenches. We strive to help everyone involved with old school strength training (without steroids) to not make some common training mistakes. Along with great information, you'll hear a fair share of steroid bashing, flamingo sightings, breaking goons, iron game history, and honest drug-free training information from various leaders and strength coaches in the field to help you get real results! If your primary training information comes from reading "Muscle & Fiction" magazine we'll help get you straightened out. If you love high-intensity strength training, dinosaur style training and just like lifting heavy weights ... or loved Jack Lalanne, Sandow, Grimek, Peary Rader's Iron Man magazine, Brad Steiner's articles, Stuart McRobert's Hardgainer, Iron Nation, Osmo Kiiha's The Iron Master, you will love the show.On The Rugged Individual, we

Frequently Asked Questions

How long is this episode of Changelog Master Feed?

This episode is 1 hour and 10 minutes long.

When was this Changelog Master Feed episode published?

This episode was published on August 7, 2015.

What is this episode about?

Julius Volz from SoundCloud joined the show to talk about Prometheus, an open-source service monitoring system written in Go.

Can I download this Changelog Master Feed episode?

Yes, you can download this episode by clicking the download button on the episode player, or subscribe to the podcast in your preferred podcast app for automatic downloads.
URL copied to clipboard!