This week we talked about backups, what do they mean to us, how do we apply it as a principle, what it boils down to and more.
In the tools section Omer mentioned
Neovim remote container: https://github.com/jamestthompson3/nvim-remote-containers
I'm ready. Yes, I am and the action action action action. I need to come more prepared more prepared every time we say we'll come with blobs, but no pull and nothing happens Okay, okay, so today's top paper first of all this is our 16th session. Whoa Oh, that's a combination every time every time we we come by like with one more episode I'm like wow another one. So today's topic. Okay, it's going to be very interesting like what we're going to be on backups and then effect backups Okay Oh my god, what the first thing that comes up to your mind when I say backups and let's say DevOps I Already used the joke last week of I don't have anything so I won't do that What's the first thing to be honest probably for most people in the world the first thing that comes to their mind is? Server backup, right creating some kind of data backup because something I'm going to lose something I need to back up my stuff I need to back up my databases. I need to back up my images for my machines That's not backup for me by the way My backup is actually going to a whole different. Maybe you didn't even think about Backup for me is mostly around Disaster recovery and by that I mean backing up my infrastructure and by that I mean Having everything templated which takes me to another topic may be infrastructure as code But not only that what I want to say is I want to say I have everything backed up as code You know what I'm saying not picked up as data, which is another aspect And we can talk about that and touch up on that But to me having everything is code every changes whether that's the application code the code base of whatever I'm managing Obviously the infrastructure all my automated tooling everything around my Kubernetes clusters my server Islam does my even my EC tools as long as I have things managed as code Templated infrastructure is code again. I'm fine. I feel like I'm backed up in that sense So that's the first thing that comes up to my mind. How about you? I'll go for database backup no idea Or maybe no people will say that like back up and then I'll jump on oh, yeah, I need to back up my database You know, this is like what I'm thinking of and also I would think but that's not a little backup But you know when we say like if it's an association game So back to back up my code, you know, I'm thinking like in the 90s So to back up my code and stuff that I did You know, you remember before git we used to do like okay version one version two version three Where you renamed the files to back them up, you know Yeah, so this is how I think of backups. I think this is what I what comes up to my mind when I think about it Okay, okay, okay, so now let's do a deep digest do double click on many things on backup I think we can talk like awesome on this topic. So yeah, our challenge here is to stay focused Rather than finding what to talk about exactly. So I'll just try to do some double click, you know, okay, so you talked about backup and infrastructure as code Okay, and Let's say I have okay, so let's say I have infrastructure as code. I'm using telephone cloud formation. Whatever What do you mean by Backup in infrastructure as code if I already have a stack, you know in infrastructure as code stack How do I use backup as code? Yeah, so what I'm referring to is to the literal sense of the word backup Right because when you speak about backups in the world of software and infrastructure You're probably speaking about something like more some more like what you set backing up my data Backing up my database and I have something to say about that too But when you say the word backup It means something is backing me up. I'm going to have an issue and in my World in my world of terminology and things that I deal with The feeling of being backed up by something More intuitively. I'm going to the world of having my things templated So if something really that happens and if you speak with a DBA and you tell them something really bad is going to happen I think intuitively is going to think about his database either being deleted flushed scratch Whatever for me It's infrastructure getting deleted infrastructure Is getting bugged with mistakes or things that people didn't mean to change and push stuff They didn't mean to now I need to kind of revive my backup and my way to revive things is to run my templates And to make sure they're up to date and to make sure that everything is aligned and in sync and this way I can control This way I feel backed up does that make sense to you? Yeah Yeah, I didn't I like I still want some example of that. I mean when I say example I mean like So I understand that you talk about in postoculars called that this is what backs you up So if something fails you can spawn it again or recreate it again, even if it failed right? That's the thing All right, all right. Okay, so I have a question like that Let's say I have it out of it doesn't really matter which dynamo did he maybe it's an elastic cell show any other base You know any form of a base Yeah, the how do you test that you can really? You know your disaster recovery mechanism like what are your steps? How do you approach that? So let's say there is a disaster or you before the disaster how do you test your disaster recovery plan for your database? No matter which other It's a loaded subject Easiest thing in the world. Okay Sorry, let me take it even a step further Or a step backwards rather I tend to go generally in my professional life quote-unquote I generally tend to go for managed services And this is one of the main reasons because I won't have backups and not only I want to have backups because you can backup anything Especially if you're running on a cloud provider. I won't have automated backups So let's take your example with RDS was it or dynamo Don't really matter. Yeah, fine both our managed services regardless of the intricacies of what each of them can do They provide you the mechanism of again automated automating your backups. So in RDS for example, you can set A weekly window or a daily window with every you choose where it's going to take a snapshot and keep that obviously with a rotation Policy that you can decide how long do I want to keep my backups back in time and that provides me the Kind of the safety of having the ability to revive my stuff So you asked then how do I plan for that or how do I operate the plan that I set in place? And it's simple as taking a snapshot and reviving it whether that's in the same location I mean same environment or on another region or in another AWS account But as long as my snapshots are being taken regularly That is something I can do now. We can take it a step further. I'm not really doing that But you can copy your snapshots across regions or availability zones or even just accounts to be to make sure that if the account gets deleted and completely scratched You can revive it somewhere else and I think AWS in there They have this best architecture or Best practice architecture. They have a white paper and they call it a pilot light I hope I'm not butchering the term. Thank you. Well, I don't like playing something like that Exactly that. Thank you. Thank you my friend. So In the well architect framework one of the concepts of being able to You know Operate in disaster recovery is having a pilot light I don't think this specifically speaks to database backups But if I'm reducing it or extrapolating to the same subject It means having a small tiny thing that you can spin up immediately I think they were more referring more to containers or servers Just having a small server sitting in place with an autoscale group not doing much Like a micro instance and then once you need and something happens and your production is in flames you can kind of scale it up You know on demand and have the same infrastructure Re-operating on another place. So I'm thinking you can do that with backups now that we're speaking I'm not saying I'm doing it my My method is having snapshots taken in an automated manner Through the managed service and then reviving them and yes We have to I think test it quarterly. We're obliged to do that. So that's what I do Mm-hmm. Okay, and do you ever play like chaos monkey like let's say you wake up in the morning I want to develop this and today we're going to have a game And they didn't tell them today. I'm going to lead dev database data development stuff You know what you just did but we could have had Episode 17 of our podcast speaking about chaos engineering Now we're going to that so let's say a word about that and keep that as a topic. What do you say? Sure? Okay So to your question. No, I don't you don't but if you were to ask I don't but I really want to I used to do it in the past Where I felt like I can I don't think I'm at a point where I can do that Um, you know working at a startup things are built well But not to the point that I can run a chaos monkey that will Completely demolish subnets or availability. So I'm not there yet. I would really just that database Let's say just going for okay, so you know to feel safe You know like the best way to feel safe is to fall down in the you know in shallow water Right to drown in shallow water So if development is dead, let's say taking the development database and you just Break it down and Take it down then you can really practice and see how you can how much time to take to waste So how much time to take till the first developer notices that you know So you can really measure stuff if you just do it randomly without telling anyone would you? Okay, would you do that in production today if I tell you now test it no No, no, no. No, no, no, no, no not even production I'm talking just about development. We'll get to the to production in a second but Development like would you okay? So let's say you're in a team with, I don't know, 20 developers. Some of them are working locally, some of them are working with development environment, QA are working with staging, blah, blah. And then you say, okay, I don't feel comfortable with my development environment. I feel like if production falls down, if production will break, if production breaks down, then our rollback of our industrial cover plan for the database is not so good. So I have an idea. Let's delete, okay? Let's delete production data in development database and see what happens. What do you say about that? Is the good thing? Yeah, I, okay, you're asking whether I do that, whether I'm prepared or whether that's a good thing to have. Two things. So are you doing that? If yes or no, whatever, like, is it a good thing? Is it something you think people should do? Okay, I'm not doing that, not intentionally. She's, you know, by the way, what's better than coffee in the morning? What? You know, the job, deleting a production database will keep you wake up for long hours. Yeah. So no, I'm not doing that intentionally. Yes, I do think it's a good idea. No, I don't think it's a good idea as a general practice because at the end of the day, we're all monkeys. And if that's the only thing I'm going to test, at some point, that would be some kind of a KPI. So me or the other engineers on the team would kind of start aligning themselves around this one database or a few RDS instances that I'm going to just throw out of the window on dev environment. So they'll get focused on that. But what happens if I kill elastic or I kill a cash instance or I delete a table on dynamic DB. Am I ready for that? And the answer is yes, I am ready because I try to use managed services and I try to automate my backups. But no, I'm not sure it'll be as easy. You know what I'm saying? So I try to keep myself kind of distributed and I kind of try to manage everything in an automated manner so I can control that and be confident that it'll be all right. Okay, and let's say you're out there. Okay, moving on to production. So let's say, and I'm talking about like a real situation. Okay, don't be gentle. Let's say the truth, okay? It's hard, but we need to say the truth. So let's say your production now database for some reason, I don't know, some security, policy, the ICD process had administrator permissions, maybe a user with proper privileges was able to read database, but the database is gone. But when I say gone, I mean a qualitative gone, okay? Let's not, you know, they didn't go to actually to the AWS and delete the whole out of the S, but it's gone. Okay, or maybe even if they did. And you still have a snapshot of that database. How do you restore it? And you're now paying money. So for each minute that you are down, you know, you owe customers money. Okay, so how do you do it as fast as you can, if that happened? Pretty sure manually. I don't have a good idea for that. I do have one thing to say about that, though. If you need to restore database, I don't see any other way around of having the database change its connection string, right? Because it's going to be a new endpoint. Yeah. And to mitigate that, you can do two things. One, obviously, you can set a DNS record or hopefully an internal one if we're speaking about AWS AWS, wherever you're running, set an internal DNS record, sorry, that you can kind of point to the endpoint of your database. And that way, if something changes, you need to make stuff, I mean, things do change even if they're planned, right? You, on some point, at some point, you will probably replace that connection string. And if you want to eliminate work of developers in the process, they can only target themselves to the DNS record and there are no issues because what you're doing is kind of under the hood, right? You're replacing the connection string, but you're updating the DNS record accordingly. So that's one thing. So you still have like, it's not the DNS, like the DNS, like all this updated mat automatically, you're saying let's have one point of update, right? So exactly what you said. So we're speaking about a manual change. The other, another option is obviously to set your DNS record to point, actually, I'm not sure, speaking about RDS, whether that's an option to keep it automated, maybe it is. I'm not sure, not aware of it. Automated for what? Do you know, like having an internal record that points to an RDS instance and changes accordingly if something happens. I don't think so. I mean, I'm doing something like that for ECS, like when it does it's easy, then Django changed something. But that's a lot of the functions, you know, and that's right. And it can be connected to a target group. So anyway, let's go back to the point. So one option is to have that. Yes, the changes are going to be manual. Another option or not another option, another layer that you can set is an RDS proxy. It's not set for that, but it's an option. The proxy serves in this scenario as another layer, because developers will be pointed at that proxy and you can replace the instance, put it again under that proxy and the proxy, you know, has the same connection string. So it has the same endpoint and that communicates with the database. So you can replace the database under the hood. They keep connecting to the same thing. But proxy, well, it's a resource. So it costs to keep it alive. And I think basically the reason to have a proxy in the first place is for it mitigating the number of connections that are being held against the database. So if you're working in a serverless intense environment and you having a lot of connections because of that opened up against your databases, a proxy is one way to mitigate. I'll be as great in audio as proxy. It's managed service by AWS search for RDS proxy. You'll see the service. I was surprised to find it this year. I didn't know it was an option. I always knew you had SQL servers, different kind of proxies, even open source one that you can set yourself. But that put me in the process that would put me in the chain and I would be the single point of failure. I didn't want to do that. But ever since I found out there's a service, I immediately started using it. Also, you're using the alias proxy right now. Like one of yours. Oh, nice. Yeah, but basically to mitigate the number of connections. Okay, okay, so I think we covered the what happens if, okay. So let's say let's summarize that if production goes down, I think I also inclined to say the same that you install stuff manually. It's not optimal. It's not the best solution, but if we're paying money because customers are screaming and saying listen with it. So we'll probably fix it manually as fast as we can. I think, you know, I mean, in most cases, and I wonder if people in the crowd will all day, if people in the crowd are saying like, but isn't it the whole purpose of DevOps? Well, everything is automated. Why are you talking about doing stuff manually? How can you challenge what we just said? Or that we just, I think it's a conflict because we are like a few years in the industry. We know how to walk. We do the stuff, we do our stuff, right? What are we saying that we are going to do it manually? What's wrong with us? That's a great question. I'll give the short answer, okay. To begin with, we have our own capacity. We can't do everything. And there are too many things to automate it, to be automated, sorry. So yes, while you can and probably should automate everything, it's hard to get to everything. That's one part of the equation. The other part is there are some things that are, maybe to, now that's an opinion, obviously. Other people can have different opinions. I had the mentor ones that told me the one thing in the world that it doesn't automate our DNS records. And for the simple reason that they are the most, this was the center of the organization, the way he builds stuff are all around connected to what I've said earlier with having a connection string under a DNS record internally that I've said, that would be the center of how he builds infrastructure. And if that's, if that's failing, everything's failing around what he built. So the one thing that wasn't automated on his end was the DNS records, I tend to follow that to some degree. That's the short answer, let's keep it short. Okay, you know, I hate to, and I also want to know if you are also like that. It's a bit, I mean, stealing off the backups, I'm just wondering like, so you said, because we talked about automated and manual. So I also try to avoid creating ACM, you know, the certificates, you know, HTTPS certificates with my infrastructure as code, I tend to do it manually in the, you know, in the AWS console. You also do that. I have to do that. It has a good reason. Yeah. Yeah, because when we're totally deviating here from the subject that when you create an ACM for the listener who the, it was in familiar is the AWS certificate manager, I hope. Anyway, a manager for certificates that you can provision on AWS. And the reason is when you create a certificate, anywhere, an SSL certificate, you have to pass some kind of challenge in order for it to provision successfully. That can be an DNS record that you said. Again, going back to DNS, it can be, I think, an email that you control the domain, you need to prove in some way that the domain is yours and you can create a certificate. And for that reason, having it templated doesn't help you much because you can start the process. But what happens in the gap? Sometimes can take hours until it's approved. Yeah. Because of the challenge and then I thought to myself, maybe I need to automate the challenge and then I'm like, okay, I don't have time for this. So let's just get it manually. It's a one-time thing, I'm not going to do it again. Okay. Okay, so we touched back up, yeah, go ahead. So I have another thing to say about the Cups, which I think intuitively comes up to listeners and I wanna know what you think about it. Sure. We touched on the standard things. We touched on data, we touched on, I don't know, infrastructure is code. And I think another thing that comes up to mine, the next thing is what happens if I have EC2 instances, right? So the first thing that I think of is creating an AMIs, maybe periodically, maybe not. I don't know, obviously if you're running the next thing that I'm not happy to say, but if you have a Jenkins server, you're probably backing this app that way. You're saying AMIs, why not creating a snapshot and then let's talk to AMI, like why are you going to AMI? Why are you going? Why not a snapshot? It doesn't, I'm making another point. Let's speak about that. Let's put a pin in that. Okay. So where I'm going with this is, yes, you're right. You can take a snapshot. You mean a volume snapshot? That's what you're making. Yeah. Okay, yeah, sure. Another option for someone who doesn't understand, let's again put a pin in that. We'll touch on that on a second. What I was saying it is, why don't you think how to decouple the data completely from the incident? That's not always the case, not always optional. And I know I'm talking about my scenario where most of my things are managed services. If you're running your own service, that's not as easy. But I think in a lot of scenarios, you can find yourself being able to decouple the data away from the instance. And that way you don't have to create an AMI. You don't have to create a snapshot of the volume because everything, the volume is what you call ephemeral. You can lose the data. It'll be all right because it's only a cache mechanism, maybe something that stored elsewhere. Again, that's the point. To your next point, yes, of course. The next step is to not create an entire AMI of the entire instance because that would probably be a little bit of a waste. The smarter thing to do is what you just said. Create a snapshot of the volume and back up the data itself because hopefully you keep your data stored on the EBS volume that's attached and the instance doesn't matter. And now to the worst scenario, I think if I'm not mistaken, when you run a Jenkins server, that's going to be an issue. Because yes, your data, which holds the pipelines and I hope the configurations too will be held there. But I think if you create only of, I'm not sure, it's been years since I've worked with it, but it may become an issue because the configuration of the server itself can get lost. I'm not sure about what I'm saying. Can you in Jenkins? I'm not sure what, can you say everything like, okay, so in Nexus, right? You can have like, S3 backed up, you know, data. So you can save your data in S3, where it's backed up instead of saving it from the volume. So maybe it's the same in Jenkins. We'll can save everything backed up in S3 like in J4, like they have Nexus. Exactly. So J4 Gertifactory, Nexus, they're all built around having backups and having the data stored elsewhere. Whereas Jenkins is built around bad practices. But you can also use, I'm not sure if Jenkins needs like a very, very high speed of freedom, right? I don't think so, but you can use AWS elastic file system if you want to, you know, preserve the data and you also have a point in time recovery for that. So if everything gets lost, you can get to a point in time and recover that. So it's nice. So that's a good point. Specifically, to EFS, I'll say one sentence about it. I try to shy away from that. It has its benefits. It's elastic. It can grow and shrink. You can multi-attach with these too many servers, if you like. But the performance is a little bit of an issue. So it depends. In my cases at least, it depends. Exactly. In my scenarios, where I use EBS volumes, it's mostly for servers that do actually need them and need to be performant. And EFS might not be the best solution in that case. So that's a little bit of a stark one. But if you have money, you can, and I hope you have money, but you can have some dedicated throughput. So you can pay, but it costs a lot of money. So I think it's like 10 callos, still one megabyte per second or something. I'm not sure about the price that we need to check in AWS. But the pricing is high, but you can have like a very high throughput if you want. You're pushing me to a corner to say that if you don't have money, just call zesty, because that's what we do. And we'll help you with having your file system elastic. But that's besides the point, and yes, you're right. You can provision it. So that's a sentence. If you don't have money, just call zesty. Katin, right? Ooh, okay. Gently is that. Okay, okay. So we talked about, so EFS, let's do your time to your Jenkins issue. So generally speaking, I love what you said about the easy to instances and decoupling stuff. So when every time you hear easy to instance, you're tending to go with maybe save my data in S3 or maybe save it on EFS if I don't care about the performance. Also, S3 is also not very performant. It's only for backups, you know? Right, totally, totally. Okay, an object star, not where you wanna, yes, keep your data as a hot source of your data. Okay, so since we're like 25 minutes in, and I think we can keep talking about backups and every AWS or ever, we didn't even start touching about database clusters and how do you handle elastic? Yes, wow, okay. So I say we need to move to the corner or you ready to the corner, totally right, yes. You know, it's funny because like after 10 episodes, we just decided that we have the corner. So corner of the week. I think you also made up a term. I don't think that's an actual English word, but yes. But yeah, that's the thing. It's the corner of the week. So corner of the week, choo, choo, choo, choo, choo, choo, choo. Yeah, effects, okay. So now that we're in the corner of the week, we'll all melanize, share our experience this week. So Omer, would you like to share an experience or technology or anything else that you want to share about that you experience very half week? I'd be very happy to share this. Yes, go ahead and do that. Can I share it too? You can share too. Okay. Do you know VS Code's ability to create a remote environment where you can kind of use a headless VS Code and run that in a remote container? Do you know that ability? I know, okay, so there is a server, like VS Code server, you're referring to that, the open source VS Code server. I'm not sure that what I'm referring to, but VS Code has a plugin where you can connect your IDE from your local host to a remote environment and kind of work on top of the server, but having the IDE experience locally. That's what I'm referring to. Okay, so no, no, no, no, okay, talk about it. Share, share, share. In any case, in any case, that's a plugin. VS Code has that functionality. I discovered by someone who had actually contacted Neon Linkedin, and they have an open source project for Neovim, the same. And if you know me, you know I like Neovim, so Neovim can now provide the same functionality. So you can basically install Neovim, by the way, can work in a headless mode, in any, that's just a functionality, but you can install that plugin and work remotely on a server. So if you have some kind of remote server on AWS or any other cloud platform, and you need to, for some reason, run very heavy processes and you want to run in there, but still have your local working environment, that's something you can do. That's the one thing. You got to conclude a little bit, because you started talking about VS Code, and I was like, what's all I'm doing? And then you moved to Neovim, and I was like, oh, okay, okay, got it. Just for reference, of course, yes. I don't have it installed on my machine. No one can. Yeah, I was like, what is talking about? What is it? Okay, yeah. No, no, my computer doesn't know VS Code. Okay, so that was the first thing. The other thing is a really cool project, sorry, that I've just seen today. I'm not yet sure how to utilize it, but there is a project that's called test containers, and you can find them in many languages, JavaGo, Python, Ruby, they have one container for everything, and that's something you can integrate with your unit test. So your unit test, you can import the library and run a container, you can tell it what kind of image do you want, what kind of dependencies, and it will run your code as part of the unit tests inside a container, if that makes sense. It's just a shared library that you can use for testing inside container, and it's really cool. Oh, cool, okay. So you'll put that in the description, right? Definitely. Okay, so a ready for mine, because mine is a bit of a challenge. Go ahead. It's a challenge for you and the viewers, because I don't think anyone has ever experienced that. And if they do, they did, they didn't call this thing, okay? Okay. Okay, so let's say, let's say you have an environment, and it happened like in a site project, okay? So a WordPress website, which is hosted on AWS, ECS with a database hosted on our DS, okay? AWS, our DS, and it's post-less, all right? And the data is saved on EFS, okay? Beb with me on this one, it's ECS file. Is that simple? Yeah, can you do that? Mm-hmm, yeah. Can you scale it? Yeah, yeah, it works great. I mean, it works great. The website, like, it's not like it's WordPress, you know? So you don't really need, like, even if you have high, very high load of traffic, you get the content, very, very network, like the CBN in front of it. Okay. It takes all the traffic files and whatever. That's for sure, yeah. So that's the infrastructure. Now, do you know, are you ready for the challenge? Do you know what happens if you don't pay to AWS? To AWS, let's say for, I don't know, I think one month, or two month, like if your payment method is expired, and that's it. So you didn't pay for AWS for one month. Do you know what happens then to go in front of you? Nothing. So I can tell you. That's the challenge. So now you think, okay, first of all, why would that happen? But it might, because maybe somebody stole your gift card and you forgot to update your gift card. A lot of my friends in AWS have to ever. Yeah, and you didn't notice the emails, whatever. So apparently, when you go to your website, like HTTP, blah, blah, you find out the website is down. So you go to AWS account, you cannot really log in, you got to log in as root, right? Because you didn't pay. And then after you fix the payment method, you realize that your ECS service is not working, right? Because AWS, you're not paying the way with a provision of resources for you. Right. So what do you do? Your ECS service is set to desired tasks as one, okay? One desired task, but it doesn't work because AWS took it down. So what you'd say, probably, okay, so you can just set it to zero, wait, and then set it to one back, it will reset it. No, amigo, no, you have to delete the service and then recreate it, okay? Another thing, the database, you would think that they just stop your database and that's it, you just need to rerun it, right? Because you can stop LDS for like two weeks or something like that, I'm not sure, but something like that. So, apparently, they just delete your database. So you have to restore the database for one of the automated snapshots. Okay, isn't that perfect ending to an episode that speaks about backups? Yeah. Look at your payment method, definitely. Yeah, so apparently, if you forget to update your payment method, it's a big issue. I mean, because if you're in a production environment, think about the time that it takes to realize, okay? So it takes time to realize you need to delete the service, recreate it, you need to restore the database. And also, you talked about internal DNS services. So if I restore database and I create a new one, my LDS endpoint was changed, right? So I need to go back to the ECS, create a new task definition, change the environment valuable of the LDS endpoint. And, you know, so it takes time. So the views that the NS entry mail. Yeah. So yeah, it takes me back to what we talked in the beginning, the NS, internal DNS entry for LDS might be a good thing to do, especially if you don't think. So I think you can write a book of your edge cases with AWS. Yeah, I don't think anyone has ever or will ever experience that, but just in case you wondered, like it's a unicorn, you know? You know, it can happen, but it's not there. Yeah. Nice. So for a summary, and episode that's because about backups, backup your shit. Yeah, that's a slogan. Okay, so thank you, everyone. Thank you for listening, for coming, for attending, for viewing, so whatever you do. From all over the world, by the way, from London and from Spain this time, it's time to change. Yeah, all at home Spain. So, yep, let's talk again next week. Yep, see you next week. Bye bye for now, bye bye.