#27 - Chaos Engineering

Send us a Text Message.

This week we discussed Chaos Engineering, where did it come from, what it means, how can you incorporate it into daily work and plans.

Links

https://github.com/vmware-tanzu/velero
https://principlesofchaos.org/
https://aws.amazon.com/apprunner/
https://aws.amazon.com/blogs/machine-learning/llama-2-foundation-models-from-meta-are-now-available-in-amazon-sagemaker-jumpstart/

Meir's blog: https://meirg.co.il
Omer's blog: https://omerxx.com
Telegram channel: https://t.me/espressops

0:00

Oh man, ready? Oh, Mary's ready. And actions, actions, and the lower everyone. And why do we always laugh when we start? It's like we are embarrassed or something, I don't know. Because it's kind of this awkward dance of clapping and I'm waiting for things to happen and I don't know how to behave every time it's like we laugh when we're listening. Okay. So, hello everyone. That's our intro, man. Instead of playing music, we're just awkward for 10 seconds and that's how people get into things. So, the awkwardness in the intro, that's our thing. We made it a thing instead of being embarrassed about it. I think. Okay. It's good that we talk about it. I feel more comfortable. Yeah. Do you want to finish now or should we? Yeah. I feel like it was a nice therapy session. Okay. Okay. Okay. So, welcome to DevOps topics. Well, we don't talk about therapy. We talk about DevOps topics. And today's episode is 27. 27. The dangerous number. Lots of famous people died at the age of 27. I hope we'll pass the 28. We will. I don't think we'll die. I just hope that the podcast will do episode 28. We will do it. Okay. Let's do it right away afterwards. So, we break the evil curse. I agree. Okay. Ready? Yes. In this episode, we are here for the chaos. We are going to talk about DevOps and chaos and how they are related and what do we do about chaos? And we like to do chaos and chaos chaos chaos chaos. So, Omer. Yes. When I say DevOps and chaos, what's the first thing that comes up to your mind? Okay. So, we're talking about chaos engineering, right? And it's kind of a concept that was built over the last decade or two because I think it was, I mean, there wasn't a name around it. Kind of like DevOps wasn't a name, but people were already trying to do something, but then someone put a name around it. I want to say Netflix, but I'm not 100% sure. I do know that Netflix took it to another level when there was a leader, a woman leader in the company that kind of helped and transitioned to the cloud. I think her name was Nora Jones, not the singer, the engineering manager. She wrote a book with a couple of other senior managers in Netflix, I think, and I believe it was just chaos and engineering, that's the name of the book. And it was a large concept of how to do stuff. Let's just give the title and then we can break it down. Basically, it's the concept of intentionally breaking stuff in your environment, regardless, we can speak about later production or production, but it's breaking stuff intentionally to make things more resilient, which kind of sound counterintuitive. But the idea is to have developers think or infrastructure engineers, think of how they build things in order to make it more resilient because they know things can break, not only because of life, but because that's how the system is built, it's built to hurt itself, if you will. Like, so that's the first thing that comes up to your mind. Yeah, quite a mouthful. But just wanted to make sure, okay? Yeah. I'm taking a leg, okay. So when I said it, like when I asked the question, mind-mind was like, DevOps and chaos is like a lot of engineers holding keyboards and stocks, mashing them all around on laptops and computers, you know. So this is what I saw when we talked about DevOps and chaos, the first thing came up to my mind. That's one way to get to your engineering, and I think it would achieve the same goal. It would just cost a little bit, sounds more fun. It's more fun, definitely, definitely. So now we'll move to the first question, which we always invent in the, you know, current moment. We always do things from now to now, okay? So the first question is, how do you usually implement chaos engineering in your, I can say, company, but maybe in your position, sometimes it's failed, default, mental, like, company wide. So how do you implement it? How do you try to do that? Maybe chaos engineer. Okay. Um, to be fully honest, I'm not doing chaos engineering per se today, because like doing air quotes, chaos engineering is following the method and applying stuff. If you're just following the method, I think the method involves something like, they have the notion of, um, generating what's the steady state of the system is, right? Thinking of how things should look like. For example, I have this amount of servers. I have this amount of live services. I have this amount of subnets. I have this amount of backups that can take the traffic if something goes wrong. This is the steady state of the system. And now I want to break something and that would be called something like the experimental group or something and see how the experimental group can be broken while not actually hurting the system. I hope it makes sense. I'm probably again butchering stuff, but this is the idea in general, right? Making a steady state and having other stuff that break the steady state and making sure you kind of do your iterations to improve it and build it in a more resilient way. For example, if we take easy example, Kubernetes, what did Kubernetes do in the old days, we had machines, we were just installing processes on the machines and processes can break, right? Your application can just throw an exception and die. Kubernetes builds that into a container that's built into a pod. And if the process dies, someone sees that the health check is bad and your container can die and another one can come up and start running again, that's one form of making things more resilient. And next step, so when Netflix took the idea of the concept of chaos engineering, they built like a toolkit, and they call it chaos monkey or I don't know if the monkey was one of them. There was guys, okay, so anyway, there were a bunch of tools, one of them, the famous one was chaos monkey. And it's kind of a monkey that's running around and breaking stuff. So it used to break, they started with just breaking servers, but I think they went onto breaking like pool subnets and VPC, they were just the leader region at some point. And it would actually go to production, I think they've shown in one of their talks, someone reaching out to the Netflix on the TV and it shows a black screen with an error. And they say, you know that black screen, I think any Netflix user knows that. They say, that's us, that's chaos monkey. That's why you've seen that. So I remember taking chaos monkey and trying to implement it. It was five or six years ago, I'm not even sure if it's still running, if the project is still active, but I tried using it in staging, it was really cool, but really dangerous because I wasn't really aware of what I was doing, but it was a way to show engineers how to work. And the last thing I want to say is that it's a good thing to think about. You don't have to actually implement it, you can do your own script, of course, you can just manually try and delete stuff, but if engineers think about the concept that something can get intentionally deleted, they'll think about how they build the application, DevOps engineers will think about how they build infra, and everything should be built around that. So I think the key word here is resilience. So you need to be prepared, like if you want to do some chaos, you've got to do some initial work, or that you can just just stay okay, let's try to implement chaos monkey and see what happens. You've got to prepare your environment for that. So maybe one of the viewers, one of the audience people on the audience would say, okay, I want to do chaos monkey, so how do you suggest to approach that? How can I develop my environment to be prepared to apply chaos monkey or any chaos engineer? Do you have any tips for that? So it'll be kind of, I don't know if I can give tips to something that I'm not currently doing. The idea came from speaking to you, and you told me that you've started with it, and I think the process that you mentioned was actually the right way to go about it. Like, do something, first of all, have it kind of wrapped in a very small bubble that you can control, kind of does it probably a better word for that, but it should be compounded in a way, isolated, isolated, yeah, contained isolated, yeah, contained, yeah, perfect. So in a contained both environment and the tool itself, right, I wouldn't go running a full blown chaos monkey, if I'm not sure what it's going to delete and kill. Let's start with a small script that randomly deletes an EC2 instance, which will be contained in dev environment or staging environment, and make sure it's running only on certain tags that are not having to do with something that's super critical. Even in staging can have critical stuff, like you probably don't want to delete an entire database with its backups, maybe it's better to start with some supporting application, or just if your own Kubernetes just deleting a pod, because deleting a pod is normally it should be rather easy, right, you delete the pod, it should come back up, et cetera, but that should probably send some shock waves across the system. If that pod is serving other pods and those pods can read that, you want to see how the system behaves when it's gone. So I'd start from there, make sure it's contained, running on certain labels or tags, one environment, and from there, expand it to other services, other components, maybe other environments, you can go the entire way, like Netflix did, and go to subnets and VPCs. By the way, subnets and VPCs, you don't really have to delete the entire thing, you don't have to delete VPC. Right, and you can't even delete, like if you have resources in the VPC, you cannot actually delete all the VPCs. Exactly. So you can break stuff, you can detach a route table, you can delete the VPC peering, you can try and remove internet gateway attachment or not gateway. So lots of things you can do to kind of test the system. Now that we talked about KEL, I think that the thing that comes naturally to me to think about is like disaster recovery. Because KEL is like a disaster, you know, KEL causes disaster. And we are here as a DevOps engineer, so we want to do a disaster recovery plan, something that we would recover from any disaster that was made by KEL as soon as it can, like with the least impacted KEL can do, like bad impact. So now I'm trying to think about things we can offer the audience, and maybe you also do it in your company, I don't know, talking to you all now, and like maybe which practices we can implement, but easy ones, once that you can take on this talk and do it in your organization and I'll start, okay, I think I'll do totally in the last session. So in time to time, we do like a ping pong, okay, I'll say one, you'll say one will be fun. And so, and once you finish, I will really be really happy to hear again your idea of, because you mentioned, I don't know if it was the last episode, the two episodes ago, you were talking about how you implemented something like that on databases in staging. So if you can tell us about that. Yeah, actually, so we did something like that a few weeks ago where I worked, like that. We wanted to make sure our database can be fully covered, okay, so of course we didn't check it in production, but we didn't do it on staging. So it's an out of years database in one of our systems, you know, Amazon, let's progress and manage post-guests. And I just started doing things manually, remember we also talked about disaster recovery, how we always wanted to be automatically, eventually when something breaks, we're like, okay, let's do it. So I did it like the way I thought I would do it if it's gone. So I think the best thing to do is to start from, well, you can also like to have a quick victory. So I knew that I would have a snapshot of the database. I was able to report back to everyone around, okay, so I was going to restore to this point in time, okay, and I restarted the whole cluster following that I got a new domain, and I took that domain of the RBS and updated the application side. And this way I was just like, you know, I didn't even have to delete the old database. I just acted as if it doesn't exist, okay. So you don't really have to delete resources, you don't really have to start destroying stuff, because I don't think that the strength is the part that you're to care about. You just care about how can you recover. So sometimes like I did just restore from an existing resource to see if how you think it's going to work is actually going to work. And also to mention about it that I said restore from my deleted resources, right? So I just told you, listen, I restored from an existing database, but what if the managed service, for example, saves the snapshots on its own, right? Like in its own state data and it's connected to the database, and if you deleted database all the snapshots are gone, because there are kind of those services, okay, like Elastic Search, open search, if you delete the whole cluster with by cluster, okay, unless they change it, I don't know, maybe in the past few weeks or something, but you got to take manual snapshots outside of that cluster, and otherwise the cluster is gone, whether they had a talk about it. So I also recommend like when you do those restorations, make sure the service, like when the time comes and your database is actually deleted, make sure you're doing it like it will happen in reality, because in Open Search, it won't work. So what I decided is working for RDS for Open Search, with another solution, we'll talk about that later. Now you, any good practice or any experience that you've experienced like with chaos and recovery from one? So again, it's hard for me to say, because I'm not actually doing that today, but tapping into what you said, which is very, very important, it doesn't often, it doesn't necessarily mean that when you delete something, that's what matters, how you delete it or what goes wrong, because what you will do, again, I'm using the same term, it will send shock waves. So yes, you as an ops engineer, you probably think, okay, databases did, let's go and debug what happened. Maybe I can restore it from a snapshot, maybe it's just DNS, maybe it's something about the connectivity in the VPC, but on the other side, the engineer can check their logs, and then they can see how their application behaves when something is inaccessible. For example, we once had an application that every time the cache instance, the reddish cache instance wasn't available, it would crash so bad and it would take so long to recover that they understand that maybe they need to apply some kind of a retry mechanism around it. Because cache instances, we actually found that in an interesting way, which we can kind of connect to, we were running tests during CI with a reddish cache instance. So in many, many CI systems, you can use the notion of services and just apply kind of a sidecar that's running alongside the test and help you, because I mean, often you need reddish to support you in your test, you want to test that the cache is fine, that the function that save and read from the cache are working as expected. And they were doing that and sometimes it took a little, like a little longer, a couple of more seconds for the container of reddish to start supporting the application and it would immediately crash and the tests would crash and then they understood, okay, that applies to the application itself, not only to the tests, maybe when I apply some kind of a retry mechanism, that actually ended up saving us, literally saving us in production. So yeah, so to touch on what you said, it will have so much more value than just what you think about everyone across the company, the R&D, would see the shockwaves that are sent and going back to the microservices on Kubernetes that I mentioned earlier, when pods die in a way that the entire service is not available, not a single pod. If the entire service is not available, and again, you don't have to delete the deployment or delete the health chart, you can just kind of destroy the service, literally the service in Kubernetes and make things inaccessible and just see how other applications behave because sometimes, especially new ones that you haven't tested enough, maybe they rely so much on having everything surrounding them and all the other services available to them and when one critical service isn't available and they crash bad enough for it to hurt the entire system, they can learn a lot from that. So it's just to add on the concept. In terms of tips, I don't think I have any more to add. Okay, so I'll take another one I want to again talk about that. Change management, so when something happens, that is a disaster. Now, when it happens, you probably want to plan, like disaster recovery plan usually, this is how we call it. So change management is one of the most important things, which when I say change management and in communication, and also, as you like to say when it comes to security, the attack vector, so I like to think about two things, like communication and what do you really want to check in the disaster, in the recovery, I would say. So I'm saying like, okay, when I destroy database, what do I want to check? How long do you take to the web team to our actual engineers to respond to this type of bill and maybe say, listen, database is down and how long do you take them to organize or do I want to check how long does it take to that base to recover? You know, I actually test AWS because sometimes a big database can like a huge database can take longer to recover because it's more data. So now you need to think when you do a disaster recovery plan, what do you want to check? And also, you need to communicate it because if it's just destroy stuff, people will be like, okay, it's not walking, my job is hurt and they won't respond as you want. So when you do those things, you also need to communicate and say, listen, guys, I want to test you and I want to see if you see any change or stuff like that. So communication and I forgot because I saw your smiling and I'm like, okay, but what do you want to say? I see you're like your buzzing. Yeah. What do you want to say? No, I mean, you made me think of so many things. To begin with, you made me think of something that we worked again, the same consultancy every time I was doing something, I was told, what's the KPI? What are you measuring? Because if you're just doing stuff and I think that's a DevOps problem, let's call it because we're doing stuff every day, maybe to support disaster recovery, maybe to make deployments faster, maybe to make people happy, maybe to make people happy, maybe to make people happy, but that's not really measurable, right? Other than the fact that you're not getting fired, maybe. So I was told always what is the KPI and thinking about that, you just told me how long does it take to recover and that's a very good KPI to measure. On one hand, it's good to know. It takes 60 minutes or 120 minutes or whatever, four hours, one day. Rough estimation. So you have the ballpark. So that's good. But what I was thinking is that I'm signed on literal documents that are enforcing me to have some kind of an SLA for certain things because that what happens when you go through ISO and SOC and other certifications. And I'm signed on a document that says, I have, I think it's like four hours to recover critical database instances. And we did run a test, but that's all it's fine. We run one test. How do you know if it keeps going? How do you know if it's ongoing, you know, new applications, you in front, you stuff, wrapping it. How do you know if you're still compliant with what's happening? And one of the ways is just running exactly what you said and measuring it, right? A database, it doesn't have to mean that the database is now dead. It can mean that the DNS entry was changed, not that it should, but it did something went wrong. By the way, there was, I think we were still working together. There were a few very major DNS crashes over the last couple of years. One of them was AWS and other Google and another, I don't remember. There was a huge DNS on the internet that just died for a few hours. Come on, maybe. No, no, no. No, no, no. There was another one. Anyway, when that major DNS on the internet dies, regardless of the fact that everything else is living under the hood, the servers are still there, everything is still running. If you don't have DNS queries, there's no way to reach out to anything. So you're stuck. It happened to Facebook, but that was contained in the Facebook network, but Facebook and Instagram and everything that's in there wasn't available. So again, KPIs are an amazing way to think just touching the conceptual way, just like we were talking about chaos engineering. If you're doing something conceptually when developers will build their applications and think about how they are deployed, they will also think about the fact that it can go wrong. Because they will be used to that, Mary's killing their stuff every day and he's testing them, right? And if you do it long enough and fast enough and obviously in a secure manner, they will have to think about it. And it will be kept in the back of their mind, always thinking, okay, I'm writing an application, but it should also take care of the extreme situations that are not happening often, but Mary's destiny so much, maybe I should be careful. So that's the one thing and the other is KPIs. That's a good conceptual thing for, I think ops side mainly, but also for them. If you're doing something, how are you measuring? How do you measure success? How do you measure, like take a year from now, you're doing stuff, you're doing projects, like you said. In ops, by the way, it's, I won't say easy, but usually we say, okay, it costs less money. It took less time to recover. It took less research to recover, which applies on the test of the deployment, time to market, whatever, all of those, you know? Exactly. So the one thing I want to say is, think about it, what are you measuring? If you're now going through the entire set of containers in your company and you're now not only deleting vulnerabilities, you're making sure that everything is not running on Ubuntu, it's running on Alpine, for example, and you're reducing the sizes and yada, yada, because you think it's better. What is better? What did it mean that deployments would run faster if it does, would it, they run instead of five minutes, they would run four minutes? Does it make sense? Maybe it does, but does it have enough effect to make a difference? Just think of the KPI and what you're measuring. That's it. I have, I have another hat for chaos. Maybe it's like a different monkey, the crazy monkey, okay? So I want to talk about a different just case. So far, we moved the chaos army, by the way, just recall. So I'm going to talk about the chaos or anything, okay? So yeah, so, so far we talked about kidding cells, kidding databases, kidding applications. Now I want to talk maybe about chaos security, chaos security by saying, telling all the developers, listen, the API key that you used for this, you know, application, whatever, it was leaked. You revoke it, now what? So now I'm wondering, like, do you also consider it as chaos monkey, like, um, pretending to have a security leak? When I secure, there can be a password in API key and anything that I don't speak, whatever. Most definitely, yes, not only secrets and secrets and parameters, all, all the set of configurations that would force you to work better. Again, that's, by the way, that's a perfect API because if that's something that's killing me, I would build my things in order to, how would I say that maybe protect them, but I'm saying that if I have a measurable KPI, for example, you have 60 minutes to recover from a leak of a database key, okay? That's my KPI, I will work my way to get to that KPI because let's take, for example, a situation where you have a set of 1000 functions running in production, each of them have their own configuration with that database, uh, database key, probably not all of them are reaching out to production. That's the first step. Do you even know who's reaching out to database in production, who is not what set of functions are you going to change? Do you have a central way to change it? If you do, does that mean you at once need to deploy everything? If you need that once, I'm, I'm building it up, but if you need to, let's say 500 of the 1000 functions reach out to your production database and you do have a way to change it centrally, does that mean you now have to run 500 pipelines in CI CD, your system would die. I guess, I mean, maybe not. Do you have a way to do that? So that's, that's a great example in, yes, exactly part of chaos. What I think I've seen. So we also consider it's not only chaos ops, it's also chaos security, right? Definitely. Let's talk about under the hat or the umbrella of security, I also want to shove another thing. Let's say there is a leak of maybe a file picture image, something you didn't want to expose. And I'll give an example and maybe I'll also also a solution. I want you to help me with that solution. And of course, we need to consult with professionals because we are not super security professionals. We know a bit, but we're not like, I wouldn't take myself as a hand testing engineer, right? So let's say I have an S3 bucket and my application is still the behind cloud font, the CDN AWS cloud font. And I have a file which I uploaded to the S3 bucket and sadly this file wasn't supposed to be uploaded because it contains a photo or anything that I didn't want to get exposed. But what can I do? Cloud fonts fetched this file, you know, so it's all over the internet. And now search engines and everyone on the online can get to this file. So first thing, I think we both could say, okay, so just first delete the file from S3. The second thing, so I'm just talking about mitigation, like how I can talk about it. And I want to, you know, what I talk about it, do you think it's chaos is considered chaos, what I've just said? Okay. That's fine. That's the same point. Definitely. The leak of a leak. You mean? Yeah. Look, chaos is not a secret, it's like a pop play or P, how do you say? Pop a proprietary, proprietary. Yeah. So this one. So it's like losing proprietary. Look, it's, I think you can put anything under the umbrella of chaos, anything that's unexpected, that would require engineers to step into action and do something they were probably unprepared to. I think that comes under the, let's do it now, let's pretend as if we have an S3 file, file an S3 that we want to disappear from the internet. So I'd say delete the file, close the cache in the CDN, okay. I also have another idea, but I want you to give maybe you also have another idea and delete file search cache. I think the first step, the very, very first step is to rotate it. If that was a key or it's a file, it's an image, it's an image, it's an image, it's totally neat. And it shouldn't be out there, right? Yeah. Yeah. It's secretive for some reason. So some reason. Okay. So image is quite an extreme case because there's no way to delete it. If it's there, it's there. But usually the things that are getting licked in our line of work is not images. It's secret. It's secret. But let's say it's an image of a patient, ooh, Ipa. Well, that is a major, I don't want to make a major, I don't, I don't, I don't, I don't know. And again, say you want to make it disappear from the internet because it can happen. Let's just roll with it. Okay. Let's just play with it. So you delete it. You approach the cache. What else do you think we can do other than that? Do you know robots, TXT, I'm just thinking about Google, so I also wouldn't want any robots to crawl my website. And maybe I would add the address, you know, the path of the image in my S3. I've also added to robots. So Google will stop wearing it and caching it between search engine, or maybe charge it between if it goes online, you know, the internet, but any other, I mean, stuff like that that are out there, they're out there. So, for example, I had the major leak, I can now talk about it for, I can't talk about it. I mean, you know, it's not ongoing and it was a long while ago, so I can speak about it. A large set, by the way, another thing to scan, a large set of keys was leaked into production. Literally, if you went to the front end and you opened developer tools and read the JavaScript, you can find production keys for authentication, and you'd be amazed how much funny things you can find in JavaScript, in production, in front end code. Funny things, it doesn't have to be actual leaks, you'll find developers living comments and functions that should probably not be there, and you can find database endpoints, so it doesn't have to be very much secretive, but you find things that you probably should not see when you open your developer tools, so it's just another thing to scan. To your example, I mean, I don't want to be part of a theme that encounters that, it's not fun, but it's something to think about, and you know what? I have another thing to add on top of that. That happened. I mean, you know, shit happens. The last step, we talked about the very first step, rotate it, if you can. The very last step is going back and understanding what process led to it, because your example, an image that was leaked, what was the process leading to that? Could it be a manual work by an engineer? Maybe you shouldn't have access to S3? Maybe if he does, did he do it manually, like literally put the image into that, or was there a pipeline that had the permissions to take it from somewhere, if that was a pipeline? Should the pipeline have permission to do that? Should a test run after that? Maybe the gig to know was not updated to exclude SVG files, whatever. Exactly. It was a part of the concept. So just go back and understand what the process led to it and fix it. Because I think, you know, we tend to call it postmortem. In many cases, people just don't do that. They don't have that concept. There's no postmortem. In many cases, maybe on the enterprises and startups, you know, shit happens, you fix it. You move on. If you don't learn from your mistakes, and it sounds like very pretentious to say it, but try to learn from your mistakes. I know I do, but nobody's perfect. So I'd give a tip just for the viewers, you know, just for like anyone on the line and to your email. I think we all have to be confident in our system. And it just feels that if you're not confident enough that your database or application or anything or a secret can be crushed or stolen, so maybe you should do about it. So you need to think about the things that struggle at night, like, OK, if someone steals my passport, you can also do it by the way. On your private password, it doesn't have to be in your work. You can say, OK, I have one password. Everything is managed in one password. How do I recover my passport? Is it even possible? Because then you'll assess the risks, you know, this is how you do a risk assessment and say, OK, this is like, I can't handle it, I can never let it happen, you know? So there are also kind of those things, like if I lost also my master key and password. In one password, that's it. I'm gone. So I got to have maybe a hard copy of it. So you start, you know, inviting solutions to your own problem because it's too complicated. But the thing is, with that, I totally agree. But we did these things. Let's, my analogy is Bitcoin because that is with Bitcoin, you want to secure a key at the end of the day, just string that you want to secure. So there's lots of ways to do that. You can write it with a pen on paper. Obviously you want to do it online, so it's not even connected to the internet. You can use a hardware copy, so there's like a small U.S. Like, yeah, exactly a cold wallet that you can install it. But what happens if that goes away, so you want to store backup keys, where do you store them? It's paper. Maybe if your house, I don't know, there is a leak from the burn, yeah, or burn exactly is it gone? Is everything gone? So you back it up again. Where do you put it in a bank in a safe, doesn't it cost you what? It can be endless. So just cut the measure somewhere, but yeah, do take risk measurements. But you know, it's an endless loop. So to the best you can. OK. I also wanted to talk about Kubernetes scales on a, but not just deleting pods. I actually used a tool a few weeks ago, and I really like it. Do you know the level? It sounds familiar, no, I don't think so. So the level is like a backup tool for Kubernetes, where you can just schedule backups. I'm just saying that because we are all in the ops world over here, and I want to be practical. So assuming you want to say maybe a namespace was deleted in Kubernetes, what do you do? I mean, that's crazy. Think about it. Some in general maybe have a administrative office somehow, maybe in CICD or whatever, and somehow deleted the namespace, which includes too many things. How do you recover from that? So apparently that is a tool which is called the level, which reminds me solar. No, E, V, the level, yeah, it's like solar. Even though there is no connection. So the level is like it's a health child, right? And also a CLI. They also have a CLI and a health child, so you install the operator as a health child. And then you can tell it, OK, I want, once a day, have a backup of this namespace with this label or whatever, and then it creates backups in your cluster, and you can just restore them and voila, everything is up. So the funny thing to do with it, I think, is really to maybe recover a namespace, don't delete the namespace, or maybe create a bogus Kubernetes cluster locally, even locally. And try to use it and see what would happen if a deployment was deleted, or maybe namespace, or maybe load by oneself. And then try to recover from that, and using voila is great. I mean, I used it, it was great. So that's another tip for maybe a disaster to recover the plan. Maybe if you're using Kubernetes, voila is definitely a tool you should try out because open source for the task. Yeah, cool. Lots of stars active, yeah, it looks good. Perfect. Yeah. So, OK. I think I'd say that, OK, so I think I moved to the corner without moving to the corner. Yeah, moving exactly. That's what I'm saying. I'm doing it all over again, it's like, shoot a little, a little, a little, let it move to the corner. Yes. So let's move to the corner. Oh, shoot. Poor number of the week. OK, so this week, I really told what I've been through, but I also want to add all the mail that I also worked with Conan for iOS. Remember, Conan, the Techage Manager for C++, how can I work here? Yeah. So I finally migrated all of the platforms to working with Conan. When I say all platforms, it means Linux, macOS, Windows, iOS, Android, and WebAssembly. So all of those are written in C++, yeah, it's like the big thing, it's a major milestone. I love that word. Also, your mind is in milestone, it's like a corner stone from a Harry Potter, even though it's not related at all. So that's a milestone. Every time someone says milestone, I'm like, it's the corner stone. OK, philosophy stone or whatever. I don't see, I don't think you didn't think it's corner stone in a Harry Potter. It's philosophy stone, you know, even in that, it's good, OK. So all of those are going to share an impressive, impressive way you took there, impressive down the road, like that. So Omele, what have you done in the first week, any challenge, any experience, any tools that you wish to share? It's always, always tools and announcements. And I remember, I think this week is cool because last week, we mentioned two things. First of all, I asked you, do you think that AWS are going to get into the AI game? Don't remember, I asked you. And then, maybe I'm a prophet, but a few days later, meta, like Facebook, Facebook, announced that they're coming out with llama, yam, I don't know llama, llama, yeah, open softet or something, like probably open sorts, their own trained model for AI kind of competitive, kind of a competitor to chat GPT. And they were doing that in kind of in a strange partnership with both Microsoft and AWS. So Microsoft have a large footprint, which is weird on its own because they're one of the biggest investors in open AI and it's kind of integrated with their own search engine. So it feels like Microsoft is everywhere nowadays. And AWS was part of the announcement saying that they will offer that through their system. And I was really happy because I thought, oh nice, AI, API through AWS is everything I wanted because I'm using open AI, not so easy. They offer that through SageMaker. So you can start another niche service SageMaker with the runner that's basically an instance that will hold the model and then you can speak to the model. But it's kind of in closing your own VPC, it's not something that you can use or consume at the service. I hope you in the future, but not right now. So that was the first thing, which is a finding about a disappointment. And I'm quite disappointed before you continue. I thought you would say, SkyNet is everywhere and then you'd say like, sorry, Microsoft, you know, it's felt like that. Yeah, go on, sorry. Not too bad. Yes, another time we asked you last week, we were talking about niche services, one of the niche services was we were talking about how AWS offers a set of tools to help you as a, let's say you're a single developer is just starting out. You don't know a lot about AWS. How do you deploy your application? It's hard, right? So you have the two probably best options are either Beanstalk, which is a little bit more, I don't know, scalable kind of offers more featureful maybe. The other was lights, lights, lights, so these are your options. And then a friend was listening to us and he told me, do you know about app runner? And I said about app, what and apparently I heard about it. I heard about it. I had no idea. I just, but just the name, maybe it's also related to amplify. I don't know. Okay. So app runner, basically, if I mentioned FlyIO a couple of times, which is basically like Heroku, it's essentially a very easy way for you as a single developer or a small team to take the application and just tell AWS, run it. I don't want to handle load balancers and containers and wiring and all that stuff. Just run it. Just do something with it. And that's what it is. It does that. Obviously it's not for free. It costs a little bit, but very little. And if you're a single developer trying to deploy something, I would still not go with AWS. Sorry, I would go to FlyIO or something like that because I think it just offers a better service. They have a free tier, but I was really happy to hear that because that, that says, do things. First of all, AWS are our listening and they're now starting to maybe approach a different crowd, a different audience. They're now trying to appeal to single developers or very small teams. And that makes me happy because it feels like they will try, I don't know, maybe taking the extra mile with every service and doing the work that we were always complaining. They're not doing right. The world is doing 80% and living stuff to ups and juniors for ups and juniors to do. Maybe that was our job security until today, but I'm really happy to hear about it. So that's the second thing. So the second thing is like, it's like a friend telling you about up on it. Exactly. Yes. Okay. Okay. I just wanted to make sure I didn't miss anything. Okay. So any other thing you want to share before we wrap it up? No. No. I think that's it. You mentioned it's very, very good, but it's it's VMware, I'm saying, does it belong to the VMware? It says VMware. Maybe they sponsor it, VMware-tansu, which sounds Chinese cloud native open source from VMware. Yeah. So I think also there was a company who bought VMware so or something, by the way, two weeks ago, if I'm not sure, not sure who in which maybe Nvidia, you know, who knows. But somebody I think bought VMware, if you want, you can check before we go offline. Five, four, three, broad comb. Broad comb. Yeah. Yeah. So I remember one of the month stills. Yeah. Like 61 bucks. Yeah. Yeah. Nice. So thank you, everyone. Thank you. And see you next week. See you. Bye bye. Bye bye. Bye bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye.

DevOps Topeaks

#27 - Chaos Engineering

Listen to this podcast on