DevOps&SRE Library – Telegram
DevOps&SRE Library
18.4K subscribers
466 photos
4 videos
2 files
5K links
Библиотека статей по теме DevOps и SRE.

Реклама: @ostinostin
Контент: @mxssl

РКН: https://www.gosuslugi.ru/snet/67704b536aa9672b963777b3
Download Telegram
4
Implementing Scalable GitOps With Argo CD and ApplicationSets: A Case Study

https://aviadhaham.me/posts/implementing-gitops-with-argo-cd-and-applicationsets
1
Managing many Helm Charts with Kluctl

Learn how easy it is to manage multiple Helm Charts from one deployment project using Kluctl.


https://kluctl.io/blog/2023/02/28/managing-many-helm-charts-with-kluctl
3
Kubernetes on Proxmox

In this article we’ll create a two-node cluster with one control-plane node and one worker node as a proof-of-concept. As an extra challenge we’ll also take a look at how to do PCIe passthrough for the worker node.


https://blog.stonegarden.dev/articles/2024/03/proxmox-k8s-with-cilium
1
book6

A collaborative IPv6 book.

The intention is a practical introduction to IPv6 for technical people, kept up to date by active practitioners.


https://github.com/becarpenter/book6
1
Piloting through the Fog: A Tale of Migrating to a New Kubernetes Platform

It’s a tale as old as UNIX_MIN_TIMESTAMP. Your team owns a service that you treat like a black box as long as it’s working. Sure, there’s a small maintenance task here and there that the most tenured member of the team almost exclusively picks up. How they fix it might as well be a wizard’s incantation with a sprinkle of fairy dust. But this time around they’re busy on another task, or worse, gone from the company altogether. Here’s my story of such a maintenance task. In this post I go through my journey of migrating one such service from Klaviyo’s legacy kubernetes platform, to our new spiffy, well-managed platform.


https://klaviyo.tech/piloting-through-the-fog-a-tale-of-migrating-to-a-new-kubernetes-platform-7fe5677310fa
1
How our data team handles incidents

Historically, data teams have not been closely involved in the incident management process (at least, not in the traditional “get woken up at 2AM by a SEV0” sense). But with a growing involvement of data (and therefore data teams) in core business processes, decision making, and user-facing products, data-related incidents are increasingly common, and more important than ever.

At incident.io, the Data team works across multiple areas of the business, enabling go-to-market and product teams alike to make data-driven decisions. Given our broad involvement, we’re no stranger to data incidents and are heavy users of our own product to monitor, triage, and respond to them. Here’s a quick run-through of how we’ve set this up.


https://incident.io/blog/how-our-data-team-handles-incidents
1
What is an SLA?

A Service Level Agreement (SLA) is a formal document that outlines the expectations, responsibilities, and performance metrics between a service provider and a customer.


https://uptimerobot.com/blog/what-is-an-sla
1
Optimizing global message transit latency: a journey through TCP configuration

https://ably.com/blog/optimizing-global-message-transit-latency-a-journey-through-tcp-configuration
1
kubetrim

kubetrim tidies up old and broken cluster and context entries from your kubeconfig file.


https://github.com/alexellis/kubetrim
1
outline

The fastest knowledge base for growing teams. Beautiful, realtime collaborative, feature packed, and markdown compatible.


https://github.com/outline/outline
1
isaiah

Self-hostable clone of lazydocker for the web. Manage your Docker fleet with ease


https://github.com/will-moss/isaiah
1
wush

wush is a command line tool that lets you easily transfer files and open shells over a peer-to-peer WireGuard connection.


https://github.com/coder/wush
1
How to secure Terraform code with Trivy

In this blog post we will look at securing an AWS Terraform configuration using Trivy to check for known security issues. We will explore different ways of using Trivy, integrating it into your CI pipelines, practical issues you might face and solutions to those issues to get you started with improving the security of your IaC codebases.


https://verifa.io/blog/how-to-secure-terraform-trivy
1
Proper setup of IAM federation in Multi-account AWS Organization for Terragrunt

In this article I want to describe how to configure the IAM relationships in a multi-account AWS organization with AWS SSO to allow managing infrastructure as a code with terragrunt/terraform from both CI/CD runner and local PCs.


https://dev.to/yyarmoshyk/proper-setup-of-iam-federation-in-multi-account-aws-organization-for-terragrunt-3ape
1
Remotely Producing Terraform from an API

https://nitric.io/blog/terraform-api
1
terraform plan -light

Add a terraform plan -light flag such that only resources modified in code are targeted for planning.

This would reduce the scope of the pre-plan refresh down to the set of resources we know changed, which reduces overall plan times without the consistency risk of -refresh=false.

For Terraform to know what resources were modified in code, it would store the hash of the serialized sorted attribute map for each resource successfully applied. This would allow diff’ing “last-applied code” vs. “new code”, the result of which is the scope of the next -light plan.

Basically, -light autogenerates the -target list from code changes.


https://www.bejarano.io/terraform-plan-light
2
Terraform Modules and Grand Finale Project: Orchestrating Complex Azure Infrastructure

Master Terraform modules for Azure infrastructure management. Learn to create, use, and optimize modules, build a multi-tier application, and implement best practices for large-scale projects. Ideal for DevOps engineers and cloud architects looking to enhance their Infrastructure as Code skills on Azure.


https://www.iamachs.com/p/azure-terraform/part-7-modules-grand-finale
1
Monitoring Inter-Pod Traffic at the AZ Level with Retina (an eBPF based tool)

Recently, I was tasked with analyzing cross-AZ traffic in Kubernetes to identify opportunities for reduction, as it is a significant contributor to our AWS bill. The first step was to understand how traffic flows between services and what portion consistently crosses Availability Zones (AZs). To optimize cross-AZ traffic, I considered using topology-aware routing for services. However, before implementing this solution, I needed a method to effectively analyze inter-pod traffic at the AZ level.

To achieve this, monitoring network traffic at the pod level is necessary. I decided to use eBPF (Extended Berkeley Packet Filter) technology, as it allows us to observe network interactions with minimal performance overhead.

In this article, I will explain what eBPF is, explore the tools available for using it, and provide a step-by-step guide on implementing monitoring for inter-pod traffic using Retina, Kube State Metrics, Prometheus, and Grafana.


https://medium.com/@j.aslanov94/monitoring-inter-pod-traffic-at-the-az-level-with-ebpf-based-tool-retina-7a79818e305b
1