DevOps & SRE notes – Telegram
DevOps & SRE notes
12K subscribers
38 photos
19 files
2.49K links
Helpfull articles and tools for DevOps&SRE

WhatsApp: https://whatsapp.com/channel/0029Vb79nmmHVvTUnc4tfp2F

For paid consultation (RU/EN), contact: @tutunak


All ways to support https://telegra.ph/How-support-the-channel-02-19
Download Telegram
Securing multi-cluster ArgoCD setups requires innovative approaches to authentication and token management to avoid long-lived credentials. This post explores how OpenUnison, kube-oidc-proxy, and ArgoCD's credential plugins can be combined to create a centralized, secure GitOps platform that spans multiple Kubernetes clusters.

https://www.tremolo.io/post/securing-multi-cluster-argocd
👍1
Managing stateful workloads in Kubernetes often comes with challenges, particularly when scaling storage dynamically. This article introduces the PvcAutoscaler, a custom solution developed by City Storage Systems to enable volume expansion, shrinking, and modification for StatefulSets, improving cost efficiency and operational flexibility.

https://techblog.cloudkitchens.com/p/swapping-disks-in-kubernetes
👍3
Optimizing Kubernetes cluster networking is essential for modern applications requiring scalability, low latency, and efficient resource utilization. This blog explores how LoxiLB leverages eBPF technology to enhance load balancing, observability, and security while overcoming the limitations of traditional proxy-based solutions like kube-proxy.

https://www.loxilb.io/post/loxilb-cluster-networking-elevating-k8s-networking-capabilities
👍2
Enhancing workload isolation and security in Kubernetes environments is critical for protecting sensitive operations and preventing container breakouts. This blogpost explores how Kata Containers combine the efficiency of containers with the robust security of virtual machines, enabling secure deployments on Amazon EKS with minimal configuration changes.

https://aws.amazon.com/blogs/containers/enhancing-kubernetes-workload-isolation-and-security-using-kata-containers/
👍31
The challenge of making artificial intelligence more transparent is at the heart of Andrew Mallaband's exploration of the "black box" dilemma. This insightful editorial delves into the real-world implications of explainability in AI systems.

https://www.linkedin.com/pulse/explainability-black-box-dilemma-real-world-andrew-mallaband-ogvae/
👍1
Optimizing autoscaling in Kubernetes involves much more than just monitoring CPU and memory, as this blogpost by Cristian Sepulveda demonstrates through a practical application workflow. By leveraging KEDA to scale based on real-world metrics like message queue length, teams can achieve faster, cost-effective scaling tailored to specific application needs.

https://medium.com/@csepulvedab/how-to-optimize-autoscaling-in-kubernetes-using-metrics-based-on-application-workflows-7f899fdef4d9
👍2
As the complexity of modern software systems grows, the meaning and practice of "observability" have become increasingly muddled. In this personal essay, Charity Majors argues that it's time to "version" observability—differentiating the traditional metrics-logs-traces approach (Observability 1.0) from a new, more flexible model built on wide, structured log events (Observability 2.0).

https://charity.wtf/2024/08/07/is-it-time-to-version-observability-signs-point-to-yes/
👍2
Designing a robust network architecture for K3s multi-cluster environments can be challenging, especially when integrating Layer 2 and BGP routing on Unifi UDM devices. In this guide, David Elizondo walks through practical considerations and strategies for planning private RFC 1918 address spaces and achieving effective communication between clusters using tools like Cilium and native routing.

https://medium.com/@david-elizondo/planning-a-k3s-multi-cluster-network-with-l2-and-bgp-on-unifi-udm-ae4480a7b4f7
Learning from unexpected service failures can be a catalyst for long-term improvement, as Tines software engineer Shayon Mukherjee shares in this blog post. The story reveals how a Redis upgrade exposed a hidden point of failure in their webhook system, ultimately leading to stronger resilience and more comprehensive testing practices.

https://www.tines.com/blog/engineering-incidents-improvement/
👍21
Slow container startup times can cripple the productivity of Kubernetes teams managing large Docker images—sometimes dragging deployments out for hours. In this feature, Kazakov Kirill shares a practical strategy for pre-warming nodes and leveraging image caching, dramatically reducing cold starts and disk pressure during mass pod rollouts in Amazon EKS clusters.

https://hackernoon.com/how-to-optimize-kubernetes-for-large-docker-images
2