NEW BOT Телеграм, страница

Reddit DevOps

One end-to-end DevOps project to learn almost all tools together?

Hey everyone,

I’m a DevOps beginner. I’ve covered the theory, but now I want hands-on experience.

Instead of learning tools separately, I’m looking for ONE consolidated, end-to-end DevOps project where I can see how tools work together, like:

Git → CI/CD (Jenkins/GitLab) → Docker → Kubernetes → Terraform → Monitoring (Prometheus/Grafana) on AWS.

A YouTube series, GitHub repo, or blog + repo is totally fine.

Goal is to understand the real DevOps flow, not just run isolated commands.

If you know any solid project or learning resource like this, please share 🙏

Thanks!

https://redd.it/1qarrve
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

7 views05:28

Reddit DevOps

What DevOps and cloud practices are still worth adding to a live production app ?

Hello everyone, I'm totally new to devops
I have a question about applying Devops and cloud practices to an application that is already in production and actively used by users.
Let’s assume the application is already finished, stable, and running in production, I understand that not all Devops or cloud practices are equally easy, safe, or worth implementing late, especially things like deep re-architecture, Kubernetes, or full containerization.
my question is: What Devops and cloud concepts, practices, and tools are still considered late-friendly, low risk, and truly worth implementing on a live production application? ( This is for learning and hands-on practice, not a formal or professional engagement )
Also if someone has advice in learning devops that would be appreciated to help :))

https://redd.it/1qb9mcf
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

7 views06:28

Reddit DevOps

Observabilty For AI Models and GPU Infrencing

Hello Folks,

I need some help regarding observability for AI workloads. For those of you working on AI workloads or have worked on something like that, handling your own ML models, and running your own AI workloads in your own infrastructure, how are you doing the observability for it? I'm specifically interested in the inferencing part, GPU load, VRAM usage, processing, and throughput etc etc. How are you achieving this?

What tools or stacks are you using? I'm currently working in an AI startup where we process a very high number of images daily. We have observability for CPU and memory, and APM for code, but nothing for the GPU and inferencing part.

What kind of tools can I use here to build a full GPU observability solution, or should I go with a SaaS product?

Please suggest.

Thanks

https://redd.it/1qb51ph
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

9 views07:28

Reddit DevOps

Deterministic analysis of Java + Spring Boot + Kafka production logs

I’m working on a **Java tool that analyzes real production logs** from **Spring Boot + Apache Kafka** services.

This is **not an auto-fixing tool** and not a tutorial.

The goal is **fast incident classification + safe recommendations**, the way an experienced on-call / production engineer would reason.

**Example: Kafka consumer JSON deserialization failure**

**Input (real Kafka production log):**

`Caused by: org.apache.kafka.common.errors.SerializationException:`

`Error deserializing JSON message`

`Caused by: com.fasterxml.jackson.databind.exc.InvalidDefinitionException:`

`Cannot construct instance of \`com.mycompany.orders.event.OrderEvent\``

`(no Creators, like default constructor, exist)`

`at [Source: (byte[])"{"orderId":123,"status":"CREATED"}"; line: 1, column: 2]`

**Output (tool result)**

`Category: DESERIALIZATION`

`Severity: MEDIUM`

`Confidence: HIGH`

`Root cause:`

`Jackson cannot construct target event class due to missing creator`

`or default constructor.`

`Recommendation:`

`Add a default constructor or annotate a constructor`

**Example fix:**

public class OrderEvent {

private Long orderId;
private String status;

public OrderEvent() {}

public OrderEvent(Long orderId, String status) {
this.orderId = orderId;
this.status = status;
}
}

# Design goals

* Known **Kafka / Spring / JVM failures** detected via **deterministic rules**
* Kafka rebalance loops
* schema incompatibility
* topic not found
* JSON deserialization errors
* timeouts
* missing Spring beans
* **LLM assistance is strictly constrained**
* forbidden for infrastructure issues
* forbidden for concurrency / threading
* forbidden for binary compatibility (e.g. `NoSuchMethodError`)
* Some failures must **always** result in:
* **No safe automatic fix, human investigation required.**

This project is **not about auto-remediation**
and explicitly avoids “AI guessing fixes”.

It’s about **reducing cognitive load during incidents** by:

* classifying failures fast
* explaining *why* they happened
* only suggesting fixes when they are provably safe

**GitHub (WIP):**
[https://github.com/mathias82/log-doctor](https://github.com/mathias82/log-doctor)

# Looking for feedback from DevOps / SRE folks on:

* Java + Spring boot + Kafka related failure coverage
* missing rule categories you see often on-call
* where LLMs should be **completely disallowed**

Production war stories very welcome 🙂

https://redd.it/1qbllc2
@r_devops

GitHub

GitHub - mathias82/log-doctor: CLI tool that analyzes Java, Spring boot, JVM, Hibernate and Kafka logs and explains failures.

CLI tool that analyzes Java, Spring boot, JVM, Hibernate and Kafka logs and explains failures. - mathias82/log-doctor

10 views08:28

Reddit DevOps

Azure VM auto-start app

Azure has auto‑shutdown for VMs, but no built‑in “auto‑start at 7am” feature. So I built an app for that - VMStarter.

It’s a small Go worker that:

• discovers all VMs across any Azure subnoscriptions it has access to

• sends a start request to each one — **no need to specify VM names**

• runs cleanly as a scheduled Azure Container Apps Job (cron)

Instructions how-to deploy: https://github.com/groovy-sky/vm-starter#deployment-noscript

Docker image: https://hub.docker.com/repository/docker/gr00vysky/vm-starter

Any feedback/PRs welcome.

https://redd.it/1qbmr2s
@r_devops

GitHub

GitHub - groovy-sky/vm-starter

Contribute to groovy-sky/vm-starter development by creating an account on GitHub.

6 views09:28

Reddit DevOps

Spark stage cost breakdown on aws: (Why distributed tracing isn't helping & how to fix it)

Tempo has been a total headache lately. I’ve been staring at Spark traces in there for weeks now, and I’m honestly coming up empty.

What I really want is simple: a clear picture of which Spark stages are actually driving up our costs.

Here’s the thing… poorly optimized Spark jobs can quietly rack up massive bills on AWS. I’ve seen real-world cases where teams cut infrastructure costs by over 100x on critical pipelines just by pinpointing inefficiencies, and others achieve 10x faster runtimes with dramatically lower spend.

We’re aiming to tie stage-level resource usage directly to real AWS dollar figures, so we can rank priorities and tackle the biggest optimizations first. Right now, though, it just feels like we’re gathering traces with no real insight.

I still can’t answer basic questions like:

Which stages are consuming the most CPU, memory, or disk I/O?
How do we accurately map that to actual spend on AWS?

Here’s what I’ve tried :

Running the OTel Java agent and exporting to Tempo -> massive trace volume, but the spans don’t align meaningfully with Spark stages or resource usage. Feels like we’re tracing the wrong things entirely.
Spark UI -> perfect for one-off debugging, but not practical for ongoing cost analysis across production jobs.

At this point, I’m seriously questioning whether distributed tracing is even the right approach for cost attribution.

Would we get further with metrics and Mimir instead? Or is there a smarter way to structure Spark traces in Tempo that actually enables proper cost breakdown?

I’ve read all the docs, watched the talks, and even asked GPT, Claude, and Mistral for ideas… I’m still stuck.

Any advice or experience here would be hugely appreciated,

https://redd.it/1qbnszj
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

7 views10:28

Reddit DevOps

Built a Real Estate Market Intelligence Pipeline Dashboard using Python + Power BI (Learning Project)

This is a learning project where I attempted to build an end-to-end analytics pipeline and visualize the results using Power BI.

Project overview:

I designed a simple data pipeline using static real estate data to understand how different tools fit together in an analytics workflow, from raw data collection to business-facing dashboards.

Pipeline components:

• GitHub – used as the source for collecting and storing raw data

• Python – used for data cleaning, transformation, and basic processing

• Power BI – used for building the Market Intelligence dashboard

• n8n – used for pipeline orchestration (pipeline currently paused due to technical issues at the automation stage)

Current status:

The pipeline is partially implemented. Data extraction and processing were completed, and the final dashboard was built using the processed data. Automation via n8n is planned but temporarily halted.

Dashboard focus:

• Price overview (average, median, min, max)

• Location-wise price comparison

• Property distribution by number of bedrooms

• Average price per square foot

• Business-oriented insights rather than purely visual design

This project was done independently as part of learning data pipelines and analytics workflows.

I’d appreciate constructive feedback—especially on pipeline design, tooling choices, and how this could be improved toward a more production-ready setup.

https://redd.it/1qboimq
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

4 views11:28

Reddit DevOps

My review of Orca security for cloud based vuln management

Been a Tenable shop for vuln management for years, brought on Orca about a year ago. Figured I'd share what I've found.
Context: 80+ AWS accounts at any given time. QoL for multi-account handling matters a lot - main reason we moved off Tenable.

Orca's been overall good, but not without faults. UI gets sluggish when you're filtering across everything - annoying but livable.

Query language took me longer than it should have to get comfortable with, ended up bugging our CSM more than I wanted to early on.

Once you're past that though, day-to-day is good. Less painful than I expected at our scale.

As I said at the start, main use is vuln management and that hasn't let me down yet.

Agentless scanning works, good enough exploitability context, multi-account handling is better than what we had, or at least less annoying to deal with.

Alerting took some tuning to not be noisy as hell but once it's dialed it stays dialed.

Other stuff worth mentioning:

Exports: no weird formatting when pulling compliance reports, which is more than I can say for some tools
Deleted resources: clears out fast, not chasing ghosts
Attack paths: actually useful for explaining risk to non-security people, good for getting buy-in
Dashboards: CVE data populates clean, prioritization logic makes sense without having to customize everything

Overall, not a perfect tool but it's been a net positive. Does what I need it to do.

https://redd.it/1qbpaay
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

7 views12:28

Reddit DevOps

I need a feedback about an open-source CLI that scan AI models (Pickle, PyTorch, GGUF) for malware, verify HF hashes, and check licenses

Hi everyone,

I've created a new CLI tool to secure AI pipelines. It scans models (Pickle, PyTorch, GGUF) for malware using stack emulation, verifies file integrity against the Hugging Face registry, and detects restrictive licenses (like CC-BY-NC). It also integrates with Sigstore for container signing.

GitHub: https://github.com/ArseniiBrazhnyk/Veritensor
Install: pip install veritensor

If you're interested, check it out and let me know what you think and if it might be useful to you?

https://redd.it/1qbpjja
@r_devops

GitHub

GitHub - arsbr/Veritensor: The Anti-Virus for AI Artifacts & RAG Firewall. A static analysis tool scanning Models and Notebooks…

The Anti-Virus for AI Artifacts & RAG Firewall. A static analysis tool scanning Models and Notebooks for RCE, Datasets and RAG docs for Data Poisoning, PII, and Prompt Injections. Secure yo...

10 views13:28

Reddit DevOps

What are the basic tools you would suggest for a DevOps newbie ?

Python, Git Actions, Terraform, Docker, K8s.. anything else ?

https://redd.it/1qbrvkt
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

9 views14:28

Reddit DevOps

What causes VS Code to bypass Husky hooks, and how can I force the Source Control commit button to behave exactly like a normal git commit from the terminal?

I have a Git project with Husky + lint-staged configured.

When I run git commit from the terminal, the pre-commit hook executes correctly.

However, when I commit using the VS Code Source Control UI, the Husky hook is completely skipped.

https://redd.it/1qbnfe9
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

12 views15:28

Reddit DevOps

I built an interactive tutorial for learning docker I wish I had when I was learning Docker

Hello Everyone,
I always had passion for teaching new technologies and concepts, Therefore I decided to build this interactive tutorial for learning docker

Link to tutorial: https://learn-how-docker-works.vercel.app/

https://redd.it/1qbufky
@r_devops

learn-how-docker-works.vercel.app

Docker Made Easy — Learn Docker Internals

An interactive tutorial that teaches you Docker from the ground up. Learn how containers actually work under the hood.

9 views16:28

Reddit DevOps

Why 'works on my machine' means your build is broken

We’ve been using Nix derivations at work for a while now. Steep learning curve, no question, but once it clicks, it completely changes how you think about builds, CI, and reproducibility.

What surprised me most is how many “random” CI failures were actually self-inflicted: network access, implicit system deps, time, locale, you name it.

I tried to write down a tool-agnostic mental model of what makes a build hermetic and why it matters, before getting lost in Nix/Bazel specifics.

If you’re curious, I put the outline here:
https://nemorize.com/roadmaps/hermetic-builds

https://redd.it/1qbx6lg
@r_devops

Nemorize

Hermetic Builds - Learning Roadmap | Nemorize

1️⃣ Hermetic Builds

Verdict: ⭐⭐⭐⭐⭐ (Top-tier, underrated gold)

Why it’s strong

Almost nobody teaches this clearly

Hugely relevant to modern tooling: Nix,...

10 views17:28

Reddit DevOps

#devops #infrastructureascode #kvm #virtualization #opensource #linux #automation #cloudnative #techinnovation #github #virsh #bash #terraform #kickstart #qemu #kvm | Ahmad M Waddah

Check KVMSpinUps

https://redd.it/1qbyxld
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

9 views18:28

Reddit DevOps

Tabletop incident exercises feel so cringe

We have an incident response plan, on-call, alerts aand postmortems. But we’ve never done a proper tabletop exercise. Now bigger customers keep asking if we test IR and I’m realizing they’re expecting something more formal than “we handle incidents.”
Don't get me wrong I’m not against doing tabletops, it just feels like one more thing that turns into paperwork.

What’s the simplest way to do it without making it cringe or useless?

https://redd.it/1qbyjnh
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

10 views19:28

Reddit DevOps

Hosting a Hugo site and Laravel app in the same server

Hi guys,

I don't know whether this is the right sub to ask this, I have a DO droplet. On it I want to host a Hugo static site and a Laravel app. Hugo generates auto routes based on its content. As an example if you have a /content/posts/about.md, the site will generate a route like example.com/posts/about.

I want that behaviour as well, plus I want to deploy my Laravel application on the same domain like example.com/app too. How can I do that? Subdomain approach is not possible because of SEO reasons.

https://redd.it/1qc10v7
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

9 views20:28

Reddit DevOps

I built a way to make infrastructure safe for AI

I built a platform that lets AI agents work on infrastructure by wrapping KVM/libvirt with a Go API.

Most AI tools stop at the codebase because giving an LLM root access to prod is crazy. fluid.sh creates ephemeral sandboxes where agents can execute tasks like configuring firewalls, restarting services, or managing systemd units safely.

How it works:

- It uses qcow2 copy-on-write backing files to instantly clone base images into isolated sandboxes.

- The agent gets root access within the sandbox.

- Security is handled via an ephemeral SSH Certificate Authority; agents use short-lived certificates for authentication.

- As the agent works, it builds an Ansible playbook to replicate the task.

- You review the changes in the sandbox and the generated playbook before applying it to production.

Tech: Go, libvirt/KVM, qcow2, Ansible, Python SDK.

GitHub: https://github.com/aspectrr/fluid.sh
Demo: https://youtu.be/nAlqRMhZxP0

Happy to answer any questions or feedback!

https://redd.it/1qc5ecx
@r_devops

9 views22:28

Reddit DevOps

infra team reorg and impact on setup

I am an Engineering manager working for a multinational organization. I'm part of the data analytics department, and I am managing data platform and AWS cloud platform teams. Due to internal reorganisation, I and my teams are now moved to the tech department. The data and analytics department was around 50-60 people, whereas the tech department is about 500 people.

The new manager is proposing to split my role where i’d be focused on cloud platform. data platform reporting would change to another manager. He would also like to add Azure and GCP and additional DevX teams to my portfolio.

I should state that my background is mainly the data area - data engineering and developed into managing the aws cloud platform. I’ve done so for the last 3 years and have managed to keep cloud costs flat while business topline grew by upto 50%, and profitability doubled in this period.

1. I’m of the opinion that cloud and data platform leadership should remain close for better finops (50-70% of our cloud costs have data footprint)

2. I believe adding azure and gcp to my portfolio is more of a director (or head of) level request.

A few points to consider here is that the organization is really not offering director or head of roles, and they're downplaying the scope increase. To give context, the Azure and gcp spend is 8-9 times bigger than the AWS spend, so in terms of cost footprint, those clouds have a higher cost footprint. The ROI on aws is 2-3x the other hyperscalers.

Any tips or counter arguments on how i should navigate this? Experience sharing encouraged.

https://redd.it/1qbyth4
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

8 views23:28

Reddit DevOps

Deployments kept failing in production for the dumbest reason

Spent two months chasing phantom bugs that turned out to not be bugs at all. Our staging environment would work perfectly and all tests were green but once you deploy to production everything explodes. And if we tried again with the same code sometimes it'd work and sometimes no, it made zero sense.

Figured out the issue was just services not knowing where to find each other. We had configs spread across different repos that would get updated at different times so service A deploys on monday expecting service b to be at one address but service b already moved on friday and nobody updated the config. We switched everything to just figure out addresses at runtime instead of hardcoding them. We looked at a few options like consul for service discovery or using kubernetes dns or even just etcd for config management, in the end we went with synadia cause it handles service discovery plus the messaging we needed anyway. Now services find each other automatically. Sounds like an obvious solution in hindsight but we wasted so much time thinking it was code problems.

Feel kind of stupid it took this long to figure out but at least its fixed now.

https://redd.it/1qc7kln
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

13 views00:28

Reddit DevOps

Solving Factorio with Terraform

Just released this video not too long ago, and while its part entertainment. I'd be cursious on your guy's impression on the conclusion. When is Terraform overkill?

https://redd.it/1qcaap6
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

11 views02:28

Reddit DevOps

Devcontainers question

Just a quick question because I came across a youtube video where the creator was talking about doing everything out of devcontainers. So that if he gets a new PC, he just has to clone a repo and everything he needs is right there. And I got to thinking, rather than installing azurecli, powershell, python, go, etc. why can't these things just be setup in a devcontainer so when work issues a temp laptop or a new laptop, boom I am good to go. So I was curious if anyone is doing or has done this. I thought of having just a single devcontainer with all things installed, but I also thought of having different devcontainers with different versions of things like older versions of powershell.

So tell me, have to seen or done anything like this? Thoughts / suggestions?

TY in advance.

https://redd.it/1qcecqt
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

12 views05:28

About

Blog

Apps

Platform