Reddit DevOps – Telegram
Getting started with devops/devsecops

I have been a pentester for 7 years but my current role demands more from me. I am thinking of learning devops and devsecops but I need a roadmap to get started. Would really appreciate if someone can recommend some places to get started. I am thinking of buying of kodekloud subnoscription.

https://redd.it/1papm97
@r_devops
An entire sprint got stuck because of one obvious detail my ADHD brain just ignored

I don’t know if anyone else here deals with this, but yesterday I lost almost 5 hours on a bug that made absolutely no sense.
Everything was correct. Logic fine. Structures fine. Tests failing for no explainable reason.

I was already in that panic mode like:

> “How have I been working in this field for years and still fall for stuff like this?”

After running in circles, getting up, coming back, opening 20 tabs, closing 19, focusing for 15 minutes and losing focus for the next 45… I found out the issue was literally a configuration setting I forgot to disable

It wasn’t the code.
It wasn’t the logic.
It wasn’t the test.
It was just my brain deciding to ignore the one detail that would’ve solved everything in 30 seconds.

I like to think developers with ADHD don’t have an intelligence problem — they have an organization problem.
We see 50 things at once, except the thing that actually matters.

The “good” part is that once I solved it, I jumped into hyperfocus and finished the rest of the ticket in record time.
The bad part is that the mental energy cost feels like it’s doubled.

If any of you work as a dev with ADHD, how do you deal with those “blind spots” that only show up hours later?
Seriously, I’d love to know if this is just me or if it comes with the package.


https://redd.it/1pavs9d
@r_devops
From SaaS Black Boxes to OpenTelemetry

> **TL;DR:** We needed metrics and logs from SaaS (Workday etc.) and internal APIs in the same observability stack as app/infra, but existing tools (Infinity, json_exporter, Telegraf) always broke for some part of the use-case. So I built [otel-api-scraper](https://github.com/aakashh242/otel-api-scraper) - an async, config-driven service that turns arbitrary HTTP APIs into OpenTelemetry metrics and logs (with auth, range scrapes, filtering, dedupe, and JSON→metric mappings). If "just one more cron noscript" is your current observability strategy for SaaS APIs, this is meant to replace that. [Docs](https://aakashh242.github.io/otel-api-scraper/)

I’ve been lurking on tech communities in reddit for a while thinking, “One day I’ll post something.”
Then every day I’d open the feed, read cool stuff, and close the tab like a responsible procrastinator.
That changed during an observability project that got...interesting. Recently I ran into an observability problem that was simple on paper but got annoying the more you dug deeper into it. This is a story of how we tackled the challenge.

---

So... hi. I’m a developer of ~9 years, heavy open-source consumer and an occasional contributor.

**The pain:** Business cares about signals you can’t see yet and the observability gap nobody markets to you

Picture this:

- The business wants data from SaaS systems (our case Workday, but it could be anything: ServiceNow, Jira, GitHub...) in the same, centralized Grafana where they watch app metrics.
- Support and maintenance teams want connected views: app metrics and logs, infra metrics and logs, and "business signals" (jobs, approvals, integrations) from SaaS and internal tools, all on one screen.
- Most of those systems don’t give you a database, don’t give you Prometheus, don’t give you anything except REST APIs with varying auth schemes.

The requirement is simple to say and annoying to solve:
> We want to move away from disconnected dashboards in 5 SaaS products and see everything as connected, contextual dashboards in one place.
Sounds reasonable.

Until you look at what the SaaS actually gives you.

**The reality**

What we actually had:

- No direct access to underlying data.
- No DB, no warehouse, nothing. Just REST APIs.
- APIs with weird semantics.
- Some endpoints require a time range (start/end) or “give me last N hours”. If you don’t pass it, you get either no data or cryptic errors. Different APIs, different conventions.
- Disparate auth strategies.
Basic auth here, API key there, sometimes OAuth, sometimes Azure AD service principals.

We also looked at what exists in the opensource space but could not find a single tool to cover the entire range of our use-cases - they would fall short for some use-case or the other.

- You can configure Grafana’s [Infinity data source](https://github.com/grafana/grafana-infinity-datasource) to hit HTTP APIs... but it doesn’t persist. It just runs live queries. You can’t easily look back at historical trends for those APIs unless you like screenshots or CSVs.
- Prometheus has [json_exporter](https://github.com/catawiki/json_exporter), which is nice until you want anything beyond simple header-based auth and you realize you’ve basically locked yourself into a Prometheus-centric stack.
- Telegraf has an [HTTP input plugin](https://docs.influxdata.com/telegraf/v1/input-plugins/http/) and it seemed best suited for most of our use-cases but it lacks the ability to scrape APIs that require time ranges.
- Neither of them emit log - one of the prime use-cases: capture logs of jobs that ran in a SaaS system

**Harsh truth:** For our use-case, nothing fit the full range of needs without either duct-taping noscripts around them or accepting “half observability” and pretending it’s fine.

---

**The "let’s not maintain 15 random noscripts" moment**

The obvious quick fix was:

> "Just write some Python noscripts, curl the APIs, transform the data, push metrics somewhere. Cron it. Done."

We did that in the past. It works... until:

- Nobody remembers how each
noscript works.
- One noscript silently breaks on an auth change and nobody notices until business asks “Where did our metrics go?”
- You try to onboard another system and end up copy-pasting a half-broken noscript and adding hack after hack.

At some point I realized we were about to recreate the same mess again: a partial mix of existing tools (json_exporter / Telegraf / Infinity) + homegrown noscripts to fill the gaps. Dual stack, dual pain. So instead of gluing half-solutions together and pretending it was "good enough", I decided to build one generic, config-driven bridge:

> Any API → configurable scrape → OpenTelemetry metrics & logs.

We called the internal prototype `api-scraper`.

The idea was pretty simple:

- Treat HTTP APIs as just another telemetry source.
- Make the thing config-driven, not hardcoded per SaaS.
- Support multiple auth types properly (basic, API key, OAuth, Azure AD).
- Handle range scrapes, time formats, and historical backfills.
- Convert responses into OTEL metrics and logs, so we can stay stack-agnostic.
- Emit logs if users choose

It's not revolutionary. It’s a boring async Python process that does the plumbing work nobody wants to hand-roll for the nth time.

---

**Why open-source a rewrite?**

Fast-forward a bit: I also started contributing to open source more seriously. At some point the thought was:

> We clearly aren’t the only ones suffering from 'SaaS API but no metrics' syndrome. Why keep this idea locked in?

So I decided to build a clean-room, enhanced, open-source rewrite of the concept - a general-purpose otel-api-scraper that:

- Runs as an async Python service.
- Reads a YAML config describing:
- Sources (APIs),
- Auth,
- Time windows (range/instant),
- How to turn records into metrics/logs.
- Emits OTLP metrics and logs to your existing OTEL collector - you keep your collector; this just feeds it.

I’ve added things that our internal version either didn’t have:

- A proper configuration model instead of “config-by-accident”.
- Flexible mapping from JSON → gauges/counters/histograms.
- Filtering and deduping so you keep only what you want.
- Delta detection via fingerprints so overlapping data between scrapes don’t spam duplicates.
- A focus on keeping it stack-agnostic: OTEL out, it can plug in to your existing stack if you use OTEL.

And since I’ve used open source heavily for 9 years, it seemed fair to finally ship something that might be useful back to the community instead of just complaining about tools in private chats.

---

I enjoy daily.dev, but most of my daily work is hidden inside company VPNs and internal repos. This project finally felt like something worth talking about:

- It came from an actual, annoying real-world problem.
- Existing tools got us close, but not all the way.
- The solution itself felt general enough that other teams could benefit.

So:

- If you’ve ever been asked “Can we get that SaaS’ data into Grafana?” and your first thought was to write yet another noscript… this is for you.
- If you’re moving towards OpenTelemetry and want business/process metrics next to infra metrics and traces, not on some separate island, this is for you.
- If you live in an environment where "just give us metrics from SaaS X into Y" is a weekly request: same story.

The repo and documentation links:
👉 [API2OTEL(otel-api-scraper)](https://github.com/aakashH242/otel-api-scraper)
📜 [Documentation](https://aakashh242.github.io/otel-api-scraper/)

It’s early, but I’ll be actively maintaining it and shaping it based on feedback. Try it against one of your APIs. Open issues if something feels off (missing auth type, weird edge case, missing features). And yes, if it saves you a night of "just one more noscript", a would genuinely be very motivating.

This is my first post on reddit, so I’m also curious: if you’ve solved similar "API → telemetry" problems in other ways, I’d love to hear how you approached it.

https://redd.it/1panxw9
@r_devops
QA Engineer looking to transition into DevOps — advice?

Hello there! :)

I’m currently a QA Engineer and I’m looking to transition into DevOps, both for the challenge and the better pay.

For those who’ve made this transition, what tools should I focus on learning, and what kind of self-projects would be valuable for building a portfolio? Any other tips are very welcome.

For context, I work with embedded systems and my stack is Python, Robot Framework, and Jenkins (as a user, not maintaining the pipelines).

https://redd.it/1pb06e6
@r_devops
I'm building a tool to simplify internal DNS and I need your feedback

Hello community!!

I'm a developer/devops engineer who's fed up with the complexity surrounding internal DNS for development and staging environments. As a side project, I'm building a multi-tenant private DNS service with an API.

The idea is for it to be as simple to use as /etc/hosts, but with access control, logging, and scalability for teams. It's not ready for launch yet, so I want to make sure I'm addressing the right issues.

Based on my experience, I think it would help with:
Avoiding reliance on public DNS servers or complex Consul configurations.
Having a clear audit trail of who resolved or modified what.
Isolating domains by project/team.

My question for you is: Does this resonate with any of your problems? What else drives you crazy about DNS management/service discovery that I should consider?

If the concept has potential, I'd love to keep you updated. Any feedback is welcome!

https://redd.it/1pb2neo
@r_devops
CKA certification Cybermonday deals

Linux foundation Is offering regular discount on Cyber Monday deals.

CKA original price: 445$
Discounted price: 223$

CKAD original price: 445$
Discounted price: 223$

Coupon code: CW25CC

https://training.linuxfoundation.org/cyber-week-2025/

https://redd.it/1pb54ir
@r_devops
Our developers are moving faster, maybe too fast for our release model

Lately our release pace went crazy.
We used to ship twice a day. Now it’s ten… before anyone finishes their first coffee.

And it’s not magic. It’s the copilots/cursor/windsurf doing quiet fixes, bumping versions, patching stuff in the background. Honestly, at this point I’m not sure whether to thank the dev… or their AI agent.
(Yeah, classic Monday thoughts. Ignore me.)

never mind... anyway...
The problem is figuring out what’s actually in each release.
We tag Docker images with timestamps, but honestly, with this speed, the tags tell me nothing.
Sometimes I look at two images from the same morning and I have no idea what changed or why.

Are you seeing this too?
How do you keep track when dev + AI ship faster than you can understand the lineage of a single binary?

Because right now, I feel like the releases make sense to everyone except… me.

https://redd.it/1pb7dxv
@r_devops
KubeGUI - Release v1.9.7 with new features like dark mode, modal system instead of tabs, columns sorting (drag and drop), large lists support (7k+ pods battle tested), and new incarnation of network policy visualizer and sweet little changes like contexts, line height etc

KubeGUI is a free minimalistic desktop app for visualizing and managing Kubernetes clusters without any dependencies. You can use it for any personal or commercial needs for free (as in beer). Kubegui runs locally on Windows, macOS and Linux - just make sure you remember where your kubeconfig is stored.

Heads up - a bit of bad news first:

The Microsoft certificate on the app has expired, which means some PCs are going to flag it as “blocked.” If that happens, you’ll need to manually unblock the file.

You can do it using Method 2: Unblock the file via File Properties (right-click → Properties → check Unblock).

Quick guide here: https://www.ninjaone.com/blog/how-to-bypass-blocked-app-in-windows-10/

Now for the good news - a bunch of upgrades just landed:

>\+ Dark mode is here.
\+ Resource viewer columns sorting added.
\+ All contexts now parsed from provided kubeconfigs.
\+ If KUBECONFIG is set locally, Kubegui will auto-import those contexts on startup.
\+ Resource viewer can now handles big amount of data (tested on \~7k pods clusters).
\+ Much simpler and more readable network policy viewer.
\+ Log search fix for windows.
\+ Deployments logs added (to fetch all pods streams in the deployment).
\+ Lots of small UI/UX/performance fixes throughout the app.

\- Community \- r/kubegui

\- Site (download links on top): https://kubegui.io

\- GitHub: https://github.com/gerbil/kubegui (your suggestions are always welcome!)

\- To support project (first goal - to renew MS and Apple signing certs): https://github.com/sponsors/gerbil

Would love to hear your thoughts or suggestions — what’s missing, what could make it more useful for your day-to-day operations?

Check this out and share your feedback.

PS. no emojis this time! Pure humanized creativity xD

https://redd.it/1pb7ksm
@r_devops
Has DevOps become too complex? or are we just drowning in our own tooling?

Lately, it feels like every simple problem needs five different DevOps tools glued together. Is this normal now, or are we all quietly suffering?
How are you all keeping things sane in your setup?

https://redd.it/1pb9068
@r_devops
Network nerd trying to build a DevOps home lab with zero DevOps experience. I have two solid servers… but no idea where to start. Help me out?

Alright, so here’s my situation.

I’ve spent years breaking and fixing networks for fun. VLANs, firewalls, routers, VPNs, even running clusters in Proxmox… that’s my comfort zone. But DevOps? That whole universe feels like a different planet to me.

Still, I want to dive in.

Right now, I’ve got two decent machines sitting with me:
• One running Ubuntu 25
• One running Proxmox on good hardware ( 16 GB RAM, 256 GB and 3TB, and i3 7100 CPU to run VMs)

And I thought… why not turn these into a proper DevOps playground?

The problem is simple.
I have no idea where to begin.
Like, literally no roadmap.

Everyone online throws around “CI/CD pipelines” and “Kubernetes clusters” like it’s a casual morning chai, but when I look at my setup I’m like:
Which server should run what? Should I start with Docker, Ansible, GitLab, Jenkins, Prometheus, Grafana, Rancher… or something totally different?

Since I’m coming from networking, I’m used to clear architecture. But here I feel like a first-year student again.

So I’m hoping the DevOps people here can point me in the right direction.

If you had:
• A Proxmox box with good specs
• An Ubuntu 25 server
• Zero DevOps experience but solid networking background
• And an interest in learning automation, CI/CD, containerization and monitoring

How would you design the starting layout?
What should I run on the Ubuntu machine?
What should I virtualize on the Proxmox machine?
And what’s the best beginner friendly path to actually learn all this without overwhelming myself?

Any guidance, starter stack ideas or “if I were you” suggestions are welcome.
Thanks in advance. I’m excited to get into this, just need someone to point me toward the first few steps.

https://redd.it/1pb8zjs
@r_devops
I build a public Chatbox, do you want to try it ?

It’s still a wip, but it’s functional
You can just type messages and its in real time

There is nobody in the chat, its normal :D

for the moment, its on my pc but i think i'll put the server on a raspberry pi OR an old smartphone for the experience :) i think its very lowcost energy for the smartphone concept server

If you want to try it, it will be extremely cool :) !!

>there is no save chat historic, and its anonym
just a newID is random generate to write when you open the browser

I just create a light anti-spam

Thanks a lot in advance to anyone who try
Sometimes the simplest tests make the biggest difference

Its very experimental, if you have any ideas for the way this app can take ? :D

>The online version is on ***https://chat.glhf.be***

i have created a windows app on dotnet, but its the same on the webpage :)

Tell me what do you think !

Cheeeeers
Joseph

https://redd.it/1pbb38h
@r_devops
AI intelligence debug tool. Need feedback.

I built a small tool to help debug test failures automatically. It pulls in your test runs and uses AI to surface flaky tests, failure clusters, weekly dashboard and stability trends and AI powered run summaries.

If a few people can try it and tell me what sucks before I launch officially, I'd really appreciate it.

https://redd.it/1pbc7x9
@r_devops
$10K logging bill from one line of code - rant about why we only find these logs when it's too late (and what we did about it)

This is more of a rant than a product announcement, but there's a small open source tool at the end because we got tired of repeating this cycle.

Every few months we have the same ritual:

\- Management looks at the cost
\- Someone asks "why are logs so expensive?"
\- Platform scrambles to:
\- tweak retention and tiers
\- turn on sampling / drop filters

And every time, the core problem is the same:

\- We only notice logging explosions after the bill shows up
\- Our tooling shows cost by index / log group / namespace, not by lines of code
\- So we end up sending vague messages like "please log less" that don't actually tell any team what to change

In one case, when we finally dug into it properly, we realised:
\- The majority of the extra cost came from one or two log statements:
\- debug logs in hot paths
\- usage for that service gradually increased (so there were no spikes in usage)
\- verbose HTTP tracing we accidentally shipped into prod
\- payload dumps in loops

What we wanted was something that could say:

src/memoryutils.py:338 Processing step: %s
315 GB | $157.50 | 1.2M calls

i.e. "this exact line of code is burning
$X/month", not just "this log index is expensive."

Because the current flow is:
\- DevOps/Platform owns the bill
\- Dev teams own the code
\- But neither side has a simple, continuous way to connect "this monthly cost" → "these specific lines"

At best someone does grepping through the logs (on DevOps side) and Dev team might look at that later if chased.

———
We ended up building a tiny Python library for our own services that:
\- wraps the standard logging module and print
\- records stats per file:line:level – counts and total bytes
\- does not store any raw log payloads (just aggregations)

Then we can run a service under normal load and get a report like (also, get Slack notifications):

Provider: GCP Currency: USD
Total bytes: 900,000,000,000 Estimated cost: 450.00 USD

Top 5 cost drivers:
\- src/memory\
utils.py:338 Processing step: %s... 157.5000 USD
...

The interesting part for us wasn't "save money" in the abstract, it was:
\- Stop sending generic "log less" emails
\- Start sending very specific messages to teams:
"These 3 lines in your service are responsible for \~40% of the logging cost. If you change or sample them, you’ll fix most of the problem for this app."

\- It also fixes the classic DevOps problem of "I have no idea whether this log is important or not":

Platform can show cost and frequency,
Teams who own the code decide which logs are worth paying for.

It also runs continuously, so we don’t only discover the problem once the monthly bill arrives.

———

If anyone's curious, the Python piece we use is here (MIT): https://github.com/ubermorgenland/LogCost

It currently:

works as a drop‑in for Python logging (Flask/FastAPI/Django examples, K8s sidecar, Slack notifications)
only exports aggregated stats (file:line, level, count, bytes, cost) – no raw logs

https://redd.it/1pbcuny
@r_devops
Best container image security tool for growing company?

Mentioned it here earlier but now leading a devops team following a quick departure by the person who hired me. That person completely ignored the Bitnami change to paid and now it’s up to me to figure out what to do. Not clear if this is one of the many reasons they were dismissed.

We’re using dozens of open source images like Python, ArgoCD, and Istio, and right now using Trivy for security scans but have been a crap ton of unnecessary vulnerability alerts. 

I’m looking for something that handles vulnerability fatigue, CI/CD, etc., that doesn’t piss the team off. 

Are most of you just eating the cost of your base images on Bitnami and patching vulnerabilities yourself? If not, what container image tool are you using?

Dev count is \~50 and devops is 5 including myself.



https://redd.it/1pbd14u
@r_devops
Finally joined product company but in a bad team

Always worked in a mediacor company in my career, i had faced issues in my project where client was not happy but i have worked on it and got good feedback from the client instead.

I finally joined a product company which i always dreamed of. But this time I got into a very very bad team i say. Its been just few months only and am not able to adjust it. Start of the day with anxiety and ends with no energy to do anything.

I carry good amount of Cloud experience and being into devops as well.. But i feel like its very overwhelming for me.... Am getting panic attacks in every scrum, retro, refinements...



https://redd.it/1pbfifl
@r_devops
Is DevOps/R&D dynamics so tense for all of you?

I'm in the first year of my first devops position, and the relationship between us the developers is so tense it's ridiculous. And from my view it seems like they are just lazy and not really owning their work. They’ll pick CPU and memory requests once in dev, ship, and then never think about it again. They don't load-test or profile, and then are very surprised when latency explodes at scale. I’m getting paged for their services becuase somehow the alerts are always “ops noise” instead of, you know, their code falling over.

A lot of my energy goes into being frustrated with them and their seeming inclination to first say anything wrong has got to do with us, and them if we check it and disagree, we need to make a court-worthy case in order to roll the problem back to them so they can fix whatever it is they didn't do well in the first place. Is it like that everywhere? Or is it just shitty culture in our org?

https://redd.it/1pbczlz
@r_devops
Event about shell noscripting with NuShell in Ghent

I am posting here, because I assume noscripting with Bash (or others) is an essential part of DevOps. There is this new language around the block called NuShell that promises to improve the Bash experience. I am giving a free interactive workshop in Ghent, Belgium on Wednesday. Anyone interested and in the area? You can sign up here

The content of the presentation (such as slides and exercises) is in a Git repository.



https://redd.it/1pbh9jj
@r_devops
The Hidden Cost of “Cold Starts”: Defeating EBS Lazy Loading in AI Pipelines

So i was working on ML Workload Optimization and faced some issues regarding Lazy Start and First Touch Latency. These are the things which we miss while doing the optimization for our high throught pipelines. A simple yet small thing can make such a big impact. Added my finding in this blog. Hope this might help you guys.

https://dcgmechanics.medium.com/the-hidden-cost-of-cold-starts-defeating-ebs-lazy-loading-in-ai-pipelines-ff784febba74


https://redd.it/1pblub0
@r_devops
For those that declined a +50% raise from another company’s offer, why did you do it?

Current compensation including benefits insurance is $150k. I have an offer for $250k with room for growth, but the job would absolutely wreck me in regard to the amount of work I would be taking on. My current role is basically heaven in terms of workload. Both are fully remote.

https://redd.it/1pbowvm
@r_devops
Wanted: A simple event bus service similar to Eiffel

At my last employer we interfaced with an Eiffel service (https://e-pettersson-ericsson.github.io/eiffel-community.github.io/) to be able to gather statistics about our CI/CD pipeline runs and then use that data to measure quality metrics for projects. It was an interesting setup, which we where mostly just users of and not the maintainers.

At my current employer we are currently trying to implement something similar for our CI/CD pipeline. A couple of my colleagues started with a PoC of the simplest possible usecase (i.e something custom). But I was thinking if there was something our there that is *not* Eiffel, a bit simpler, but still open source that we could look at instead of having to build and maintain something ourselves?

I spend quite a lot of time in the self-hosting community but haven't seen anything like it yet.

https://redd.it/1pbof06
@r_devops