Reddit DevOps – Telegram
I need advice on meaningful personal projects (developer + DevOps, tool-building focus)

Im trying to decide on what kind of personal project to make that will be meaningful for learning and possibly useful for job applications, but learning comes first. I've made many small projects before while creating my homelab setup but I am looking for something more like actually creating my own tools.

Im aiming for something that sits between developer and DevOps.

I want to improve my coding skills and understand DevOps tools on a deeper level. I'm kind of sick of just using tools and not creating my own, if that makes sense.

Maybe Im having the wrong take on these things, a comment I always get from older gen engineers is how much they learned when they had to create their own tools. So, I thought it would be cool too.

I would be grateful for any guidance regarding this topic, if my thought pattern is incorrect I'm open to hearing what I should focus on instead.


Some additional context, Ive been a DevOps for 4 years and recently I have become unemployed and I want to start a project but everything I've seen online feels like I've done better versions of those in real production environments.

https://redd.it/1q99x3q
@r_devops
Why the hell are devs still putting passwords in AI prompts? It's 2026!

Writing this because I keep seeing devs hardcode API keys and passwords directly in prompts during code reviews. Your LLM logs everything. Your prompts get cached. Your secrets end up in training data.

Use environment variables. Use secret managers. Sanitize inputs before they hit the model.

This should be basic security hygiene by now but apparently it needs saying.

https://redd.it/1q9cw8r
@r_devops
Vendor selection: enterprise vs startup vs build your own?

Hey! Solopreneur here who just launched an observability SaaS.
Need honest feedback on how you make vendor decisions.

Three options with identical SLA and infrastructure:
Enterprise with high prices ($$$)
Small company/solo founder with moderate prices ($$)
Build your own (Prometheus, Grafana, Loki) ($)

Which do you choose and why?

Key questions:

How much does brand recognition matter (to you vs management)?
Hard requirements on vendor stability/longevity?Support team size important?
Build vs Buy: what tips the scale - control/customization or time-to-market/maintenance?

If self-hosted: how many FTEs maintaining your stack?

On integrations:
Unified dashboard - deal breaker or nice-to-have?
Alert integrations (PagerDuty, Slack)?
API access?

Appreciate any feedback, especially recent vendor selection or migration experiences

https://redd.it/1q9bu87
@r_devops
Open Source Built a self-hosted PAM system - Looking for feedback

Hey r/devops!

I've been building Orion-Belt, an open-source Privileged Access Management system, and would love your feedback from folks who've dealt with SSH access at scale.

The problem we're solving:

After getting quoted $50k-$200k/year for commercial PAM solutions as a startup, we decided to build a self-hosted alternative that doesn't require enterprise budgets.

What it does:

\- Zero inbound firewall rules: Agents use reverse SSH tunneling to dial out to the gateway

\- Fine-grained access control: Specify which users can access which machines as which remote users (e.g., "Jane can SSH to prod-db as postgres")

\- Session recording & audit trails: Full compliance logging for SOC2/ISO27001

\- Temporary access workflows: Time-limited access with admin approval

\- Standard SSH compatibility.

Tech stack:

\- Backend: Go (Gin framework, golang.org/x/crypto/ssh)

\- Permissions: ReBAC with OpenFGA

\- Storage: PostgreSQL

\- Deployment: Docker + systemd, multi-distro support

Current state: Core functionality working, deployed in production in our homelab/staging environments.

Why I'm posting: Before building more features, I want to validate we're solving real problems.

Questions for the community:

1. What's your current SSH access management strategy?

(SSH keys everywhere? Jump hosts? Commercial PAM? Something else?)

2.If you've looked at commercial PAM solutions, what stopped you from adopting them?

(Cost? Complexity? Vendor lock-in?)

3. What would make a tool like this worth adopting in your environment?

(Specific features? Integration points? Deployment model?)

GitHub: https://github.com/zrougamed/orion-belt

Looking for:

\- Beta testers: Deploy it, break it, tell me what's missing

\- Contributors: Go backend developers and Frontend/UI folks (currently no UI - WIP)

\- Feedback: Honest criticism about architecture, features, docs

Happy to answer technical questions about the reverse tunneling implementation, session recording, or anything else!

https://redd.it/1q9k1kk
@r_devops
SMS as an alerting channel who do you actually trust?

If SMS is your last-resort alert channel, which providers have actually been reliable for you in production?

https://redd.it/1q9af9l
@r_devops
Anyone else finding AI code review tools useless once you hit 10+ microservices?

We've been trying to integrate AI-assisted code review into our pipeline for the last 6 months. Started with a lot of optimism.

The problem: we run \~30 microservices across 4 repos. Business logic spans multiple services—a single order flow touches auth, inventory, payments, and notifications.

Here's what we're seeing:

\- The tool reviews each service in isolation. Zero awareness that a change in Service A could break the contract with Service B.

\- It chunks code for analysis and loses the relationships that actually matter. An API call becomes a meaningless string without context from the target service.

\- False positives are multiplying. The tool flags verbose utility functions while missing actual security issues that span services.

We're not using some janky open-source wrapper—this is a legit, well-funded tool with RAG-based retrieval.

Starting to think the fundamental approach (chunking + retrieval) just doesn't work for distributed systems. You can't understand a microservices codebase by looking at fragments.

Anyone else hitting this wall? Curious if teams with complex architectures have found tools that actually trace logic across service boundaries.

https://redd.it/1q9tup1
@r_devops
Showcase High-density architecture: Running 100+ containers on a single VPS with Traefik and FrankenPHP

Hi everyone,

I wanted to share a breakdown of the infrastructure I just built for a new SaaS project (a dependency health monitor).

As a DevOps consultant, I usually deal with K8s clusters, but for this project, I wanted to see how much performance I could squeeze out of a single multi-site VPS using a Docker Compose stack.

The Architecture:
Currently running \~30 projects and close to 100 containers on one node with high availability.

Ingress/Routing: Traefik (Auto-discovery of new docker containers is a lifesaver).
Runtime: FrankenPHP + Laravel Octane. This runs the app as a long-running Go process rather than traditional PHP-FPM, keeping the application bootstrapped in memory.
Caching: 2-hour aggressive Edge caching via Cloudflare to minimize hit-rate on the backend.
Storage: Redis for queues/cache.

The Workflow:
User Request -> Cloudflare (Edge) -> Traefik (VPS Ingress) -> FrankenPHP (App Container)

I wrote a blog post detailing the specific setup and how this stack handles the traffic:
**https://danielpetrica.com/how-i-built-a-high-performance-directory-with-laravel-octane-and-filament/**

Curious to hear your thoughts on pushing vertical scaling/Docker Compose this far versus moving to a small K8s cluster/Nomad setup. At what point do you usually force the switch?

https://redd.it/1q9y4tc
@r_devops
How liable are DevOps for redundancies in acquisitions (UK)?

Hi folks!

As the noscript says, my current company has just been acquired in the last week and while this is an acquisition (financially), this is going to be a merger i.e. our company merging into their company.

The next steps in the integration phase, AFAIK, is a company restructure, and as I have read the employees in the acquired company would be more at risk than the acquirer employees. Therefore, that would make me more at risk.

The DevOps team I am in is 7 DevOps engineers, 1 Tech lead DevOps and 1 Team lead.

I believe on their side it is 4/5 DevOps engineers.

We host our product heavily on AWS, and from what I can see they use Azure.

My main questions here is:

1. Has anyone been in a similar situation
2. If so, what happened? What side of the table where you on?
3. How "At Risk" are DevOps engineers in a merger compared to other areas of business?
4. Any other things / pointers you can give me? It is my first time in this situation.

I know that it is different company-to-company, but if I could get a general consensus of others past experience then I can come to my own conclusion on whether or not I would be highly at risk.

Any comments are appreciated.

Thanks!

https://redd.it/1q9yrii
@r_devops
Headless browser sessions keep timing out after ~30 minutes. Has anyone managed to fix this?

I’ve been automating dashboard logins and data extraction using Puppeteer and Selenium for a while now. Single runs are solid, but once I scale to multiple tabs or let jobs run for hours, things start falling apart. Sessions randomly expire, cookies disappear, tabs lose state, and accounts get logged out mid flow. I’ve tried rotating proxies, custom user agents, persisted cookies, and even moved to headless=new. It helped a bit but still not reliable enough for production workloads. At this point I’m trying to understand what’s actually causing this instability. Is it session isolation, anti automation defenses, browser lifecycle issues, or something else entirely? Looking for approaches or tools that support long lived, multi account browser workflows without constant monitoring. Any real world experience appreciated.

https://redd.it/1qa1uvy
@r_devops
Self host Gitlab (GitOps) in k8s, or stand alone?

Hi! Linux sysadmin and hobby programmer here, I'm learning iac by converting my infra at home using OpenTofu against Proxmox. I use workspaces to launch stages as dev (and staging etc in the future). Figured it would be cool to orient everything around it.. but as I'm gonna learn/use Talos k8s ahead, I can't figure out how to deal with deploying apps with the same workspace approach in mind, to avoid being repetitive and all that.

Never automated via Gitlab before, but understood what is called GitOps is used for automation, and it's baked into Gitlab. So the thing I can't figure out is if I should setup Gitlab in k8s, or as stand alone. The first means HA, but if k8s breaks then GitOps goes down I assume. The latter means skip k8s dependency, but no HA.

Idk, maybe I'm overthinking this at such a early time, but would appreciate some insight into how others setup their self hosted iac based IT.

Cheers!

https://redd.it/1qa67nj
@r_devops
Struggles at a new org

I'm a DevOps tech lead at an AWS shop for the past 5~ months with some senior engineers, a few juniors and oh boy - the tech debt and org culture has me seriously reconsidering employment. I'm running into problems like:

- Company has a DevOps team that is treated exclusively like an Ops team. DevOps culture was never adopted and isn't practiced
- Lack of development ownership on product issues. Engineering management fails to hold their teams accountable and isn't responsive to issues in their domain
- Engineering team is comprised of a 50/50 split of contractors and full time engineers, with contractors taking a "that's not my job" approach to problems that bubble up outside of their usual work
- Some of the most spaghetti terraform I've ever had the displeasure of reading - in 0.11.15 no less
- No CI/CD - terraform applies are done locally and software deployments are done by SSH'ing into a Jenkins host to run some wild chain of zsh noscripts
- Chef 0.14.5 is being used to provisioned new EC2 instances
- Static SSH keys installed on hosts (no SSM)
- IAM users with a partial, but incomplete AWS SSO roll out
- A contracting DevOps company was hired to start an EKS migration, but they're at the point of throwing in the towel because of the complexity
- To top it all off, a manager with no technical experience and no spine. I'm not sure how he's still here given his passive nature and lack of ability to lead a team towards change

It would be easier if I was only solving technical issues, but this is both technical and cultural. This feels like a huge step back in my career of having to go back to managing EC2 instances like pets instead of like cattle. As a lead, I'm trying my best to get my manager to understand what a DevOps team is and how it should operate, but I am having a hard time reaching him. He's one that literally manages his team communication through AI as English isn't his first language; it's quite frustrating to say the least.

When I have time, I've been trying to get them off of terraform 0.11.15 and fixing their drift so that there's a standard way for everyone to run things on their local machines, as well as a folder structure that makes sense - with CI following once things are more consistent. Outside of that, I'm been "voluntold" to be on a few tiger teams to help a few product features get off the ground as I have the keys to the kingdom and can keep developers unblocked.

There's no platform and no structure.

With this situation, do others have experiences on how I could go about tackling the challenges at this org? I'm quite stressed at the moment. Thanks!

https://redd.it/1qb6w2v
@r_devops
Our CI strategy is basically "rerun until green" and I hate it

The current state of our pipeline is gambling.

Tests pass locally. Push to main. Pipeline fails. Rerun. Fails again. Rerun. Oh look it passed. Ship it.

We've reached the point where nobody even checks what failed anymore. Just click retry and move on. If it passes the third time clearly there's no real bug right.

I know this is insane. Everyone knows this is insane. But fixing flaky tests takes time and there's always something more urgent.

Tried adding more wait times. Tried running in Docker locally to match the CI environment. Nothing really helped. The tests are technically correct, they're just unreliable in ways I can't pin down.

One of the frontend devs keeps pushing to switch tools entirely. Been looking at options like Testim, Momentic, maybe even just rewriting everything in Playwright. At this point I'd try anything if it means people stop treating retry as a debugging strategy.

Anyone actually solved this or is flaky CI just something we all live with?

https://redd.it/1qas4ft
@r_devops
One end-to-end DevOps project to learn almost all tools together?

Hey everyone,



I’m a DevOps beginner. I’ve covered the theory, but now I want hands-on experience.



Instead of learning tools separately, I’m looking for ONE consolidated, end-to-end DevOps project where I can see how tools work together, like:

Git → CI/CD (Jenkins/GitLab) → Docker → Kubernetes → Terraform → Monitoring (Prometheus/Grafana) on AWS.



A YouTube series, GitHub repo, or blog + repo is totally fine.



Goal is to understand the real DevOps flow, not just run isolated commands.



If you know any solid project or learning resource like this, please share 🙏



Thanks!

https://redd.it/1qarrve
@r_devops
What DevOps and cloud practices are still worth adding to a live production app ?

Hello everyone, I'm totally new to devops
I have a question about applying Devops and cloud practices to an application that is already in production and actively used by users.
Let’s assume the application is already finished, stable, and running in production, I understand that not all Devops or cloud practices are equally easy, safe, or worth implementing late, especially things like deep re-architecture, Kubernetes, or full containerization.
my question is: What Devops and cloud concepts, practices, and tools are still considered late-friendly, low risk, and truly worth implementing on a live production application? ( This is for learning and hands-on practice, not a formal or professional engagement )
Also if someone has advice in learning devops that would be appreciated to help :))

https://redd.it/1qb9mcf
@r_devops
Observabilty For AI Models and GPU Infrencing

Hello Folks,

I need some help regarding observability for AI workloads. For those of you working on AI workloads or have worked on something like that, handling your own ML models, and running your own AI workloads in your own infrastructure, how are you doing the observability for it? I'm specifically interested in the inferencing part, GPU load, VRAM usage, processing, and throughput etc etc. How are you achieving this?

What tools or stacks are you using? I'm currently working in an AI startup where we process a very high number of images daily. We have observability for CPU and memory, and APM for code, but nothing for the GPU and inferencing part.

What kind of tools can I use here to build a full GPU observability solution, or should I go with a SaaS product?

Please suggest.

Thanks

https://redd.it/1qb51ph
@r_devops
Deterministic analysis of Java + Spring Boot + Kafka production logs

I’m working on a **Java tool that analyzes real production logs** from **Spring Boot + Apache Kafka** services.

This is **not an auto-fixing tool** and not a tutorial.

The goal is **fast incident classification + safe recommendations**, the way an experienced on-call / production engineer would reason.

**Example: Kafka consumer JSON deserialization failure**

**Input (real Kafka production log):**

`Caused by: org.apache.kafka.common.errors.SerializationException:`

`Error deserializing JSON message`

`Caused by: com.fasterxml.jackson.databind.exc.InvalidDefinitionException:`

`Cannot construct instance of \`com.mycompany.orders.event.OrderEvent\``

`(no Creators, like default constructor, exist)`

`at [Source: (byte[])"{"orderId":123,"status":"CREATED"}"; line: 1, column: 2]`

**Output (tool result)**

`Category: DESERIALIZATION`

`Severity: MEDIUM`

`Confidence: HIGH`

`Root cause:`

`Jackson cannot construct target event class due to missing creator`

`or default constructor.`

`Recommendation:`

`Add a default constructor or annotate a constructor`

**Example fix:**

public class OrderEvent {

private Long orderId;
private String status;

public OrderEvent() {}

public OrderEvent(Long orderId, String status) {
this.orderId = orderId;
this.status = status;
}
}

# Design goals

* Known **Kafka / Spring / JVM failures** detected via **deterministic rules**
* Kafka rebalance loops
* schema incompatibility
* topic not found
* JSON deserialization errors
* timeouts
* missing Spring beans
* **LLM assistance is strictly constrained**
* forbidden for infrastructure issues
* forbidden for concurrency / threading
* forbidden for binary compatibility (e.g. `NoSuchMethodError`)
* Some failures must **always** result in:
* **No safe automatic fix, human investigation required.**

This project is **not about auto-remediation**
and explicitly avoids “AI guessing fixes”.

It’s about **reducing cognitive load during incidents** by:

* classifying failures fast
* explaining *why* they happened
* only suggesting fixes when they are provably safe



**GitHub (WIP):**
[https://github.com/mathias82/log-doctor](https://github.com/mathias82/log-doctor)

# Looking for feedback from DevOps / SRE folks on:

* Java + Spring boot + Kafka related failure coverage
* missing rule categories you see often on-call
* where LLMs should be **completely disallowed**

Production war stories very welcome 🙂

https://redd.it/1qbllc2
@r_devops
Azure VM auto-start app

Azure has auto‑shutdown for VMs, but no built‑in “auto‑start at 7am” feature. So I built an app for that - VMStarter.

It’s a small Go worker that:

• discovers all VMs across any Azure subnoscriptions it has access to

• sends a start request to each one — **no need to specify VM names**

• runs cleanly as a scheduled Azure Container Apps Job (cron)

Instructions how-to deploy: https://github.com/groovy-sky/vm-starter#deployment-noscript

Docker image: https://hub.docker.com/repository/docker/gr00vysky/vm-starter



Any feedback/PRs welcome.



https://redd.it/1qbmr2s
@r_devops
Spark stage cost breakdown on aws: (Why distributed tracing isn't helping & how to fix it)

Tempo has been a total headache lately. I’ve been staring at Spark traces in there for weeks now, and I’m honestly coming up empty.

What I really want is simple: a clear picture of which Spark stages are actually driving up our costs.

Here’s the thing… poorly optimized Spark jobs can quietly rack up massive bills on AWS. I’ve seen real-world cases where teams cut infrastructure costs by over 100x on critical pipelines just by pinpointing inefficiencies, and others achieve 10x faster runtimes with dramatically lower spend.

We’re aiming to tie stage-level resource usage directly to real AWS dollar figures, so we can rank priorities and tackle the biggest optimizations first. Right now, though, it just feels like we’re gathering traces with no real insight.

I still can’t answer basic questions like:

Which stages are consuming the most CPU, memory, or disk I/O?
How do we accurately map that to actual spend on AWS?

Here’s what I’ve tried :

Running the OTel Java agent and exporting to Tempo -> massive trace volume, but the spans don’t align meaningfully with Spark stages or resource usage. Feels like we’re tracing the wrong things entirely.
Spark UI -> perfect for one-off debugging, but not practical for ongoing cost analysis across production jobs.

At this point, I’m seriously questioning whether distributed tracing is even the right approach for cost attribution.

Would we get further with metrics and Mimir instead? Or is there a smarter way to structure Spark traces in Tempo that actually enables proper cost breakdown?

I’ve read all the docs, watched the talks, and even asked GPT, Claude, and Mistral for ideas… I’m still stuck.

Any advice or experience here would be hugely appreciated,

https://redd.it/1qbnszj
@r_devops
Built a Real Estate Market Intelligence Pipeline Dashboard using Python + Power BI (Learning Project)

This is a learning project where I attempted to build an end-to-end analytics pipeline and visualize the results using Power BI.

Project overview:

I designed a simple data pipeline using static real estate data to understand how different tools fit together in an analytics workflow, from raw data collection to business-facing dashboards.

Pipeline components:

• GitHub – used as the source for collecting and storing raw data

• Python – used for data cleaning, transformation, and basic processing

• Power BI – used for building the Market Intelligence dashboard

• n8n – used for pipeline orchestration (pipeline currently paused due to technical issues at the automation stage)

Current status:

The pipeline is partially implemented. Data extraction and processing were completed, and the final dashboard was built using the processed data. Automation via n8n is planned but temporarily halted.

Dashboard focus:

• Price overview (average, median, min, max)

• Location-wise price comparison

• Property distribution by number of bedrooms

• Average price per square foot

• Business-oriented insights rather than purely visual design

This project was done independently as part of learning data pipelines and analytics workflows.

I’d appreciate constructive feedback—especially on pipeline design, tooling choices, and how this could be improved toward a more production-ready setup.

https://redd.it/1qboimq
@r_devops
My review of Orca security for cloud based vuln management

 Been a Tenable shop for vuln management for years, brought on Orca about a year ago. Figured I'd share what I've found.
Context: 80+ AWS accounts at any given time. QoL for multi-account handling matters a lot - main reason we moved off Tenable.

Orca's been overall good, but not without faults. UI gets sluggish when you're filtering across everything - annoying but livable.

Query language took me longer than it should have to get comfortable with, ended up bugging our CSM more than I wanted to early on.

Once you're past that though, day-to-day is good. Less painful than I expected at our scale.

As I said at the start, main use is vuln management and that hasn't let me down yet.

Agentless scanning works, good enough exploitability context, multi-account handling is better than what we had, or at least less annoying to deal with.

Alerting took some tuning to not be noisy as hell but once it's dialed it stays dialed.

Other stuff worth mentioning:

Exports: no weird formatting when pulling compliance reports, which is more than I can say for some tools
Deleted resources: clears out fast, not chasing ghosts
Attack paths: actually useful for explaining risk to non-security people, good for getting buy-in
Dashboards: CVE data populates clean, prioritization logic makes sense without having to customize everything

Overall, not a perfect tool but it's been a net positive. Does what I need it to do.



https://redd.it/1qbpaay
@r_devops
I need a feedback about an open-source CLI that scan AI models (Pickle, PyTorch, GGUF) for malware, verify HF hashes, and check licenses

Hi everyone,

I've created a new CLI tool to secure AI pipelines. It scans models (Pickle, PyTorch, GGUF) for malware using stack emulation, verifies file integrity against the Hugging Face registry, and detects restrictive licenses (like CC-BY-NC). It also integrates with Sigstore for container signing.

GitHub: https://github.com/ArseniiBrazhnyk/Veritensor
Install: pip install veritensor

If you're interested, check it out and let me know what you think and if it might be useful to you?



https://redd.it/1qbpjja
@r_devops