Reddit DevOps – Telegram
How do you catch cron jobs that "succeed" but produce wrong results?

I've been dealing with a frustrating problem: my cron jobs return exit code 0, but the actual results are wrong.

I'm seeing cases where noscripts complete successfully but produce incorrect or incomplete results:

* Backup noscript completes successfully but creates empty backup files
* Data processing job finishes but only processes 10% of records
* Report generator runs without errors but outputs incomplete data
* Database sync completes but the counts don't match
* File transfer succeeds but the destination file is corrupted

The logs show "success" - exit code 0, no exceptions - but the actual results are wrong. The errors might be buried in logs, but I'm not checking logs proactively every day.

I've Tried:

1. Adding validation checks in noscripts - Works, but you have to modify every noscript, and changing thresholds requires code changes. Also, what if the file exists but is from yesterday? What if you need to check multiple conditions?
2. Webhook alerts - requires writing connectors for every noscript, and you still need to parse/validate the data somewhere
3. Error monitoring tools (Sentry, Datadog, etc.) - they catch exceptions, not wrong results. If your noscript doesn't throw an exception, they won't catch it
4. Manual spot checks - not scalable, and you'll miss things

The validation-in-noscript approach works for simple cases, but it's not flexible. You end up mixing monitoring logic with business logic. Plus, you can't easily:

* Change thresholds without deploying code
* Check complex conditions (size + format)
* Centralize monitoring rules across multiple noscripts
* Handle edge cases like "file exists but is corrupted" or "backup is from yesterday"

I built a simple monitoring tool that watches job results instead of just execution status. You send it the actual results (file size, record count, status, etc.) via a simple API call, and it alerts if something's off. No need to dig through logs, and you can adjust thresholds without deploying code.

How do you handle simillar cases in your environment?

https://redd.it/1qrjfqc
@r_devops
AWS vs Azure - learning curve.

So...sorry, dnt mean to hate on Azure, but why is it so hard to grasp..

Here's my example, breaking into cloud architecture, and have been trying to create serverless workflows. Mind you I already have a solid understanding, as I am currently in the IT field.

Azure functions gave me endless problems....and I never got it working. The function never got triggered. No help provided by Azure in the form of tips etc. Certain function plans are not allowed on the free tier, just so much of hoops to jump through. Sifting through logs is daunting, as apparently you have to setup queries to see logs.


AWS on the other hand, within 2 hours, I was able to get my app up and running. So much help just with AWS basic tips and suggested help articles.

Am I the only one which feels this way about Azure..



https://redd.it/1qrl93k
@r_devops
Asked for honest feedback last month, got it, spent January actually fixing things

a few weeks ago I posted here about OpsCompanion. You told me where it sucked and what was cool. Appreciate everyone who took the time to try it.

I was an sre at Cloudflare.. I know that behind every issue is a real person just trying to do their job. Keeping things secure, helping devs out, or dealing with stuff getting thrown over the fence....

And now everyone is vibe coding with zero context or concern about prod. Honestly I am a little worried about where this is all headed.

I see what we are all dealing with and I want to help. Would love to hear what would actually make your days easier...really. not just another AI SRE thing.

Check it out: https://opscompanion.ai/

If it still sucks, let me know and I will fix it.

https://redd.it/1qroxnu
@r_devops
Will this AWS security project add value to my resume?

Hi everyone,

I’d love your input on whether the following project would meaningfully enhance my resume, especially for DevOps/Cloud/SRE roles:

Automated Security Remediation System | AWS

Engineered event-driven serverless architecture that auto-remediates high-severity security violations (exposed SSH ports, public S3 buckets) within 5 seconds of detection, reducing MTTR by 99%
Integrated Security Hub, GuardDuty, and Config findings with EventBridge and Lambda to orchestrate remediation workflows and SNS notifications
Implemented IAM least-privilege policies and CloudFormation IaC for repeatable deployment across AWS accounts
Reduced potential attack surface exposure time from avg 4 hours to <10 seconds

Do you think this project demonstrates strong impact and would stand out to recruiters/hiring managers? Any suggestions on how I could frame it better for maximum resume value?

Thanks in advance!

https://redd.it/1qrw38y
@r_devops
How do you manage database access?

I've worked at a few different companies. Each place had a different approach for sharing database credentials for on-call staff for troubleshooting/support.

Each team had a set of read-only credentials, but credentials were openly shared (usually on a public password manager) and not rotated often. Most of them required VPNs though.

I'm building a tool for managed, credential-less database access (will not promote here).

I'm curious to know what are the other best practices that teams follow?

https://redd.it/1qsjswf
@r_devops
From QA to DevOps - What’s your advice?

Hi everyone,

I’m currently working as a Software Quality Engineer with a background in test automation, and I’m planning to transition into a DevOps role within the next 1-2 years in EU job market.

I already have hands-on experience with:

Docker
Linux
Some Kubernetes basics
Some basics with CICD Pipelines (Gitlab, GitHub Actions)
Grafana & Prometheus
Networking

My background is mainly in automation, noscripting, and system reliability from a QA perspective. I’m now trying to identify the most effective next steps to become a solid DevOps candidate in Europe.

For those who’ve made a similar move (QA/SDET → DevOps), especially in the EU:

Which skills or tools should I prioritize next (I am currently getting deeper into Kubernetes)?
What kind of practical projects actually help in EU hiring processes?
Are certifications (e.g. AWS, CKA, etc.) valued, or is experience king?
How can I best position my QA background as an advantage?

https://redd.it/1qsi7kl
@r_devops
Honestly, would you recommend the DevOps path?

This isn't one of those "DevOps or other coolnoscript.txt?" question per se. I'm wondering if you'd genuinely recommend the path to becoming a DevOps. Are you happy where you are? Are the hours making you questioning your life choices etc. I'm looking to hearing genuine personal opinions.

I have a networking background and I currently work as a network engineer. I have several Cisco, AWS and Azure certifications and I have been doing this for a while. I fell in love with networking instantly and I still love it to this day. However it's a lot of the same and I have to travel/be away from my family more than I'd like. I have diagnosed ADHD which I am medicated for and it's been a blessing in my life. However, it's no secret that we get extra bored of repetitive tasks if there's nothing new and exciting.

Here I feel like the DevOps career is something that could be right up my alley, the amount of knowledge you need to have to just get started, the constantly changing environment, the never ending learning and the fact that there always seems to be something to do. Please correct me if I'm wrong.

I am now legible for a "scholarship" of sorts to get a 2 year DevOps education for free and I wonder if you'd take that chance if it was you? I was super excited until I realised that I have barely done any coding and sure there's courses in coding covered in this education but there are also many other things. But since I have experience in other things covered I could focus more on the coding aspect. Do you think two years will be enough experience to get into a junior DevOps role without being a burden to said company?

Thank you for your time.

/M

https://redd.it/1qssoqt
@r_devops
Astrological CPU Scheduler with eBPF

Someone built a Linux CPU scheduler that makes scheduling decisions based on planetary positions and zodiac signs with eBPF and sched_ext...and it works! Obviously not something to run into production, but still a fun idea to play around with.

"Because if the universe can influence our lives, why not our CPU scheduling too?"


https://github.com/zampierilucas/scx\_horoscope

https://redd.it/1qrzpbr
@r_devops
Getting pigeon-holed in my career - Need advice

A little background of myself, I have been working for the same company, in the same team since I graduated a few years ago. I had gotten an internship with them while I was studying CS and was lucky enough to get a FT role as soon as I graduated with the same team. Now the issue is this is a small team that purely does infrastructure automation for a big bank. I work with other infrastructure engineering teams and help automate many of their flows and create them into ansible pipelines. My company doesn’t even have terraform, we use Azure built in Azure Bicep to do IaC for cloud and use Ansible to do IaC for onPrem, I have minimal exposure to cloud, have only done a few automation and integrations with them.

With this job I have become an Ansible expert, and I am now knowledgeable on all the basics of Infrastructure Engineering especially onPrem however I don’t see a path upwards in my career and wanted advice on how to break out of this pigeon hole as a Ansible Automation expert to more conventional Cloud/DevOps Engineering.

What are maybe some certs I can pursue? What are some other ways to take my skill and expand on it? Just feeling stuck…

https://redd.it/1qspv6s
@r_devops
Underground office has a sulfuric smell

I work in a windowless office one floor underground, about 3 m (10 ft) from a large server room. There’s also a large cabinet nearby with a tunnel underneath it carrying thick cables.

Last week the room smelled sulfurous—like fart gas or stovetop gas. Not the extremely putrid “rotten egg” H₂S smell, but a persistent sulfur/fart odor.

A building inspector initially dismissed it, but a safety professional later tested the room with a 4-gas detector (O₂, CO, NO₂, SO₂, H₂S). No detectable H₂S was found, even though the odor is still present after testing.

Building management claims renovations upstairs may have disturbed plumbing (e.g., drilling into a sewer pipe), and advised heavy ventilation. It didn't smell today at the morning, but it's almost lunch and I smelt it again. Now my head hurts.

My concern is whether it’s safe to continue working there. If this were H₂S, smell would not be a reliable warning at dangerous concentrations, so I’m trying to assess risk without being paranoid.

I've used a career tag cause this seems to be more metawork than actual work.

https://redd.it/1qsvcqe
@r_devops
Who owns GitHub/vcs policies and compliance at your company?

Like specific things in GitHub settings such as which branches should be protected (when you have multiple orgs and those orgs all disagree on which branches should be protected), etc.

https://redd.it/1qsp2jo
@r_devops
Update to my “Al was implemented as a trial in my company, and it's scary.”

I’ve made a [post\](https://www.reddit.com/r/devops/s/rgLaBXNe7W) here a couple of months ago where my company was experimenting with implementing AI, this post is an update to how it went and what happened.

The company stopped hiring any “infra personnel” and started utilizing AI to do things like create and configure some AWS machines and VPCs by just talking with the agent (using the CLI) with specific IAM policies just in case.

I thought this was just a problem with the company I am in but everyone I know has almost the exact same thing. I am not working anymore, I either use AI or when I start to use my brain, everyone around me answers with AI. I am not an angel, I am a junior that can’t learn properly because no one wants to, everyone wants AI and less human error.

The only thing it failed at was deep architecture like database migration and specific clustering, but everything else it simply just does it and when it doesn’t, we only have to do maybe a single thing to fix it.

I am leaving the DevOps as a field and getting into security (was really interested in it before) but I genuinely feel like I was trolled and did nothing, and maybe even soon security would be replaced with AI.

This post may be stupid to seniors, but as a junior and people starting, this is reality. We don’t learn, we don’t grow, we are the ones getting replaced and I see no field being currently resistant to that. I will just get into moltbook and doom scroll.

Thank you for everyone who helped me pave my devops path, it is really one of the best fields I’ve ever went in and honored to have been here even if just for a short while, hopefully where I live is the problem and not the entire planet.

https://redd.it/1qsy1ln
@r_devops
European infrastructure engineers - What's happening inside your companies regarding your dependency on US hyperscalers?

Everybody follows the news and sees what's going on.

In the Netherlands, this has sparked a debate on our dependence on US tech specifically AWS, Azure, and GCP for businesses and the government. Management at my working place (medium sized SaaS business) has instructed the operations team to start planning an exit strategy.

We will probably stay with AWS for the time being but will slowly move everything towards OSS components as long as it's a feasible option. This shift was already initiated last year by moving towards Kubernetes, but we still use a dozen AWS services. It's going to take some time to move to a more portable architecture.

I'm wondering: what's going on in your company or team? Do you think this trend will last?

https://redd.it/1qsyjdw
@r_devops
Almost twice (2x) the salary but high workload. Should I accept the new offer?

I have around 4-5 years of experience, and I'm in my late 20s, not married. Recently, I got a job offer from a startup, and I’m just thinking whether I should accept it. So let me brief.

The new offer’s take-home salary is almost twice the current job’s take-home salary. 80% increase. It’s a big jump as I see. For my experience, I’m pretty sure this is above the market range in my country. It’s difficult to find this kind of a job. Downsides are high workload and high risk.

So let me compare the current one and the new one.

Current job:

2 days per office job, with EPF,ETF and OPD, insurance coverage.
I’m a permanent employee, and have 3 months of notice period. So job security is high.
Current compay is large and spread across multiple countries with 1500+ employees.
Tech Stack is good. (Azure, ArgoCD, AKS, GitOps, LGTM stack, etc)
Culture is bit toxic and not supportive at all. I’m actually looking for a good job for a while.
Major releases happen 2 times per month.

New Job:

Fully Remote, USD salary, but no OPD/Insurance coverage.
Notice period is pretty low. When probation it’s 8 days and after probation it’s 4 weeks. So job security is pretty low as well.
It’s a startup, and have Sri Lankan Team, with employees in other countries as well. And it’s seems to be growing okay with funds.
Tech stack is OK/Good. (AWS, ECS, GitHub Actions, Cloudwatch, etc. )
Culture I’m not so sure. Seems it’s better than the current job.
Releases happen every week.

Both have similar kind of weekend works, once in around 2 months.

What I know is salary increase is high (80%), and the workload is high as well. As I heard few days per week I may have to work 12+ hours per day, may be even more, since this is a startup.

Current job’s workload is also sometimes getting higher. I believe the new one will be pretty high. And the new job security is pretty low as well with smaller notice.

For me it’s high risk, high income, high stress/ workload job.

Should I accept the new offer?? What’ your opinion. I like to hear from experienced people in the industry.

https://redd.it/1qt0aca
@r_devops
Linux packages - v2026.02.01 - Versions, files and directories

In operating systems with shared dependencies, we often don't know which program or version a particular file was in. This is a recurring problem in my daily work. That's why I created a public domain index with all the packages from the Arch Linux, Artix Linux, Black Arch Linux, and CachyOS Linux repositories.

It is in the public domain and is updated monthly.

https://archive.org/details/packages\_202602

https://redd.it/1qsyygh
@r_devops
My team should be renamed to talkOps

Some days I spend more time talking about reliability than actually improving it.

Standups, syncs, postmortems, pre-mortems, planning, re-planning, alignment calls... and by the time I get a quiet hour, I'm already drained.

get that communication matters, but at some point the work needs focus.

How do you protect deep work time without looking "unavailable"?

https://redd.it/1qvzhiv
@r_devops
Audits keep pulling senior engineers into work only they can explain

Growing tired of these audit cycles. We plan ahead and just when we think we’re ready senior engineers get dragged into explaining configs, workflows and edge cases that technically exist but aren’t documented in the most formal way.


It’s not wrong but it’s disruptive and hard to schedule around delivery. We want audits to be predictable not ifs buts and maybes.


How do we relieve the eng team of this work?

https://redd.it/1qvtb82
@r_devops
Every ai code assistant assumes your code can touch the internet?

Getting really tired of this.

Been evaluating tools for our team and literally everything requires cloud connectivity. Cursor sends to their servers, Copilot needs GitHub integration, Codeium is cloud-only.

What about teams where code cannot leave the building? Defense contractors, finance companies, healthcare systems... do we just not exist?

The "trust our security" pitch doesn't work when compliance says no external connections. Period. Explaining why we can't use the new hot tool gets exhausting.

Anyone else dealing with this, or is it just us?

https://redd.it/1qwfo46
@r_devops
Currently using code-driven RAG for K8s alerting system, considering moving to Agentic RAG - is it worth it?

Hey everyone,

I'm building a system that helps diagnose Kubernetes alerts using runbooks stored in a vector database (ChromaDB). Currently it works, but I'm questioning my architecture and wanted to get some opinions.

**Current Setup (Code-Driven RAG):**

When an alert comes in (e.g., PodOOMKilled), my code:

1. Extracts keywords from the alert using a hardcoded list (`['error', 'failed', 'crash', 'oom', 'timeout']`)
2. Queries the vector DB with those keywords
3. Checks similarity scores against fixed thresholds:
* Score ≥ 0.80 → Reuse existing runbook
* Score ≥ 0.65 → Update/adapt runbook
* Score < 0.65 → Generate new guidance
4. Passes the decision to the LLM agent.

The agent basically just executes what the code tells it to do.

**What I'm Considering (Agentic RAG):**

Instead of hardcoding the decision logic, give the agent simple tools (`search_runbooks`, `get_runbook`) and let IT:

* Formulate its own search queries
* Interpret the results
* Decide whether to reuse, adapt, or ignore runbooks
* Explain its reasoning

The decision-making moves from code to prompts.

**My Questions:**

1. Is this actually better, or am I just adding complexity?
2. For those running agentic RAG in production - how do you handle the non-determinism? My code-driven approach is predictable, agent decisions aren't.
3. Are there specific scenarios where code-driven RAG is actually preferable?
4. Any gotchas I should know about before making this switch?

I've been going back and forth on this. The agentic approach seems more flexible (agent can craft better queries than my keyword list), but I lose the predictability of "score > 0.8 = reuse".

Would love to hear from anyone who's made this transition or has opinions either way.

Thanks!

https://redd.it/1qwh5t2
@r_devops
Is this enough to target a DevOps / Cloud role without a degree?

I’ve been freelancing in infra, cloud, and ops work for 3–4 years. I also co-founded a private limited company, but I’m shutting that down due to compliance and sales fatigue.

I don’t have a degree.

My experience is mostly practical:

* Windows installations, configurations
* Security hardening for Windows
* Linux server installation (Ubuntu, Red Hat)
* Email security (SPF, DKIM, DMARC)
* DNS setup (Cloudflare, Route 53)
* SSL installation
* LAMP/LEMP stack setup, maintain, support
* Server administration (Hetzner, DigitalOcean, AWS, Azure)
* Peripherals connectivity issues, driver issues
* Windows applications error troubleshooting
* Dependency management
* MySQL / PostgreSQL administration
* Deployed applications using Docker compose
* Odoo / ERPNext administration
* SES mail server setup
* AWS deployments using Lightsail, EC2, RDS, VPN, S3, CloudFront, Lambda
* Git source code management
* Deployed static sites using Hugo and Cloudflare Pages
* Protected data theft and hotlinking using BunnyCDN CORS rules
* Troubleshot android OS, increased performance by using dev tools
* Google Workspace & Microsoft Outlook for Business administration
* Identified and blocked phishing emails by diagnosing email headers
* Removed a cryptojacking malware from multiple compromised servers
* Automated repetitive processes using AutoHotKey
* Created python noscript to fetch all uploaded videos and create wordpress posts in bulk
* Prevented bots and malicious traffic using Cloudflare under attack mode
* Blocked traffic from restricted geos using Cloudflare WAF
* Filtered logs, JSON, and other data using basic regex
* Right-sized EC2 instances based on historic usage to save costs

Provisioned basic cloud infrastructure using Terraform (EC2, VPC, CIDR configuration) and worked with local Kubernetes environments (Minikube, KIND) to deploy and validate Nginx workloads based on official docs.

**Question:**

Does this map to DevOps / Cloud Engineer roles, or is it still sysadmin-heavy?
What skills would you expect before hiring someone with this background?

I’m currently pursuing IT support roles because I’ve heard that’s where most people start. If possible, I’d also appreciate some resume tips.

https://redd.it/1qwhhxe
@r_devops
Restricting external egress to a single API (ChatGPT) in Istio Ambient Mesh?

I'm working with Istio Ambient Mesh and trying to lock down a specific namespace (ai-namespace).

The goal: Apps in this namespace should only be allowed to send requests to the ChatGPT API (api.openai.com). All other external systems/URLs must be blocked.

I want to avoid setting the global outboundTrafficPolicy.mode to REGISTRY_ONLY because I don't want to break egress for every other namespace in the cluster.

What is the best way to "jail" just this one namespace using Waypoint proxies and AuthorizationPolicies? Has anyone done this successfully without sidecars?

https://redd.it/1qwflgn
@r_devops