Reddit DevOps – Telegram
Clarity from an experienced cloud architect/DevOps engineer

How secure is path-based routing and is it industry standard for a 3-tier cloud native application that makes use of ECS and CodePipeline for CI/CD?

https://redd.it/1on8nuk
@r_devops
From Linux System Engineer to DevOps - Looking for Advice and Experiences

Hi everyone, I’ve wanted to transition into DevOps for a long time, but I only started seriously working toward it in February this year, building up the necessary skills. In the meantime, I received an offer to work as a Linux System Engineer, and I’ve been in that role for about four months now. I accepted it thinking it would help me transition to DevOps because of the skill similarities. Before that, I completed a three-year System Administrator apprenticeship here in Germany (“Ausbildung zum Fachinformatiker für Systemintegration”), where I mainly worked with Windows servers until the company introduced a deployment pipeline for its software. Unfortunately, the only overlapping skills in my current role are noscripting and Linux. The rest, Ansible, Kubernetes, CI/CD pipelines, etc. are not part of my job. I recently told my boss that I had expected more hands-on work with tools like Ansible and Terraform, and I asked whether there’s a way for me to transition internally to a DevOps position or possibly take on a new DevOps-focused role. Has anyone here gone through a similar transition? If so, I’d really appreciate hearing your detailed experience and any good tips you might have.

https://redd.it/1onacpn
@r_devops
We had perfect observability but still struggled during incidents. Here's what fixed it

We built a solid observability stack. OpenTelemetry pipelines, unified metrics, logs, traces. Beautiful Grafana dashboards. Everything instrumented. We could see everything.

But when incidents hit, we still struggled. Alerts fired, but we didn't know: is this severe? What do we do? Who should respond? Everyone had different opinions. "2% error rate is fine" vs "2% is catastrophic." We were improvising every time.

The missing piece wasn't technical. It was organizational. We needed SLOs to define what "working" means (so severity isn't subjective), runbooks to codify remediation steps (so response isn't improvisation), and post-mortems to learn from failures systematically (so we don't repeat mistakes).

Here's what actually worked for us:

SLOs: We use availability SLIs from OpenTelemetry span-metrics in Prometheus. We calculate percentage of successful requests by comparing successful calls (2xx/3xx) against total calls for each service. This gives us availability. We set 99.5% as our SLO, which creates a 0.5% error budget (14.4 hours downtime per month). Now we know when something is actually broken, not just "different." When we're burning error budget faster than expected, we slow feature releases.

Runbooks: We connect runbooks directly to alerts via PagerDuty. When an alert fires, the notification includes what's broken (service name, error rate), current vs expected (SLO threshold), where to look (dashboard link, trace query), and what to do (runbook link). The on-call engineer clicks the runbook and follows steps. No guessing, no Slack archaeology trying to remember what worked last time.

Post-mortems: We use a simple template: Impact (users affected, SLO impact), Timeline, Root Cause, What Went Well/Poorly, Action Items (with owners, priorities P0-P2, and due dates). The key is prioritizing action items in sprint planning. Otherwise post-mortems become theater where everyone nods, writes "we should monitor better" and changes nothing.

After implementing these practices, our MTTR dropped by 60% in three months. Not because we collected more data, but because we knew how to act on it.

I wrote about the framework, templates, and practical steps here: From Signals to Reliability: SLOs, Runbooks and Post-Mortems

What practices have helped your team move from reactive firefighting to proactive reliability?

https://redd.it/1ona979
@r_devops
How are you enforcing code-quality gates automatically in CI/CD?

Right now our CI just runs unit tests. We keep saying we’ll add coverage and complexity gates, but every time someone tries, the pipeline slows to a crawl or throws false positives. I’d love a way to enforce basic standards - test coverage > 80%, no new critical issues - without babysitting every PR.

https://redd.it/1onb20l
@r_devops
Combining code review and SAST results - possible?

Security runs their scans separately, devs review manually, and we’re constantly duplicating effort. Ideally, reviewers should see security warnings inline with the code diff. Has anyone achieved that?

https://redd.it/1ona5yo
@r_devops
Anyone using AI for pull-request reviews yet?

Copilot is fine for writing code, but it doesn’t help during reviews. I’m wondering if anyone has used AI that can actually review a PR - like summarize changes, highlight risky logic, or point out missing edge cases.

https://redd.it/1onfv66
@r_devops
AI is a Corporate Fad where I work

The noscript says it all. In my workplace (big company) we have non-technical decision makers asking for integrations of technology that they don't understand with existing technologies that they don't understand. What could go wrong financially?

My only hope is that this fad replaces the existing fad of hiring swaths of inexpensive out of town engineers to provide "top notch" solution design that falls flat at the implementation phase.

What's your experience?



https://redd.it/1onilgi
@r_devops
Gprxy: Go based SSO-first, psql-compatible proxy

https://github.com/sathwick-p/gprxy

Hey all,
I built a postgresql proxy for AWS RDS, the reason i wrote this is because the current way to access and run queries on RDS is via having db users and in bigger organization it is impractical to have multiple db users for each user/team, and yes even IAM authentication exists for this same reason in RDS i personally did not find it the best way to use as it would required a bunch of configuration and changes in the RDS.

The idea here is by connecting via this proxy you would just have to run the login command that would let you do a SSO based login which will authenticate you through an IDP like azure AD before connecting to the db. Also helps me with user level audit logs

I had been looking for an opensource solution but could not find any hence rolled out my own, currently deployed and being used via k8s

Please check it out and let me know if you find it useful or have feedback, I’d really appreciate hearing from y'all.

Thanks!

https://redd.it/1oni3df
@r_devops
Just got $5K AWS credits approved for my startup

Didn’t expect this to still work in 2025, but I just got **$5,000 in AWS credits** approved for my small startup.

We’re not in YC or any accelerator just a verified startup with:

* a **website**
* a **business email**
* and an actual product in progress

It took around 2–3 days to get verified, and the credits were added directly to the AWS account.

So if you’re building something and have your own domain, there’s still a valid path to get AWS credits even if you’re not part of Activate.

If anyone’s curious or wants to check if they’re eligible, DM me I can share the steps.

https://redd.it/1onmg20
@r_devops
Migrating from Octopus Deploy to Gitlab. What are Pros and Cons?

Due to reasons I won't get into, we might need to move from Octopus Deploy to Gitlab for CICD. Trying to come up with some pros and cons so I can convince management to keep Octopus (despite the cost). Here are some of pros for having Octopus that I have listed:

Release management.
If we need to roll back to a previously functioning version of our code, we can simply click on the previous release and then leisurely work on fixing the problem. (sometimes issues aren't always visible in QA or Staging). Gitlab doesn't seem to have this.
Script Console
Octopus lets us send commands (eg, iisreset) to an entire batch of VMs in one shot instead having to write something that would loop through a list of VMs, or God forbid, remoting into each VM manually. GitLab doesn't seem to have that either. This comes in really handy when we need to quickly run a task in the middle of an outage.
Variable Management and Substitution
Scoping variable with different values seem to be handled much better in Octopus compared to GitLab. Also I could not find anything that says you can do variable substitution in your code for files like .config, .json files. No .NET variable substitution either in Gitlab.
Pipeline Design
Gitlab pipeline seems to be all YAML which means a lot of the tasks that Octo does for you, like IIS configuration, Kubernetes deployments, etc., will have to noscripted from scratch. (Correct me if I'm wrong on this).

These some of the Pros of Octopus I could think of. Are there any more I can use to back up my argument.
Also is there anyone who went through the same exercise? What is your experience using Gitlab after having Octopus for a while?

https://redd.it/1onlv3s
@r_devops
Curious how folks are feeling regarding ethics in the current political climate with regards to tech?

I'm asking the question in this sub on the basis that people have to have a reasonable level of experience to be in this field across disciplines. (I did the helpdesk - Sysad - DevOps route, for context I started that journey in the late 90s).

If not allowed (I understand politics is a sensitive subject) I will completely understand mods removing the post.

I'm between contracts at the moment, and in the last decade at least, whenever I'm not working I get offers to work for "gaming" (gambling) sites that I always turn down... I have a number of friends who wrecked their lives through gambling addiction - I wouldn't feel comfortable taking that paycheck.

(I'm not shitting on people who do, I get it. No judgement... Just a personal thing on my part).

I recently watched a pretty in-depth breakdown of the facial recognition AI stuff being trialled in the US (additional context, I'm not American nor do I live there anymore, but it's fair to assume it's coming everywhere soon), but I have previously worked for a number of US companies, including at least one that I know is involved in stuff that makes me feel pretty uncomfortable about the way the technology is progressing, and importantly, the things it is being used for.

I suppose I'm speaking more to the greybeards and grey hats in this community - but I'm curious to gauge how folks are feeling about developing and supporting this kind of thing?

https://redd.it/1onpv8x
@r_devops
The APM paradox

I've recently been thinking about how to get more developers (especially on smaller teams) to adopt observability practices, and put together some thoughts about how we're approaching it at the monitoring tool I'm building. We're a small team of developers who have been on-call for critical infrastructure for the past 13 years, and have found that while "APM" tools tend to be more developer-focused, we've generally found logging to be more essential for our own systems (which led us to build a structured logging tool that encourages wide events).

I'm curious what y'all think — how can we encourage more developers to learn about observability?

https://www.honeybadger.io/blog/apm-paradox/

https://redd.it/1onmrnj
@r_devops
Why do cron monitors act like a job "running" = "working"?


Most cron monitors are useless if the job executes but doesn't do what it's supposed to. I don't care if the noscript ran. I care if:
- it returned an error
- it output nothing
- it took 10x longer than usual
- it "succeeded" but wrote an empty file

All I get is "✓ ping received" like everything's fine.

Anything out there that actually checks exit status, runtime anomalies, or output sanity? Or does everyone just build this crap themselves?

https://redd.it/1onrwrl
@r_devops
Custom Podman Container Dashboard?

I have a bunch of docker containers(well technically podman containers) running on a Linux node and its getting to a point where its annoying to keep a track of all the containers. I have all the necessary identifying information(like requestor, poc etc.) added as labels to each container.

I'm looking for a way to create something like a dashboard to present this information like Container name, status, label1, label2, label3 in a nice tabular form.

I've already experimented with Portainer and Cockpit but couldn't really create a customized view per my needs. Does anyone have any ideas?

https://redd.it/1onsszc
@r_devops
How do you size VPS resources for different kinds of websites? Looking for real-world experience and examples.

I’m trying to understand how to estimate VPS resource requirements for different kinds of websites — not just from theory, but based on real-world experience.

Are there any guidelines or rules of thumb you use (or a guide you’d recommend) for deciding how much CPU, RAM, and disk to allocate depending on things like:

* Average daily concurrent visitors

* Site complexity (static site → lightweight web app → high-load dynamic site)

* Whether a database is used and how large it is

* Whether caching or CDN layers are implemented

I know “it depends” — but I’d really like to hear from people who’ve done capacity planning for real sites:

What patterns or lessons did you learn?

* What setups worked well or didn’t?

* Any sample configurations you can share (e.g., “For a small Django app with \~10k daily visitors and caching, we used 2 vCPUs and 4 GB RAM with good performance.”)?

I’m mostly looking for experience-based insights or reference points rather than strict formulas.

Thanks in advance!

https://redd.it/1onlpxe
@r_devops
Dudes, I'm scared, I know it's scam, but what if it is not? Have you ever received a mail like this before? and how its going?

Dudes, I'm just a hobbys ( If you look in my linkedin, you'll notice I'm not a programmer ), I've learned programming, algorithms, design patterns, all by myself. I also publish articles on my Medium blog, documenting the concepts I've learned from my reading and online study.

I recently discovered an email written in Chinese in my Gmail spam folder and have translated its contents using Google Translate.

---
Tonghuashun AIME Program Invitation

Hello,

I am XXXXX, HR from Hexin Tonghuashun (Stock Code: 300033). We noticed your excellent background in development on GitHub, which highly matches the requirements for our AI Engineering Development position. The specific focus areas include, but are not limited to, algorithm application, algorithm engineering, and large AI model development.

Core Advantages of the Position

- Salary benchmarked against top-tier tech companies.

- Listed company stock incentives provided through the AIME Talent Double Hundred Plan.

- Participation in the R&D of AI products with millions of users.

- Tech Stack: Engineering (Java/Web/C++/etc.) and cutting-edge algorithms (Large Models/NLP/AIGC/Robotics/Speech/etc.).

Company Profile

Zhejiang Hexin Tonghuashun Network Information Co., Ltd. (Tonghuashun), established in 1995 and listed on the Shenzhen Stock Exchange in 2009 (Stock Code: 300033), is the first listed company in China's internet financial information services industry. We currently have over 7,000 employees, with our headquarters located in the beautiful and livable city of Hangzhou.

As an internet financial information provider, Tonghuashun's main business is to offer software products and system maintenance services, financial data services, and intelligent promotion services to various institutions, and to provide financial information and investment analysis tools to individual investors. To date, Tonghuashun has nearly 600 million registered users and over ten million daily active users. We have established business cooperation with over 90% of domestic securities companies, with a strong "moat" business ensuring stable cash flow for the company.

Supported Business

Based on comprehensive AI capabilities such as large models, NLP, speech, graphics, image, and vision, we currently cover multiple 2B and 2C application scenarios. Our numerous products include the intelligent investment advisory robot AIME, intelligent service, data intelligence, smart healthcare, AIGC, and digital humans. Targeting various regions including China, Europe and the US, the Middle East, and Southeast Asia, we are progressively realizing the path of technology commercialization and product marketization. The AI team has accumulated over ten years of experience, with hundreds of large model application scenarios implemented, trillions of user financial dialogue data points, and a self-built cluster of thousands of cards for computing power. We are one of the first domestic enterprises to receive cybersecurity administration approval for financial large models.

I look forward to discussing this further with you! You can contact me via:

- WeChat/Phone: XXXX

- Email: XXXX

If you are interested, please feel free to contact me at any time, and I will arrange a detailed conversation for you.

Wishing you the best of business!

HR XXX | Zhejiang Hexin Tonghuashun Network Information Co., Ltd.

Company Website: https://www.10jqka.com.cn/

----

Ok, now, I dont know what to do. I know this could be spam, but what if doesn't, I mean, links look real.

here is my git if you're interested in what they've seen: https://github.com/EDBCREPO

Have you ever received a mail like this before?

----

EDIT: f@@k, the link and job looks real: https://campus.10jqka.com.cn/job/list?type=school



https://redd.it/1ony2tl
@r_devops
How are you handling these AWS ECS (Fargate) issues? Planning to build an AI agent around this…

Hey Experts,

I’m exploring the idea of building an AI agent for AWS ECS (Fargate + EC2) that can help with some tricky debugging and reliability gaps — but before going too far, I’d love to hear how the community handles these today.

**Here are a few pain points I keep running into 👇**

* When a process slowly eats memory and crashes — and there’s no way to grab a heap/JVM dump *before* it dies.
* Tasks restart too fast to capture any “pre-mortem” evidence (logs, system state, etc.).
* Fargate tasks fill up ephemeral disk and just get killed, no cleanup or alert.
* Random DNS or network resolution failures that are impossible to trace because you can’t SSH in.
* A new deployment “passes health checks” but breaks runtime after a few minutes.


**I’m curious**

* Are you seeing these kinds of issues in your ECS setups?
* And if so, how are you handling them right now — noscripts, sidecars, observability tools, or just postmortems?



Would love to get insights from others who’ve wrestled with this in production. 🙏

https://redd.it/1onzb40
@r_devops
Paid Study Help us improve Virtual Machine Tools – $150 for a 60-minute interview

We’re conducting a paid research study to learn more about how professionals create, manage, and provision virtual machines (VMs) at work. Our goal is to better understand your workflows and challenges so we can make VM tools more efficient and user-friendly.



Details:



\- Compensation: $150 USD for a 60-minute 1:1 conversation



\- Format: Online interview via Zoom or Teams



\- Who we’re looking for: Anyone who creates or uses virtual machines, at any experience level or for any type of application



\- Priority: Participants with a LinkedIn profile linked to our platform will be considered first



If you’re interested, please send me a message or comment below and I’ll share the next steps.

Your feedback will directly help improve the tools used by thousands of professionals worldwide.

https://redd.it/1oo1fe7
@r_devops
Tired of applying everywhere - Looking for Fresher DevOps / Cloud Support / Linux Opportunity

Hey everyone,

I’m a recent Computer Science graduate actively looking for fresher roles in DevOps, Cloud Support, or Linux.
I’ve applied to many companies and portals, but most either ask for experience or never respond — it’s been really tough finding that first break.

I’ve learned and practiced:

Linux
AWS (EC2, S3, IAM, Lambda basics)
Docker & Kubernetes
Git/GitHub
CI/CD concepts
I’m genuinely passionate about DevOps and Cloud, and I’m just looking for that first opportunity to prove myself.
Preferably looking for roles in Pune or remote.

If anyone here knows of openings or referrals, I’d really appreciate your help 🙏

Thanks a lot for reading and supporting freshers like me!

https://redd.it/1oo3d4e
@r_devops
India's largest automaker Tata Motors showed how not to use AWS keys

guy found two exposed aws keys on public sites, which gave access to \~70tb of internal data - customer info, invoices, fleet tracking, you name it

they also had a decryptable aws key (encryption that did nothing), a backdoor in tableau where you could log in as anyone with no password, and an exposed api key that could mess with their test-drive fleet

cert-in tried to get tata to fix it, but it took months of back-and-forth before the keys were finally rotated

link: https://eaton-works.com/2025/10/28/tata-motors-hack/ and https://news.ycombinator.com/item?id=45741569

https://redd.it/1oo402w
@r_devops