NEW BOT Телеграм, страница

Reddit DevOps

Join the on-call roster, it’ll change your life

Joining an on-call rotation might change the future of your career and maybe even you as a person. Joining the roster 9 years ago has definitely changed me. In this article, I shared my experience being on-call.

Link: https://serce.me/posts/2025-12-09-join-oncall-it-will-change-your-life

https://redd.it/1pijp5t
@r_devops

Join the on-call roster, it’ll change your life

Joining an on-call rotation might change the future of your career – and maybe you as a person. This article shares my experience being on-call.

4 views21:28

Reddit DevOps

Need brutally honest feedback: Am I employable as an internal tools/automation engineer with my background?

I'd really appreciate candid, unbiased feedback.

I’m based in Toronto and trying to understand where I realistically fit into the tech job market. My background is non-traditional, and I’ve developed a fear that I’m underqualified for most software roles despite being able to build a lot of things.

My background:

I was the main tech person at a small hedge fund that launched in 2021.

I built all the internal trading and operations tools from scratch:

PnL/exposure dashboards

Efficient trade executors

Signal engines built with insights from PM, deployed on EC2 communicated to client (traders') side noscripts through sockets.

automated margin checks

reconciliation pipelines

Excel/Python hybrid tools for ops

Basically: if the team needed something automated or streamlined, I designed and built it.

Where I feel confident:

I’m very comfortable:

understanding messy business processes

abstracting them into clean systems

building reliable automations

shipping internal tools quickly

integrating APIs

automating workflows for non-technical users

designing guardrails so people don’t make mistakes

Across domains, I feel I could pick up any internal bottleneck and automate it.

Where I feel unprepared / insecure:

Because I was the only technical person:

I never learned Agile/Scrum

never used Jira or any formal ticketing

barely used SQL (everything was Python + Excel)

never worked with other engineers

didn’t learn proper software development patterns

no pull requests, no code reviews

no experience building public products or services

I worry that I’m mostly a “noscript kiddie” who built robust systems by intuition, but not a “proper software engineer.”

The fund manager was a trained software engineer but gave me full freedom as long as the tools worked — which I loved, but now I’m worried I skipped important foundational learning.

My questions for people working in tech today:

1. Is someone with my background employable for internal tools or automation engineering roles in Canada?

2. If not, what specific skills should I prioritize learning to become employable?

SQL?

TypeScript/React?

DevOps?

Software architecture?

3. What kinds of roles would someone like me realistically be competitive for?

Internal tools engineer?

Automation engineer?

Operations engineer?

AI automation roles?

4. Is it realistic for someone with mostly Python + automation experience (but little formal SWE experience) to land roles in the ~80–110k range in Canada?

5. If you were in my position, what would you do next to fix the gaps and move forward?

I’m not looking for comfort — I genuinely want realistic, even harsh feedback from people who understand the current job market.

Thanks in advance to anyone who takes the time to answer.

https://redd.it/1piihs4
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

5 views22:28

Reddit DevOps

Malware on application server

I’m a 3rd year IT student on the ops team for a DevOps class where the devs are building a .NET application.

Earlier today I noticed a suspicious process called b0s running from `/var/tmp/.b0s` and eating a ridiculous amount of CPU. After digging into it I realized the application server was actually compromised. There were:

* strange binaries dropped in `/var/tmp` and `/tmp`
* a fake sshd running from `/var/tmp/sshd`
* cronjobs that kept recreating themselves every minute

With some AI help I cleaned everything up. I killed the active malware processes, and removed all the persistence so the server is stable again.

I built the application server with Ansible so rebuilding it tomorrow will be easy… still mad embarrassing though ngl.

https://redd.it/1piol32
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

4 views01:28

Reddit DevOps

Feel so hopeless and directionless

Just some backstory: I started off in devops straight off without any SWE background. Was working minimum wage jobs and spent hours of tutorials on my day job as I worked. A friend referred me and helped me get a support engineer job and I know how lucky I got there - I had take home assignments that I finished perfectly and got the job (the manager was leaving company and I think he just wanted to fill the position). But I struggle so much every day, team does not help me - not a single person interested in helping a junior learn or unblocking them. They don't even want to be dm'ed, they want you to ask any question in company slack in front of 500 people - which keeps me from asking at all at times. This was a couple years ago and I still have not learned or made any progress. Everyday is a struggle - I switch from one problem to next so fast that I never learn anything (thats support eng for you).

I feel like a complete newb in meetings or any discussions. I really really want to learn and find a direction for my learning. I have a few weeks off and I want to get somewhere in this time.

Here is my game plan:

Take the CKA course and pass the test: As I do this it will help me learn K8s (my jobs needs k8s knowledge) I'm working on kodekloud course.

AWS Solution architect course and test

Sys admin handbook to get good at fundamentals: https://www.amazon.com/UNIX-Linux-System-Administration-Handbook/dp/0134277554 (if you're familiar with this book and you know what can be skipped to save time please do let me know)

I think these three cover:

Container / Orchestration (k8s)
Cloud / Automation concepts (k8s / aws)
Observability (k8s)
Troubleshooting (book)
IaC (k8s)
Security (AWS)
Operating sys fundamentals (book)
Shell / noscripting (book)

My goal is 3 hours on CKA, one hour on book and 2 hours on AWS course daily.

If you think I should prioritize one above another or this looks good, let me know. Eager for some direction and advice.

https://redd.it/1piqj5d
@r_devops

5 views02:28

Reddit DevOps

Is there anyone use MLFlow for GenAI?

Heyyy. I'm sorry if my question is too naive or sounds lack of researching. But I swear I read the whole internet :)

Is there anyone here use MLFlow for GenAI ? So I started learning MLOps from a pure R&D NLP Engineer. I'm working for a startup company, and the evaluation pipeline right now is too vague and got a lot of criticism about the bad quality. I want to setup CI/CD pipeline integrate with MLFLow to make evaluation process clear and transparent. Build a quality gate to check the quality and decide if it should be on production or not.

While exploring MLFlow, I found it quite difficult to organize different stage: dev/staging/prod. As it all put in Experiment? Also I got difficulty in how to distinguish between experiment in dev (different config, model prompt) and evaluation result which put in production. (something like champion model in traditional quite useful but we don't have champion config? )

Aww thank you so much for reading this:) (this is my 3rd post T.T)

https://redd.it/1pit7nd
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

5 views04:28

Reddit DevOps

Hi guys, been looking into building a

price discovery platform for checking various FinOps platforms, and applying the optimal combination from a lookup to an individual and/or renegotiating rates

I also had a couple internal tools that I was thinking about open sourcing for using boto3 to map resource dependencies and VPCs/networks between resources

Thoughts on what the you'd like to see in something like this?

https://redd.it/1piti87
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

7 views05:28

Reddit DevOps

Using PSI + cgroups to debug noisy neighbors on Kubernetes nodes

I got tired of “CPU > 90% for N seconds → evict pods” style rules. They’re noisy and turn into musical chairs during deploys, JVM warmup, image builds, cron bursts, etc.

The mental model I use now:

* CPU% = how busy the cores are
* PSI = how much time things are actually *stalled*

On Linux, PSI shows up under `/proc/pressure/*`. On Kubernetes, a lot of clusters now expose the same signal via cAdvisor as metrics like `container_pressure_cpu_waiting_seconds_total` at the container level.

The pattern that’s worked for me:

1. Use PSI to confirm the node is actually under pressure, not just busy.
2. Walk cgroup paths to map PIDs → pod UID → {namespace, pod\_name, QoS}.
3. Aggregate per pod and split into:
* “Victims” – high stall, low run
* “Bullies” – high run while others stall

That gives a much cleaner “who is hurting whom” picture than just sorting by CPU%.

I wrapped this into a small OSS node agent I’m hacking on (Rust + eBPF):

* `/processes` – per-PID CPU/mem + namespace/pod/QoS (basically `top` but pod-aware).
* `/attribution` – you give it `{namespace, pod}`, it tells you which neighbors were loud while that pod was active in the last N seconds.

Code: [https://github.com/linnix-os/linnix](https://github.com/linnix-os/linnix?utm_source=chatgpt.com)
Write-up + examples: [https://getlinnix.substack.com/p/psi-tells-you-what-cgroups-tell-you](https://getlinnix.substack.com/p/psi-tells-you-what-cgroups-tell-you)

This isn’t an auto-eviction controller; I use it on the “detection + attribution” side to answer:

>

before touching PDBs / StatefulSets / scheduler settings.

Curious what others are doing:

* Are you using PSI or similar saturation signals for noisy neighbors?
* Or mostly app-level metrics + scheduler knobs (requests/limits, PodPriority, etc.)?
* Has anyone wired something like this into automatic actions without it turning into musical chairs?

https://redd.it/1pitfnt
@r_devops

GitHub

GitHub - linnix-os/linnix: eBPF-powered Linux observability with AI incident detection. AGPL-3.0 licensed.

eBPF-powered Linux observability with AI incident detection. AGPL-3.0 licensed. - linnix-os/linnix

5 views06:28

Reddit DevOps

How much better is AI at coding than you really?

If you’ve been writing code for years, what’s it actually been like using AI day to day? People hype up models like Claude as if they’re on the level of someone with decades of experience, but I’m not sure how true that feels once you’re in the trenches.

I’ve been using Claude and Cosine a lot lately, and some days it feels amazing, like having a super fast coworker who just gets things. Other days it spits out code that leaves me staring at my screen wondering what alternate universe it learned this from.

So I’m curious, if you had to go back to coding without any AI help at all, would it feel tiring?

https://redd.it/1piy5c1
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

3 views09:28

Reddit DevOps

Built a visual debugger for my local agents because I was lost in JSON, would you use this?

I run local LLM agents with tools / RAG.

When a run broke, my workflow was basically: rerun with more logging, diff JSON, and guess which step actually screwed things up. Slow and easy to miss.

So I hacked a small tool for myself: it takes a JSON trace and shows the run as a graph + timeline.

Each step is a node with the prompt / tool / result, and there’s a basic check that highlights obvious logic issues (like using empty tool results as if they were valid).

It’s already way faster for me than scrolling logs.

Long-term, I’d like this to become a proper “cognition debugger” layer on top of whatever logs/traces you already have, especially for non-deterministic agents where “what happened?” is not obvious.

It’s model-agnostic as long as the agent can dump a trace.

I’m mostly curious if anyone else here hits the same pain.

If this sounds useful, tell me what a debugger like this must show for you to actually use it.

I’ll drop a demo link in the comments 🔗.

https://redd.it/1piyll2
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

4 views10:28

Reddit DevOps

Your AI agents are a compliance disaster waiting to happen

Just got out of a meeting with legal and I need to vent somewhere.

We have like six agents running in production now. Different teams built them over the past year. They work fine, users like them, everyone was happy. Then legal started asking questions for some audit prep and everything fell apart.

Can you prove what data this agent accessed when it made that decision? No. Can you show me a trace of why it recommended X to this customer? Also no. Can you demonstrate that PII wasnt sent to openai? Definitely no. Can you prove GDPR compliance for the eu users? Lmao.

None of this stuff was even on anyones radar when we were building. We were just trying to get the damn things working. Now legal is talking about shutting down two of the agents entirely until we can prove theyre compliant. Which we cant. Because we logged basically nothing.

The thing that kills me is this isnt even hard technically. Audit logs, decision traces, data lineage. We know how to build this stuff. We just didnt because nobody asked and we were moving fast. Classic.

Now Im looking at retrofitting observability into agents that were built by people who already left the company. Some of this code is held together with prayers and yaml. One agent calls three different llm providers and nobody documented why.

Anyone else getting hit with this? How are you handling audit requirements for agent stuff? Our legal team wants full decision trails and Im not even sure where to start without rebuilding half of this from scratch.

https://redd.it/1pize7u
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

6 views11:28

Reddit DevOps

Jenkins alternative for workflows and tools

We are currently using Jenkins for a lot of automation workflows and calling all kind of tools with various parameters. What would be an alternative? GitOps is not suitable for all scenarios. For example I need to restore some specific customer database from a backup. Instead of running a noscript locally, I want to have some sort of a Jenkins-like pipeline/worflow where I can specify various parameters. What kind of tools do you guys use for such scenarios?

https://redd.it/1pj0hw0
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

6 views12:28

Reddit DevOps

ever mass forwarded a "quick fix" and mass forwarded your weekend?

Ever mass forwarded "yeah this looks like a quick fix" and mass forwarded your entire weekend?

Or approved a PR on Friday because you didn't want to be "that guy" — and spent Saturday debugging production?

I keep thinking about how no tutorial teaches you this stuff. You just screw up and learn.

But what if senior devs just... told you their screw-ups? Short videos, real situations, what they'd do differently.

Would you actually pay for that? Or is that what Reddit and YouTube are for?

https://redd.it/1pj2a99
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

6 views13:28

Reddit DevOps

Self host k3s github pipeline

Hi all,
I'm trying to build a DIY CI/CD solution on my VPS using k3s, ArgoCD, Tekton, and Helm.
I'm avoiding PaaS solutions like Coolify/Dokploy because I want to learn how to handle automation and autoscaling manually. However, I'm really struggling with the integration part (specifically GitHub webhooks failing and issues with my self-hosted registry, and tekton).

It feels like I might be over-engineering for a single server.

- What can I do to simplify this stack while keeping it "cloud-native"?
- Are there better/simpler alternatives to Tekton for a setup like this?

Thanks for any keywords or suggestions!

https://redd.it/1pj24v6
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

8 views14:28

Reddit DevOps

For the Europeans here how do you deal with agentic compliance ?

I’ve seen a few people complain about this and with the AI EU act it’s only getting worse, how are you handling this ?

https://redd.it/1pj40zu
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

7 views15:28

Reddit DevOps

CDKTF is abandoned.

https://github.com/hashicorp/terraform-cdk?tab=readme-ov-file#sunset-notice

They just archived it. Earlier this year we had it integrated deep into our architecture, sucks.

https://redd.it/1pj6732
@r_devops

GitHub

GitHub - hashicorp/terraform-cdk: Define infrastructure resources using programming constructs and provision them using HashiCorp…

Define infrastructure resources using programming constructs and provision them using HashiCorp Terraform - hashicorp/terraform-cdk

6 views16:28

Reddit DevOps

How do I actually speedrun DevOps?

My main background is sysadmin, been doing it for like 10years. Few years back I decided to switch to DevOps bc I didn't wanna do the physical stuff anymore. Mainly printers...I hate printers. Anyways I started looking and found a devops job and been at it for 4+ years now.
The boss knew I didn't have actual devops experience. But based on my sysadmin background and willingness to learn and tinker, he hired me. (I told him about my whole homelap).

Here's the thing at this company for the past 4 years I haven't really done any actual "DevOps" stuff. Simply bc of the platforms and environments the company has. We have a GitHub account with a few repos that are for the most part ai generated ai apps/sites. The rest of the stack is bunch of websites on other platforms like sitegound, square space, etc. Basically for the past 4 years I've been more of a WordPress web admin and occasionally troubleshooted someone's Microsoft account/azure issues. We also have an AWS account but only use S3 for some images.

Every week, every month I would say to myself "tomorrow I'ma learn docker...or terraform...or I'ma setup a cool ci/cd pipeline in GitHub to learn devops" well everyday I was hella busy with the wp sites and other none DevOps duties that I would never get too do anything else.
Fast-forward to today and the company is being bought out and the tech dep will be gone. So I need to find a job.
While job hunting I realized(and forgot) that I needed actual DevOps experience 😢😅 everyone asking for AWS, GCP, azure, terraform, ansible..and I have NOT touched any of those.
So, how do I learn the most important things in like,..a week or so? . Im great at self-learning. Any project ideas I can whip up to speed run devops ?
My boss has told me to get certified in AWS or something, and while Yea I do want too. I also feel like I can study hard and learn what I need and just apply everything I've done for past 4years to "I automated x thing on aws to improve x thing" and use that during interviews.
Thoughts? Ideas?
Also, bc of my 3years of experience in basically WordPress and website design I kind of just want to start a side gig doing that. I became a WordPress/elementor pro basically. Oh and I actually learned a lot of JavaScript/html/css.(I already knew enough python/bash from sysadmin stuff) .
Thanks in advance!

https://redd.it/1pj8cte
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

6 views17:28

Reddit DevOps

The Log Reading Commands That Save Me During On-call

Sharing a guide on the Ubuntu commands that help during log-heavy debugging sessions. These are the ones I use during outages or incident analysis. Might help someone on pager duty.

Link : https://medium.com/stackademic/the-15-ubuntu-commands-i-use-every-time-i-troubleshoot-logs-0858dd876572?sk=b7c55fa75369ceed88e9310a3c94456a

https://redd.it/1pj9u6j
@r_devops

Medium

The 15 Linux commands I use every time I troubleshoot logs

When a Java or Spring Boot service slows down, the first thing I touch is not the code. It is the log file. Logs tell you the story. They…

4 views18:28

Reddit DevOps

Best way to create an offline iso proxmox with custom packages + zfs

I have tried proxmox autoinstall. I managed to create an iso. But I have no idea how to make it work by including python ansible and setup zfs. Maybe there is better ways of doing it? I am installing 50 proxmox servers physically

https://redd.it/1pj6hxb
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

5 views19:28

Reddit DevOps

I built a unified CLI tool to query logs from Splunk, K8s, CloudWatch, Docker, and SSH with a single syntax.

Hi everyone,

I’m a dev who got tired of constantly context-switching between multiples Splunk UI, multiples OpenSearch,`kubectl logs`, AWS Console, and SSHing into servers just to debug a distributed issue. And that rather have everything in my terminal.

I built a tool written in Go called **LogViewer**. It’s a unified CLI interface that lets you query multiple different log backends using a consistent syntax, extract fields from unstructured text, and format the output exactly how you want it.

**1. What does it do?** LogViewer acts as a universal client. You configure your "contexts" (environments/sources) in a YAML file, and then you can query them all the same way.

It supports:

* **Kubernetes**
* **Splunk**
* **OpenSearch / Elasticsearch / Kibana**
* **AWS CloudWatch**
* **Docker** (Local & Remote)
* **SSH / Local Files**

**2. How does it help?**

* **Unified Syntax:** You don't need to remember SPL (Splunk), KQL, or specific AWS CLI flags. One set of flags works for everything.
* **Multi-Source Querying:** You can query your `prod-api` (on K8s) and your `legacy-db` (on VM via SSH) in a single command. Results are merged and sorted by timestamp.
* **Field Extraction:** It uses Regex (named groups) or JSON parsing to turn raw text logs into structured data you can filter on (e.g., `-f level=ERROR`).
* **AI Integration (MCP):** It implements the **Model Context Protocol**, meaning you can connect it to Claude Desktop or GitHub Copilot to let AI agents query and analyze your infrastructure logs directly.

[Link to github repo](https://github.com/bascanada/logviewer)

VHS Demo: [https://github.com/bascanada/logviewer/blob/main/demo.gif](https://github.com/bascanada/logviewer/blob/main/demo.gif)

**3. How to use it?**

It comes with an interactive wizard to get started quickly:

logviewer configure

Once configured, you can query logs easily:

Basic query (last 10 mins) for the prod-k8s and prod-splunk context:

logviewer -i prod-k8s -i prod-splunk --last 10m query log

Filter by field (works even on text logs via regex extraction):

logviewer -i prod-k8s -f level=ERROR -f trace_id=abc-123 query log

Custom Formatting:

logviewer -i prod-docker --format "[{{.Timestamp}}] {{.Level}} {{KV .Fields}}: {{.Message}}" query log

It’s open source (GPL3) and I’d love to get feedback on the implementation or feature requests!

https://redd.it/1pj6d9i
@r_devops

GitHub

GitHub - bascanada/logviewer: Terminal based log viewer with multiple datasource (OpenSearch, Splunk, Docker, K8S, SSH, Local Command)

Terminal based log viewer with multiple datasource (OpenSearch, Splunk, Docker, K8S, SSH, Local Command) - bascanada/logviewer

4 views20:28

Reddit DevOps

How would you improve DevOps on a system not owned by the dev team

I work in a niche field and we work with a vendor that manages our core system. It’s similar to SalesForce but it’s a banking system that allows us to edit the files and write noscripts in a proprietary programming language. So far no company I’ve worked for that works for this system has figured it out. The core software runs on IBM AIX so containerizing is not an option.

Currently we have a single dev environment that every dev makes their changes on at the same time, with no source control used at all. When changes are approved to go live the files are simply manually moved from test to production.

Additionally there is no release schedule in our team. New features are moved from dev to prod as soon as the business unit says they are happy with the functionality.

I am not an expert in devops but I have been tasked with solving this for my organization. The problems I’ve identified that make our situation unique are as follows:

No way to create individual dev environments
The core system runs on an IBM PowerPC server running AIX. Dev machines are Windows or Mac, and from my research, there is no way to run locally. It is possible to create multiple instances on a single server, but the disk space on the server is quite limiting.
No release schedule
I touched on this above but there is no project management. We get a ticket, write the code, and when the business unit is happy with the code, someone manually copies all of the relevant files to production that night.
System is managed by an external organization
This one isn't too much of an issue but we are limited as to what can be installed on the host machines, though we are able to perform operations such as transferring files between the instances/servers via a console which can be accessed in any SSH terminal.
The code is not testable
I'd be happy to be told why this is incorrect but the proprietary language is very bare bones and doesn't even really have functions. It's basically SQL (but worse) if someone decided you should also be able to build UIs with is.

As said in my last point, I'd be happy to be told that nothing about this is a particularly difficult problem to solve, but I haven't been able to find a clean solution.

My current draft for devops is as follows:

1. Keep all files that we want versioned in a git repository - this would be hosted on ADO.
2. Set up 3 environments: Dev, Staging, and Production, these would be 3 different servers or at lest Dev would be a separate server from Staging and Production.
3. Initialize all 3 environments to be copies of production and create a branch on the repo to correspond to each environment
4. When a dev receives a ticket, they will create a feature branch off of Dev. This is where I'm not sure how to continue. We may be able to create a new instance for each feature branch on the dev server, but it would be a hard sell to get my organization to purchase more disk space to make this feasible. At a previous organization, we couldn't do it, and the way that we got around that is by having the repo not actually be connected to dev. So devs would pull the dev branch to their local, and when they made changes to the dev environment they would manually copy the changed files into their local repo after every change and push to the dev branch from there. People eventually got tired of doing that and our repo became difficult to maintain.
5. When a dev completes their work, push it to Dev and make a PR to staging. At this point is there a way for us to set up a workflow that would automatically update the Staging environment when code is pushed to the Staging branch? I've done this with git workflows in .NET applications but we wouldn't want it to 'build' anything. Just move the files and run AIX console commands depending on the type of file being updated (i.e. some files need to be 'installed' which is an operation provided by the aforementioned console).
6. Repeat 5 but Staging to Production

So essentially I

2 views21:28

Reddit DevOps

am looking to answer two questions. Firstly, how do I explain to the team that their current process is not up to standard? Many of them do not come from a technical background and have been updating these noscripts this way for years and are quite comfortable in their workflow, I experienced quite a bit of pushback trying to do this in my last organization. Is implementing a devops process even worth it in this case? Secondly, does my proposed process seem sound and how would you address the concerns I brought up in points 4 and 5 above?

Some additional info: If it would make the process cleaner then I believe I could convince my manager to move to scheduled releases. Also, I am a developer, so anything that doesn't just work out of the box, I can build, but I want to find the cleanest solution possible.

Thank you for taking the time to read!

https://redd.it/1pjd1mh
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

2 views21:28

About

Blog

Apps

Platform