Reddit DevOps – Telegram
Do you know any open-source agent that can automatically collect traces like Dynatrace OneAgent?

I work at a large bank, and I’m facing challenges collecting trace data to understand how different components affect my applications. Dynatrace OneAgent is excellent since it automatically collects traces once installed on the server. However, its cost is very high, and I have security concerns because the data is sent over the internet.
We’ve tried using OpenTelemetry, but it requires modifying or re-coding the entire application. That’s fine for new systems, but it’s almost impossible for legacy or third-party applications.
Do you have any ideas or solutions for automatic trace collection in such environments?

https://redd.it/1o10dji
@r_devops
Full-Stack Developer exploring DevOps, DevSecOps, or MLOps, which path makes more sense long-term?

Hey everyone

I’m a Full-Stack Developer (C#, Java, React) with around 3 years of experience, mostly working on backend APIs and microservices in cloud environments (AWS + Kubernetes).

Lately, I’ve been getting more interested in the infrastructure and automation side of things, and I’m planning a career shift within the cloud/engineering space. I’ve narrowed it down to DevOps, DevSecOps, or MLOps, but I’m not sure which direction would be more valuable and sustainable in the long run.

Here’s what I’m trying to figure out:

1. How do DevOps, DevSecOps, and MLOps differ in day-to-day work and responsibilities?
2. What’s the best learning roadmap or certification path (especially on AWS or GCP) to get started?
3. If you’ve worked in more than one of these areas, how did you decide which to stick with?

TL;DR:

* 3 yrs full-stack experience (C#, Java, React, AWS).
* Exploring DevOps, DevSecOps, and MLOps.
* Want to pick one that fits and offers solid long-term growth.

Would love to hear from people working in these fields and what you wish you’d known before switching.

https://redd.it/1o121mq
@r_devops
Built a replit/lovable clone that allows my marketing interns to vibe code but deploys to GCP using my policy guardrails and Terraform - is this something you are asked to build in your org?

I’m experimenting with Claude Code as a DevOps interface.

It acts like Replit — you write code, it generates specs, and then Humanitec (a backend orchestrator, disclaimer I work there) handles the full deployment to GCP. No pipeline. No buttons. Just Claude + infra API.

🎥 Short demo (1 min): https://www.youtube.com/watch?v=jvx9CgBSgG0

Not saying this is production-ready for everyone, but I find the direction interesting. Curious what others here think.

https://redd.it/1o14bzn
@r_devops
Lazy-ECS for quickly managing ECS from command line

My little tool to quickly manage your ECS clusters got such a good response that I've now put quite a lot more effort to it. You can quickly now:

tail logs from your containers
compare task definitions
show environment variables and secrets from your tasks
force redeploymentsetc.

with a super simple interactive command line tool.

Install with brew or pipx or no install needed with ready docker container.

Yes, I know there is alternatives too. This just solved bunch of things that annoyed me with AWS UI and CLI so I went a head and wrote a little tool.

I'd love to get any feed back or if you feature requests etc.

https://github.com/vertti/lazy-ecs

https://redd.it/1o15ppb
@r_devops
How can I convert application metrics embedded in logs into Prometheus metrics?

I'm working in a remote environment with limited external access, where I run Python applications inside pods. My goal is to collect **application-level metrics** (not infrastructure metrics) and expose them to Prometheus on my backend (which is external to this limited environment).

The environment already uses Fluentd to stream logs to AWS Data Firehose, and I’d like to leverage this existing pipeline. However, Fluentd and Firehose don’t seem to support direct metric forwarding.

To work around this, I’ve started emitting metrics as structured logs, like this:

METRIC: {
"metric_name": "func_duration_seconds_hist",
"metric_type": "histogram",
"operation": "observe",
"value": 5,
"timestamp": 1759661514.3656244,
"labels": {
"id": 123,
"func": "func1",
"sid": "123"
}
}


These logs are successfully streamed to Firehose. Now I’m stuck on the next step:
How can I convert these logs into actual Prometheus metrics?

I considered using OpenTelemetry Collector as the Firehose stream's destination, to ingest and transform these logs into metrics, but I couldn’t find a straightforward way to do this. Ideally I would also prefer to not write a custom Python service.

I'm looking for a solution that:

* Uses existing tools (Fluentd, Firehose, OpenTelemetry, etc.)
* Can reliably transform structured logs into Prometheus-compatible metrics

Has anyone tackled a similar problem or found a good approach for converting logs to metrics in a Prometheus-compatible way? I'm also open to other suggestions and solutions.

https://redd.it/1o160w8
@r_devops
DevOps Bootcamp Recommendations

Hey everyone,

I’m new to the DevOps subreddit so let me introduce myself.

I come from a SysAdmin and NetEng background (Junior) and want to use my experience to transfer to the DevOps sphere.

I like the concept of DevOps and am passionate about infrastructure and automation, however I am missing bits and pieces, more so, I struggle understanding the full scope of DevOps.

With that said, I’m looking into different bootcamps, 3-6 months (ideally 3), to really level up my knowledge and practical experience within the sphere. I want to hit the ground running.

The reason why I want to do a bootcamp is because I struggle with setting up labs for myself and really getting the most out of it, I feel like I reached the point where I need som guidance, mentoring, tutoring, just need some help.

I’ve been looking into TechWorld with Nana DevOps Bootcamp and it does sound very interesting. I like the fact that after the bootcamp you will have actually projects to present when looking for jobs.

Has anyone had any experience with that bootcamp? Would anyone have other options to recommend?

The budget is tops 3k, and I have the time to dedicate to go through it intensely, so preferably I would want to do it in 3months.

If you made it this far, thank you for reading!

/C

https://redd.it/1o17mtz
@r_devops
How do AEO platforms deploy .well-known/llms.txt/faq.json to customers’ domains? Looking for technical patterns (CNAME, Workers, FTP, plugins)

Hi everyone — I’m building an AEO/AI-visibility product and I’m trying to figure out how established providers handle per-customer hosting of machine-readable feeds (FAQ/Product/Profile JSON, llms.txt, etc.).

We need a reliable, scalable approach for hundreds+ customers and I’m trying to map real, battle-tested patterns. If you have experience (as a vendor, integrator, or client), I’d love to learn what you used and what problems you ran into.

Questions:

1. Do providers usually require customers to host feeds on their own domain (e.g. https://customer.com/.well-known/faq.json) or do they host on the vendor domain and rely on links/canonical? Which approach worked better in practice?
2. If they host on the client domain, how is that automated?
FTP/SFTP upload or HTTP PUT to the origin?
CMS plugin (WP/Shopify) that writes the files?
GitHub/Netlify/Vercel integration (PR or deploy hook)?
DNS/CNAME + edge worker (Cloudflare Worker, Lambda@Edge, Fastly) that serves provider content under client domain?
3. How do you handle TLS for custom domains? ACME automation / wildcard certs / CDN managed certs? Any tips on DNS verification and automation?
4. Did you ever implement reverse proxying with host header rewriting? Any issues with SEO, caching, or bot behaviour?
5. Any operational gotchas: invalidation, cache headers, rate limits, robot exclusions, legal issues (content rights), or AI bots not fetching .well-known at all?

If you can share links to docs, blog posts, job ads (infra hiring hints), or short notes on pros/cons — that’d be fantastic. Happy to DM for private details.

Thanks a lot!



https://redd.it/1o16fgz
@r_devops
Spacelift Intent MCP - Build Infra with AI Agents using Terraform Providers

Hey everyone, Kuba from Spacelift here!

We’ve built Spacelift Intent to make it much easier to build ad-hoc cloud infrastructure with AI. It’s an MCP server that uses Terraform/OpenTofu providers under the hood to talk directly to your cloud provider, and lets your AI agent create and modify cloud resources.

You can either use the open-source version which is just a binary, or the Spacelift-hosted version as a remote MCP server (there you also get stuff like policies, audit history, and credential management).

Compared to clickops/raw cloud cli invocations it also keeps track of all managed resources. This is especially useful across e.g. Claude Code sessions, as even though the conversation context is gone, the assistant can easily read the current state of managed resources, and you can pick up where you left off. This also makes it easy to later dump it all into a tf config + statefile.

Hope you will give it a try, and curious to hear your thoughts!

Here's the repo: https://github.com/spacelift-io/spacelift-intent


https://redd.it/1o1bl6i
@r_devops
Argo CD got us 80% of the way there… but what about the last mile?

Curious if others have run into this… Argo CD nails GitOps-driven deployments, rollbacks, visibility, etc. But once we started scaling across multiple environments and teams, the last mile (promotion between envs, audit/compliance, complex orchestration) became the real pain point… How are you handling the “glue” work around Argo?

Custom noscripting? GitHub Actions / Jenkins? Octopus Deploy? Something else? Feels like everyone’s got their own duct-tape solution here. What’s worked (or blown up) for you?

https://redd.it/1o1di1u
@r_devops
Deciding on a database for our mobile application that has google API+embedded external hardware

Hello!

I'm developing an application for my graduation project using react Native to work on android mobile phones, now as I am are considering my database, I have many options including NoSQL(Firebase), SQL or Supbase..

Beside the mobile application, we have an embedded hardware (ESP34 communicates with other hardware and the phone) as well as a google calendar api in the application (if that matters, anyway)

Please recommend me a suitable Database approach for my requirements! I would appreciate it a lot!

https://redd.it/1o1eb17
@r_devops
Minimus vs Aqua Security: Which One Would You Pick?

I’m currently deep-diving into container security solutions and wanted to get some thoughts on two players that caught my attention: Minimus and Aqua Security.

Here is what I have got after digging in:

Minimus builds ultra-minimal images straight from upstream, stripping out anything unnecessary. That way, you get to start with way fewer CVEs. Less alert noise, faster triage. Integration is also pretty simple. On the downside, minimus does not offer runtime protection.

Aqua’s the heavyweight. They provide full lifecycle security, scanning, runtime protection, compliance, etc. But it kinda feels reactive. You're securing bloated images, which can slow things down and flood you with alerts. On the upside, Aqua’s runtime protection is pretty solid.

So I’m torn: Do you start clean with Minimus and avoid most issues upfront, or go all-in with Aqua and deal with vulnerabilities as they come?

Anyone using either (or both)? Would love to hear how they fit into your workflows.

https://redd.it/1o1g4rz
@r_devops
AWS/AzDo: Export configuration

We have setup AWS transfer using cloud formation and automated deployment through AzDo. We are planning DORA now and want to best use of having all the configuration outside of AWS for disaster recovery?
Options we have thought of
1. AzDo artifacts
2. AzDo library using variables
3. Manually consumers to edit the exported json file with all the config everytime they run the pipeline which has runtime parameters.

Note: This solution is consumed by non/tech teams who don’t know what AWS is, nor AzDo- designed solution in a very simple way (Business is not ready to maintain a team to manage this solution so we are just build and give it away team so it’s decentralised solution using templates)

Open to more suggestions

https://redd.it/1o1gv5w
@r_devops
Perspective on Agent Tooling adoption

I have been talking to a bunch of developers and enterprise teams lately, but I wanted to throw this out here to get a broader perspective from all.

Are enterprises actually preferring MCPs (Model Context Protocols) for production use cases or are they still leaning towards general-purpose tool orchestration platforms?

Is this more about trust both in terms of security and reliability? Enterprises seem to like the tighter control and clearer boundaries MCPs provide, but I’m not sure if that’s actually playing out in production decisions or just part of the hype cycle right now.

Curious what everyone here has seen, especially from those integrating LLMs into enterprise stacks. Are MCPs becoming the go-to for production, or is everyone sticking with their own tools/tool providers?


https://redd.it/1o1gix4
@r_devops
How to learn cloud and K8s fundamentals?

Hey everyone I know this question would have been asked a million if not a billion times on this subreddit but I really wanna know good resources to learn cloud fundamentals mostly AWS, and K8s it just looks so scary tbh the config file grows and grows without any logic to me I've seen various videos explaining the things but I forget them after a few days. I want to be very good with the fundamentals then only I feel comfortable in any thing I do, I can make things work with the help of googling and gpt but that doesn't give me the satisfaction I really wanna spend time get my concepts so good that I can basically teach it to my dog. So please can you all list from where you studied these things how you get the fine details of these complex concepts.
Thanks

https://redd.it/1o1m5qd
@r_devops
Why we stopped trusting devs to write good commits

Our dev team commit history used to be a mess.
Stuff like “fix again,” “update stuff,” “final version real” (alright maybe not literally like that but you get the point).
It didnt bother me much until we had to write proper release notes, then it became a nightmare.

Out of curiosity, I got data from around 15k commits across our team repos.
- About 50% were too short to explain anything meaningful.
- Another 30% didn’t follow any convention at all.
- The rest was okay.

My team tried enforcing commit guidelines, adding precommit hooks, all that, but devs (including myself) would just skip or just do the minimum to make it pass. The problem was that writing a clean message takes effort when youre already mentally done with the task.

So we built an internal toolthat reads the staged diff and suggests a commit message automatically. It looks at the code, branch name, previous commits, etc., and tries to describe why the change was made, not just what changed.

It ended up being really useful. We added custom rules for our own commit conventions and some analytics for fun, turns out people started "competing" over having the cleanest commits. Code reviews got easier, history made sense, and even getting new devs into the team was easier.

Now we have turned that tool into a full platform. It’s got a cli, web dashboard, team spaces, analytics, etc.

Curious if anyone else has tried to fix this problem differently. Do you guys automate commits in any way, or do you just rely on discipline and PR reviews?

https://redd.it/1o1ojr4
@r_devops
ISSUE - Some users encounter unsecure connection while others have no issues

I have setup an AWS API gateway which is connected to a Cloudfront distribution. The distribution is then connected using CNAME in cloudflare (where my domain is)
Certificate is issued in Amazon and used in Cloudfront distribution

I am not sure what i am doing wrong here most of our users have no issues accessing the domain URL (secure connection/HTTPS) while some face the issue around the country (US)

how can i fix this / debug this issue
any kind of help is appreciated
Thanks

https://redd.it/1o1ldqt
@r_devops
How to progress quickly - Cloud Eng 1

I am a chemical engineer by background who busted my ass to learn how to code and did many personal projects and side projects in my “real job” to get marketable experience. I have been hired as a Cloud Engineer 1 and have been working really hard to wrap my brain around cloud engineering. I know I’m smart because chem e is one of the harder degrees, but this job has me feeling like a dumbass. Some days I feel like I get it and other days I’m a deer in the headlights. Any tips to expedite my learning process? I’m at an terraform heavy shop and that makes more sense to me currently than operating in the gui. I appreciate any resources or advice (encouragement also welcome) you’d be willing to share. TIA

Edit: for context I’ve been in this job about 2 months.

https://redd.it/1o1r7xv
@r_devops
I’m thinking about learning to program at my 38's

I have an IT background. I learned HTML, PHP, and how to set up Linux servers in college. I work in tech support, solving issues on Windows and Mac. But it’s been years since I last coded. I want to relearn HTML and learn CSS and JavaScript. I have a Synology server and know a bit about containers. What do you think? Am I too old? I want to learn because I’d like to build apps to help my clients with certain tasks.

https://redd.it/1o1s3em
@r_devops
Looking for Career Advice

I've pursued DevOps Engineering from a non-technical position as a Civil Engineer three years ago.

It started when I was looking for a career shift that led me to look into IT since I was an IT enthusiast who loved working with Linux and managing home servers.

And since IT was welcoming to non degree holders, I took online courses like CS50X, CS50P, then got into Cloud Computing Bootcamp that teaches AWS. Got certified as AWS SAA and continued upskilling with Basic CCNA concepts, Containerization, IaC, Linux administration, and CICD toward the DevOps concepts, tooling and culture implementation.

I was inclined into Cloud Engineering and architecture only. but the job market kept pushing towards DevOps Engineering and made no difference between cloud engineer and DevOps role (the difference is only theoritical).

A year after upskilling and building a portfolio I finally got a DevOps Engineer position. Although the company had no DevOps culture I worked on implementing it, setting a complete workflow for developement stages with CICD, using IaC for managing infra, managing linux servers and setting dockerfiles.

I kept improving and showcasing my knowledge by building scalable infrastructure projects including serverless, focusing on DevOps and GitOps culture follows the best practices paths and cost optimization.

Even was able to run a whole production level EKS infrastructure integrated with GitOps workflow for IaC infra, Helm charts and ArgoCD.

I've been laid off 6 months ago after 1.5 years of working and total of 2 years of experience.

I've been looking for a job for more than a year with about 11 screening calls, 4 technical interviews, 2 final interview passed but ghosted.

I find it very difficult to find jobs now, there is huge compitition and most jobs require 3.5+ years of experience, while every job denoscription is different from the other one with different stack.

Despite all what I built with this three years phase, I always feel my skills are not enough.

I am not in entry level anymore and I see my skills comfortably mid level engineer. But I'm struggiling with this loop of learning and hoping and applying and being rejected.

I need advice to whether continue in devops or transfer to closer role? I loved IT System Engineering and server management with automation implementation. But now I'm flexible.

Thanks,

https://redd.it/1o1vxrq
@r_devops
Need advice: Stuck in a niche IT project, want to switch to DevOps – what’s the best approach?

Hi everyone,

I’ve been working in an IT company in Bangalore for the past 2 years as an **Electronic Software Engineer**. I joined a project that was supposed to last around 2 years, but I later realized it’s a very **specific, long-term project** that could continue for 8–10 years. The project is highly specialized and similar opportunities are hard to find in other companies.

Now I feel **stuck in my current role** and want to transition into a **DevOps Engineer role**, or possibly a broader **software development role**.

I came across a **paid DevOps course** that claims to offer placement after completion, but the fee is **₹90K** and I’m unsure whether it’s worth the investment. Internal transfer in my current company is difficult because I handle critical parts of this project, and even if they allow it, I may be pulled back when issues arise.

My questions for this community:

* Is it better to take a structured paid course for a career switch, or learn **DevOps skills independently** and apply directly?
* For someone with 2 years of experience in a niche project, which path is more realistic: **transitioning to DevOps** or **switching to development**?
* How can I **safely plan a career move** without risking financial loss or getting stuck again?

Any advice or personal experiences would be greatly appreciated. Thanks in advance! 🙏

https://redd.it/1o20fil
@r_devops