NEW BOT Телеграм, страница

Reddit DevOps

How liable are DevOps for redundancies in acquisitions (UK)?

Hi folks!

As the noscript says, my current company has just been acquired in the last week and while this is an acquisition (financially), this is going to be a merger i.e. our company merging into their company.

The next steps in the integration phase, AFAIK, is a company restructure, and as I have read the employees in the acquired company would be more at risk than the acquirer employees. Therefore, that would make me more at risk.

The DevOps team I am in is 7 DevOps engineers, 1 Tech lead DevOps and 1 Team lead.

I believe on their side it is 4/5 DevOps engineers.

We host our product heavily on AWS, and from what I can see they use Azure.

My main questions here is:

1. Has anyone been in a similar situation
2. If so, what happened? What side of the table where you on?
3. How "At Risk" are DevOps engineers in a merger compared to other areas of business?
4. Any other things / pointers you can give me? It is my first time in this situation.

I know that it is different company-to-company, but if I could get a general consensus of others past experience then I can come to my own conclusion on whether or not I would be highly at risk.

Any comments are appreciated.

Thanks!

https://redd.it/1q9yrii
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

12 views16:40

Reddit DevOps

Headless browser sessions keep timing out after ~30 minutes. Has anyone managed to fix this?

I’ve been automating dashboard logins and data extraction using Puppeteer and Selenium for a while now. Single runs are solid, but once I scale to multiple tabs or let jobs run for hours, things start falling apart. Sessions randomly expire, cookies disappear, tabs lose state, and accounts get logged out mid flow. I’ve tried rotating proxies, custom user agents, persisted cookies, and even moved to headless=new. It helped a bit but still not reliable enough for production workloads. At this point I’m trying to understand what’s actually causing this instability. Is it session isolation, anti automation defenses, browser lifecycle issues, or something else entirely? Looking for approaches or tools that support long lived, multi account browser workflows without constant monitoring. Any real world experience appreciated.

https://redd.it/1qa1uvy
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

13 views20:43

Reddit DevOps

Self host Gitlab (GitOps) in k8s, or stand alone?

Hi! Linux sysadmin and hobby programmer here, I'm learning iac by converting my infra at home using OpenTofu against Proxmox. I use workspaces to launch stages as dev (and staging etc in the future). Figured it would be cool to orient everything around it.. but as I'm gonna learn/use Talos k8s ahead, I can't figure out how to deal with deploying apps with the same workspace approach in mind, to avoid being repetitive and all that.

Never automated via Gitlab before, but understood what is called GitOps is used for automation, and it's baked into Gitlab. So the thing I can't figure out is if I should setup Gitlab in k8s, or as stand alone. The first means HA, but if k8s breaks then GitOps goes down I assume. The latter means skip k8s dependency, but no HA.

Idk, maybe I'm overthinking this at such a early time, but would appreciate some insight into how others setup their self hosted iac based IT.

Cheers!

https://redd.it/1qa67nj
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

12 views00:48

Reddit DevOps

Struggles at a new org

I'm a DevOps tech lead at an AWS shop for the past 5~ months with some senior engineers, a few juniors and oh boy - the tech debt and org culture has me seriously reconsidering employment. I'm running into problems like:

- Company has a DevOps team that is treated exclusively like an Ops team. DevOps culture was never adopted and isn't practiced
- Lack of development ownership on product issues. Engineering management fails to hold their teams accountable and isn't responsive to issues in their domain
- Engineering team is comprised of a 50/50 split of contractors and full time engineers, with contractors taking a "that's not my job" approach to problems that bubble up outside of their usual work
- Some of the most spaghetti terraform I've ever had the displeasure of reading - in 0.11.15 no less
- No CI/CD - terraform applies are done locally and software deployments are done by SSH'ing into a Jenkins host to run some wild chain of zsh noscripts
- Chef 0.14.5 is being used to provisioned new EC2 instances
- Static SSH keys installed on hosts (no SSM)
- IAM users with a partial, but incomplete AWS SSO roll out
- A contracting DevOps company was hired to start an EKS migration, but they're at the point of throwing in the towel because of the complexity
- To top it all off, a manager with no technical experience and no spine. I'm not sure how he's still here given his passive nature and lack of ability to lead a team towards change

It would be easier if I was only solving technical issues, but this is both technical and cultural. This feels like a huge step back in my career of having to go back to managing EC2 instances like pets instead of like cattle. As a lead, I'm trying my best to get my manager to understand what a DevOps team is and how it should operate, but I am having a hard time reaching him. He's one that literally manages his team communication through AI as English isn't his first language; it's quite frustrating to say the least.

When I have time, I've been trying to get them off of terraform 0.11.15 and fixing their drift so that there's a standard way for everyone to run things on their local machines, as well as a folder structure that makes sense - with CI following once things are more consistent. Outside of that, I'm been "voluntold" to be on a few tiger teams to help a few product features get off the ground as I have the keys to the kingdom and can keep developers unblocked.

There's no platform and no structure.

With this situation, do others have experiences on how I could go about tackling the challenges at this org? I'm quite stressed at the moment. Thanks!

https://redd.it/1qb6w2v
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

10 views03:28

Reddit DevOps

Our CI strategy is basically "rerun until green" and I hate it

The current state of our pipeline is gambling.

Tests pass locally. Push to main. Pipeline fails. Rerun. Fails again. Rerun. Oh look it passed. Ship it.

We've reached the point where nobody even checks what failed anymore. Just click retry and move on. If it passes the third time clearly there's no real bug right.

I know this is insane. Everyone knows this is insane. But fixing flaky tests takes time and there's always something more urgent.

Tried adding more wait times. Tried running in Docker locally to match the CI environment. Nothing really helped. The tests are technically correct, they're just unreliable in ways I can't pin down.

One of the frontend devs keeps pushing to switch tools entirely. Been looking at options like Testim, Momentic, maybe even just rewriting everything in Playwright. At this point I'd try anything if it means people stop treating retry as a debugging strategy.

Anyone actually solved this or is flaky CI just something we all live with?

https://redd.it/1qas4ft
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

11 views04:28

Reddit DevOps

One end-to-end DevOps project to learn almost all tools together?

Hey everyone,

I’m a DevOps beginner. I’ve covered the theory, but now I want hands-on experience.

Instead of learning tools separately, I’m looking for ONE consolidated, end-to-end DevOps project where I can see how tools work together, like:

Git → CI/CD (Jenkins/GitLab) → Docker → Kubernetes → Terraform → Monitoring (Prometheus/Grafana) on AWS.

A YouTube series, GitHub repo, or blog + repo is totally fine.

Goal is to understand the real DevOps flow, not just run isolated commands.

If you know any solid project or learning resource like this, please share 🙏

Thanks!

https://redd.it/1qarrve
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

7 views05:28

Reddit DevOps

What DevOps and cloud practices are still worth adding to a live production app ?

Hello everyone, I'm totally new to devops
I have a question about applying Devops and cloud practices to an application that is already in production and actively used by users.
Let’s assume the application is already finished, stable, and running in production, I understand that not all Devops or cloud practices are equally easy, safe, or worth implementing late, especially things like deep re-architecture, Kubernetes, or full containerization.
my question is: What Devops and cloud concepts, practices, and tools are still considered late-friendly, low risk, and truly worth implementing on a live production application? ( This is for learning and hands-on practice, not a formal or professional engagement )
Also if someone has advice in learning devops that would be appreciated to help :))

https://redd.it/1qb9mcf
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

7 views06:28

Reddit DevOps

Observabilty For AI Models and GPU Infrencing

Hello Folks,

I need some help regarding observability for AI workloads. For those of you working on AI workloads or have worked on something like that, handling your own ML models, and running your own AI workloads in your own infrastructure, how are you doing the observability for it? I'm specifically interested in the inferencing part, GPU load, VRAM usage, processing, and throughput etc etc. How are you achieving this?

What tools or stacks are you using? I'm currently working in an AI startup where we process a very high number of images daily. We have observability for CPU and memory, and APM for code, but nothing for the GPU and inferencing part.

What kind of tools can I use here to build a full GPU observability solution, or should I go with a SaaS product?

Please suggest.

Thanks

https://redd.it/1qb51ph
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

9 views07:28

Reddit DevOps

Deterministic analysis of Java + Spring Boot + Kafka production logs

I’m working on a **Java tool that analyzes real production logs** from **Spring Boot + Apache Kafka** services.

This is **not an auto-fixing tool** and not a tutorial.

The goal is **fast incident classification + safe recommendations**, the way an experienced on-call / production engineer would reason.

**Example: Kafka consumer JSON deserialization failure**

**Input (real Kafka production log):**

`Caused by: org.apache.kafka.common.errors.SerializationException:`

`Error deserializing JSON message`

`Caused by: com.fasterxml.jackson.databind.exc.InvalidDefinitionException:`

`Cannot construct instance of \`com.mycompany.orders.event.OrderEvent\``

`(no Creators, like default constructor, exist)`

`at [Source: (byte[])"{"orderId":123,"status":"CREATED"}"; line: 1, column: 2]`

**Output (tool result)**

`Category: DESERIALIZATION`

`Severity: MEDIUM`

`Confidence: HIGH`

`Root cause:`

`Jackson cannot construct target event class due to missing creator`

`or default constructor.`

`Recommendation:`

`Add a default constructor or annotate a constructor`

**Example fix:**

public class OrderEvent {

private Long orderId;
private String status;

public OrderEvent() {}

public OrderEvent(Long orderId, String status) {
this.orderId = orderId;
this.status = status;
}
}

# Design goals

* Known **Kafka / Spring / JVM failures** detected via **deterministic rules**
* Kafka rebalance loops
* schema incompatibility
* topic not found
* JSON deserialization errors
* timeouts
* missing Spring beans
* **LLM assistance is strictly constrained**
* forbidden for infrastructure issues
* forbidden for concurrency / threading
* forbidden for binary compatibility (e.g. `NoSuchMethodError`)
* Some failures must **always** result in:
* **No safe automatic fix, human investigation required.**

This project is **not about auto-remediation**
and explicitly avoids “AI guessing fixes”.

It’s about **reducing cognitive load during incidents** by:

* classifying failures fast
* explaining *why* they happened
* only suggesting fixes when they are provably safe

**GitHub (WIP):**
[https://github.com/mathias82/log-doctor](https://github.com/mathias82/log-doctor)

# Looking for feedback from DevOps / SRE folks on:

* Java + Spring boot + Kafka related failure coverage
* missing rule categories you see often on-call
* where LLMs should be **completely disallowed**

Production war stories very welcome 🙂

https://redd.it/1qbllc2
@r_devops

GitHub

GitHub - mathias82/log-doctor: CLI tool that analyzes Java, Spring boot, JVM, Hibernate and Kafka logs and explains failures.

CLI tool that analyzes Java, Spring boot, JVM, Hibernate and Kafka logs and explains failures. - mathias82/log-doctor

10 views08:28

Reddit DevOps

Azure VM auto-start app

Azure has auto‑shutdown for VMs, but no built‑in “auto‑start at 7am” feature. So I built an app for that - VMStarter.

It’s a small Go worker that:

• discovers all VMs across any Azure subnoscriptions it has access to

• sends a start request to each one — **no need to specify VM names**

• runs cleanly as a scheduled Azure Container Apps Job (cron)

Instructions how-to deploy: https://github.com/groovy-sky/vm-starter#deployment-noscript

Docker image: https://hub.docker.com/repository/docker/gr00vysky/vm-starter

Any feedback/PRs welcome.

https://redd.it/1qbmr2s
@r_devops

GitHub

GitHub - groovy-sky/vm-starter

Contribute to groovy-sky/vm-starter development by creating an account on GitHub.

6 views09:28

Reddit DevOps

Spark stage cost breakdown on aws: (Why distributed tracing isn't helping & how to fix it)

Tempo has been a total headache lately. I’ve been staring at Spark traces in there for weeks now, and I’m honestly coming up empty.

What I really want is simple: a clear picture of which Spark stages are actually driving up our costs.

Here’s the thing… poorly optimized Spark jobs can quietly rack up massive bills on AWS. I’ve seen real-world cases where teams cut infrastructure costs by over 100x on critical pipelines just by pinpointing inefficiencies, and others achieve 10x faster runtimes with dramatically lower spend.

We’re aiming to tie stage-level resource usage directly to real AWS dollar figures, so we can rank priorities and tackle the biggest optimizations first. Right now, though, it just feels like we’re gathering traces with no real insight.

I still can’t answer basic questions like:

Which stages are consuming the most CPU, memory, or disk I/O?
How do we accurately map that to actual spend on AWS?

Here’s what I’ve tried :

Running the OTel Java agent and exporting to Tempo -> massive trace volume, but the spans don’t align meaningfully with Spark stages or resource usage. Feels like we’re tracing the wrong things entirely.
Spark UI -> perfect for one-off debugging, but not practical for ongoing cost analysis across production jobs.

At this point, I’m seriously questioning whether distributed tracing is even the right approach for cost attribution.

Would we get further with metrics and Mimir instead? Or is there a smarter way to structure Spark traces in Tempo that actually enables proper cost breakdown?

I’ve read all the docs, watched the talks, and even asked GPT, Claude, and Mistral for ideas… I’m still stuck.

Any advice or experience here would be hugely appreciated,

https://redd.it/1qbnszj
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

7 views10:28

Reddit DevOps

Built a Real Estate Market Intelligence Pipeline Dashboard using Python + Power BI (Learning Project)

This is a learning project where I attempted to build an end-to-end analytics pipeline and visualize the results using Power BI.

Project overview:

I designed a simple data pipeline using static real estate data to understand how different tools fit together in an analytics workflow, from raw data collection to business-facing dashboards.

Pipeline components:

• GitHub – used as the source for collecting and storing raw data

• Python – used for data cleaning, transformation, and basic processing

• Power BI – used for building the Market Intelligence dashboard

• n8n – used for pipeline orchestration (pipeline currently paused due to technical issues at the automation stage)

Current status:

The pipeline is partially implemented. Data extraction and processing were completed, and the final dashboard was built using the processed data. Automation via n8n is planned but temporarily halted.

Dashboard focus:

• Price overview (average, median, min, max)

• Location-wise price comparison

• Property distribution by number of bedrooms

• Average price per square foot

• Business-oriented insights rather than purely visual design

This project was done independently as part of learning data pipelines and analytics workflows.

I’d appreciate constructive feedback—especially on pipeline design, tooling choices, and how this could be improved toward a more production-ready setup.

https://redd.it/1qboimq
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

4 views11:28

Reddit DevOps

My review of Orca security for cloud based vuln management

Been a Tenable shop for vuln management for years, brought on Orca about a year ago. Figured I'd share what I've found.
Context: 80+ AWS accounts at any given time. QoL for multi-account handling matters a lot - main reason we moved off Tenable.

Orca's been overall good, but not without faults. UI gets sluggish when you're filtering across everything - annoying but livable.

Query language took me longer than it should have to get comfortable with, ended up bugging our CSM more than I wanted to early on.

Once you're past that though, day-to-day is good. Less painful than I expected at our scale.

As I said at the start, main use is vuln management and that hasn't let me down yet.

Agentless scanning works, good enough exploitability context, multi-account handling is better than what we had, or at least less annoying to deal with.

Alerting took some tuning to not be noisy as hell but once it's dialed it stays dialed.

Other stuff worth mentioning:

Exports: no weird formatting when pulling compliance reports, which is more than I can say for some tools
Deleted resources: clears out fast, not chasing ghosts
Attack paths: actually useful for explaining risk to non-security people, good for getting buy-in
Dashboards: CVE data populates clean, prioritization logic makes sense without having to customize everything

Overall, not a perfect tool but it's been a net positive. Does what I need it to do.

https://redd.it/1qbpaay
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

7 views12:28

Reddit DevOps

I need a feedback about an open-source CLI that scan AI models (Pickle, PyTorch, GGUF) for malware, verify HF hashes, and check licenses

Hi everyone,

I've created a new CLI tool to secure AI pipelines. It scans models (Pickle, PyTorch, GGUF) for malware using stack emulation, verifies file integrity against the Hugging Face registry, and detects restrictive licenses (like CC-BY-NC). It also integrates with Sigstore for container signing.

GitHub: https://github.com/ArseniiBrazhnyk/Veritensor
Install: pip install veritensor

If you're interested, check it out and let me know what you think and if it might be useful to you?

https://redd.it/1qbpjja
@r_devops

GitHub

GitHub - arsbr/Veritensor: The Anti-Virus for AI Artifacts & RAG Firewall. A static analysis tool scanning Models and Notebooks…

The Anti-Virus for AI Artifacts & RAG Firewall. A static analysis tool scanning Models and Notebooks for RCE, Datasets and RAG docs for Data Poisoning, PII, and Prompt Injections. Secure yo...

10 views13:28

Reddit DevOps

What are the basic tools you would suggest for a DevOps newbie ?

Python, Git Actions, Terraform, Docker, K8s.. anything else ?

https://redd.it/1qbrvkt
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

9 views14:28

Reddit DevOps

What causes VS Code to bypass Husky hooks, and how can I force the Source Control commit button to behave exactly like a normal git commit from the terminal?

I have a Git project with Husky + lint-staged configured.

When I run git commit from the terminal, the pre-commit hook executes correctly.

However, when I commit using the VS Code Source Control UI, the Husky hook is completely skipped.

https://redd.it/1qbnfe9
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

12 views15:28

Reddit DevOps

I built an interactive tutorial for learning docker I wish I had when I was learning Docker

Hello Everyone,
I always had passion for teaching new technologies and concepts, Therefore I decided to build this interactive tutorial for learning docker

Link to tutorial: https://learn-how-docker-works.vercel.app/

https://redd.it/1qbufky
@r_devops

learn-how-docker-works.vercel.app

Docker Made Easy — Learn Docker Internals

An interactive tutorial that teaches you Docker from the ground up. Learn how containers actually work under the hood.

9 views16:28

Reddit DevOps

Why 'works on my machine' means your build is broken

We’ve been using Nix derivations at work for a while now. Steep learning curve, no question, but once it clicks, it completely changes how you think about builds, CI, and reproducibility.

What surprised me most is how many “random” CI failures were actually self-inflicted: network access, implicit system deps, time, locale, you name it.

I tried to write down a tool-agnostic mental model of what makes a build hermetic and why it matters, before getting lost in Nix/Bazel specifics.

If you’re curious, I put the outline here:
https://nemorize.com/roadmaps/hermetic-builds

https://redd.it/1qbx6lg
@r_devops

Nemorize

Hermetic Builds - Learning Roadmap | Nemorize

1️⃣ Hermetic Builds

Verdict: ⭐⭐⭐⭐⭐ (Top-tier, underrated gold)

Why it’s strong

Almost nobody teaches this clearly

Hugely relevant to modern tooling: Nix,...

10 views17:28

Reddit DevOps

#devops #infrastructureascode #kvm #virtualization #opensource #linux #automation #cloudnative #techinnovation #github #virsh #bash #terraform #kickstart #qemu #kvm | Ahmad M Waddah

Check KVMSpinUps

https://redd.it/1qbyxld
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

9 views18:28

Reddit DevOps

Tabletop incident exercises feel so cringe

We have an incident response plan, on-call, alerts aand postmortems. But we’ve never done a proper tabletop exercise. Now bigger customers keep asking if we test IR and I’m realizing they’re expecting something more formal than “we handle incidents.”
Don't get me wrong I’m not against doing tabletops, it just feels like one more thing that turns into paperwork.

What’s the simplest way to do it without making it cringe or useless?

https://redd.it/1qbyjnh
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

10 views19:28

Reddit DevOps

Hosting a Hugo site and Laravel app in the same server

Hi guys,

I don't know whether this is the right sub to ask this, I have a DO droplet. On it I want to host a Hugo static site and a Laravel app. Hugo generates auto routes based on its content. As an example if you have a /content/posts/about.md, the site will generate a route like example.com/posts/about.

I want that behaviour as well, plus I want to deploy my Laravel application on the same domain like example.com/app too. How can I do that? Subdomain approach is not possible because of SEO reasons.

https://redd.it/1qc10v7
@r_devops

From the devops community on Reddit

Explore this post and more from the devops community

9 views20:28

About

Blog

Apps

Platform