Reddit DevOps – Telegram
Custom Podman Container Dashboard?

I have a bunch of docker containers(well technically podman containers) running on a Linux node and its getting to a point where its annoying to keep a track of all the containers. I have all the necessary identifying information(like requestor, poc etc.) added as labels to each container.

I'm looking for a way to create something like a dashboard to present this information like Container name, status, label1, label2, label3 in a nice tabular form.

I've already experimented with Portainer and Cockpit but couldn't really create a customized view per my needs. Does anyone have any ideas?

https://redd.it/1onsszc
@r_devops
How do you size VPS resources for different kinds of websites? Looking for real-world experience and examples.

I’m trying to understand how to estimate VPS resource requirements for different kinds of websites — not just from theory, but based on real-world experience.

Are there any guidelines or rules of thumb you use (or a guide you’d recommend) for deciding how much CPU, RAM, and disk to allocate depending on things like:

* Average daily concurrent visitors

* Site complexity (static site → lightweight web app → high-load dynamic site)

* Whether a database is used and how large it is

* Whether caching or CDN layers are implemented

I know “it depends” — but I’d really like to hear from people who’ve done capacity planning for real sites:

What patterns or lessons did you learn?

* What setups worked well or didn’t?

* Any sample configurations you can share (e.g., “For a small Django app with \~10k daily visitors and caching, we used 2 vCPUs and 4 GB RAM with good performance.”)?

I’m mostly looking for experience-based insights or reference points rather than strict formulas.

Thanks in advance!

https://redd.it/1onlpxe
@r_devops
Dudes, I'm scared, I know it's scam, but what if it is not? Have you ever received a mail like this before? and how its going?

Dudes, I'm just a hobbys ( If you look in my linkedin, you'll notice I'm not a programmer ), I've learned programming, algorithms, design patterns, all by myself. I also publish articles on my Medium blog, documenting the concepts I've learned from my reading and online study.

I recently discovered an email written in Chinese in my Gmail spam folder and have translated its contents using Google Translate.

---
Tonghuashun AIME Program Invitation

Hello,

I am XXXXX, HR from Hexin Tonghuashun (Stock Code: 300033). We noticed your excellent background in development on GitHub, which highly matches the requirements for our AI Engineering Development position. The specific focus areas include, but are not limited to, algorithm application, algorithm engineering, and large AI model development.

Core Advantages of the Position

- Salary benchmarked against top-tier tech companies.

- Listed company stock incentives provided through the AIME Talent Double Hundred Plan.

- Participation in the R&D of AI products with millions of users.

- Tech Stack: Engineering (Java/Web/C++/etc.) and cutting-edge algorithms (Large Models/NLP/AIGC/Robotics/Speech/etc.).

Company Profile

Zhejiang Hexin Tonghuashun Network Information Co., Ltd. (Tonghuashun), established in 1995 and listed on the Shenzhen Stock Exchange in 2009 (Stock Code: 300033), is the first listed company in China's internet financial information services industry. We currently have over 7,000 employees, with our headquarters located in the beautiful and livable city of Hangzhou.

As an internet financial information provider, Tonghuashun's main business is to offer software products and system maintenance services, financial data services, and intelligent promotion services to various institutions, and to provide financial information and investment analysis tools to individual investors. To date, Tonghuashun has nearly 600 million registered users and over ten million daily active users. We have established business cooperation with over 90% of domestic securities companies, with a strong "moat" business ensuring stable cash flow for the company.

Supported Business

Based on comprehensive AI capabilities such as large models, NLP, speech, graphics, image, and vision, we currently cover multiple 2B and 2C application scenarios. Our numerous products include the intelligent investment advisory robot AIME, intelligent service, data intelligence, smart healthcare, AIGC, and digital humans. Targeting various regions including China, Europe and the US, the Middle East, and Southeast Asia, we are progressively realizing the path of technology commercialization and product marketization. The AI team has accumulated over ten years of experience, with hundreds of large model application scenarios implemented, trillions of user financial dialogue data points, and a self-built cluster of thousands of cards for computing power. We are one of the first domestic enterprises to receive cybersecurity administration approval for financial large models.

I look forward to discussing this further with you! You can contact me via:

- WeChat/Phone: XXXX

- Email: XXXX

If you are interested, please feel free to contact me at any time, and I will arrange a detailed conversation for you.

Wishing you the best of business!

HR XXX | Zhejiang Hexin Tonghuashun Network Information Co., Ltd.

Company Website: https://www.10jqka.com.cn/

----

Ok, now, I dont know what to do. I know this could be spam, but what if doesn't, I mean, links look real.

here is my git if you're interested in what they've seen: https://github.com/EDBCREPO

Have you ever received a mail like this before?

----

EDIT: f@@k, the link and job looks real: https://campus.10jqka.com.cn/job/list?type=school



https://redd.it/1ony2tl
@r_devops
How are you handling these AWS ECS (Fargate) issues? Planning to build an AI agent around this…

Hey Experts,

I’m exploring the idea of building an AI agent for AWS ECS (Fargate + EC2) that can help with some tricky debugging and reliability gaps — but before going too far, I’d love to hear how the community handles these today.

**Here are a few pain points I keep running into 👇**

* When a process slowly eats memory and crashes — and there’s no way to grab a heap/JVM dump *before* it dies.
* Tasks restart too fast to capture any “pre-mortem” evidence (logs, system state, etc.).
* Fargate tasks fill up ephemeral disk and just get killed, no cleanup or alert.
* Random DNS or network resolution failures that are impossible to trace because you can’t SSH in.
* A new deployment “passes health checks” but breaks runtime after a few minutes.


**I’m curious**

* Are you seeing these kinds of issues in your ECS setups?
* And if so, how are you handling them right now — noscripts, sidecars, observability tools, or just postmortems?



Would love to get insights from others who’ve wrestled with this in production. 🙏

https://redd.it/1onzb40
@r_devops
Paid Study Help us improve Virtual Machine Tools – $150 for a 60-minute interview

We’re conducting a paid research study to learn more about how professionals create, manage, and provision virtual machines (VMs) at work. Our goal is to better understand your workflows and challenges so we can make VM tools more efficient and user-friendly.



Details:



\- Compensation: $150 USD for a 60-minute 1:1 conversation



\- Format: Online interview via Zoom or Teams



\- Who we’re looking for: Anyone who creates or uses virtual machines, at any experience level or for any type of application



\- Priority: Participants with a LinkedIn profile linked to our platform will be considered first



If you’re interested, please send me a message or comment below and I’ll share the next steps.

Your feedback will directly help improve the tools used by thousands of professionals worldwide.

https://redd.it/1oo1fe7
@r_devops
Tired of applying everywhere - Looking for Fresher DevOps / Cloud Support / Linux Opportunity

Hey everyone,

I’m a recent Computer Science graduate actively looking for fresher roles in DevOps, Cloud Support, or Linux.
I’ve applied to many companies and portals, but most either ask for experience or never respond — it’s been really tough finding that first break.

I’ve learned and practiced:

Linux
AWS (EC2, S3, IAM, Lambda basics)
Docker & Kubernetes
Git/GitHub
CI/CD concepts
I’m genuinely passionate about DevOps and Cloud, and I’m just looking for that first opportunity to prove myself.
Preferably looking for roles in Pune or remote.

If anyone here knows of openings or referrals, I’d really appreciate your help 🙏

Thanks a lot for reading and supporting freshers like me!

https://redd.it/1oo3d4e
@r_devops
India's largest automaker Tata Motors showed how not to use AWS keys

guy found two exposed aws keys on public sites, which gave access to \~70tb of internal data - customer info, invoices, fleet tracking, you name it

they also had a decryptable aws key (encryption that did nothing), a backdoor in tableau where you could log in as anyone with no password, and an exposed api key that could mess with their test-drive fleet

cert-in tried to get tata to fix it, but it took months of back-and-forth before the keys were finally rotated

link: https://eaton-works.com/2025/10/28/tata-motors-hack/ and https://news.ycombinator.com/item?id=45741569

https://redd.it/1oo402w
@r_devops
Those of you who switched from DataDog to Google Observability - do you miss anything?

The company I work for is switching from DataDog to Google's own offering, mostly driven by cost reasons. At surface level the offering seems to be par - but I wonder if we will discover things missing after it's too late?

https://redd.it/1oo36h7
@r_devops
data democratization aka automation and management of data platforms

Hi folks, Are you guys aware of any platforms that can help with management of a number of users on large datalakes, what i mean by this say u have a product like databricks and we want to "user-wise" manage how much access someone has, we wanna stream line this by maybe this flow , user raises a request somehwere -> automated noscript grants access -> access revoked automatically within a set time,
also log who had what access etc etc,
ofc a custom solution is possible but i was hoping for any opinions on if anything similar to this already exists.
Thanks for yuour time have agood one

https://redd.it/1oo9x3u
@r_devops
EKS Node Resource Limits

I am currently undertaking the task of auditing EKS Node resource limits, comparing the limits to the requests and actual usage for around 40 applications. I have to pinpoint where resources are being wasted and propose changes to limits/requests for these nodes.

My question for you all is, what percentage above average Usage should I set the resource limits? I know we still need some wiggle room, but say that an application is using on average 531m of Memory, but the limit is at 1000m (1Gb). That limit obviously needs to come down, but where should it come down to? 600m I think would be too close. Is there a rule of thumb to go by here?

Likewise, the same service uses 10.1mcores of CPU on average, but the limit is set to 1core. I know CPU throttling won't bring down an application, but I'd like to keep wiggle room there to, I'm just not sure how close to bring the limit to the average usage. Any advice?

https://redd.it/1oo78yq
@r_devops
GitOps role composition pattern for deployments?

Is anyone utilizing or has anyone utilized a cluster role-based composition pattern for deployments? Any other patterns?

Currently spinning up ArgoCD for current org and looking at efficiently implementing this for scalability.

At my previous org, we wound up having things a bit scattered about with \~30 AppSets and 30 applications (separate from appsets, for individual clusters).

It was manageable as we didn't change things much but I could see running into scaling issues as far as effort/maintenance goes down the road.

I would appreciate getting a second set of eyes to see if this makes sense or if I'm going to run into issues I haven't thought of: https://github.com/SelfhostedPro/ArgoCD-Role-Composition

https://redd.it/1ooejsr
@r_devops
How a tiny DNS fault brought down AWS us-east-1 and what devops engineers can learn from it

When AWS us-east-1 went down due to a DynamoDB issue, it wasn’t really DynamoDB that failed , it was DNS. A small fault in AWS’s internal DNS system triggered a chain reaction that affected multiple services globally.

It was actually a race condition formed between various DNS enacters who were trying to modify route53

If you’re curious about how AWS’s internal DNS architecture (Enacter, Planner, etc.) actually works and why this fault propagated so widely, I broke it down in detail here:

Inside the AWS DynamoDB Outage: What Really Went Wrong in us-east-1 https://youtu.be/MyS17GWM3Dk

https://redd.it/1ooi45v
@r_devops
What guardrails do you use for feature flags when the feature uses AI?

Before any flag expands, we run a preflight: a small eval set with known failure cases, observability on outputs, and thresholds that trigger rollback. Owners are by role and not by person, and we document the path to stable.

Which signals or tools made this smoother for you?

What do you watch in the first twenty four hours?

https://redd.it/1oo5u1m
@r_devops
LeetCode style interview for DevOps role

Curious if anyone has done any LeetCode style interviews recently?

Recently interviewed for a Senior DevOps role at a FAANG adjacent company which was a 6 stage process.

I thought I was doing pretty well after going though multiple stages doing system design, architecture, reliability engineering, scenario based troubleshooting etc, and even got through some coding exercises in Python.

One of the interviewers was changed last minute. I was told it would purely be a cultural fit type of interview but it ended up being a couple of LeetCode style problems which completely threw me off and I kinda of bombed and struggled to get through them.

I'm fairly experienced with Python but never learned DSA as I don't have a software engineering background and was frustrated to get failed on this after everything.

https://redd.it/1ookpme
@r_devops
Terraform + AWS Questions

So i'll try to keep this brief. I am an SDET learning Terraform as well as AWS. I think I mostly have "demo" stuff working but I wanted to just pose a list of questions off the top of my head:

1. Right now I think one s3 bucket per AWS account makes the most sense (for storing state). From my understanding the "key" is what determines both the terraform state file path as well as the LockID. However I am not sure if for example you define a backend s3.tf file, does the LockID use the key or the key+bucket name?
2. Sort of a follow up to #1, any suggestions for naming conventions when it comes to state files key? Something like environment+project+terraform/state.tf or similar?
3. When it comes to Terraform, I know there is the chicken and the egg sort of thing. What's the proper way to handle this? Some sort of bootstrap .tf file? From my understanding basically you would do that OR set up the s3 bucket manually and then import it? How does that usually go?
4. What are the main resources you think a newcomer should start focusing on as far as tracking? Right now i'm just doing the backend s3 and beanstalk (app and enviornment_ and rds currently.

https://redd.it/1ookd9a
@r_devops
Tofu/Terraform Modules for enterprise

So I'm looking to setup a tofu module repo, all the examples I can find show each module has to have its own git path to be loaded in.

Is there a way to load an entire repo of modules? Or do I have to roll a provider to do that?

I just want to put the classic stuff in place like tag requirements and sane defaults etc.

I got the backend config sorted but putting it in the pipeline templates so each init step gets the right settings. But struggling with the best way to centralize modules.

We are using tofu if that matters.

https://redd.it/1ooph4x
@r_devops
Need advice on deployment and dev ops

Built a simple wrapper around chatgpt for an internal audit my company and now they want it deployed company wide. I’ve never deployed something at a company, never even knew what a Linux box was until my IT team asked if I would be able to manage it which I obviously said yes too.

Looking for advice on how to best host and deploy because I’m going to have to be the one to manage it.

I have a python app wrapped in a fast api, that sends PDFs to OpenAI api for analysis and then returns the response on a basic streamlit UI. 2000-4000 6-10 page PDFs needs to be run through it monthly at scale. What’s the best way to get there. I’ve used render, but only on the free plan to demo it, now I’m pretty lost.

Any help would be great! My outsourced IT team says the solution is a Linux box which will take 10-14 days to set up. Company is ~90mm ARR, 300 employees.

I have no formal swe experience, I still have to ask the AI in cursor to run the commands to push things to GitHub. Please explain like I have basic knowledge, I will look up anything I don’t know.

https://redd.it/1oopug3
@r_devops
I wrote zigit, a tiny C program to download GitHub repos at lightning speed using aria2c

Hey everyone!
I recently made a small C tool called zigit — it’s basically a super lightweight alternative to git clone when you only care about downloading the latest source code and not the entire commit history.

zigit just grabs the ZIP directly from GitHub’s codeload endpoint using aria2c, which supports parallel and segmented downloads.

Check it out at : https://github.com/STRTSNM/zigit/

https://redd.it/1oownb2
@r_devops