L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵ – Telegram
L̶u̵m̶i̵n̷o̴u̶s̶m̶e̵n̵B̶l̵o̵g̵
503 subscribers
156 photos
32 videos
2 files
700 links
(ノ◕ヮ◕)ノ*:・゚✧ ✧゚・: *ヽ(◕ヮ◕ヽ)

helping robots conquer the earth and trying not to increase entropy using Python, Data Engineering and Machine Learning

http://luminousmen.com

License: CC BY-NC-ND 4.0
Download Telegram
Since we all are going to be unemployed soon
😁63
Not sure what to make of this, but Googling HDFS now routes me directly to Harley-Davidson financing. Either Google's confused... or this is how the internet tells you you've reached the 'motorcycle loan' demographic
🦄51👀1
Come on, this is fucking ridiculous

"hey claude, create a datasheet where our model is leading on every benchmark (btw create a benchmark)"

🔗Link: https://www.anthropic.com/news/claude-opus-4-5
🔥4💯1
Most people treat BigQuery like a magic SQL endpoint.

You write a query, hit Run, wait a few seconds... and a petabyte-sized answer pops out.

If it's slow or expensive, the default reaction is: "I need more compute".

That's backwards.

BigQuery is designed to skip work, not to muscle through it:

https://luminousmen.com/post/bigquery-explained-what-really-happens-when-you-hit-run
🔥1
Security researchers at PromptArmor have discovered a critical vulnerability in Google Antigravity - Google's new AI-powered IDE that uses Gemini-based agents. Through an indirect prompt-injection attack, an outside actor can:

- Trick Gemini into reading sensitive local files (like .env files or API keys)
- Use the built-in agent browser to quietly exfiltrate that data through crafted URLs
- Bypass safeguards such as "secret filtering" or .gitignore protections by triggering shell commands like cat

Antigravity's agents are granted broad capabilities - access to code, a shell, and a browser - a single injected prompt hidden in a README or a code comment can silently leak data without any user action😦

If you're experimenting with Antigravity or any similar agent-driven development tools, keep the following in mind:

- Lock down access to secrets
- Audit what capabilities your agents actually have
- Treat AI agents like remote developers - don't give them any more power than you'd hand to a junior engineer with near-root access

🔗 Link: https://promptarmor.com/resources/google-antigravity-exfiltrates-data
👍2
ONLYFANS could be the most revenue-efficient company on the planet, beating Nvidia, Meta, Tesla, and Amazon - powered by ass, not AI.
😎9
Lowering the gates to the CUDA moat.
NotebookLM - generated infographics follows the Google's new TPU announcement

🔗Link: https://www.linkedin.com/posts/semianalysis_notebooklm-recently-introduced-a-new-function-activity-7400973159853780992-PsXz
👍1
Throughout my career, I keep coming back to the same optimization in data pipelines:

Filter as early as possible.

Recently I cut a 3-hour job down to 30 minutes and dropped compute cost from $600 to $9 just by doing that.

If your analytics team needs sales from just three stores, don't build the full sales mart and filter later. That's waste.

Push the store filter upstream-before joins, before aggregations, as close to storage as you can. Join only on those store IDs from the start.

On most engines this means less data scanned, less shuffling, and better use of partition pruning / predicate pushdown. In practice you get:

- Less I/O
- Less memory pressure
- Faster, cheaper queries

But here's the nuance: don't hardcode business logic upstream. Maintainability still matters.

Instead of sprinkling storeid IN (...) across jobs, drive those filters from config, parameters, or dimension tables (like an activestores view). Same optimization, less brittleness.

Before you run your next pipeline, ask:

Can I reduce data volume earlier without introducing fragile business logic?
💯5👍1
AWS Lambda Managed Instances allows you to run Lambda functions on EC2 instances, preserving the familiar serverless model while gaining control over the hardware and EC2-based pricing. Wow, serverful computing

This is an attempt to cover use cases where Lambda is great from a development perspective but not cost- or hardware-efficient-without fully switching to ECS/EC2. In architectures with steady-state load or specific hardware requirements, this could be a game-changer, but you'll need to carefully profile multiconcurrency and realistically calculate the cost for your workload.

🔗 Link: https://aws.amazon.com/blogs/aws/introducing-aws-lambda-managed-instances-serverless-simplicity-with-ec2-flexibility/
👍3
Now we have a solution
🔥4😢1👀1
It was a long year and you still hold on to my writing?

Thank you - genuinely.

Now, since you've made it this far, I want to give you a gift.

You know, I'm a simple man - my favorite holiday is New Year, and if you check the calendar you can guess I'm a bit happier right now.

I've been writing for a long time without giving much back to you, fellow reader - I assume a data engineer, maybe a future colleague.

What I write is usually deeply technical stuff, occasional rants, sometimes practical tips, and sentimental career advice for fellow data engineers. If you like how that sounds and want access to the paid posts too, there's a 30% off yearly discount running right now: https://luminousmen.substack.com/129bfd67

I keep some work on the paid side to make it sustainable and to go deeper instead of chasing clicks. As I said before, gated knowledge is where we're heading - I'm just trying to keep the gate cheap and honest.

ho-ho-ho-ho 🎄
4🔥4👀1