koin.js: MIT Licensed WebAssembly Gaming Engine for Retro Games
Hey Open source community!
I released koin.js under MIT license - a comprehensive WebAssembly gaming solution:
What it provides:
• Cross-platform emulation using Emnoscripten-compiled Libretro cores
• React component API for easy web integration
• Performance optimizations including Run-Ahead input processing
• Modular architecture - use just the engine or full UI
• Achievement system integration with RetroAchievements
• Virtual controls with haptic feedback algorithms
Architecture:
• Built on Nostalgist.js with additional performance enhancements
• WebGL rendering with SharedArrayBuffer threading
• Zero runtime dependencies for core functionality
• Comprehensive TypeScript definitions • Browser compatibility focused (Chrome, Firefox, Safari, Edge)
Perfect for: Game preservation, educational tools, indie development, web portfolios.
Contribute today:
Documentation: https://koin.js.org
Source code: https://github.com/muditjuneja/koin
Join the open-source gaming revolution - your contributions can make web gaming better for everyone!
https://redd.it/1pmmryc
@r_opensource
Hey Open source community!
I released koin.js under MIT license - a comprehensive WebAssembly gaming solution:
What it provides:
• Cross-platform emulation using Emnoscripten-compiled Libretro cores
• React component API for easy web integration
• Performance optimizations including Run-Ahead input processing
• Modular architecture - use just the engine or full UI
• Achievement system integration with RetroAchievements
• Virtual controls with haptic feedback algorithms
Architecture:
• Built on Nostalgist.js with additional performance enhancements
• WebGL rendering with SharedArrayBuffer threading
• Zero runtime dependencies for core functionality
• Comprehensive TypeScript definitions • Browser compatibility focused (Chrome, Firefox, Safari, Edge)
Perfect for: Game preservation, educational tools, indie development, web portfolios.
Contribute today:
npm install koin.js Documentation: https://koin.js.org
Source code: https://github.com/muditjuneja/koin
Join the open-source gaming revolution - your contributions can make web gaming better for everyone!
https://redd.it/1pmmryc
@r_opensource
koin.js.org
koin.js — Browser Retro Game Emulation for React
The drop-in React component for browser-based retro game emulation. 27 systems. Cloud saves. Zero backend required.
For Linux software maintainers: distropack now supports .tar archives aside from .deb .rpm and .pkg
https://distropack.dev/Blog/Post?slug=introducing-tar-package-support-simple-distribution-without-repository-complexity
https://redd.it/1pmq23a
@r_opensource
https://distropack.dev/Blog/Post?slug=introducing-tar-package-support-simple-distribution-without-repository-complexity
https://redd.it/1pmq23a
@r_opensource
Reddit
From the opensource community on Reddit: For Linux software maintainers: distropack now supports .tar archives aside from .deb…
Posted by TheAlexDev - 1 vote and 0 comments
Help improve Img2Num’s README! (Good First Issue)🦔
https://github.com/Ryan-Millard/Img2Num/issues/106
https://redd.it/1pmrd9e
@r_opensource
https://github.com/Ryan-Millard/Img2Num/issues/106
https://redd.it/1pmrd9e
@r_opensource
GitHub
Revise README to be concise, link to docs site, include logo & demo, and add credits section · Issue #106 · Ryan-Millard/Img2Num
Current Code Issue The current README is quite verbose and attempts to cover too many details that are already documented on the official documentation site. To improve clarity and first‑impression...
OpenMeters: audio visualization & metering for linux.
https://github.com/httpsworldview/openmeters
https://redd.it/1pmt5cc
@r_opensource
https://github.com/httpsworldview/openmeters
https://redd.it/1pmt5cc
@r_opensource
GitHub
GitHub - httpsworldview/openmeters: A fast and simple audio metering/visualization program for Linux.
A fast and simple audio metering/visualization program for Linux. - httpsworldview/openmeters
🌎 Trendgetter v2.0: An API for getting trending content from various platforms
https://github.com/Zivsteve/trendgetter
https://redd.it/1pmta0q
@r_opensource
https://github.com/Zivsteve/trendgetter
https://redd.it/1pmta0q
@r_opensource
GitHub
GitHub - Zivsteve/trendgetter: An API for getting trending content from various platforms 🌎
An API for getting trending content from various platforms 🌎 - Zivsteve/trendgetter
Better issues -> more contributions
If you want more pull requests, start by writing better issues.
From my own experience, on both sides, most people do not avoid contributing because they are lazy. They avoid it because the cost of entry is unclear. You do not know how much context you need or whether you will spend a weekend only to be told that is not what was meant. Clear issues remove that fear and shows respect for the contributor’s time.
The same applies to the codebase itself. If I can clone the repo, run it and understand the basic flow without reverse engineering everything, I am far more likely to help. Poor documentation does not just slow people down. It quietly filters contributors out.
Granularity matters too. Smaller, well scoped issues are simply less intimidating. That first small merge often turns into a second pull request, then a third. Large and fuzzy issues rarely get that first step.
None of this is meant to be flashy or inspirational. I just realized, that after I changed my maintainer habits a bit and followed these guidelines, way more new contributors entered the repo, which is a great feeling :)
https://redd.it/1pmspbu
@r_opensource
If you want more pull requests, start by writing better issues.
From my own experience, on both sides, most people do not avoid contributing because they are lazy. They avoid it because the cost of entry is unclear. You do not know how much context you need or whether you will spend a weekend only to be told that is not what was meant. Clear issues remove that fear and shows respect for the contributor’s time.
The same applies to the codebase itself. If I can clone the repo, run it and understand the basic flow without reverse engineering everything, I am far more likely to help. Poor documentation does not just slow people down. It quietly filters contributors out.
Granularity matters too. Smaller, well scoped issues are simply less intimidating. That first small merge often turns into a second pull request, then a third. Large and fuzzy issues rarely get that first step.
None of this is meant to be flashy or inspirational. I just realized, that after I changed my maintainer habits a bit and followed these guidelines, way more new contributors entered the repo, which is a great feeling :)
https://redd.it/1pmspbu
@r_opensource
Reddit
From the opensource community on Reddit
Explore this post and more from the opensource community
Isitreallyfoss - Website that evaluates "foss" projects to see if they're as free and open source as advertised
https://isitreallyfoss.com/
https://redd.it/1pmzi4k
@r_opensource
https://isitreallyfoss.com/
https://redd.it/1pmzi4k
@r_opensource
Kreuzberg v4.0.0-rc.8 is available
Hi Peeps,
I'm excited to announce that Kreuzberg v4.0.0 is coming very soon. We will release v4.0.0 at the beginning of next year - in just a couple of weeks time. For now, v4.0.0-rc.8 has been released to all channels.
## What is Kreuzberg?
Kreuzberg is a document intelligence toolkit for extracting text, metadata, tables, images, and structured data from 56+ file formats. It was originally written in Python (v1-v3), where it demonstrated strong performance characteristics compared to alternatives in the ecosystem.
## What's new in V4?
### A Complete Rust Rewrite with Polyglot Bindings
The new version of Kreuzberg represents a massive architectural evolution. Kreuzberg has been completely rewritten in Rust - leveraging Rust's memory safety, zero-cost abstractions, and native performance. The new architecture consists of a high-performance Rust core with native bindings to multiple languages. That's right - it's no longer just a Python library.
Kreuzberg v4 is now available for 7 languages across 8 runtime bindings:
- Rust (native library)
- Python (PyO3 native bindings)
- TypeScript - Node.js (NAPI-RS native bindings) + Deno/Browser/Edge (WASM)
- Ruby (Magnus FFI)
- Java 25+ (Panama Foreign Function & Memory API)
- C# (P/Invoke)
- Go (cgo bindings)
Post v4.0.0 roadmap includes:
- PHP
- Elixir (via Rustler - with Erlang and Gleam interop)
Additionally, it's available as a CLI (installable via
### Why the Rust Rewrite? Performance and Architecture
The Rust rewrite wasn't just about performance - though that's a major benefit. It was an opportunity to fundamentally rethink the architecture:
Architectural improvements:
- Zero-copy operations via Rust's ownership model
- True async concurrency with Tokio runtime (no GIL limitations)
- Streaming parsers for constant memory usage on multi-GB files
- SIMD-accelerated text processing for token reduction and string operations
- Memory-safe FFI boundaries for all language bindings
- Plugin system with trait-based extensibility
### v3 vs v4: What Changed?
| Aspect | v3 (Python) | v4 (Rust Core) |
|--------|-------------|----------------|
| Core Language | Pure Python | Rust 2024 edition |
| File Formats | 30-40+ (via Pandoc) | 56+ (native parsers) |
| Language Support | Python only | 7 languages (Rust/Python/TS/Ruby/Java/Go/C#) |
| Dependencies | Requires Pandoc (system binary) | Zero system dependencies (all native) |
| Embeddings | Not supported | ✓ FastEmbed with ONNX (3 presets + custom) |
| Semantic Chunking | Via semantic-text-splitter library | ✓ Built-in (text + markdown-aware) |
| Token Reduction | Built-in (TF-IDF based) | ✓ Enhanced with 3 modes |
| Language Detection | Optional (fast-langdetect) | ✓ Built-in (68 languages) |
| Keyword Extraction | Optional (KeyBERT) | ✓ Built-in (YAKE + RAKE algorithms) |
| OCR Backends | Tesseract/EasyOCR/PaddleOCR | Same + better integration |
| Plugin System | Limited extractor registry | Full trait-based (4 plugin types) |
| Page Tracking | Character-based indices | Byte-based with O(1) lookup |
| Servers | REST API (Litestar) | HTTP (Axum) + MCP + MCP-SSE |
| Installation Size | ~100MB base | 16-31 MB complete |
| Memory Model | Python heap management | RAII with streaming |
| Concurrency | asyncio (GIL-limited) | Tokio work-stealing |
### Replacement of Pandoc - Native Performance
Kreuzberg v3 relied on Pandoc - an amazing tool, but one that had to be invoked via subprocess because of its GPL license. This had significant impacts:
v3 Pandoc limitations:
- System dependency (installation required)
- Subprocess overhead on every document
- No streaming support
- Limited metadata extraction
- ~500MB+
Hi Peeps,
I'm excited to announce that Kreuzberg v4.0.0 is coming very soon. We will release v4.0.0 at the beginning of next year - in just a couple of weeks time. For now, v4.0.0-rc.8 has been released to all channels.
## What is Kreuzberg?
Kreuzberg is a document intelligence toolkit for extracting text, metadata, tables, images, and structured data from 56+ file formats. It was originally written in Python (v1-v3), where it demonstrated strong performance characteristics compared to alternatives in the ecosystem.
## What's new in V4?
### A Complete Rust Rewrite with Polyglot Bindings
The new version of Kreuzberg represents a massive architectural evolution. Kreuzberg has been completely rewritten in Rust - leveraging Rust's memory safety, zero-cost abstractions, and native performance. The new architecture consists of a high-performance Rust core with native bindings to multiple languages. That's right - it's no longer just a Python library.
Kreuzberg v4 is now available for 7 languages across 8 runtime bindings:
- Rust (native library)
- Python (PyO3 native bindings)
- TypeScript - Node.js (NAPI-RS native bindings) + Deno/Browser/Edge (WASM)
- Ruby (Magnus FFI)
- Java 25+ (Panama Foreign Function & Memory API)
- C# (P/Invoke)
- Go (cgo bindings)
Post v4.0.0 roadmap includes:
- PHP
- Elixir (via Rustler - with Erlang and Gleam interop)
Additionally, it's available as a CLI (installable via
cargo or homebrew), HTTP REST API server, Model Context Protocol (MCP) server for Claude Desktop/Continue.dev, and as public Docker images.### Why the Rust Rewrite? Performance and Architecture
The Rust rewrite wasn't just about performance - though that's a major benefit. It was an opportunity to fundamentally rethink the architecture:
Architectural improvements:
- Zero-copy operations via Rust's ownership model
- True async concurrency with Tokio runtime (no GIL limitations)
- Streaming parsers for constant memory usage on multi-GB files
- SIMD-accelerated text processing for token reduction and string operations
- Memory-safe FFI boundaries for all language bindings
- Plugin system with trait-based extensibility
### v3 vs v4: What Changed?
| Aspect | v3 (Python) | v4 (Rust Core) |
|--------|-------------|----------------|
| Core Language | Pure Python | Rust 2024 edition |
| File Formats | 30-40+ (via Pandoc) | 56+ (native parsers) |
| Language Support | Python only | 7 languages (Rust/Python/TS/Ruby/Java/Go/C#) |
| Dependencies | Requires Pandoc (system binary) | Zero system dependencies (all native) |
| Embeddings | Not supported | ✓ FastEmbed with ONNX (3 presets + custom) |
| Semantic Chunking | Via semantic-text-splitter library | ✓ Built-in (text + markdown-aware) |
| Token Reduction | Built-in (TF-IDF based) | ✓ Enhanced with 3 modes |
| Language Detection | Optional (fast-langdetect) | ✓ Built-in (68 languages) |
| Keyword Extraction | Optional (KeyBERT) | ✓ Built-in (YAKE + RAKE algorithms) |
| OCR Backends | Tesseract/EasyOCR/PaddleOCR | Same + better integration |
| Plugin System | Limited extractor registry | Full trait-based (4 plugin types) |
| Page Tracking | Character-based indices | Byte-based with O(1) lookup |
| Servers | REST API (Litestar) | HTTP (Axum) + MCP + MCP-SSE |
| Installation Size | ~100MB base | 16-31 MB complete |
| Memory Model | Python heap management | RAII with streaming |
| Concurrency | asyncio (GIL-limited) | Tokio work-stealing |
### Replacement of Pandoc - Native Performance
Kreuzberg v3 relied on Pandoc - an amazing tool, but one that had to be invoked via subprocess because of its GPL license. This had significant impacts:
v3 Pandoc limitations:
- System dependency (installation required)
- Subprocess overhead on every document
- No streaming support
- Limited metadata extraction
- ~500MB+
GitHub
GitHub - kreuzberg-dev/kreuzberg: A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured…
A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 50+ formats. Available for Rust, Python, Rub...
installation footprint
v4 native parsers:
- Zero external dependencies - everything is native Rust
- Direct parsing with full control over extraction
- Substantially more metadata extracted (e.g., DOCX document properties, section structure, style information)
- Streaming support for massive files (tested on multi-GB XML documents with stable memory)
- Example: PPTX extractor is now a fully streaming parser capable of handling gigabyte-scale presentations with constant memory usage and high throughput
### New File Format Support
v4 expanded format support from ~20 to 56+ file formats, including:
Added legacy format support:
-
-
-
-
-
Added academic/technical formats:
- LaTeX (
- BibTeX (
- Typst (
- JATS XML (scientific articles)
- DocBook XML
- FictionBook (
- OPML (
Better Office support:
- XLSB, XLSM (Excel binary/macro formats)
- Better structured metadata extraction from DOCX/PPTX/XLSX
- Full table extraction from presentations
- Image extraction with deduplication
### New Features: Full Document Intelligence Solution
The v4 rewrite was also an opportunity to close gaps with commercial alternatives and add features specifically designed for RAG applications and LLM workflows:
#### 1. Embeddings (NEW)
- FastEmbed integration with full ONNX Runtime acceleration
- Three presets:
- Custom model support (bring your own ONNX model)
- Local generation (no API calls, no rate limits)
- Automatic model downloading and caching
- Per-chunk embedding generation
#### 2. Semantic Text Chunking (NOW BUILT-IN)
Now integrated directly into the core (v3 used external semantic-text-splitter library):
- Structure-aware chunking that respects document semantics
- Two strategies:
- Generic text chunker (whitespace/punctuation-aware)
- Markdown chunker (preserves headings, lists, code blocks, tables)
- Configurable chunk size and overlap
- Unicode-safe (handles CJK, emojis correctly)
- Automatic chunk-to-page mapping
- Per-chunk metadata with byte offsets
#### 3. Byte-Accurate Page Tracking (BREAKING CHANGE)
This is a critical improvement for LLM applications:
- v3: Character-based indices (
- v4: Byte-based indices (
Additional page features:
- O(1) lookup: "which page is byte offset X on?" → instant answer
- Per-page content extraction
- Page markers in combined text (e.g.,
- Automatic chunk-to-page mapping for citations
#### 4. Enhanced Token Reduction for LLM Context
Enhanced from v3 with three configurable modes to save on LLM costs:
- Light mode: ~15% reduction (preserve most detail)
- Moderate mode: ~30% reduction (balanced)
- Aggressive mode: ~50% reduction (key information only)
Uses TF-IDF sentence scoring with position-aware weighting and language-specific stopword filtering. SIMD-accelerated for improved performance over v3.
#### 5. Language Detection (NOW BUILT-IN)
- 68 language support with confidence scoring
- Multi-language detection (documents with mixed languages)
- ISO 639-1 and ISO 639-3 code support
- Configurable confidence thresholds
#### 6. Keyword Extraction (NOW BUILT-IN)
Now built into core (previously optional KeyBERT in v3):
- YAKE (Yet Another Keyword Extractor): Unsupervised, language-independent
- RAKE (Rapid Automatic Keyword Extraction): Fast statistical method
- Configurable n-grams
v4 native parsers:
- Zero external dependencies - everything is native Rust
- Direct parsing with full control over extraction
- Substantially more metadata extracted (e.g., DOCX document properties, section structure, style information)
- Streaming support for massive files (tested on multi-GB XML documents with stable memory)
- Example: PPTX extractor is now a fully streaming parser capable of handling gigabyte-scale presentations with constant memory usage and high throughput
### New File Format Support
v4 expanded format support from ~20 to 56+ file formats, including:
Added legacy format support:
-
.doc (Word 97-2003)-
.ppt (PowerPoint 97-2003)-
.xls (Excel 97-2003)-
.eml (Email messages)-
.msg (Outlook messages)Added academic/technical formats:
- LaTeX (
.tex)- BibTeX (
.bib)- Typst (
.typ)- JATS XML (scientific articles)
- DocBook XML
- FictionBook (
.fb2)- OPML (
.opml)Better Office support:
- XLSB, XLSM (Excel binary/macro formats)
- Better structured metadata extraction from DOCX/PPTX/XLSX
- Full table extraction from presentations
- Image extraction with deduplication
### New Features: Full Document Intelligence Solution
The v4 rewrite was also an opportunity to close gaps with commercial alternatives and add features specifically designed for RAG applications and LLM workflows:
#### 1. Embeddings (NEW)
- FastEmbed integration with full ONNX Runtime acceleration
- Three presets:
"fast" (384d), "balanced" (512d), "quality" (768d/1024d)- Custom model support (bring your own ONNX model)
- Local generation (no API calls, no rate limits)
- Automatic model downloading and caching
- Per-chunk embedding generation
from kreuzberg import ExtractionConfig, EmbeddingConfig, EmbeddingModelType
config = ExtractionConfig(
embeddings=EmbeddingConfig(
model=EmbeddingModelType.preset("balanced"),
normalize=True
)
)
result = kreuzberg.extract_bytes(pdf_bytes, config=config)
# result.embeddings contains vectors for each chunk
#### 2. Semantic Text Chunking (NOW BUILT-IN)
Now integrated directly into the core (v3 used external semantic-text-splitter library):
- Structure-aware chunking that respects document semantics
- Two strategies:
- Generic text chunker (whitespace/punctuation-aware)
- Markdown chunker (preserves headings, lists, code blocks, tables)
- Configurable chunk size and overlap
- Unicode-safe (handles CJK, emojis correctly)
- Automatic chunk-to-page mapping
- Per-chunk metadata with byte offsets
#### 3. Byte-Accurate Page Tracking (BREAKING CHANGE)
This is a critical improvement for LLM applications:
- v3: Character-based indices (
char_start/char_end) - incorrect for UTF-8 multi-byte characters- v4: Byte-based indices (
byte_start/byte_end) - correct for all string operationsAdditional page features:
- O(1) lookup: "which page is byte offset X on?" → instant answer
- Per-page content extraction
- Page markers in combined text (e.g.,
--- Page 5 ---)- Automatic chunk-to-page mapping for citations
#### 4. Enhanced Token Reduction for LLM Context
Enhanced from v3 with three configurable modes to save on LLM costs:
- Light mode: ~15% reduction (preserve most detail)
- Moderate mode: ~30% reduction (balanced)
- Aggressive mode: ~50% reduction (key information only)
Uses TF-IDF sentence scoring with position-aware weighting and language-specific stopword filtering. SIMD-accelerated for improved performance over v3.
#### 5. Language Detection (NOW BUILT-IN)
- 68 language support with confidence scoring
- Multi-language detection (documents with mixed languages)
- ISO 639-1 and ISO 639-3 code support
- Configurable confidence thresholds
#### 6. Keyword Extraction (NOW BUILT-IN)
Now built into core (previously optional KeyBERT in v3):
- YAKE (Yet Another Keyword Extractor): Unsupervised, language-independent
- RAKE (Rapid Automatic Keyword Extraction): Fast statistical method
- Configurable n-grams
(1-3 word phrases)
- Relevance scoring with language-specific stopwords
#### 7. Plugin System (NEW)
Four extensible plugin types for customization:
- DocumentExtractor - Custom file format handlers
- OcrBackend - Custom OCR engines (integrate your own Python models)
- PostProcessor - Data transformation and enrichment
- Validator - Pre-extraction validation
Plugins defined in Rust work across all language bindings. Python/TypeScript can define custom plugins with thread-safe callbacks into the Rust core.
#### 8. Production-Ready Servers (NEW)
- HTTP REST API: Production-grade Axum server with OpenAPI docs
- MCP Server: Direct integration with Claude Desktop, Continue.dev, and other MCP clients
- MCP-SSE Transport (RC.8): Server-Sent Events for cloud deployments without WebSocket support
- All three modes support the same feature set: extraction, batch processing, caching
## Performance: Benchmarked Against the Competition
We maintain continuous benchmarks comparing Kreuzberg against the leading OSS alternatives:
### Benchmark Setup
- Platform: Ubuntu 22.04 (GitHub Actions)
- Test Suite: 30+ documents covering all formats
- Metrics: Latency (p50, p95), throughput (MB/s), memory usage, success rate
- Competitors: Apache Tika, Docling, Unstructured, MarkItDown
### How Kreuzberg Compares
Installation Size (critical for containers/serverless):
- Kreuzberg: 16-31 MB complete (CLI: 16 MB, Python wheel: 22 MB, Java JAR: 31 MB - all features included)
- MarkItDown: ~251 MB installed (58.3 KB wheel, 25 dependencies)
- Unstructured: ~146 MB minimal (open source base) - several GB with ML models
- Docling: ~1 GB base, 9.74GB Docker image (includes PyTorch CUDA)
- Apache Tika: ~55 MB (tika-app JAR) + dependencies
- GROBID: 500MB (CRF-only) to 8GB (full deep learning)
Performance Characteristics:
| Library | Speed | Accuracy | Formats | Installation | Use Case |
|---------|-------|----------|---------|--------------|----------|
| Kreuzberg | ⚡ Fast (Rust-native) | Excellent | 56+ | 16-31 MB | General-purpose, production-ready |
| Docling | ⚡ Fast (3.1s/pg x86, 1.27s/pg ARM) | Best | 7+ | 1-9.74 GB | Complex documents, when accuracy > size |
| GROBID | ⚡⚡ Very Fast (10.6 PDF/s) | Best | PDF only | 0.5-8 GB | Academic/scientific papers only |
| Unstructured | ⚡ Moderate | Good | 25-65+ | 146 MB-several GB | Python-native LLM pipelines |
| MarkItDown | ⚡ Fast (small files) | Good | 11+ | ~251 MB | Lightweight Markdown conversion |
| Apache Tika | ⚡ Moderate | Excellent | 1000+ | ~55 MB | Enterprise, broadest format support |
Kreuzberg's sweet spot:
- Smallest full-featured installation: 16-31 MB complete (vs 146 MB-9.74 GB for competitors)
- 5-15x smaller than Unstructured/MarkItDown, 30-300x smaller than Docling/GROBID
- Rust-native performance without ML model overhead
- Broad format support (56+ formats) with native parsers
- Multi-language support unique in the space (7 languages vs Python-only for most)
- Production-ready with general-purpose design (vs specialized tools like GROBID)
## Is Kreuzberg a SaaS Product?
No. Kreuzberg is and will remain MIT-licensed open source.
However, we are building Kreuzberg.cloud - a commercial SaaS and self-hosted document intelligence solution built on top of Kreuzberg. This follows the proven open-core model: the library stays free and open, while we offer a cloud service for teams that want managed infrastructure, APIs, and enterprise features.
Will Kreuzberg become commercially licensed? Absolutely not. There is no BSL (Business Source License) in Kreuzberg's future. The library was MIT-licensed and will remain MIT-licensed. We're building the commercial offering as a separate product around the core library, not by restricting the library itself.
## Target Audience
Any developer or data scientist who needs:
- Document text extraction (PDF, Office, images, email, archives, etc.)
-
- Relevance scoring with language-specific stopwords
#### 7. Plugin System (NEW)
Four extensible plugin types for customization:
- DocumentExtractor - Custom file format handlers
- OcrBackend - Custom OCR engines (integrate your own Python models)
- PostProcessor - Data transformation and enrichment
- Validator - Pre-extraction validation
Plugins defined in Rust work across all language bindings. Python/TypeScript can define custom plugins with thread-safe callbacks into the Rust core.
#### 8. Production-Ready Servers (NEW)
- HTTP REST API: Production-grade Axum server with OpenAPI docs
- MCP Server: Direct integration with Claude Desktop, Continue.dev, and other MCP clients
- MCP-SSE Transport (RC.8): Server-Sent Events for cloud deployments without WebSocket support
- All three modes support the same feature set: extraction, batch processing, caching
## Performance: Benchmarked Against the Competition
We maintain continuous benchmarks comparing Kreuzberg against the leading OSS alternatives:
### Benchmark Setup
- Platform: Ubuntu 22.04 (GitHub Actions)
- Test Suite: 30+ documents covering all formats
- Metrics: Latency (p50, p95), throughput (MB/s), memory usage, success rate
- Competitors: Apache Tika, Docling, Unstructured, MarkItDown
### How Kreuzberg Compares
Installation Size (critical for containers/serverless):
- Kreuzberg: 16-31 MB complete (CLI: 16 MB, Python wheel: 22 MB, Java JAR: 31 MB - all features included)
- MarkItDown: ~251 MB installed (58.3 KB wheel, 25 dependencies)
- Unstructured: ~146 MB minimal (open source base) - several GB with ML models
- Docling: ~1 GB base, 9.74GB Docker image (includes PyTorch CUDA)
- Apache Tika: ~55 MB (tika-app JAR) + dependencies
- GROBID: 500MB (CRF-only) to 8GB (full deep learning)
Performance Characteristics:
| Library | Speed | Accuracy | Formats | Installation | Use Case |
|---------|-------|----------|---------|--------------|----------|
| Kreuzberg | ⚡ Fast (Rust-native) | Excellent | 56+ | 16-31 MB | General-purpose, production-ready |
| Docling | ⚡ Fast (3.1s/pg x86, 1.27s/pg ARM) | Best | 7+ | 1-9.74 GB | Complex documents, when accuracy > size |
| GROBID | ⚡⚡ Very Fast (10.6 PDF/s) | Best | PDF only | 0.5-8 GB | Academic/scientific papers only |
| Unstructured | ⚡ Moderate | Good | 25-65+ | 146 MB-several GB | Python-native LLM pipelines |
| MarkItDown | ⚡ Fast (small files) | Good | 11+ | ~251 MB | Lightweight Markdown conversion |
| Apache Tika | ⚡ Moderate | Excellent | 1000+ | ~55 MB | Enterprise, broadest format support |
Kreuzberg's sweet spot:
- Smallest full-featured installation: 16-31 MB complete (vs 146 MB-9.74 GB for competitors)
- 5-15x smaller than Unstructured/MarkItDown, 30-300x smaller than Docling/GROBID
- Rust-native performance without ML model overhead
- Broad format support (56+ formats) with native parsers
- Multi-language support unique in the space (7 languages vs Python-only for most)
- Production-ready with general-purpose design (vs specialized tools like GROBID)
## Is Kreuzberg a SaaS Product?
No. Kreuzberg is and will remain MIT-licensed open source.
However, we are building Kreuzberg.cloud - a commercial SaaS and self-hosted document intelligence solution built on top of Kreuzberg. This follows the proven open-core model: the library stays free and open, while we offer a cloud service for teams that want managed infrastructure, APIs, and enterprise features.
Will Kreuzberg become commercially licensed? Absolutely not. There is no BSL (Business Source License) in Kreuzberg's future. The library was MIT-licensed and will remain MIT-licensed. We're building the commercial offering as a separate product around the core library, not by restricting the library itself.
## Target Audience
Any developer or data scientist who needs:
- Document text extraction (PDF, Office, images, email, archives, etc.)
-
OCR (Tesseract, EasyOCR, PaddleOCR)
- Metadata extraction (authors, dates, properties, EXIF)
- Table and image extraction
- Document pre-processing for RAG pipelines
- Text chunking with embeddings
- Token reduction for LLM context windows
- Multi-language document intelligence in production systems
Ideal for:
- RAG application developers
- Data engineers building document pipelines
- ML engineers preprocessing training data
- Enterprise developers handling document workflows
- DevOps teams needing lightweight, performant extraction in containers/serverless
## Comparison with Alternatives
### Open Source Python Libraries
Unstructured.io
- Strengths: Established, modular, broad format support (25+ open source, 65+ enterprise), LLM-focused, good Python ecosystem integration
- Trade-offs: Python GIL performance constraints, 146 MB minimal installation (several GB with ML models)
- License: Apache-2.0
- When to choose: Python-only projects where ecosystem fit > performance
MarkItDown (Microsoft)
- Strengths: Fast for small files, Markdown-optimized, simple API
- Trade-offs: Limited format support (11 formats), less structured metadata, ~251 MB installed (despite small wheel), requires OpenAI API for images
- License: MIT
- When to choose: Markdown-only conversion, LLM consumption
Docling (IBM)
- Strengths: Excellent accuracy on complex documents (97.9% cell-level accuracy on tested sustainability report tables), state-of-the-art AI models for technical documents
- Trade-offs: Massive installation (1-9.74 GB), high memory usage, GPU-optimized (underutilized on CPU)
- License: MIT
- When to choose: Accuracy on complex documents > deployment size/speed, have GPU infrastructure
### Open Source Java/Academic Tools
Apache Tika
- Strengths: Mature, stable, broadest format support (1000+ types), proven at scale, Apache Foundation backing
- Trade-offs: Java/JVM required, slower on large files, older architecture, complex dependency management
- License: Apache-2.0
- When to choose: Enterprise environments with JVM infrastructure, need for maximum format coverage
GROBID
- Strengths: Best-in-class for academic papers (F1 0.87-0.90), extremely fast (10.6 PDF/sec sustained), proven at scale (34M+ documents at CORE)
- Trade-offs: Academic papers only, large installation (500MB-8GB), complex Java+Python setup
- License: Apache-2.0
- When to choose: Scientific/academic document processing exclusively
### Commercial APIs
There are numerous commercial options from startups (LlamaIndex, Unstructured.io paid tiers) to big cloud providers (AWS Textract, Azure Form Recognizer, Google Document AI). These are not OSS but offer managed infrastructure.
Kreuzberg's position: As an open-source library, Kreuzberg provides a self-hosted alternative with no per-document API costs, making it suitable for high-volume workloads where cost efficiency matters.
## Community & Resources
- GitHub: Star us at https://github.com/kreuzberg-dev/kreuzberg
- Discord: Join our community server at discord.gg/pXxagNK2zN
- Subreddit: Join the discussion at r/kreuzberg_dev
- Documentation: kreuzberg.dev
We'd love to hear your feedback, use cases, and contributions!
---
TL;DR: Kreuzberg v4 is a complete Rust rewrite of a document intelligence library, offering native bindings for 7 languages (8 runtime targets), 56+ file formats, Rust-native performance, embeddings, semantic chunking, and production-ready servers - all in a 16-31 MB complete package (5-15x smaller than alternatives). Releasing January 2025. MIT licensed forever.
https://redd.it/1pn2g9r
@r_opensource
- Metadata extraction (authors, dates, properties, EXIF)
- Table and image extraction
- Document pre-processing for RAG pipelines
- Text chunking with embeddings
- Token reduction for LLM context windows
- Multi-language document intelligence in production systems
Ideal for:
- RAG application developers
- Data engineers building document pipelines
- ML engineers preprocessing training data
- Enterprise developers handling document workflows
- DevOps teams needing lightweight, performant extraction in containers/serverless
## Comparison with Alternatives
### Open Source Python Libraries
Unstructured.io
- Strengths: Established, modular, broad format support (25+ open source, 65+ enterprise), LLM-focused, good Python ecosystem integration
- Trade-offs: Python GIL performance constraints, 146 MB minimal installation (several GB with ML models)
- License: Apache-2.0
- When to choose: Python-only projects where ecosystem fit > performance
MarkItDown (Microsoft)
- Strengths: Fast for small files, Markdown-optimized, simple API
- Trade-offs: Limited format support (11 formats), less structured metadata, ~251 MB installed (despite small wheel), requires OpenAI API for images
- License: MIT
- When to choose: Markdown-only conversion, LLM consumption
Docling (IBM)
- Strengths: Excellent accuracy on complex documents (97.9% cell-level accuracy on tested sustainability report tables), state-of-the-art AI models for technical documents
- Trade-offs: Massive installation (1-9.74 GB), high memory usage, GPU-optimized (underutilized on CPU)
- License: MIT
- When to choose: Accuracy on complex documents > deployment size/speed, have GPU infrastructure
### Open Source Java/Academic Tools
Apache Tika
- Strengths: Mature, stable, broadest format support (1000+ types), proven at scale, Apache Foundation backing
- Trade-offs: Java/JVM required, slower on large files, older architecture, complex dependency management
- License: Apache-2.0
- When to choose: Enterprise environments with JVM infrastructure, need for maximum format coverage
GROBID
- Strengths: Best-in-class for academic papers (F1 0.87-0.90), extremely fast (10.6 PDF/sec sustained), proven at scale (34M+ documents at CORE)
- Trade-offs: Academic papers only, large installation (500MB-8GB), complex Java+Python setup
- License: Apache-2.0
- When to choose: Scientific/academic document processing exclusively
### Commercial APIs
There are numerous commercial options from startups (LlamaIndex, Unstructured.io paid tiers) to big cloud providers (AWS Textract, Azure Form Recognizer, Google Document AI). These are not OSS but offer managed infrastructure.
Kreuzberg's position: As an open-source library, Kreuzberg provides a self-hosted alternative with no per-document API costs, making it suitable for high-volume workloads where cost efficiency matters.
## Community & Resources
- GitHub: Star us at https://github.com/kreuzberg-dev/kreuzberg
- Discord: Join our community server at discord.gg/pXxagNK2zN
- Subreddit: Join the discussion at r/kreuzberg_dev
- Documentation: kreuzberg.dev
We'd love to hear your feedback, use cases, and contributions!
---
TL;DR: Kreuzberg v4 is a complete Rust rewrite of a document intelligence library, offering native bindings for 7 languages (8 runtime targets), 56+ file formats, Rust-native performance, embeddings, semantic chunking, and production-ready servers - all in a 16-31 MB complete package (5-15x smaller than alternatives). Releasing January 2025. MIT licensed forever.
https://redd.it/1pn2g9r
@r_opensource
GitHub
GitHub - kreuzberg-dev/kreuzberg: A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured…
A polyglot document intelligence framework with a Rust core. Extract text, metadata, and structured information from PDFs, Office documents, images, and 50+ formats. Available for Rust, Python, Rub...
Searching for open-source four-wheeled autonomous cargo bike components and resources
I want to try to develop, use, or improve a narrow, four-wheeled, self-driving, electric cargo bike with a rear transport box. The bike should have a width of about 1 meter and a maximum speed of 20 km/h. The goal is a fully open-source setup with permissive licenses like Apache or MIT (and not licenses like AGPL or GPL). I want to know if there are existing hardware components, software stacks, or even complete products that could be reused or adapted. I also want to know if there are ways to minimize reinventing the wheel, including simulation models, control systems, and perception modules suitable for a compact autonomous delivery vehicle.
https://redd.it/1pn2a7v
@r_opensource
I want to try to develop, use, or improve a narrow, four-wheeled, self-driving, electric cargo bike with a rear transport box. The bike should have a width of about 1 meter and a maximum speed of 20 km/h. The goal is a fully open-source setup with permissive licenses like Apache or MIT (and not licenses like AGPL or GPL). I want to know if there are existing hardware components, software stacks, or even complete products that could be reused or adapted. I also want to know if there are ways to minimize reinventing the wheel, including simulation models, control systems, and perception modules suitable for a compact autonomous delivery vehicle.
https://redd.it/1pn2a7v
@r_opensource
Reddit
From the opensource community on Reddit
Explore this post and more from the opensource community
Open source AIs?
Best safe AIs to generate text, code or pictures?
https://redd.it/1pn5usy
@r_opensource
Best safe AIs to generate text, code or pictures?
https://redd.it/1pn5usy
@r_opensource
Reddit
From the opensource community on Reddit
Explore this post and more from the opensource community
Could you guys recommend an open source To Do List product that can be downloaded on a cell phone?
I'm looking for a productive app for “planning upcoming daily activities.”
Requirements: Notifications appear without delay, data is stored locally, the interface is user-friendly, and user experience is smooth.
https://redd.it/1pn7qeo
@r_opensource
I'm looking for a productive app for “planning upcoming daily activities.”
Requirements: Notifications appear without delay, data is stored locally, the interface is user-friendly, and user experience is smooth.
https://redd.it/1pn7qeo
@r_opensource
Reddit
From the opensource community on Reddit
Explore this post and more from the opensource community
Solo maintainer suddenly drowning in PRs/issues (I need advice/help😔)
I’m looking for advice from people who’ve been in this situation before.
I maintain an open-source project that’s started getting a solid amount of traction. That’s great, but it also means a steady stream of pull requests (8 in the last 2 days), issues, questions, and review work. Until recently, my brother helped co-maintain it, but he’s now working full-time and running a side hustle, so open source time is basically gone for him. That leaves me solo.
I want community contributions, but I’m struggling with reviewing PRs fast enough, keeping issues moving without burning out, deciding who (if anyone) to trust with extra permissions (not wanting to hand repo access to a random person I barely know).
I’m especially nervous about the “just add more maintainers” advice. Once permissions are granted, it’s not trivial (socially or practically) to walk that back if things go wrong.
So I’d really appreciate hearing:
How do you triage PRs/issues when volume increases?
What permissions do you give first (triage, review, write)?
How do you evaluate someone before trusting them?
Any rules, automation, or workflows that saved your sanity?
Or did you decide to stay solo and just slow things down?
I’m not looking for a silver bullet, just real-world strategies that actually worked for you.
Thanks for reading this far, most people just ghost these.❤️
https://redd.it/1pn9qpl
@r_opensource
I’m looking for advice from people who’ve been in this situation before.
I maintain an open-source project that’s started getting a solid amount of traction. That’s great, but it also means a steady stream of pull requests (8 in the last 2 days), issues, questions, and review work. Until recently, my brother helped co-maintain it, but he’s now working full-time and running a side hustle, so open source time is basically gone for him. That leaves me solo.
I want community contributions, but I’m struggling with reviewing PRs fast enough, keeping issues moving without burning out, deciding who (if anyone) to trust with extra permissions (not wanting to hand repo access to a random person I barely know).
I’m especially nervous about the “just add more maintainers” advice. Once permissions are granted, it’s not trivial (socially or practically) to walk that back if things go wrong.
So I’d really appreciate hearing:
How do you triage PRs/issues when volume increases?
What permissions do you give first (triage, review, write)?
How do you evaluate someone before trusting them?
Any rules, automation, or workflows that saved your sanity?
Or did you decide to stay solo and just slow things down?
I’m not looking for a silver bullet, just real-world strategies that actually worked for you.
Thanks for reading this far, most people just ghost these.❤️
https://redd.it/1pn9qpl
@r_opensource
Reddit
From the opensource community on Reddit
Explore this post and more from the opensource community
How to start contributing
Hello folks, I am a CS Student and security researcher in my free time, I have been working with JavaScript technologies por 5 years, but I want to upgrade my skills from creating simple projects, so I thought that it would be nice to contribute to cool OSS projects so I can learn other people coding patterns and upgrade my skills by learning new technologies.
So how do I start ? I do not have a lot of time so perhaps I should search a little project...
I read that the way is to go to an OSS project, read an issue, create a fork and solve that issue ??
I also think that it would be nice for my dev portfolio adding OSS projects in which I collaborated ??
Cheers
https://redd.it/1pn9qdl
@r_opensource
Hello folks, I am a CS Student and security researcher in my free time, I have been working with JavaScript technologies por 5 years, but I want to upgrade my skills from creating simple projects, so I thought that it would be nice to contribute to cool OSS projects so I can learn other people coding patterns and upgrade my skills by learning new technologies.
So how do I start ? I do not have a lot of time so perhaps I should search a little project...
I read that the way is to go to an OSS project, read an issue, create a fork and solve that issue ??
I also think that it would be nice for my dev portfolio adding OSS projects in which I collaborated ??
Cheers
https://redd.it/1pn9qdl
@r_opensource
Reddit
From the opensource community on Reddit
Explore this post and more from the opensource community
dodo: A fast and unitrusive PDF reader
Hello everyone, just wanted to share my side-project, dodo, a PDF reader I have been working on for a couple of months now. I was an okular user before until I wanted a few features of my own and I just thought I'll write my own reader. One feature that I really love is session. You can open up a bunch of pdfs and then save, load those pdfs again at a later point in time.
It's using MuPDF as a pdf library with Qt6 for GUI. I daily drive it personally and it's been great. I would appreciate feedbacks if anyone decides to use it.
Github: https://www.github.com/dheerajshenoy/dodo
https://redd.it/1pngal7
@r_opensource
Hello everyone, just wanted to share my side-project, dodo, a PDF reader I have been working on for a couple of months now. I was an okular user before until I wanted a few features of my own and I just thought I'll write my own reader. One feature that I really love is session. You can open up a bunch of pdfs and then save, load those pdfs again at a later point in time.
It's using MuPDF as a pdf library with Qt6 for GUI. I daily drive it personally and it's been great. I would appreciate feedbacks if anyone decides to use it.
Github: https://www.github.com/dheerajshenoy/dodo
https://redd.it/1pngal7
@r_opensource
GitHub
GitHub - dheerajshenoy/dodo: A fast and configurable PDF reader built with Qt and MuPDF
A fast and configurable PDF reader built with Qt and MuPDF - dheerajshenoy/dodo
Deadlight: A lightweight, open-source blog framework for Cloudflare Workers – now one-command install via npm
Howdy all,
I just put together a simple blog platform called Deadlight that runs on Cloudflare Workers. It's designed for really poor internet connections pages are under 10 KB, it works in text browsers like Lynx, and you can post new entries via email. The idea came from wanting something lightweight and resilient that doesn't rely on heavy frameworks or constant high-speed access.
Why I think it's useful: If you're in a spotty network area or just prefer minimal setups, it deploys quickly and is censorship-resistant since it's global via Cloudflare. Plus, it's fully open source and you own it—no vendor lock-in. There's an "eject" option to grab your data and run it locally on something like a Raspberry Pi if you want.
To try it out yourself: Just run
Repo: https://github.com/gnarzilla/blog.deadlight
More details on the install: https://deadlight.boo/post/one-click-install
Live Demos:
deadlight.boo
Meshtastic-Deadlight
thatch pad
Feedback welcome, let me know what you think or if you run into issues.
https://redd.it/1pngi7r
@r_opensource
Howdy all,
I just put together a simple blog platform called Deadlight that runs on Cloudflare Workers. It's designed for really poor internet connections pages are under 10 KB, it works in text browsers like Lynx, and you can post new entries via email. The idea came from wanting something lightweight and resilient that doesn't rely on heavy frameworks or constant high-speed access.
Why I think it's useful: If you're in a spotty network area or just prefer minimal setups, it deploys quickly and is censorship-resistant since it's global via Cloudflare. Plus, it's fully open source and you own it—no vendor lock-in. There's an "eject" option to grab your data and run it locally on something like a Raspberry Pi if you want.
To try it out yourself: Just run
npx create-deadlight-blog your-blog-name in your terminal (replace with whatever name you want). It sets everything up in a couple minutes, including a D1 database and admin creds.Repo: https://github.com/gnarzilla/blog.deadlight
More details on the install: https://deadlight.boo/post/one-click-install
Live Demos:
deadlight.boo
Meshtastic-Deadlight
thatch pad
Feedback welcome, let me know what you think or if you run into issues.
https://redd.it/1pngi7r
@r_opensource
GitHub
GitHub - gnarzilla/blog.deadlight: Cloudflare Workers blog platform optimized for terrible connectivity. <10 KB pages, works in…
Cloudflare Workers blog platform optimized for terrible connectivity. <10 KB pages, works in lynx, post via email. - gnarzilla/blog.deadlight
Anybody in the Fediverse looking for an open source junior dev role?
I just happened to see an ad.
Not sure if it's fedi-related.
https://redd.it/1pnhgjj
@r_opensource
I just happened to see an ad.
Not sure if it's fedi-related.
https://redd.it/1pnhgjj
@r_opensource
Reddit
From the opensource community on Reddit
Explore this post and more from the opensource community
GhostStream — GPU transcoding server (HLS/ABR) now integrated with GhostHub
https://github.com/BleedingXiko/GhostStream
https://redd.it/1pnm61f
@r_opensource
https://github.com/BleedingXiko/GhostStream
https://redd.it/1pnm61f
@r_opensource
GitHub
GitHub - BleedingXiko/GhostStream: GhostStream - Hardware-Accelerated Network Transcoding Server
GhostStream - Hardware-Accelerated Network Transcoding Server - BleedingXiko/GhostStream
Check out Quantica 0.2.0 With AI/ML Capabilities
https://github.com/Quantica-Foundation/quantica-lang
https://redd.it/1pnp9h8
@r_opensource
https://github.com/Quantica-Foundation/quantica-lang
https://redd.it/1pnp9h8
@r_opensource
GitHub
GitHub - Quantica-Foundation/quantica-lang: Quantica is a fast, modern language designed for high-performance computing, AI, and…
Quantica is a fast, modern language designed for high-performance computing, AI, and quantum-inspired algorithms. It offers clean syntax, strong typing, an efficient interpreter, optional LLVM comp...