Parsing - Introduction by Examples
We introduce llmware
through self-contained examples.
🚀 Parsing Examples 🚀
Parsing is the Humble Hero of Good RAG Pipelines
LLMWare supports parsing of a wide range of unstructured content types, and views parsing, text chunking and indexing as the first step in the pipeline, and like any pipeline, care and attention to getting “great input” is usually the key to “great output.”
In this repository, we show several key features of parsing with llmware:
Parsing PDFs like a Pro
-
Configuring text chunking and extraction parameters - PDF Configuration
-
PDF Table extraction - PDF Table
-
Fallback to OCR - PDF-by-OCR
Parsing Office Documents (Powerpoints, Word, Excel)
-
Configuring text chunking and extraction parameters - Office Configuration
-
Handling ZIPs and mixed file types - Microsoft IR Documents
-
Running OCR on Images Extracted - OCR Embedded Doc Images
Parsing without a Database
-
Parse in Memory - Parse in Memory
-
Parse directly into a Prompt - Parse in Prompt
-
Parse to JSON file - Parse to JSON
Other Content Types
-
Custom CSV - Custom CSV files
-
Custom JSON - Custom JSON files
-
Images - OCR on Images
-
Web/HTML - Website Extraction
-
Voice (WAV) - in Use_Cases - Parsing Great Speeches
For more examples, see the [parsing examples]((https://www.github.com/llmware-ai/llmware/tree/main/examples/Parsing/) in the main repo.
Check back often - we are updating these examples regularly - and many of these examples have companion videos as well.