Exploring: Retrieval-Augmented Generation (RAG) with open-source LLMs

Some time ago, I’ve been experimenting with building a chatbot powered by Llama 3, LangChain, and vector databases. Initially Qdrant, later switched to Chroma.
Why RAG?
I wanted to test if I could build a helpful assistant from a specific knowledge base. In this case, content from Heni Ardiana’s beautiful travel website, Pesona Matahari 🌻
Here’s what I tried and learned:
✅ Indexing went smoothly using LangChain’s RecursiveCharacterTextSplitter, combined with FastEmbedEmbeddings.
📦 Data was loaded and chunked well, giving a solid starting point for semantic search.
🤖 I deployed the chatbot and integrated it into a Discord channel for real-world interaction.
🧪 Infra Setup:
Hosted on Oracle Cloud (OCI) using an Ampere ARM instance (CPU-only)
Used Ollama to serve Llama 3 models locally
❌ What didn’t go so well:
Qdrant retrieval via Python client occasionally stuck indefinitely, despite working manually, debugging this was inconclusive.
Switched to Chroma, and it performed much more reliably with LangChain.
📉 Evaluation:
Handles basic Q&A well
Struggles with nuanced queries, sometimes misses key info or returns irrelevant chunks
🧭 What’s next?
Looking to explore MLflow for structured experiment tracking and improved iteration speed.
If you’re also building with open-source LLMs or RAG pipelines (especially on CPU-only infra!), let’s share learnings.
💬 Drop a comment or DM. Always open to connect with fellow builders.