Exploring: Retrieval-Augmented Generation (RAG) with open-source LLMs

Exploring RAG
Exploring RAG

Some time ago, I’ve been experimenting with building a chatbot powered by Llama 3, LangChain, and vector databases. Initially Qdrant, later switched to Chroma.

Why RAG?

I wanted to test if I could build a helpful assistant from a specific knowledge base. In this case, content from Heni Ardiana’s beautiful travel website, Pesona Matahari 🌻

Here’s what I tried and learned:

✅ Indexing went smoothly using LangChain’s RecursiveCharacterTextSplitter, combined with FastEmbedEmbeddings.

📦 Data was loaded and chunked well, giving a solid starting point for semantic search.

🤖 I deployed the chatbot and integrated it into a Discord channel for real-world interaction.

🧪 Infra Setup:

Hosted on Oracle Cloud (OCI) using an Ampere ARM instance (CPU-only)

Used Ollama to serve Llama 3 models locally

❌ What didn’t go so well:

Qdrant retrieval via Python client occasionally stuck indefinitely, despite working manually, debugging this was inconclusive.

Switched to Chroma, and it performed much more reliably with LangChain.

📉 Evaluation:

Handles basic Q&A well

Struggles with nuanced queries, sometimes misses key info or returns irrelevant chunks

🧭 What’s next?

Looking to explore MLflow for structured experiment tracking and improved iteration speed.

If you’re also building with open-source LLMs or RAG pipelines (especially on CPU-only infra!), let’s share learnings.

💬 Drop a comment or DM. Always open to connect with fellow builders.

Similar Posts