I Built a RAG Chatbot for Terminal Operations — Here's Everything I Learned

TL;DR

I built a domain-specific Q&A system for container terminal operations using Ollama, LangChain, ChromaDB, and Streamlit — with zero API costs. The system retrieves answers from internal documents with exact citations, avoiding hallucinations through separate vector collections per file, dynamic prompt templates, and strict scope enforcement at upload time. Key learnings: migrate to LangChain LCEL immediately, design your collection architecture upfront, write prompt templates before retrieval code, and enforce scope at boundaries, not in prompts.

The Problem

Imagine this: an operations team has hundreds of pages of terminal procedures, HSSE guidelines, EDI specifications, and cargo rules. Someone needs to know the exact yard opening time before a vessel arrives. Or the precise EDI message code for a container discharge order. Or the emergency contact number for a hazardous cargo incident.

Right now, they open a PDF, search manually, and hope they find the right section.

I built a chatbot that answers those questions in seconds — with exact citations.

This is the story of how I built it, what went wrong, and what I'd do differently.

What is RAG and Why Does It Matter?

Before I get into the build, let me quickly explain Retrieval-Augmented Generation (RAG) — because it's the concept the entire system is built on.

A standard LLM like ChatGPT answers from its training data. It doesn't know about your internal documents. Ask it about your company's specific procedures and it will either guess, make something up, or tell you it doesn't know.

RAG changes this. Instead of asking the model to remember, you ask it to read and answer.

Here's how it works:

Your documents are split into small chunks and converted into numeric vectors (embeddings)
These vectors are stored in a vector database
When a user asks a question, the system retrieves the most relevant chunks from the database
Those chunks are given to the LLM as context — "here, read this and answer"
The LLM generates an answer grounded entirely in the retrieved text

The result: precise, cited, domain-specific answers — with zero hallucination outside the documents you upload.

The Problem I Was Solving

I was building this for vessel and container terminal operations at APM Terminals Maasvlakte II (MVII) — a major container terminal in Rotterdam.

The team works with two key documents constantly:

The MVII Operational Manual — 33 pages covering vessel procedures, container handling, yard opening rules, HSSE, EDI codes, berth schedules, and contacts
IATA Cargo Interchange Message Procedures — detailed EDI message format specifications

The challenge: these documents are dense, technical, and full of exact values that matter — specific deadlines, message codes, emergency contacts, regulatory procedures. Getting any of those wrong is not just inconvenient; it can be a compliance or safety issue.

A general-purpose chatbot would be dangerous here. I needed a system that:

Answers only from the approved documents
Cites exact sections, pages, and deadlines
Refuses to answer from outside knowledge
Warns on safety-critical topics

That's exactly what I built.

The Tech Stack

I kept the stack lean and fully open-source for the POC:

Component	Technology
Language Model	Ollama + llama3.2 (local)
Embeddings	nomic-embed-text (local)
Vector Store	ChromaDB
Orchestration	LangChain LCEL
UI	Streamlit
Platform	Lightning.ai
OCR	Tesseract + Pillow

The key design decision: everything runs locally. No OpenAI API key, no cloud inference costs during the POC phase. Ollama handles both the LLM and the embedding model on the same machine.

This is a deliberate tradeoff — local models are slower and less capable than paid APIs, but they're free to run and keep your documents private. More on the upgrade path later.

Architecture: The Pipeline

Here's the full data flow, from document upload to answer:

PDF Upload
    ↓
Document Loader  (PyPDF · DOCX · CSV · OCR for images)
    ↓
Text Splitter  (800 chars · 200 overlap)
    ↓
Embeddings  (nomic-embed-text)
    ↓
ChromaDB  (separate collection per file)
    ↑
MMR Retriever  (fetch_k=20, k=5)  ← User Question
    ↓
Dynamic Prompt Template  (domain-specific rules per file)
    ↓
LLM — Ollama llama3.2  (temperature=0.2)
    ↓
Answer with Citations → Streamlit Chat UI

The Design Decision That Mattered Most: Separate Collections

Most RAG tutorials dump everything into a single vector collection. I didn't.

Each file gets its own isolated ChromaDB collection.

chroma_db/
├── col_operational_manual_vessel_and/   ← MVII Operational Manual
└── col_IATA_Cargo_Interchange/          ← IATA Cargo Procedures

Why does this matter?

Adding a new document never risks contaminating retrieval from existing ones
Deleting one file only removes its own collection
Users can toggle which documents are active — and only those are queried
When multiple files are selected, a MergerRetriever combines results at query time

This pattern scales cleanly. Add 10 documents — you get 10 collections. Each one stays independent.

The Feature I'm Most Proud Of: Dynamic Prompt Templates

This is the part most RAG tutorials skip entirely — and it makes a massive difference in answer quality.

Every document has its own prompt template registered in a TEMPLATES dictionary. The template defines:

What role the assistant plays for that document
How it should cite information (section number? page? appendix?)
What safety warnings to append for critical topics
What to say when the answer isn't found
Domain-specific extras (EDI codes, emergency contacts, etc.)

For the MVII Operational Manual, the prompt tells the model:

"Always cite the chapter, section number, and page number. Quote exact deadlines from Appendix II. For contacts, provide exact email and phone from Appendix I. EDI codes must be reproduced exactly. For yard opening time, always state the 5-day rule and 24-hour closure before ETA. For HSSE, IMO 1, IMO 7, or confined space procedures, append: 'Verify with the HSSE department. Emergency: +31 (0)6 83076494.'"

When a user queries multiple documents simultaneously, the templates are merged — with per-document citation and safety rules applied separately for each source.

The result: answers that feel like they came from a domain expert who has read every page, not a generic AI that happened to find a relevant paragraph.

The Lesson That Cost Me: LangChain Deprecated Half Its API

Here's the thing no one warns you about when you build with LangChain: the library moves fast, and old tutorials break.

I originally built the chain using ConversationalRetrievalChain — the approach in most 2023/2024 RAG tutorials. When I deployed on Lightning.ai with a newer LangChain version, I got three immediate errors:

ModuleNotFoundError: No module named 'langchain_core.pydantic_v1'
ModuleNotFoundError: No module named 'langchain.chains'
ModuleNotFoundError: No module named 'langchain.text_splitter'

The fix was a full migration to LCEL — LangChain Expression Language — which is the modern, supported interface in LangChain 0.3+.

Here's the import map if you hit the same issue:

Old (broken)	New (correct)
`langchain.text_splitter`	`langchain_text_splitters`
`langchain_community.embeddings.OllamaEmbeddings`	`langchain_ollama.OllamaEmbeddings`
`langchain_community.chat_models.ChatOllama`	`langchain_ollama.ChatOllama`
`langchain.schema.Document`	`langchain_core.documents.Document`
`langchain.prompts.PromptTemplate`	`langchain_core.prompts.PromptTemplate`
`langchain.chains.ConversationalRetrievalChain`	Replaced with LCEL pipeline

The LCEL chain looks like this:

chain = (
    RunnablePassthrough.assign(
        context=RunnableLambda(lambda x: format_docs(retriever.invoke(x["question"]))),
        source_documents=RunnableLambda(lambda x: retriever.invoke(x["question"]))
    )
    | RunnablePassthrough.assign(
        answer=(prompt | llm | StrOutputParser())
    )
)

Cleaner, more explicit, and actually supported going forward.

The Mistake That Caused Hallucinations

During early testing, a reviewer uploaded a BMW X1 owner's manual to test the upload panel.

The system accepted it. And then, when asked about yard opening time at MVII, it answered confidently — citing engine torque specs and fuel requirements.

That's not a model failure. That's a scope failure.

The fix was two-layered:

1. Hard file validation at upload time:

ALLOWED_FILES = {
    "operational-manual-vessel-and-container-operators-v-42.pdf",
    "IATA-Cargo-Interchange-Message-Procedure.pdf",
}

rejected = [uf.name for uf in uploaded if uf.name not in ALLOWED_FILES]

Any filename not in the approved set is rejected before it's ever saved to disk. It never gets indexed.

2. A clear POC warning banner in the UI:

"This application is a Proof of Concept designed exclusively for Vessel & Container Terminal Operations. Do not upload or test with out-of-scope files. Uploading unrelated documents will cause the model to retrieve irrelevant content and produce hallucinated or misleading answers — even if the response sounds confident."

The lesson: RAG systems don't fail gracefully when given out-of-scope content. They answer anyway, with whatever they can find. Scope control is not optional — it has to be enforced at the boundary, before any content reaches the index.

Current Limitations — Being Honest

I think it's important to be transparent about what this system cannot do in its current POC state:

🔒 Fixed document scope — Only two documents are supported. This is intentional during the POC, but it's a real constraint.

🧠 Limited context window — llama3.2 has roughly a 4,000 token context window. For very long documents, chunks near the edges of the document may not surface as reliably.

📡 No real-time data — The chatbot knows only what's in the uploaded documents. It cannot access live vessel ETAs, berth availability, or any external systems.

🔁 No persistent memory — Chat history resets when the browser closes. There's no database-backed session storage yet.

💻 Slow on CPU — Running llama3.2 locally without a GPU means responses take 15–60 seconds. On Lightning.ai's free tier (CPU only), this is noticeable.

These are all solvable. Which brings me to the part I'm most excited about.

The Upgrade Path: How This Evolves

The POC proves the concept works. Here's how I'd evolve it into a production system:

1. Upgrade the Language Model

This is the single highest-impact improvement. The local llama3.2 model was chosen for zero-cost POC testing. In production, you'd switch to a paid API model:

Model	Context Window	Why It Matters
llama3.2 (current)	4K tokens	Free, local, limited
GPT-4o	128K tokens	Excellent multi-step reasoning
Claude 3.5 Sonnet	200K tokens	Long documents, precise citations
Gemini 1.5 Pro	1M tokens	Massive document sets

A 200K context window means the model can read far more of the document at once, dramatically reducing missed answers at chunk boundaries.

2. Add Hybrid Search

The current system uses pure semantic (vector) search. Adding BM25 keyword search alongside it — via LangChain's EnsembleRetriever — would dramatically improve precision for exact technical terms like EDI codes, berth IDs, and regulatory deadlines. These are exact-match queries that keyword search handles much better than semantic similarity alone.

3. Add Re-Ranking

After initial retrieval, a re-ranker (such as Cohere Rerank) re-scores chunks by actual relevance before passing them to the LLM. This removes chunks that scored high on vector similarity but aren't genuinely useful for the question.

4. Production Infrastructure

For a real deployment, you'd want:

User authentication — OAuth or Auth0, role-based access
Persistent chat history — PostgreSQL backend per user
GPU inference — 5–10x faster response times
Live port data integration — ETAs and berth status via API
Feedback loop — thumbs up/down per answer drives continuous improvement

What I'd Do Differently

If I were starting this project from scratch today, here's what I'd change:

Start with LCEL from day one. Don't touch ConversationalRetrievalChain even if older tutorials use it. It's deprecated. Write your chain with RunnablePassthrough and RunnableLambda from the beginning.

Design the collection architecture before writing code. The decision to use separate collections per file was the right one, but I only arrived at it after thinking through the failure modes of a shared collection. Think through it first.

Write prompt templates before writing retrieval code. The quality of your prompt determines the quality of the answer far more than the retrieval parameters do. I spent too long tuning chunk_size when I should have been refining the prompt.

Enforce scope at the boundary, not in the prompt. Telling the LLM "only answer from the context" in the prompt is not sufficient if you let out-of-scope documents into the index in the first place. Validate at upload time.

The Setup, In Brief

If you want to run this yourself:

# Install Ollama and pull models
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.2
ollama pull nomic-embed-text
ollama serve &

# Install Python dependencies
pip install -U langchain-ollama langchain-text-splitters \
    langchain-core langchain-community chromadb \
    pypdf streamlit pytesseract Pillow

# Run
streamlit run app_multifile.py --server.port 8080

The full project structure is straightforward:

rag_chatbot/
├── app_multifile.py     # Main Streamlit application
├── documents/           # Uploaded files stored here
│   └── .meta.json       # MD5 hash index for incremental indexing
└── chroma_db/           # ChromaDB vector collections
    ├── col_operational_manual.../
    └── col_IATA_Cargo.../

Final Thoughts

Building this system taught me something that I don't see said enough in AI content:

RAG is not primarily a model problem. It's an architecture problem.

The model is the last step. Before it ever generates a word, you've made dozens of decisions that determine whether the answer will be good: How do you chunk the document? What embedding model? How many chunks to retrieve? With what strategy? What does the prompt say? What documents are even allowed in the index?

Get those decisions right and even a modest local model like llama3.2 can produce genuinely useful, cited, domain-specific answers.

Get them wrong and GPT-4 will confidently cite engine torque specs when someone asks about a vessel's berthing window.

The POC works. The architecture is sound. The upgrade path is clear.

Now it's time to ship.

If you found this useful, consider following me for more AI engineering content. I write about building real-world AI systems — not just demos.

— Mayank Chugh

Tags: #ArtificialIntelligence #MachineLearning #RAG #LangChain #Python #NLP #Chatbot #Ollama #ChromaDB #Streamlit

Formatted with Medium Blog Publisher · Imported via medium.com/p/import