Featured image of post On-Prem AI Chatbot for PDF Search

On-Prem AI Chatbot for PDF Search

Learn how to build a fully on-prem AI-powered chatbot

Omnibot 2000 - Wikipedia

This is a simplified version of a real-world project I built for a job.

The goal: Create a fully on-prem AI chatbot that can search and retrieve information from a large collection of PDFs.

Unlike cloud-based solutions, everything runs locallyβ€”which means no API costs, no data privacy concerns, and full control over the system.


πŸ› οΈ What We’re Building

This guide will walk through setting up an AI-powered PDF search system using the following tools:

ComponentTool
Text ExtractionPyMuPDF (fastest)
Keyword SearchElasticsearch
Semantic SearchFAISS (or Qdrant)
Embedding Modelsentence-transformers
Chatbot LLMLlama 2 (running on local GPUs)
User InterfaceStreamlit

❗ A Quick Note About OCR

For this project, I didn’t need OCR because all the PDFs already contained selectable text.

However, if you’re dealing with scanned PDFs (images instead of text), you’ll need OCR (Optical Character Recognition).

For those cases, check out: How to OCR PDFs using pdfplumber and Tesseract.

But for this tutorial, we’re assuming text-based PDFs only.


πŸ“ Part 1: Extracting Text from PDFs

Before we can search or chat with our PDFs, we need to extract the text.

The best way to do this without OCR is using PyMuPDF (fitz), which is blazing fast and maintains formatting.

πŸ“¦ Step 1: Install Dependencies

First, install PyMuPDF:

1
pip install pymupdf

πŸš€ Step 2: Extract Text from a PDF

Here’s a simple function to extract text from any text-based PDF:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    """Extracts text from a PDF using PyMuPDF."""
    doc = fitz.open(pdf_path)
    text = "\n".join([page.get_text("text") for page in doc])
    return text

# Example Usage
pdf_text = extract_text_from_pdf("example.pdf")
print(pdf_text[:500])  # Print first 500 characters

βœ… Why PyMuPDF?

  • Super fast πŸš€
  • Preserves text structure
  • Can handle large PDFs without issues

❌ When it won’t work

  • If the PDF is a scanned image, PyMuPDF won’t extract anything
  • If you get empty text, your PDF likely needs OCR

Again, if OCR is needed, check out: How to OCR PDFs using pdfplumber and Tesseract.


πŸ”₯ Step 3: Batch Process a Folder of PDFs

If you have hundreds or thousands of PDFs, you’ll want to process them all at once.

Here’s how to extract text from every PDF in a folder and store the results in a dictionary:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import os

def extract_text_from_folder(pdf_folder):
    """Extracts text from all PDFs in a folder."""
    extracted_texts = {}
    
    for filename in os.listdir(pdf_folder):
        if filename.endswith(".pdf"):
            pdf_path = os.path.join(pdf_folder, filename)
            text = extract_text_from_pdf(pdf_path)
            extracted_texts[filename] = text
            
    return extracted_texts

# Example Usage
pdf_texts = extract_text_from_folder("pdf_documents")
print(pdf_texts.keys())  # Print the names of processed PDFs

πŸ”Ή What This Does:

  • Loops through all PDFs in a given folder
  • Extracts text and stores it in a dictionary {filename: extracted_text}
  • Can be used later for search indexing

βœ… What We Have So Far

At this point, we can extract text from PDFs, which is the first step toward building our AI-powered search system.

πŸ”₯Indexing PDFs in Elasticsearch (Keyword Search)

πŸ“¦ Step 1: Install Elasticsearch & Python Client

First, we need to install Elasticsearch and the Python client.

Install Elasticsearch (7.x or 8.x) from elastic.co, then start it:

1
./bin/elasticsearch

Option 2: Run Elasticsearch via Docker

1
docker run -d --name elasticsearch -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:8.5.0

Now, install the Python client:

1
pip install elasticsearch

πŸš€ Step 2: Index Extracted PDF Text

We’ll store each PDF’s extracted text as a document in Elasticsearch.

1️⃣ Connect to Elasticsearch

1
2
3
4
5
6
7
8
9
from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")  # Change if running remotely

# Check connection
if es.ping():
    print("Connected to Elasticsearch!")
else:
    print("Elasticsearch connection failed.")

2️⃣ Create an Index for PDFs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
INDEX_NAME = "pdf_documents"

# Define mapping (schema)
mapping = {
    "mappings": {
        "properties": {
            "filename": {"type": "keyword"},
            "text": {"type": "text"}
        }
    }
}

# Create the index
if not es.indices.exists(index=INDEX_NAME):
    es.indices.create(index=INDEX_NAME, body=mapping)
    print(f"Index '{INDEX_NAME}' created.")

3️⃣ Add PDFs to Elasticsearch

1
2
3
4
5
6
7
def index_pdf(filename, text):
    """Indexes a PDF document in Elasticsearch."""
    doc = {"filename": filename, "text": text}
    es.index(index=INDEX_NAME, body=doc)

# Example usage
index_pdf("example.pdf", "This is a sample PDF content.")

πŸ” Step 3: Search PDFs in Elasticsearch

Now that PDFs are indexed, we can search for keywords.

Example: Search for “machine learning” in PDFs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def search_pdfs(query):
    """Search PDFs using Elasticsearch."""
    search_query = {
        "query": {
            "match": {
                "text": query
            }
        }
    }
    results = es.search(index=INDEX_NAME, body=search_query)
    return results["hits"]["hits"]

# Example usage
results = search_pdfs("machine learning")
for r in results:
    print(f"Found in: {r['_source']['filename']}\nText: {r['_source']['text'][:200]}...\n")

βœ… Elasticsearch now powers our keyword-based PDF search!


🧠 Indexing PDFs in FAISS (Semantic Search)

Elasticsearch works well for exact keyword matches, but it doesn’t understand meaning.

To search PDFs based on meaning, we use FAISS (Facebook AI Similarity Search) with text embeddings.


πŸ“¦ Step 1: Install FAISS & Sentence-Transformers

1
pip install faiss-cpu sentence-transformers

πŸš€ Step 2: Generate Embeddings for PDFs

We’ll use sentence-transformers to convert text into numerical embeddings.

1️⃣ Load the Embedding Model

1
2
3
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")  # Fast and accurate

2️⃣ Convert PDF Text into Embeddings

1
2
3
4
5
6
7
def embed_text(text):
    """Generates an embedding for a given text."""
    return model.encode(text)

# Example usage
embedding = embed_text("This is a sample text.")
print(embedding.shape)  # Output: (384,)

πŸ”₯ Step 3: Store Embeddings in FAISS

Now, we create a FAISS index to store and search our embeddings.

1️⃣ Import FAISS & Create an Index

1
2
3
4
5
import faiss
import numpy as np

DIMENSIONS = 384  # Model output size
index = faiss.IndexFlatL2(DIMENSIONS)  # L2 distance index

2️⃣ Index PDF Embeddings

1
2
3
4
5
6
7
8
9
pdf_texts = {
    "example.pdf": "This document is about deep learning and AI.",
    "sample.pdf": "This paper discusses cloud computing concepts."
}

embeddings = np.array([embed_text(text) for text in pdf_texts.values()])
index.add(embeddings)

print("FAISS index created with", index.ntotal, "documents.")

πŸ” Step 4: Search PDFs in FAISS

Now we can search using semantic similarity.

1️⃣ Search FAISS Using a Query

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def search_faiss(query, k=2):
    """Searches FAISS for the most similar PDFs."""
    query_embedding = embed_text(query).reshape(1, -1)
    D, I = index.search(query_embedding, k)  # Retrieve top-k
    return I

# Example usage
query = "AI and deep learning"
results = search_faiss(query)

for i in results[0]:
    print("Matched:", list(pdf_texts.keys())[i])

βœ… FAISS now powers our semantic PDF search!


Here’s Part 3, where we integrate Llama 2 to create an AI chatbot that can answer questions based on our PDF data using retrieval-augmented generation (RAG). πŸš€

πŸ”₯ Running Llama 2 Locally

πŸ—οΈ Step 1: Install Llama 2

We’ll use llama-cpp-python, which allows us to run Llama 2 on CPU or GPU.

1
pip install llama-cpp-python

πŸ’‘ If you have a powerful GPU, use the GGUF version for better performance.


πŸš€ Step 2: Download a Llama 2 Model

Go to Meta’s Llama 2 page and download a model.

For fast responses, I recommend:

  • llama-2-7b-chat.Q4_K_M.gguf (Quantized 4-bit model)
  • llama-2-13b-chat.Q4_K_M.gguf (Larger but still manageable)

Place the model file in a folder called models.


πŸ”₯ Step 3: Load Llama 2 in Python

1
2
3
4
5
6
7
8
from llama_cpp import Llama

# Load Llama 2 model
llm = Llama(model_path="models/llama-2-7b-chat.Q4_K_M.gguf")

# Example chat
response = llm("What is machine learning?")
print(response["choices"][0]["text"])

βœ… Llama 2 is now running locally!


🧠 Implementing RAG (Retrieval-Augmented Generation)

By itself, Llama 2 doesn’t know about our PDFs.

To make it answer questions based on PDFs, we use retrieval-augmented generation (RAG):

  1. Search PDFs using Elasticsearch (keyword) and FAISS (semantic search)
  2. Feed search results into Llama 2 as context
  3. Ask Llama 2 a question, and it will generate an answer based on the retrieved PDFs

πŸš€ Step 1: Search PDFs Using Elasticsearch & FAISS

We combine both search methods to get the most relevant PDF chunks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def search_pdfs_rag(query, k=3):
    """Search PDFs using Elasticsearch (keyword) and FAISS (semantic)."""
    # 1️⃣ Keyword Search (Elasticsearch)
    es_results = search_pdfs(query)[:k]

    # 2️⃣ Semantic Search (FAISS)
    faiss_results = search_faiss(query, k)[:k]

    # 3️⃣ Merge and Return Results
    combined_results = set([r["_source"]["text"][:500] for r in es_results])
    combined_results.update([list(pdf_texts.values())[i][:500] for i in faiss_results[0]])

    return "\n\n".join(combined_results)

πŸ”₯ Step 2: Feed Search Results to Llama 2

Now we pass the retrieved text as context to Llama 2.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def chat_with_pdfs(query):
    """Uses RAG to answer questions based on PDF content."""
    context = search_pdfs_rag(query)

    prompt = f"Use the following context to answer the question:\n\n{context}\n\nQuestion: {query}\nAnswer:"

    response = llm(prompt)
    return response["choices"][0]["text"]

# Example Usage
print(chat_with_pdfs("What is deep learning?"))

βœ… Now, Llama 2 can answer questions based on our PDFs!


🎨 Building a Simple Chat UI with Streamlit

To make this user-friendly, let’s build a web-based chatbot using Streamlit.


πŸ“¦ Step 1: Install Streamlit

1
pip install streamlit

πŸš€ Step 2: Create a Simple Chatbot UI

Create a file app.py:

1
2
3
4
5
6
7
8
9
import streamlit as st

st.title("πŸ“„ AI Chatbot for PDF Search")

query = st.text_input("Ask a question:")
if query:
    response = chat_with_pdfs(query)
    st.write("### πŸ€– AI Response:")
    st.write(response)

🎯 Step 3: Run the App

1
streamlit run app.py

βœ… Now, you have a chatbot that searches PDFs and answers questions!


πŸ”Improving Search Ranking

Right now, our Elasticsearch + FAISS search returns somewhat relevant results, but we can improve ranking & filtering.

πŸš€ Step 1: Boost Keyword Matches in Elasticsearch

By default, Elasticsearch treats all matches equally. We can boost results that contain exact keyword matches.

βœ… Update Elasticsearch Search Query

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
def search_pdfs_improved(query, k=3):
    """Improves search ranking by boosting keyword matches."""
    search_query = {
        "query": {
            "bool": {
                "should": [
                    {"match": {"text": {"query": query, "boost": 2.0}}},  # Boost exact matches
                    {"match_phrase": {"text": {"query": query, "boost": 1.5}}}  # Boost phrase matches
                ]
            }
        }
    }
    results = es.search(index="pdf_documents", body=search_query)
    return results["hits"]["hits"][:k]

# Example usage
print(search_pdfs_improved("machine learning"))

βœ… Boosts exact and phrase matches
βœ… More relevant results appear at the top


πŸš€ Step 2: Adjust FAISS to Prefer Recent Documents

FAISS doesn’t consider document relevance, but we can re-rank results based on recency.

βœ… Re-rank FAISS Results by Document Date

1
2
3
4
5
6
7
8
def rerank_faiss_results(faiss_results, doc_metadata):
    """Re-ranks FAISS results based on recency."""
    sorted_results = sorted(faiss_results, key=lambda doc: doc_metadata[doc]["date"], reverse=True)
    return sorted_results

# Example usage
metadata = {"example.pdf": {"date": "2024-01-01"}, "old.pdf": {"date": "2019-05-10"}}
print(rerank_faiss_results(["old.pdf", "example.pdf"], metadata))  # "example.pdf" comes first

βœ… Recent documents now rank higher


🎨 Enhancing Streamlit UI

Our current chatbot UI is too basic. Let’s:
βœ… Improve layout
βœ… Add chat history
βœ… Show document sources


πŸš€ Step 1: Upgrade the Chat UI

Update app.py with a better layout:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import streamlit as st

st.set_page_config(page_title="πŸ“„ AI Chatbot for PDFs", layout="wide")

st.title("πŸ“„ AI Chatbot for PDF Search")

# Sidebar
with st.sidebar:
    st.header("Settings")
    st.text("Customize your search")

query = st.text_input("Ask a question:")

if "chat_history" not in st.session_state:
    st.session_state.chat_history = []

if query:
    response = chat_with_pdfs(query)
    st.session_state.chat_history.append((query, response))

st.write("### πŸ€– AI Response:")
for q, r in st.session_state.chat_history:
    st.write(f"**Q:** {q}")
    st.write(f"**A:** {r}")
    st.write("---")

βœ… Keeps chat history
βœ… Better layout with a sidebar


πŸš€ Step 2: Show PDF Sources in Chat

Modify chat_with_pdfs() to return sources.

1
2
3
4
5
6
7
8
def chat_with_pdfs(query):
    """Returns AI response + sources."""
    context, sources = search_pdfs_rag(query, return_sources=True)

    prompt = f"Use the following context to answer the question:\n\n{context}\n\nQuestion: {query}\nAnswer:"
    response = llm(prompt)

    return response["choices"][0]["text"], sources

Now, update app.py to show sources:

1
2
3
4
5
6
7
8
response, sources = chat_with_pdfs(query)

st.write("### πŸ€– AI Response:")
st.write(response)

st.write("πŸ“‚ **Sources:**")
for source in sources:
    st.write(f"- {source}")

βœ… Users see which PDFs were used to generate answers


πŸ“‚ Adding PDF Upload Support

Currently, we preload PDFs, but users can’t upload new ones. Let’s fix that!


πŸš€ Step 1: Add File Upload to Streamlit

Modify app.py to allow users to upload PDFs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
uploaded_files = st.file_uploader("Upload PDFs", accept_multiple_files=True, type=["pdf"])

if uploaded_files:
    for uploaded_file in uploaded_files:
        bytes_data = uploaded_file.read()
        
        # Save file locally
        with open(f"pdf_documents/{uploaded_file.name}", "wb") as f:
            f.write(bytes_data)
        
        # Extract text and index it
        text = extract_text_from_pdf(f"pdf_documents/{uploaded_file.name}")
        index_pdf(uploaded_file.name, text)
        
    st.success("Files uploaded and indexed successfully!")

βœ… Users can now upload PDFs, and they’re instantly indexed


Here’s Part 5, where we optimize performance by making Llama 2 run faster, improving FAISS search speed, and scaling to thousands of PDFs efficiently. πŸš€


⚑Speeding Up Llama 2

By default, Llama 2 can be slow, especially on CPUs. Here’s how to run it faster.


πŸš€ Step 1: Use a Quantized Llama 2 Model

Quantization reduces model size and speeds up inference.

βœ… Download a Quantized GGUF Model

Go to Meta’s Llama 2 page and download:

  • llama-2-7b-chat.Q4_K_M.gguf (4-bit quantized)
  • OR llama-2-13b-chat.Q4_K_M.gguf (faster than full precision)

Move it to models/.


πŸš€ Step 2: Enable GPU Acceleration (If Available)

If you have a GPU, use llama-cpp-python with CUDA.

βœ… Install CUDA & llama-cpp-python with GPU Support

1
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --no-cache-dir

Then, modify your Llama 2 loading code:

1
2
3
4
from llama_cpp import Llama

# Load Llama 2 model with GPU acceleration
llm = Llama(model_path="models/llama-2-7b-chat.Q4_K_M.gguf", n_gpu_layers=100)

βœ… Massive speed boost on GPUs!
βœ… Even CPU inference is faster with quantization


πŸš€ Step 3: Reduce Response Time with Streaming

Right now, Llama 2 waits for the full response before returning anything.

We can stream responses as they’re generated for a faster, chat-like feel.

βœ… Modify chat_with_pdfs() to Stream Responses

1
2
3
4
5
6
7
def chat_with_pdfs(query):
    """Streams responses from Llama 2 for faster user experience."""
    context = search_pdfs_rag(query)
    prompt = f"Use the following context to answer:\n\n{context}\n\nQuestion: {query}\nAnswer:"

    for response in llm(prompt, stream=True):
        yield response["choices"][0]["text"]

βœ… Now responses appear instantly instead of waiting!


🏎️ Optimizing FAISS for Large-Scale Search

FAISS is fast, but it can slow down as we add more PDFs.

Here’s how to speed it up for thousands of documents.


πŸš€ Step 1: Use HNSW Indexing Instead of Flat L2

By default, FAISS uses brute-force search (IndexFlatL2).
For huge datasets, we should use Hierarchical Navigable Small World (HNSW) indexing.

βœ… Modify FAISS Index to Use HNSW

1
2
3
4
import faiss

DIMENSIONS = 384  # Sentence-Transformer output size
index = faiss.IndexHNSWFlat(DIMENSIONS, 32)  # 32 is the max number of links per node

βœ… Now FAISS search is MUCH faster for large datasets


πŸš€ Step 2: Use IVF Indexing for Faster Lookups

Another trick is Inverted File Index (IVF), which clusters vectors for fast retrieval.

βœ… Modify FAISS Index to Use IVF

1
2
3
4
num_clusters = 128  # Adjust based on dataset size
quantizer = faiss.IndexFlatL2(DIMENSIONS)  
index = faiss.IndexIVFFlat(quantizer, DIMENSIONS, num_clusters)
index.train(embeddings)  # Train on initial dataset

βœ… Speeds up searches by grouping similar documents


πŸš€ Scaling Elasticsearch for Massive PDF Collections

If you have millions of PDFs, Elasticsearch needs tuning.


πŸš€ Step 1: Disable Refresh for Bulk Indexing

By default, Elasticsearch refreshes after every document insert, slowing down indexing.

βœ… Disable Refresh While Indexing

1
2
3
4
5
6
7
8
es.indices.put_settings(index="pdf_documents", body={"refresh_interval": "-1"})

# Bulk index PDFs
for filename, text in pdf_texts.items():
    index_pdf(filename, text)

# Re-enable refresh
es.indices.put_settings(index="pdf_documents", body={"refresh_interval": "1s"})

βœ… Indexing PDFs is now 5-10x faster


πŸš€ Step 2: Increase Shard Count for Large Datasets

For huge collections, increase the number of shards.

βœ… Modify Index Settings

1
2
3
4
5
6
7
8
9
index_settings = {
    "settings": {
        "index": {
            "number_of_shards": 3,
            "number_of_replicas": 1
        }
    }
}
es.indices.create(index="pdf_documents", body=index_settings)

βœ… Speeds up searches & indexing on large datasets



🧠Adding Multi-Turn Memory

By default, Llama 2 only answers one question at a time.

To enable multi-turn conversation memory, we need to track past questions and answers.


πŸš€ Step 1: Modify chat_with_pdfs() to Include Memory

Modify the chatbot function to store past questions & responses.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def chat_with_pdfs(query):
    """Uses multi-turn memory for AI conversations."""
    
    # Retrieve relevant PDFs
    context, sources = search_pdfs_rag(query, return_sources=True)
    
    # Maintain conversation memory
    if "conversation_history" not in st.session_state:
        st.session_state.conversation_history = []
    
    # Create conversation context
    conversation_history = "\n".join(st.session_state.conversation_history)
    
    # Generate response using Llama 2
    prompt = f"""
    Previous conversation:
    {conversation_history}
    
    Use the following PDF context to answer the question:
    {context}
    
    Question: {query}
    Answer:
    """
    
    response = llm(prompt)["choices"][0]["text"]
    
    # Save conversation
    st.session_state.conversation_history.append(f"Q: {query}\nA: {response}")
    
    return response, sources

βœ… Now, the chatbot remembers previous questions!


πŸš€ Step 2: Display Conversation History in the UI

Modify app.py to show chat history.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import streamlit as st

st.title("πŸ“„ AI Chatbot for PDF Search")

query = st.text_input("Ask a question:")

if query:
    response, sources = chat_with_pdfs(query)

    st.write("### πŸ€– AI Response:")
    st.write(response)

    # Display conversation history
    st.write("### πŸ“ Conversation History:")
    for message in st.session_state.conversation_history[-5:]:  # Show last 5 messages
        st.write(message)

    # Show document sources
    st.write("πŸ“‚ **Sources:**")
    for source in sources:
        st.write(f"- {source}")

βœ… Now users see chat history & sources in a clean format


πŸ‹οΈβ€β™‚οΈ Fine-Tuning Llama 2 for Better Responses

Currently, Llama 2 isn’t optimized for our PDFs.
Fine-tuning makes it much smarter about our documents.


πŸš€ Step 1: Prepare Custom Training Data

Fine-tuning requires examples of questions & correct answers.
We’ll use our own PDFs to create a dataset.

βœ… Format Training Data in JSON

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
[
    {
        "input": "What is machine learning?",
        "output": "Machine learning is a method of data analysis that automates analytical model building."
    },
    {
        "input": "Explain deep learning.",
        "output": "Deep learning is a subset of machine learning that uses neural networks to model complex patterns in data."
    }
]

Save this as training_data.json.


πŸš€ Step 2: Fine-Tune Llama 2

We’ll use Hugging Face’s transformers to fine-tune Llama 2.

βœ… Install Dependencies

1
pip install transformers datasets peft

βœ… Fine-Tune Llama 2

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
import torch
import json

# Load base model & tokenizer
model_name = "meta-llama/Llama-2-7b-chat-hf"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load training data
with open("training_data.json", "r") as f:
    training_data = json.load(f)

# Convert to tokenized format
train_texts = [d["input"] for d in training_data]
train_labels = [d["output"] for d in training_data]

train_encodings = tokenizer(train_texts, padding=True, truncation=True, return_tensors="pt")
label_encodings = tokenizer(train_labels, padding=True, truncation=True, return_tensors="pt")

# Fine-tuning settings
training_args = TrainingArguments(
    output_dir="./fine-tuned-llama",
    per_device_train_batch_size=2,
    num_train_epochs=3,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_encodings,
    eval_dataset=label_encodings
)

trainer.train()
model.save_pretrained("./fine-tuned-llama")
tokenizer.save_pretrained("./fine-tuned-llama")

βœ… Llama 2 is now fine-tuned on our PDFs!


⚑ Deploying the Chatbot Backend with FastAPI

FastAPI is a lightweight, high-performance API framework that will serve as our chatbot’s backend.


πŸš€ Step 1: Install FastAPI & Uvicorn

1
pip install fastapi uvicorn gunicorn

πŸš€ Step 2: Create the FastAPI Server

Create a new file server.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from fastapi import FastAPI
from pydantic import BaseModel
from llama_cpp import Llama

app = FastAPI()

# Load Llama 2 model
llm = Llama(model_path="models/llama-2-7b-chat.Q4_K_M.gguf", n_gpu_layers=100)

# Define request schema
class ChatRequest(BaseModel):
    query: str

@app.post("/chat")
def chat(request: ChatRequest):
    """Handles chat requests and returns AI responses."""
    response = llm(request.query)["choices"][0]["text"]
    return {"response": response}

πŸš€ Step 3: Run FastAPI Server

Start the server using Uvicorn:

1
uvicorn server:app --host 0.0.0.0 --port 8000

βœ… Now, our chatbot runs as an API!

Test it with:

1
curl -X POST "http://localhost:8000/chat" -H "Content-Type: application/json" -d '{"query": "What is machine learning?"}'

🏎️ Scaling with Gunicorn

Uvicorn runs a single process, which isn’t ideal for multiple users.
We use Gunicorn to run multiple workers.

πŸš€ Step 1: Run FastAPI with Gunicorn

1
gunicorn -w 4 -k uvicorn.workers.UvicornWorker server:app --bind 0.0.0.0:8000

βœ… Now, our chatbot can handle multiple users at once!


🎨 Deploying Streamlit as a Frontend

Now that the API is live, we’ll connect it to Streamlit.


πŸš€ Step 1: Modify app.py to Call FastAPI

Update app.py to fetch chatbot responses via API.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import streamlit as st
import requests

st.title("πŸ“„ AI Chatbot for PDF Search")

query = st.text_input("Ask a question:")

if query:
    response = requests.post("http://localhost:8000/chat", json={"query": query}).json()
    st.write("### πŸ€– AI Response:")
    st.write(response["response"])

πŸš€ Step 2: Run Streamlit on a Web Server

Run Streamlit with:

1
streamlit run app.py --server.port 8501 --server.address 0.0.0.0

βœ… Now, the chatbot has a web interface!


πŸ—οΈ Running Everything with Supervisor

To keep FastAPI & Streamlit running in the background, use Supervisor.

πŸš€ Step 1: Install Supervisor

1
sudo apt install supervisor

πŸš€ Step 2: Create Supervisor Config

Create /etc/supervisor/conf.d/chatbot.conf:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
[program:fastapi_server]
command=/usr/bin/gunicorn -w 4 -k uvicorn.workers.UvicornWorker server:app --bind 0.0.0.0:8000
autostart=true
autorestart=true
stderr_logfile=/var/log/fastapi_server.err.log
stdout_logfile=/var/log/fastapi_server.out.log

[program:streamlit_ui]
command=/usr/bin/streamlit run /path/to/app.py --server.port 8501 --server.address 0.0.0.0
autostart=true
autorestart=true
stderr_logfile=/var/log/streamlit_ui.err.log
stdout_logfile=/var/log/streamlit_ui.out.log

Reload Supervisor:

1
2
3
4
sudo supervisorctl reread
sudo supervisorctl update
sudo supervisorctl start fastapi_server
sudo supervisorctl start streamlit_ui

βœ… Now, FastAPI & Streamlit start automatically on reboot!



πŸ”‘ API Authentication with API Keys

Right now, anyone can access our chatbot API. We’ll restrict access using API keys.


πŸš€ Step 1: Generate API Keys for Users

Modify server.py to store API keys.

1
2
3
4
5
6
7
8
API_KEYS = {
    "user1": "abc123",
    "admin": "xyz789"
}

def verify_api_key(api_key: str):
    """Checks if the provided API key is valid."""
    return api_key in API_KEYS.values()

πŸš€ Step 2: Require API Key for Chat Requests

Modify the /chat route to require an API key.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from fastapi import FastAPI, Header, HTTPException

app = FastAPI()

@app.post("/chat")
def chat(query: str, api_key: str = Header(None)):
    """Requires API key for chatbot access."""
    
    if not api_key or not verify_api_key(api_key):
        raise HTTPException(status_code=401, detail="Invalid API Key")

    response = llm(query)["choices"][0]["text"]
    return {"response": response}

βœ… Now, only users with a valid API key can access the chatbot!

Test it with:

1
curl -X POST "http://localhost:8000/chat" -H "API-Key: abc123" -H "Content-Type: application/json" -d '{"query": "What is AI?"}'

⏳ Preventing API Abuse with Rate-Limiting

To prevent spam/bot abuse, we’ll limit how often users can query the API.


πŸš€ Step 1: Install Rate-Limiting Middleware

Install slowapi, a FastAPI-compatible rate limiter.

1
pip install slowapi

πŸš€ Step 2: Add Rate-Limiting to FastAPI

Modify server.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from slowapi import Limiter
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address)

@app.post("/chat")
@limiter.limit("5/minute")
def chat(query: str, api_key: str = Header(None)):
    """Limits requests to 5 per minute per user."""
    
    if not api_key or not verify_api_key(api_key):
        raise HTTPException(status_code=401, detail="Invalid API Key")

    response = llm(query)["choices"][0]["text"]
    return {"response": response}

βœ… Now, users can only make 5 requests per minute!

Test it by sending multiple requests in a short time.


πŸ”’ Part 3: Encrypting User Queries

By default, data is sent in plaintext. Let’s encrypt user queries to protect sensitive data.


πŸš€ Step 1: Install Cryptography Library

1
pip install cryptography

πŸš€ Step 2: Encrypt User Queries Before Sending

Modify Streamlit UI (app.py):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from cryptography.fernet import Fernet
import requests

# Generate encryption key (Only run once!)
key = Fernet.generate_key()
cipher = Fernet(key)

st.title("πŸ” Secure AI Chatbot")

query = st.text_input("Ask a question:")

if query:
    encrypted_query = cipher.encrypt(query.encode()).decode()
    
    response = requests.post("http://localhost:8000/chat", json={"query": encrypted_query})
    
    decrypted_response = cipher.decrypt(response.json()["response"].encode()).decode()
    
    st.write("### πŸ€– AI Response:")
    st.write(decrypted_response)

βœ… Now, queries & responses are encrypted before being sent!


πŸ› οΈ Securing Deployment with HTTPS

To enable secure communication, use Let’s Encrypt for SSL/TLS encryption.


πŸš€ Step 1: Install Certbot

1
sudo apt install certbot python3-certbot-nginx

πŸš€ Step 2: Configure SSL for Nginx

Modify /etc/nginx/sites-available/chatbot:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
server {
    listen 80;
    server_name yourdomain.com;
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl;
    server_name yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/yourdomain.com/privkey.pem;

    location / {
        proxy_pass http://localhost:8000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Restart Nginx:

1
sudo systemctl restart nginx

βœ… Now, the chatbot runs securely over HTTPS!

πŸ“œ Logging User Interactions

We’ll log every chatbot request to a database for later analysis.


πŸš€ Step 1: Install SQLite for Logging

We’ll store logs in an SQLite database.

1
pip install sqlite3

πŸš€ Step 2: Modify FastAPI to Log Chats

Modify server.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import sqlite3
from datetime import datetime

# Connect to SQLite database
conn = sqlite3.connect("chat_logs.db")
cursor = conn.cursor()

# Create logs table
cursor.execute("""
CREATE TABLE IF NOT EXISTS chat_logs (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    timestamp TEXT,
    user TEXT,
    query TEXT,
    response TEXT
)
""")
conn.commit()

def log_chat(user, query, response):
    """Logs chatbot interactions to the database."""
    timestamp = datetime.now().isoformat()
    cursor.execute("INSERT INTO chat_logs (timestamp, user, query, response) VALUES (?, ?, ?, ?)",
                   (timestamp, user, query, response))
    conn.commit()

@app.post("/chat")
def chat(query: str, api_key: str = Header(None)):
    """Handles chatbot requests and logs them."""
    
    if not api_key or not verify_api_key(api_key):
        raise HTTPException(status_code=401, detail="Invalid API Key")

    response = llm(query)["choices"][0]["text"]
    
    # Log interaction
    log_chat(api_key, query, response)
    
    return {"response": response}

βœ… Now, all chatbot interactions are logged!


πŸ“ˆ Tracking Most Common Queries

Now that chats are logged, let’s track the most frequently asked questions.


πŸš€ Step 1: Query Most Common Searches

Modify server.py to fetch analytics:

1
2
3
4
5
6
@app.get("/analytics/top-queries")
def top_queries():
    """Returns the top 5 most asked questions."""
    cursor.execute("SELECT query, COUNT(query) as count FROM chat_logs GROUP BY query ORDER BY count DESC LIMIT 5")
    results = cursor.fetchall()
    return {"top_queries": results}

βœ… Now we can see the top 5 queries!

Test it:

1
curl -X GET "http://localhost:8000/analytics/top-queries"

⏳ Monitoring Response Time

To track how fast the chatbot is responding, we’ll log execution time.


πŸš€ Step 1: Modify FastAPI to Track Response Time

Modify server.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import time

@app.post("/chat")
def chat(query: str, api_key: str = Header(None)):
    """Logs chatbot response times for performance tracking."""
    
    if not api_key or not verify_api_key(api_key):
        raise HTTPException(status_code=401, detail="Invalid API Key")

    start_time = time.time()
    response = llm(query)["choices"][0]["text"]
    end_time = time.time()
    
    response_time = round(end_time - start_time, 2)
    
    # Log response time
    cursor.execute("INSERT INTO chat_logs (timestamp, user, query, response) VALUES (?, ?, ?, ?)",
                   (datetime.now().isoformat(), api_key, query, f"{response} (Response Time: {response_time}s)"))
    conn.commit()
    
    return {"response": response, "response_time": response_time}

βœ… Now, chatbot response times are tracked!

Test it:

1
curl -X POST "http://localhost:8000/chat" -H "API-Key: abc123" -H "Content-Type: application/json" -d '{"query": "How does machine learning work?"}'

πŸ“Š Displaying Analytics in Streamlit

We’ll display usage insights in a dashboard.


πŸš€ Step 1: Modify Streamlit UI

Update app.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import streamlit as st
import requests

st.title("πŸ“Š Chatbot Analytics")

# Fetch top queries
response = requests.get("http://localhost:8000/analytics/top-queries").json()
top_queries = response["top_queries"]

# Display analytics
st.write("### πŸ”₯ Top 5 Most Asked Questions")
for query, count in top_queries:
    st.write(f"- {query} ({count} times)")

# Fetch recent chats
st.write("### πŸ“ Recent Chat Logs")
conn = sqlite3.connect("chat_logs.db")
cursor = conn.cursor()
cursor.execute("SELECT timestamp, user, query FROM chat_logs ORDER BY timestamp DESC LIMIT 5")
recent_chats = cursor.fetchall()

for timestamp, user, query in recent_chats:
    st.write(f"πŸ“… {timestamp} - **{user}** asked: *{query}*")

βœ… Now, we have a real-time analytics dashboard!

Run it:

1
streamlit run app.py --server.port 8502