How to Chat with Your PDF Using Retrieval Augmented Generation
Large language models are good at answering questions, but they have one big limitation: they don’t know what is inside your private documents.
If you upload a PDF like a company policy, research paper, or contract, the model cannot magically read it unless you give it that content.
This is where Retrieval Augmented Generation, or RAG, becomes useful.
RAG lets you combine a language model with your own data. Instead of asking the model to guess, you first retrieve the right parts of the document and then ask the model to answer using that information.
In this article, you will learn how to chat with your own PDF using RAG. You will build the backend using LangChain and create a simple React user interface to ask questions and see answers.
You should be comfortable with basic Python and JavaScript, and have a working knowledge of React and REST APIs. Familiarity with language models and a basic understanding of embeddings or vector search will be helpful but not mandatory.
What We’ll Cover
What Problem Are We Solving?
Imagine you have a long PDF with hundreds of pages. Searching manually is slow. Copying text into ChatGPT is not practical.
You want to ask simple questions like “What is the leave policy?” or “What does this contract say about termination?”
A normal language model cannot answer these questions correctly because it has never seen your PDF. RAG solves this by adding a retrieval step before generation.
The system first finds relevant parts of the PDF and then uses those parts as context for the answer.
What Is Retrieval Augmented Generation?
Retrieval Augmented Generation is a pattern with three main steps.
First, your document is split into small chunks. Each chunk is converted into a vector embedding. These embeddings are stored in a vector database.
Second, when a user asks a question, that question is also converted into an embedding. The system searches the vector database to find the most similar chunks.
Third, those chunks are sent to the language model along with the question. The model uses only that context to generate an answer.
This approach keeps answers grounded in your document and reduces hallucinations.
The system has four main parts:
-
A PDF loader reads the document.
-
A text splitter breaks it into chunks.
-
An embedding model converts text into vectors and stores them in a vector store.
-
A language model answers questions using retrieved chunks.
The frontend is a simple chat interface built in React. It sends the user’s question to a backend API and displays the response.
This type of custom RAG development helps companies build internal tools that work with their own private data instead of sending it to large language models.
Setting Up the Backend with LangChain
We’ll use Python and LangChain for the backend. The backend will load the PDF, build the vector store, and expose an API to answer questions.
Installing Dependencies
Start by installing the required libraries.
pip install langchain langchain-community langchain-openai faiss-cpu pypdf fastapi uvicorn
This setup uses FAISS as a local vector store and OpenAI for embeddings and chat. You can swap these later for other models.
Loading and Splitting the PDF
The first step is to load the PDF and split it into chunks that are small enough for embeddings.
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = PyPDFLoader("document.pdf")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)
Chunking is important. If chunks are too large, embeddings become less accurate. If they are too small, context is lost.
Creating Embeddings and Vector Store
Next, convert the chunks into embeddings and store them in FAISS.
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)
This step is usually done once. In a real app, you would persist the vector store to disk.
Creating the Retrieval Chain
Now create a retrieval-based question answering chain.
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
llm = ChatOpenAI(
temperature=0,
model="gpt-4o-mini"
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
return_source_documents=False
)
The retriever finds the top matching chunks. The language model answers using only those chunks.
Exposing an API with FastAPI
Now wrap this logic in an API so the React app can use it.
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class QuestionRequest(BaseModel):
question: str
@app.post("/ask")
def ask_question(req: QuestionRequest):
result = qa_chain.run(req.question)
return {"answer": result}
Run the server using this command:
uvicorn main:app --reload
Your backend is now ready.
Building a Simple React Chat UI
Next, build a simple React interface that sends questions to the backend and shows answers.
You can use any React setup. A simple Vite or Create React App project works fine.
Inside your main component, manage the question input and answer state.
import { useState } from "react";
function App() {
const [question, setQuestion] = useState("");
const [answer, setAnswer] = useState("");
const [loading, setLoading] = useState(false);
const askQuestion = async () => {
setLoading(true);
const res = await fetch("http://localhost:8000/ask", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ question })
});
const data = await res.json();
setAnswer(data.answer);
setLoading(false);
};
return (
<div style={{ padding: "2rem", maxWidth: "600px", margin: "auto" }}>
<h2>Chat with your PDF</h2>
<textarea
value={question}
onChange={(e) => setQuestion(e.target.value)}
rows={4}
style={{ width: "100%" }}
placeholder="Ask a question about the PDF"
/>
<button onClick={askQuestion} disabled={loading}>
{loading ? "Thinking..." : "Ask"}
</button>
<div style={{ marginTop: "1rem" }}>
<strong>Answer</strong>
<p>{answer}</p>
</div>
</div>
);
}
export default App;
This UI is simple but effective. It lets users type a question, sends it to the backend, and shows the answer. Make sure to use the latest version of React to avoid the growing React vulnerabilities.
How the Full Flow Works
When the app starts, the backend has already processed the PDF and built the vector store. When a user types a question, the React app sends it to the API.
The backend converts the question into an embedding. It searches the vector store for similar chunks. Those chunks are passed to the language model as context. The model generates an answer based only on that context.
The answer is sent back to the frontend and displayed to the user.
Why This Approach Works Well
RAG works well because it keeps answers grounded in real data. The model is not guessing – it’s reading from your document.
This approach also scales well. You can add more PDFs, reindex them, and reuse the same chat interface. You can also swap FAISS for a hosted vector database if needed.
Another benefit is control. You decide what data the model can see. This is important for private or sensitive documents.
Common Improvements You Can Add
You can improve this setup in many ways. You can persist the vector store so it doesn’t rebuild on every restart. You can also add document citations to the answer. And you can stream responses for a better chat experience.
You can also add authentication, upload new PDFs from the UI, or support multiple documents per user.
Final Thoughts
Chatting with PDFs using Retrieval Augmented Generation is one of the most practical uses of language models today. It turns static documents into interactive knowledge sources.
With LangChain handling retrieval and a simple React UI for interaction, you can build a useful system with very little code. The same pattern can be used for HR policies, legal documents, technical manuals, or research papers.
Once you understand this flow, you can adapt it to many real world problems where answers must come from trusted documents rather than from the model’s memory alone.
