AI – Using Python for RAG (Part III) over Dataset

By | 23/07/2025

In this post, we will see how to create a specialized RAG system using a large CSV contains details of 1000 more movies, including titles, genres, descriptions, and ratings.
The system will process structured CSV data, create semantic embeddings for each record, build a searchable vector index, and provide an interactive query interface based on a local language model, namely gemma3:4b.

Dataset Overview:
The CSV file containes movie information with the following structure:

  • Movie metadata: Title, Year, Genre, Director, Actors
  • Performance metrics: IMDB Rating, Votes, Revenue, Metascore
  • Content data: Plot descriptions, Runtime

This structured format is typical of many enterprise datasets where we need to enable natural language queries across multiple fields and data types.

Before we dive into the code, let’s install all the necessary dependencies:

# Core RAG framework
pip install langchain langchain-community langchain-ollama

# Vector storage and numerical processing  
pip install faiss-cpu numpy

# Data processing
pip install pandas

Why FAISS for Vector Storage?
I chose FAISS for its superior performance and Windows compatibility. Unlike alternatives like Chroma that can fail silently on Windows systems, FAISS provides reliable vector storage with sub-linear search times that scale efficiently to millions of vectors. As a self-contained library requiring no external dependencies, FAISS eliminates infrastructure complexity while delivering the battle-tested reliability of Meta’s production systems, making it ideal for enterprise environments where performance and stability are critical.

Now, let’s see how to define our RAG pipeline:

[STEP 1] – Data loading and preprocessing
The first step transforms raw CSV data into structured documents optimized for embedding and retrieval:

def load_movie_data(csv_path):
    """
    Load and process movie data from CSV file.
    
    This function:
    - Reads the CSV with multiple encoding attempts for compatibility
    - Converts each movie row into a comprehensive text document
    - Creates metadata for each movie (title, year, rating, etc.)
    - Returns a list of LangChain Document objects
    
    Args:
        csv_path (str): Path to the movie CSV file
        
    Returns:
        list: List of Document objects containing movie information
    """
    print(f"Loading movie data from {csv_path}...")
    
    # Try multiple encodings to handle different CSV formats
    # This ensures compatibility with various file encodings
    df = None
    for encoding in ['utf-8', 'latin-1', 'cp1252']:
        try:
            df = pd.read_csv(csv_path, encoding=encoding)
            print(f"Loaded with {encoding} encoding")
            break
        except UnicodeDecodeError:
            continue
    
    # Verify that we successfully loaded the data
    if df is None:
        raise Exception("Could not read CSV file")
    
    print(f"Dataset: {len(df)} movies, {len(df.columns)} columns")
    
    # Convert each movie row into a structured document
    documents = []
    for idx, row in df.iterrows():
        # Extract movie information from CSV columns
        # Using .get() method with defaults to handle missing data gracefully
        title = str(row.get('Title', f'Movie {idx+1}')).strip()
        year = row.get('Year', 'Unknown')
        genre = row.get('Genre', 'Unknown')
        rating = row.get('Rating', 'N/A')
        director = row.get('Director', 'Unknown')
        actors = row.get('Actors', 'Unknown')
        description = row.get('Description', 'No description')
        runtime = row.get('Runtime (Minutes)', 'Unknown')
        votes = row.get('Votes', 'Unknown')
        revenue = row.get('Revenue (Millions)', 'Unknown')
        metascore = row.get('Metascore', 'Unknown')
        
        # Create a comprehensive, structured text representation of the movie
        # This format optimizes the content for embedding and retrieval
        movie_text = f"""MOVIE: {title}

BASIC INFO:
- Release Year: {year}
- Genre: {genre}
- Director: {director}
- Main Actors: {actors}
- Runtime: {runtime} minutes

RATINGS & PERFORMANCE:
- IMDB Rating: {rating}/10
- Votes: {votes}
- Metascore: {metascore}
- Box Office Revenue: ${revenue} million

PLOT SUMMARY:
{description}

KEYWORDS: {genre.lower()} {title.lower()} {director.lower()} movie film cinema"""
        
        # Create metadata for filtering and enhanced retrieval
        # Metadata allows for structured querying and result filtering
        metadata = {
            "title": title,
            "year": int(year) if str(year).isdigit() else 0,
            "rating": float(rating) if str(rating).replace('.', '').isdigit() else 0.0,
            "genre": str(genre),
            "director": str(director),
            "movie_id": idx
        }
        
        # Create LangChain Document object with content and metadata
        documents.append(Document(page_content=movie_text, metadata=metadata))
    
    print(f"Created {len(documents)} movie documents")
    return documents

[STEP 2] – Embedding creation
We process documents in batches to manage memory and prevent system overload:

def create_embeddings_batch(documents, embeddings_model, batch_size=10):
    """
    Create vector embeddings for movie documents in batches.
    
    This function processes documents in small batches to prevent memory issues
    and system crashes. Each document is converted to a vector representation
    using the specified embedding model.
    
    Args:
        documents (list): List of Document objects to embed
        embeddings_model: Ollama embedding model instance
        batch_size (int): Number of documents to process in each batch
        
    Returns:
        list: List of embedding vectors for all documents
    """
    print(f"Creating embeddings for {len(documents)} documents...")
    
    # Initialize storage for all embeddings
    all_embeddings = []
    total_batches = (len(documents) + batch_size - 1) // batch_size
    
    # Process documents in batches to manage memory and prevent crashes
    for i in range(0, len(documents), batch_size):
        batch = documents[i:i + batch_size]
        current_batch = i // batch_size + 1
        
        print(f"Batch {current_batch}/{total_batches}: Processing {len(batch)} movies...")
        
        try:
            # Extract text content from Document objects
            texts = [doc.page_content for doc in batch]
            
            # Create embeddings for the entire batch at once (more efficient)
            batch_embeddings = embeddings_model.embed_documents(texts)
            all_embeddings.extend(batch_embeddings)
            
            # Show progress to user
            progress = (current_batch / total_batches) * 100
            print(f"Batch {current_batch} complete ({progress:.1f}% total)")
            
            # Small delay to prevent overwhelming the embedding model
            time.sleep(0.5)
            
        except Exception as e:
            print(f"Batch {current_batch} failed: {e}")
            # Fallback: try processing documents individually
            print("Trying individual documents...")
            
            for j, doc in enumerate(batch):
                try:
                    # Process single document if batch processing fails
                    doc_embedding = embeddings_model.embed_documents([doc.page_content])
                    all_embeddings.extend(doc_embedding)
                    print(f"Document {j+1}/{len(batch)} embedded")
                except Exception as doc_error:
                    print(f"Document {j+1} failed: {doc_error}")
                    # Add zero embedding as placeholder to maintain document alignment
                    all_embeddings.append([0.0] * 768)  # nomic-embed-text dimension
    
    print(f"Created {len(all_embeddings)} embeddings")
    return all_embeddings

[STEP 3] – FAISS Vector store creation
We create a persistent FAISS index with fallback mechanisms:

def save_vector_store(documents, embeddings, save_path):
    """
    Create and save a FAISS vector store for similarity search.
    
    This function creates a FAISS (Facebook AI Similarity Search) vector database
    that enables fast similarity search across movie embeddings. It also includes
    fallback mechanisms for data persistence.
    
    Args:
        documents (list): List of Document objects
        embeddings (list): List of embedding vectors
        save_path (str): Path where to save the vector store
        
    Returns:
        FAISS: Vector store object for similarity search
    """
    print(f"Saving vector store to {save_path}...")
    
    try:
        # Convert embeddings list to numpy array for FAISS compatibility
        embeddings_array = np.array(embeddings).astype('float32')
        
        # Create FAISS vector store from documents and embeddings
        # This creates an index that allows for fast similarity search
        vector_store = FAISS.from_embeddings(
            text_embeddings=list(zip([doc.page_content for doc in documents], embeddings)),
            embedding=None,  # We're providing pre-computed embeddings
            metadatas=[doc.metadata for doc in documents]
        )
        
        # Save the vector store to disk for future use
        vector_store.save_local(save_path)
        
        print(f"Vector store saved successfully")
        return vector_store
        
    except Exception as e:
        print(f"Error saving vector store: {e}")
        
        # Fallback mechanism: save raw data if FAISS save fails
        print("Saving raw data as fallback...")
        fallback_data = {
            "documents": documents,
            "embeddings": embeddings
        }
        
        # Use pickle to serialize the data
        with open(f"{save_path}_fallback.pkl", "wb") as f:
            pickle.dump(fallback_data, f)
        
        print("Fallback data saved")
        return None

[STEP 4] – Main execution pipeline
The orchestration function ties all components together:

def main():
    """
    Main function that orchestrates the complete RAG system setup.
    
    This function executes all steps in sequence:
    1. Data loading and preprocessing
    2. Embedding model initialization  
    3. Vector embedding creation
    4. Vector store creation and saving
    """
    print("movie " + "=" * 60)
    print(" WINDOWS COMPATIBLE MOVIE RAG SETUP")
    print("=" * 61 + " movie ")
    
    # Configuration parameters
    csv_path = "DATASET/IMDBMovieData.csv"  # Path to movie data
    vector_store_path = "./movie_vector_store"  # Where to save vector store
    
    try:
        # =====================================================================
        # PREREQUISITE CHECKS
        # =====================================================================
        print("\n Checking prerequisites...")
        if not os.path.exists(csv_path):
            raise Exception(f"CSV file not found: {csv_path}")
        print("CSV file found")
        
        # =====================================================================
        # STEP 1: LOAD AND PROCESS MOVIE DATA
        # =====================================================================
        print("\n STEP 1: Loading movie data...")
        documents = load_movie_data(csv_path)
        
        # =====================================================================
        # STEP 2: INITIALIZE EMBEDDING MODEL
        # =====================================================================
        print("\n STEP 2: Initializing embedding model...")
        embeddings_model = OllamaEmbeddings(model="nomic-embed-text")
        
        # Test the embedding model to ensure it's working
        test_embedding = embeddings_model.embed_query("test movie")
        print(f" Embedding model ready (dimension: {len(test_embedding)})")
        
        # =====================================================================
        # STEP 3: CREATE EMBEDDINGS FOR ALL DOCUMENTS
        # =====================================================================
        print("\n STEP 3: Creating embeddings...")
        embeddings = create_embeddings_batch(documents, embeddings_model, batch_size=5)
        
        # =====================================================================
        # STEP 4: BUILD AND SAVE VECTOR STORE
        # =====================================================================
        print("\n STEP 4: Creating and saving vector store...")
        vector_store = save_vector_store(documents, embeddings, vector_store_path)
        
        # Fallback: create in-memory vector store if saving fails
        if vector_store is None:
            print(" Creating in-memory vector store...")
            vector_store = FAISS.from_embeddings(
                text_embeddings=list(zip([doc.page_content for doc in documents], embeddings)),
                embedding=embeddings_model,
                metadatas=[doc.metadata for doc in documents]
            )
        
        # =====================================================================
        # SUCCESS CONFIRMATION
        # =====================================================================
        print("MOVIE RAG SYSTEM READY!")
        print(f"Processed {len(documents)} movies")
        print(f"Vector store: {vector_store_path}")
        print("\n System is ready for queries!")
        print("You can now ask questions about your movie database!")
        
        # Create a success indicator file for reference
        with open("setup_complete.txt", "w") as f:
            f.write(f"Movie RAG setup completed successfully\n")
            f.write(f"Movies processed: {len(documents)}\n")
            f.write(f"Vector store: {vector_store_path}\n")
        
        print("\n SETUP COMPLETE - EXITING SCRIPT")
        print("Script finished successfully!")
        
    except Exception as e:
        print(f"\n SETUP FAILED: {e}")
        print("\n Make sure:")
        print("1. Ollama is running")
        print("2. Models are installed: ollama pull nomic-embed-text && ollama pull gemma3:4b")
        print("3. CSV file exists in DATASET folder")
        print("\n Script exiting due to error.")
    
# =============================================================================
# PROGRAM ENTRY POINT
# =============================================================================

if __name__ == "__main__":
    """
    Entry point of the program.
    
    This block ensures that the main() function only runs when the script
    is executed directly (not when imported as a module).
    
    To run this script:
    1. Ensure we have all required dependencies installed:
       pip install langchain langchain-community langchain-ollama pandas faiss-cpu numpy
    
    2. Make sure Ollama is running with the required models:
       ollama pull nomic-embed-text
       ollama pull gemma3:4b
    
    3. Place our movie CSV file in the DATASET folder
    
    4. Run the script

    main()

Now, we run the script (the duration of the script depends on the power of our computer!) and after this, we’ll have our vector database ready.


After building the vector database, we need a separate query script to interact with our RAG system. The setup script creates the searchable index, while the query script provides the user interface for asking natural language questions about the movie dataset.
This query script loads the pre-built FAISS vector store, initializes the embedding model, and creates an interactive interface where users can ask questions like “What are the highest-rated sci-fi movies?”.
The system returns contextual responses with specific movie details and shows which source movies were used to generate each answer.

import os
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

def load_rag_system(vector_store_path="./movie_vector_store"):
    """Load the RAG system and return QA chain."""
    
    # Check if vector store exists
    if not os.path.exists(vector_store_path):
        print("Vector store not found. Run setup script first.")
        return None
    
    print("Loading Movie RAG System...")
    
    # Load embedding model and vector store
    embeddings = OllamaEmbeddings(model="nomic-embed-text")
    vector_store = FAISS.load_local(
        vector_store_path, 
        embeddings,
        allow_dangerous_deserialization=True
    )
    
    # Initialize language model
    llm = ChatOllama(model="gemma3:4b")
    
    # Create prompt template
    prompt_template = """You are a movie expert. Use the provided movie information to answer questions about films.

Movie Information: {context}
Question: {question}

Answer with specific movie titles, years, and ratings when relevant:"""
    
    prompt = PromptTemplate(
        template=prompt_template,
        input_variables=["context", "question"]
    )
    
    # Create QA chain
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vector_store.as_retriever(search_kwargs={"k": 5}),
        chain_type_kwargs={"prompt": prompt}
    )
    
    print("System ready!")
    return qa_chain

def query_movies(qa_chain, question):
    """Query the movie database."""
    try:
        result = qa_chain.invoke({"query": question})
        return result["result"]
    except Exception as e:
        return f"Error: {e}"

def main():
    """Main interactive function."""
    print("Movie Database Query System")
    print("=" * 35)
    
    # Load RAG system
    qa_chain = load_rag_system()
    if not qa_chain:
        return
    
    print("\n Example questions:")
    print("• What are the highest-rated movies?")
    print("• Tell me about sci-fi movies from the 1990s")
    print("• Which movies did Christopher Nolan direct?")
    print("\nType 'quit' to exit\n")
    
    # Interactive loop
    while True:
        question = input("Your question: ").strip()
        
        if question.lower() in ['quit', 'exit', 'q']:
            print("Goodbye!")
            break
        
        if question:
            print(f"\n Answer: {query_movies(qa_chain, question)}\n")

if __name__ == "__main__":
    main()


We have done and now, let’s test our model asking him “the best three sci-fi movies in the database”:


In this post, we built a robust RAG system over structured dataset information, demonstrating how to transform CSV data into an intelligent query interface. By leveraging LangChain, Ollama embeddings, and FAISS vector storage, we:

  • Processed structured CSV data into comprehensive, searchable documents
  • Generated semantic embeddings with batch processing for scalability
  • Built a persistent FAISS vector store optimized for Windows environments
  • Created an interactive query system that understands natural language questions about movies




Leave a Reply

Your email address will not be published. Required fields are marked *