Academic Agent Outreach

May 2025 - Jul 2025

React
FastAPI
Gemini
LangChain
SentenceTransformers
A RAG-powered email scheduling and professor matching platform combining web scraping, semantic search, and LLM personalization.
Published

May 1, 2025

GitHub | Live Demo

Project Overview

Academic Agent Outreach is an AI-driven outreach dashboard designed to match students with academic faculty based on research interests. It crawls university profiles in real-time, extracts publication records using LangChain and Gemini, drafts highly contextual cold emails referencing specific papers, and schedules delivery using the Gmail API.

Problem

  • advisor Search Friction: Manually crawling dozens of faculty profiles to match research interests is time-consuming.
  • Generic Emails: Generic outreach emails are frequently ignored; successful emails must show a clear understanding of the professor’s recent publications.
  • Auth State in Background Tasks: Background schedulers running without active sessions fail when access tokens expire.

Features

  • Dynamic RAG Scraper: Scrapes search indexes using SerpAPI and extracts faculty pages via BeautifulSoup, cleaning HTML script/style noise.
  • Information Extraction (LLM RAG): Leverages LangChain and gemini-2.0-flash to extract structured publications lists and ongoing projects from unstructured text.
  • Semantic Similarity Matching: An offline module encoding student details and faculty profiles into 384D vectors using SentenceTransformers (all-MiniLM-L6-v2).
  • Asynchronous Email Scheduler: Integrates APScheduler to monitor scheduled emails and execute sends via the Gmail API.
  • Google OAuth Refresh Loop: Automatically exchanges offline refresh tokens for active access credentials prior to background scheduling tasks.
  • Profile Feature Serializer: Compiles student academic data (GPA, courses, projects) into structured text context blocks to guide personalization.

Tech Stack

  • Frontend:
    • React
    • TypeScript
    • TailwindCSS
    • shadcn-ui
  • Backend API:
    • FastAPI (Python)
    • LangChain
    • Google Gemini API
    • SerpAPI
    • APScheduler
  • Vector Matching:
    • SentenceTransformers (all-MiniLM-L6-v2)
    • NumPy / Scikit-Learn
  • Database & Auth:
    • Firebase Firestore
    • Google OAuth 2.0 / Gmail API

Architecture

graph TD
    Query["User Query (React Dashboard)"] --> Controller["FastAPI API Controller"]
    Controller --> S1["Step 1: Query Gemini -> Predict matching professor names"]
    Controller --> S2["Step 2: SerpAPI -> Fetch profile links"]
    Controller --> S3["Step 3: BeautifulSoup -> Extract page text"]
    Controller --> S4["Step 4: LangChain/Gemini -> Parse structured publications"]
    S4 --> Draft["Personalized Email Draft"]
    Draft --> Queue["Schedule queue (APScheduler + Firestore)"]
    Queue -->|Refreshes Google OAuth token| Send["Gmail API Send"]

My Contributions

  • Built the dynamic scraping and text cleaning pipeline using BeautifulSoup and SerpAPI.
  • Developed the LangChain information extraction chain parsing unstructured profiles.
  • Engineered the SentenceTransformer matching module using weighted linear combinations of cosine similarities.
  • Created the background email queue inside FastAPI using APScheduler and Gmail API integrations.
  • Implemented the Google OAuth refresh loop logic in Firestore.

What I Learned

  • Structuring Retrieval-Augmented Generation (RAG) pipelines over unstructured HTML.
  • Encoding and comparing semantic vectors using transformer models.
  • Configuring background cron-like tasks inside ASGI servers.
  • Operating OAuth 2.0 credential exchanges for offline API access.

Results

  • 8-12 seconds average response time for scraping Google, extracting details, and generating personalized drafts.
  • Maintained 100% token refresh reliability across scheduled cron sends.
  • Vector matching results aligned CS researchers with interdisciplinary projects accurately by weighting publications and interests over department names.

Future Work

  • Pre-embed and cache thousands of faculty profiles in ChromaDB/FAISS to support sub-second query matching.
  • Add multi-page PDF parsing to automatically extract student profiles from CV uploads.
  • Build LangGraph agents to run double-check validation on email templates before delivery.