Academic Agent Outreach

May 2025 - Jul 2025

React

FastAPI

Gemini

LangChain

SentenceTransformers

A RAG-powered email scheduling and professor matching platform combining web scraping, semantic search, and LLM personalization.

Published

May 1, 2025

GitHub | Live Demo

Project Overview

Academic Agent Outreach is an AI-driven outreach dashboard designed to match students with academic faculty based on research interests. It crawls university profiles in real-time, extracts publication records using LangChain and Gemini, drafts highly contextual cold emails referencing specific papers, and schedules delivery using the Gmail API.

Problem

advisor Search Friction: Manually crawling dozens of faculty profiles to match research interests is time-consuming.
Generic Emails: Generic outreach emails are frequently ignored; successful emails must show a clear understanding of the professor’s recent publications.
Auth State in Background Tasks: Background schedulers running without active sessions fail when access tokens expire.

Features

Dynamic RAG Scraper: Scrapes search indexes using SerpAPI and extracts faculty pages via BeautifulSoup, cleaning HTML script/style noise.
Information Extraction (LLM RAG): Leverages LangChain and gemini-2.0-flash to extract structured publications lists and ongoing projects from unstructured text.
Semantic Similarity Matching: An offline module encoding student details and faculty profiles into 384D vectors using SentenceTransformers (all-MiniLM-L6-v2).
Asynchronous Email Scheduler: Integrates APScheduler to monitor scheduled emails and execute sends via the Gmail API.
Google OAuth Refresh Loop: Automatically exchanges offline refresh tokens for active access credentials prior to background scheduling tasks.
Profile Feature Serializer: Compiles student academic data (GPA, courses, projects) into structured text context blocks to guide personalization.

Tech Stack

Frontend:
- React
- TypeScript
- TailwindCSS
- shadcn-ui
Backend API:
- FastAPI (Python)
- LangChain
- Google Gemini API
- SerpAPI
- APScheduler
Vector Matching:
- SentenceTransformers (all-MiniLM-L6-v2)
- NumPy / Scikit-Learn
Database & Auth:
- Firebase Firestore
- Google OAuth 2.0 / Gmail API

Architecture

graph TD
    Query["User Query (React Dashboard)"] --> Controller["FastAPI API Controller"]
    Controller --> S1["Step 1: Query Gemini -> Predict matching professor names"]
    Controller --> S2["Step 2: SerpAPI -> Fetch profile links"]
    Controller --> S3["Step 3: BeautifulSoup -> Extract page text"]
    Controller --> S4["Step 4: LangChain/Gemini -> Parse structured publications"]
    S4 --> Draft["Personalized Email Draft"]
    Draft --> Queue["Schedule queue (APScheduler + Firestore)"]
    Queue -->|Refreshes Google OAuth token| Send["Gmail API Send"]

My Contributions

Built the dynamic scraping and text cleaning pipeline using BeautifulSoup and SerpAPI.
Developed the LangChain information extraction chain parsing unstructured profiles.
Engineered the SentenceTransformer matching module using weighted linear combinations of cosine similarities.
Created the background email queue inside FastAPI using APScheduler and Gmail API integrations.
Implemented the Google OAuth refresh loop logic in Firestore.

What I Learned

Structuring Retrieval-Augmented Generation (RAG) pipelines over unstructured HTML.
Encoding and comparing semantic vectors using transformer models.
Configuring background cron-like tasks inside ASGI servers.
Operating OAuth 2.0 credential exchanges for offline API access.

Results

8-12 seconds average response time for scraping Google, extracting details, and generating personalized drafts.
Maintained 100% token refresh reliability across scheduled cron sends.
Vector matching results aligned CS researchers with interdisciplinary projects accurately by weighting publications and interests over department names.

Future Work

Pre-embed and cache thousands of faculty profiles in ChromaDB/FAISS to support sub-second query matching.
Add multi-page PDF parsing to automatically extract student profiles from CV uploads.
Build LangGraph agents to run double-check validation on email templates before delivery.