NLPパイプライン
02

Project 02 · 2025 · Harsh Yadav

Reddit
Persona
Analytics.

ソーシャルNLP分析

End-to-end ETL pipeline scraping Reddit activity via PRAW API, processing 100+ posts per user and generating structured psychological persona reports using Groq LLM (Llama3-70B). Has garnered community interest with 1 external fork.

Year2025
RoleML Engineer
TypeNLP / ETL Pipeline
StatusPublic · 1 Fork
System Overview · システム概要 Groq · NLTK · SpaCy · PRAW · Streamlit
100+Posts processed per user
Big 5Personality traits estimated
1 ForkExternal community interest
[01] Overview

Turning Reddit signals into structured human psychology.

Applied multi-library NLP preprocessing (NLTK, TextBlob, VaderSentiment, SpaCy) for sentiment scoring, keyword extraction, and Big Five personality trait estimation with citation tracking. The pipeline is fully modular with separate scraper, processor, analyzer, and output modules.

An optional Streamlit UI enables non-technical stakeholders to explore results without code.

Python Groq API PRAW NLTK SpaCy TextBlob PostgreSQL Streamlit
Pipeline Architecture · パイプライン設計 NLPアーキテクチャ
[ DATA EXTRACTION ] └── PRAW API ──► Scrape Posts + Comments │ ▼ [ ETL LAYER ] ├── Clean & Tokenize (NLTK / SpaCy) ├── Sentiment Score (VADER + TextBlob) ├── Keyword Extraction └── Store JSON → PostgreSQL │ ▼ [ NLP ANALYSIS ] ├── Big Five Personality Estimation ├── Citation Tracking per Trait └── Behavioral Pattern Recognition │ ▼ [ LLM REPORT GENERATION ] └── Groq API (Llama3-70B) ── Prompt Engineering ── Structured JSON Persona Output │ ▼ [ PRESENTATION ] └── Streamlit UI Dashboard (Interactive) ── Persona visualization ── No-code exploration