You can use Python with libraries like:
import PyPDF2 import textstat from sklearn.feature_extraction.text import TfidfVectorizer with open("leo_babauta.pdf", "rb") as f: reader = PyPDF2.PdfReader(f) text = " ".join([page.extract_text() for page in reader.pages]) Deep features readability = textstat.flesch_kincaid_grade(text) sentence_len = textstat.avg_sentence_length(text) Concept matching (simplified) concepts = "habit": ["habit", "routine", "daily", "practice"], "minimalism": ["less", "simplify", "declutter", "minimalist"], "mindfulness": ["mindful", "presence", "aware", "attention"] leo babauta pdf
| Feature Category | Example Deep Features for Leo Babauta PDF | |----------------|---------------------------------------------| | | Average sentence length (short: ~10-15 words), frequent use of “you,” “simple,” “less,” “focus,” “habit” | | Structural Markers | Number of bullet lists / checklists, presence of “Zen” headings, numbered steps (e.g., “1. Do one task at a time”) | | Readability | Flesch-Kincaid Grade Level (typically 6th–8th grade for Babauta), high personal pronoun density | | Lexical Themes | TF-IDF top terms: habit, distraction, mindfulness, clutter, daily routine, procrastination, gratitude | 2. Semantic & Conceptual Features (Deep) Use a small language model or keyword mapping to extract conceptual depth. You can use Python with libraries like: import
scores = k: sum(text.lower().count(w) for w in words) for k, words in concepts.items() scores = k: sum(text