A practical guide to using local or paid LLMs to understand dense technical papers at your own pace.
Why I Started Using AI to Read Research Papers
I’ll be honest: I used to dread reading research papers. The dense math, the notation-heavy proofs, the assumption that you already know three other papers to understand this one. As an ML engineer, I knew these papers held insights I could apply at work, but actually extracting that value felt like a slog.
My learning style doesn’t mesh well with traditional academic writing. I learn best through small examples and code I can run and tinker with. Show me a 3x3 matrix example before the general n×n case. Let me see the algorithm work on toy data before throwing theorems at me.
For years, this created a barrier. Research papers felt like they were written for PhDs, not practitioners. You either had the advanced math background to parse them, or you waited months for someone to write a blog post explaining it in plain English.
That’s changed. With AI assistants (both local and cloud-based LLMs), I can now have a conversation with a paper. I can ask “what does this symbol mean?” or “show me this concept with a simple Python example” and get an answer calibrated to my current understanding. The paper adapts to me, not the other way around.
This post walks through my workflow for using AI to read technical papers, so you can apply the same approach to level up your own learning.
By the end of this post, you’ll be able to:
Set up a local LLM (using Ollama) or connect to cloud APIs for paper reading
Convert PDFs to markdown that LLMs can actually process
Craft system prompts that turn an LLM into a patient tutor, not a lecture-bot
Build an interactive Q&A workflow where you control the pace and depth
Apply this approach to papers, documentation, or any dense technical content
Before we dive in, let’s get our environment ready. You’ll need two things: a local LLM running on your machine, and a few Python libraries to glue everything together.
Installing Ollama
Ollama makes running LLMs locally dead simple. Head to ollama.com and download the installer for your OS. On Mac, it’s a standard .dmg install. On Linux, a one-liner:
curl-fsSL https://ollama.com/install.sh |sh
Once installed, verify it’s working:
ollama--version
Choosing and Pulling a Model
Browse the Ollama model library to see what’s available. For reading papers with diagrams and equations, I recommend qwen3-vl:8b, a vision-language model that can handle both text and images. You can use other models based on your machine’s specifications.
Pull it with:
ollama pull qwen3-vl:8b
This downloads the model weights (around 5GB). Grab a coffee.
To start serving:
OLLAMA_CONTEXT_LENGTH=64000 ollama serve
Keep this terminal running. Ollama now exposes an OpenAI-compatible API at localhost:11434.
Note
Note: Default context length in Ollama is 4096 tokens which is too small for our usecase, so we bump it upto 64K.
Lightweight wrapper on litellm that simplifies multi-turn conversations and provide a Unified API to talk to 100+ LLM providers (local or cloud) with one interface
Install them:
pip install pymupdf4llm lisette
With Ollama running and these libraries installed, you’re ready to start chatting with papers.
Core Workflow: PDF to Conversation
The workflow has two parts: getting the paper into a format LLMs can work with, then setting up the conversation so the AI becomes a useful tutor rather than a generic chatbot.
Converting PDF to Markdown
First step is to convert a pdf into text. pymupdf4llm handles this by extracting text while preserving structure. Here’s an example, reading this paper:
import pymupdf4llm# Convert PDF to txtpaper_txt = pymupdf4llm.to_markdown("./paper/1_2018_DPP_YT.pdf")# Save the paper as Markdownwithopen("./paper/1_2018_DPP_YT.md", "w") as f: f.write(paper_txt)
Consider using the pymupdf_layout package for a greatly improved page layout analysis.
Let’s see the first paragraph.
print(paper_txt[:500])
Industry and Case Study Paper CIKM’18, October 22-26, 2018, Torino, Italy
# **Practical Diversified Recommendations on YouTube with** **Determinantal Point Processes**
Mark Wilhelm, Ajith Ramanathan, Alexander Bonomo, Sagar Jain, Ed H. Chi, Jennifer Gillenwater
Google Inc.
{wilhelm,ajith,bonomo,sagarj,edchi,jengi}@google.com
**ABSTRACT**
Many recommendation systems produce result sets with large numbers of highly similar items. Diversifying these results is often
accomplished with heuris
That’s it. You now have clean markdown with headers, code blocks, and reasonable table formatting. For papers with complex layouts, the results aren’t perfect, but they’re good enough for an LLM to understand the content. If you want a more accurate PDF to markdown converter, you can also look at Datalab Marker.
Designing Your System Prompt
This is where the magic happens. A vanilla “you are a helpful assistant” prompt gives you Wikipedia-style explanations. A well-crafted system prompt gives you a patient tutor who meets you at your level.
Here’s the prompt which works well for paper-reading sessions:
system_prompt =f"""You are helping a someone understand an academic paper.Here is the paper \n<paper>{paper_txt}</paper>CRITICAL RULES:1. NEVER explain everything at once. Take ONE small step, then STOP and wait.2. ALWAYS start by asking what the learner already knows about the topic.3. After each explanation, ask a question to check understanding OR ask what they want to explore next.4. Keep responses SHORT (2-4 paragraphs max). End with a question.5. Use concrete examples and analogies before math. 6. Build foundations with code - Teach unfamiliar mathematical concepts through small numpy experiments rather than pure theory. Let the learner run code and observe patterns.7. If they ask "explain X", first ask what parts of X they already understand.8. Use string format like this for formula display `L_ij = q_i × q_j × exp(-α × D_ij^γ)`.TEACHING FLOW:- Assess background → Build intuition with examples → Connect to math → Let learner guide directionBAD (don't do this):"Here's everything about DPPs: [wall of text with all equations]""""
The Teaching Prompt Deep-Dive
The prompt design isn’t arbitrary. Each rule targets a specific failure mode I’ve seen when using LLMs for learning:
The paper lives in the system prompt. This keeps the full context available for every question you ask. You don’t need to re-paste sections or remind the AI what paper you’re discussing.
“Intuition before math” matches how practitioners learn. Most of us grasp concepts faster through examples and analogies than through theorem statements. If your learning style differs, swap in your own preferences here.
“Tiny example first” counteracts the LLM’s default behavior. Without this, models tend toward abstract, general explanations. Forcing concrete examples (a 3x3 matrix, a toy dataset) makes concepts stick.
“Check if I want to go deeper” prevents walls of text. LLMs love to explain everything at once. This rule puts you in control of pacing. You decide when to zoom in on details vs. move to the next topic.
The “BAD” example shows what to avoid. Negative examples help the model understand boundaries. Without it, you’ll get comprehensive but overwhelming responses.
Putting It Together: With these two pieces, you’re ready to start a conversation. The next section shows how to wire this up with lisette for a smooth multi-turn chat experience.
Wiring Up the Conversation with Lisette
lisette handles the boilerplate of multi-turn conversations: tracking message history, formatting requests for different LLM providers, and managing context windows. Here’s the minimal setup:
That’s it. The model string follows litellm’s provider/model convention. For Ollama, prefix with ollama/. For cloud providers, you’d use anthropic/claude-sonnet-4-20250514 or openai/gpt-4o.
To send a message and get a response:
chat("Hi! I am Aayush")
Hello Aayush! 😊 How can I assist you today? Let me know if you have any questions or need help with anything!
As we can see above, it’s as easy to start a conversation with a ollama running in your local machine. Chat class is designed to be persistent which means it remembers the previous conversations.
chat("What's my name?")
Your name is Aayush! 😊 I remember you mentioned it in your first message. Is there anything else you’d like to share or ask?
As you can see model remembers my name from the previous conversation. Each call to chat() appends your message and the AI’s response to the conversation history. The next time you call it, the model sees the full exchange so far. Let’s see current history:
chat.hist
[{'role': 'user', 'content': 'Hi! I am Aayush'},
Message(content='Hello Aayush! 😊 How can I assist you today? Let me know if you have any questions or need help with anything!', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None),
{'role': 'user', 'content': "What's my name?"},
Message(content="Your name is **Aayush**! 😊 I remember you mentioned it in your first message. Is there anything else you'd like to share or ask?", role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None)]
Live Example: Reading a DPP Paper
Here’s a real exchange from when I was working through a paper on Determinantal Point Processes. For ease of reading in the blog post, lets create a class which can render the AI output in a terminal style for easy reading.
This time we are actually passing the prompt we created in previous section with paper text and our instructions.
q ="""I want to learn about this paper. Help me."""ask(chat, q, think=True)
You:
I want to learn about this paper. Help me.
Tutor: Let's start with a simple question: What do you already know about recommendation systems or diversification? Are you familiar with YouTube's recommendation algorithm? Or maybe you've heard of techniques like collaborative filtering or matrix factorization?
I'm excited about this paper because it tackles a real problem we face on YouTube: when users scroll through recommendations, they often see similar videos in a row, which can cause them to abandon the feed. The authors used a statistical model called Determinantal Point Processes (DPPs) to diversify recommendations while still respecting user preferences.
What aspects of this paper interest you most? Are you more curious about how DPPs work conceptually, how they were implemented at scale, or the specific results YouTube saw? I can tailor my explanation to what you'd like to explore first.
q ="""I know about recommendation systems and work on a large recommender system which serves video content.I am familiar with recommendation algorithms. I am interested in understanding how DPP works."""ask(chat, q, think=True)
You:
I know about recommendation systems and work on a large recommender system which serves video content.
I am familiar with recommendation algorithms. I am interested in understanding how DPP works.
Tutor: The core idea of DPPs is modeling **repulsion between items** in a recommendation list. Unlike traditional pointwise scoring (e.g., sorting by individual video quality scores), DPPs consider *how items interact* when presented together. For example, if two basketball videos are shown consecutively, the DPP model would reduce the probability of both being selected because they're redundant.
In your video recommender system, this translates to: Instead of just selecting the "highest quality" videos, DPPs automatically suppress items that would decrease overall user engagement when placed together. The magic happens in the **determinant calculation**: when you take the determinant of a submatrix from the kernel matrix `L`, it measures both the individual quality of items (via diagonal entries) and their mutual diversity (via off-diagonal entries). A large determinant means you've selected a diverse set with high overall utility.
Would you like me to walk through a simple Python example using NumPy to demonstrate how the determinant of a 2x2 kernel matrix captures both quality and diversity? Or would you prefer to first understand why this approach is better than heuristic methods like "sliding window" diversity?
q ="""Yes I don't understand what determinant are, can you help me understand the basic intuition and use code examples?I am fairly familiar with numpy."""ask(chat, q)
You:
Yes I don't understand what determinant are, can you help me understand the basic intuition and use code examples?
I am fairly familiar with numpy.
Tutor: Let's visualize this with a simple NumPy example. Determinants measure how "spread out" a matrix is - think of it as the volume of the parallelogram formed by the matrix's columns. For DPPs, this volume represents both item quality and diversity:
```python
import numpy as np
# Create two items with different qualities
q1 = 5.0 # High-quality video
q2 = 1.0 # Low-quality video
# Case 1: Similar items (D=0.5)
L_similar = np.array([
[q1**2, 0.5 * q1 * q2 * np.exp(-0.5**2 / 2)],
[0.5 * q1 * q2 * np.exp(-0.5**2 / 2), q2**2]
])
print("Determinant for similar items:", np.linalg.det(L_similar))
# Case 2: Dissimilar items (D=2.0)
L_dissimilar = np.array([
[q1**2, 0.5 * q1 * q2 * np.exp(-2.0**2 / 2)],
[0.5 * q1 * q2 * np.exp(-2.0**2 / 2), q2**2]
])
print("Determinant for dissimilar items:", np.linalg.det(L_dissimilar))
```
What you'll see:
- When items are similar (D=0.5), the determinant is small (≈18.75)
- When items are dissimilar (D=2.0), the determinant is larger (≈24.99)
Why? The determinant combines:
1. **Quality**: Diagonal terms (q1² and q2²)
2. **Diversity**: Off-diagonal terms (which decrease as items become more dissimilar)
This creates a natural trade-off: adding a high-quality but similar item reduces the determinant (bad), while adding a high-quality dissimilar item increases it (good).
Would you like me to:
1. Show the actual numbers from the paper's experiments?
2. Explain how this connects to YouTube's specific implementation?
3. Or dive deeper into why determinants work for this problem?
Note
The rendering from Chat class is markdown style, which is more readable. The above examples are wrapped for easy separation b/w blog text and code output.
Managing Chat History: Checkpoints and Context Control
One advantage of this approach over general chat interfaces is that you have full access to the conversation history. This matters when sessions get long or when the AI goes down a wrong path and you want to rewind.
Accessing the History
The Chat object stores messages in a list you can inspect and modify:
# See the current conversationprint(chat.hist) # Returns list of message dicts# Check how many turns you've hadprint(len(chat.hist))
[{'role': 'user', 'content': '\nI want to learn about this paper. Help me.\n'}, Message(content="Let's start with a simple question: What do you already know about recommendation systems or diversification? Are you familiar with YouTube's recommendation algorithm? Or maybe you've heard of techniques like collaborative filtering or matrix factorization?\n\nI'm excited about this paper because it tackles a real problem we face on YouTube: when users scroll through recommendations, they often see similar videos in a row, which can cause them to abandon the feed. The authors used a statistical model called Determinantal Point Processes (DPPs) to diversify recommendations while still respecting user preferences.\n\nWhat aspects of this paper interest you most? Are you more curious about how DPPs work conceptually, how they were implemented at scale, or the specific results YouTube saw? I can tailor my explanation to what you'd like to explore first.", role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None), {'role': 'user', 'content': '\nI know about recommendation systems and work on a large recommender system which serves video content.\nI am familiar with recommendation algorithms. I am interested in understanding how DPP works.\n'}, Message(content='The core idea of DPPs is modeling **repulsion between items** in a recommendation list. Unlike traditional pointwise scoring (e.g., sorting by individual video quality scores), DPPs consider *how items interact* when presented together. For example, if two basketball videos are shown consecutively, the DPP model would reduce the probability of both being selected because they\'re redundant. \n\nIn your video recommender system, this translates to: Instead of just selecting the "highest quality" videos, DPPs automatically suppress items that would decrease overall user engagement when placed together. The magic happens in the **determinant calculation**: when you take the determinant of a submatrix from the kernel matrix `L`, it measures both the individual quality of items (via diagonal entries) and their mutual diversity (via off-diagonal entries). A large determinant means you\'ve selected a diverse set with high overall utility.\n\nWould you like me to walk through a simple Python example using NumPy to demonstrate how the determinant of a 2x2 kernel matrix captures both quality and diversity? Or would you prefer to first understand why this approach is better than heuristic methods like "sliding window" diversity?', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None), {'role': 'user', 'content': "\nYes I don't understand what determinant are, can you help me understand the basic intuition and use code examples?\nI am fairly familiar with numpy.\n"}, Message(content='Let\'s visualize this with a simple NumPy example. Determinants measure how "spread out" a matrix is - think of it as the volume of the parallelogram formed by the matrix\'s columns. For DPPs, this volume represents both item quality and diversity:\n\n```python\nimport numpy as np\n\n# Create two items with different qualities\nq1 = 5.0 # High-quality video\nq2 = 1.0 # Low-quality video\n\n# Case 1: Similar items (D=0.5)\nL_similar = np.array([\n [q1**2, 0.5 * q1 * q2 * np.exp(-0.5**2 / 2)],\n [0.5 * q1 * q2 * np.exp(-0.5**2 / 2), q2**2]\n])\nprint("Determinant for similar items:", np.linalg.det(L_similar))\n\n# Case 2: Dissimilar items (D=2.0)\nL_dissimilar = np.array([\n [q1**2, 0.5 * q1 * q2 * np.exp(-2.0**2 / 2)],\n [0.5 * q1 * q2 * np.exp(-2.0**2 / 2), q2**2]\n])\nprint("Determinant for dissimilar items:", np.linalg.det(L_dissimilar))\n```\n\nWhat you\'ll see:\n- When items are similar (D=0.5), the determinant is small (≈18.75)\n- When items are dissimilar (D=2.0), the determinant is larger (≈24.99)\n\nWhy? The determinant combines:\n1. **Quality**: Diagonal terms (q1² and q2²)\n2. **Diversity**: Off-diagonal terms (which decrease as items become more dissimilar)\n\nThis creates a natural trade-off: adding a high-quality but similar item reduces the determinant (bad), while adding a high-quality dissimilar item increases it (good).\n\nWould you like me to:\n1. Show the actual numbers from the paper\'s experiments?\n2. Explain how this connects to YouTube\'s specific implementation?\n3. Or dive deeper into why determinants work for this problem?', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None)]
6
Creating Checkpoints
Before exploring a tangent, save your current state:
# Save checkpoint before a risky questioncheckpoint =len(chat.hist)# Ask something that might derail the conversationask(chat, "Actually, can you explain all the math in one go?")
You: Actually, can you explain all the math in one go?
Tutor: I understand you want a comprehensive explanation, but I need to be honest: explaining all the math at once would overwhelm you. The paper contains several key mathematical concepts that build on each other, including:
1. **Determinantal Point Processes (DPPs)** - Probabilistic models that capture repulsion between items
2. **Kernel matrices** - The `L` matrix that encodes both item quality and diversity
3. **Determinant calculation** - The core operation that measures set utility
4. **Approximate inference** - How they compute DPPs at scale
Let's take one concept at a time. Would you like me to:
1. Explain the kernel matrix `L` with a concrete example?
2. Show how the determinant calculation works for small sets?
3. Walk through the training procedure for the DPP parameters?
I can provide a focused explanation on whichever part interests you most. What would you like to explore first?
# That response was overwhelming. Rewind.chat.hist = chat.hist[:checkpoint]print(chat.hist[len(chat.hist)-1])
Message(content='Let\'s visualize this with a simple NumPy example. Determinants measure how "spread out" a matrix is - think of it as the volume of the parallelogram formed by the matrix\'s columns. For DPPs, this volume represents both item quality and diversity:\n\n```python\nimport numpy as np\n\n# Create two items with different qualities\nq1 = 5.0 # High-quality video\nq2 = 1.0 # Low-quality video\n\n# Case 1: Similar items (D=0.5)\nL_similar = np.array([\n [q1**2, 0.5 * q1 * q2 * np.exp(-0.5**2 / 2)],\n [0.5 * q1 * q2 * np.exp(-0.5**2 / 2), q2**2]\n])\nprint("Determinant for similar items:", np.linalg.det(L_similar))\n\n# Case 2: Dissimilar items (D=2.0)\nL_dissimilar = np.array([\n [q1**2, 0.5 * q1 * q2 * np.exp(-2.0**2 / 2)],\n [0.5 * q1 * q2 * np.exp(-2.0**2 / 2), q2**2]\n])\nprint("Determinant for dissimilar items:", np.linalg.det(L_dissimilar))\n```\n\nWhat you\'ll see:\n- When items are similar (D=0.5), the determinant is small (≈18.75)\n- When items are dissimilar (D=2.0), the determinant is larger (≈24.99)\n\nWhy? The determinant combines:\n1. **Quality**: Diagonal terms (q1² and q2²)\n2. **Diversity**: Off-diagonal terms (which decrease as items become more dissimilar)\n\nThis creates a natural trade-off: adding a high-quality but similar item reduces the determinant (bad), while adding a high-quality dissimilar item increases it (good).\n\nWould you like me to:\n1. Show the actual numbers from the paper\'s experiments?\n2. Explain how this connects to YouTube\'s specific implementation?\n3. Or dive deeper into why determinants work for this problem?', role='assistant', tool_calls=None, function_call=None, provider_specific_fields=None)
Why This Matters
Two problems this solves:
Context bloat. Long conversations eat up your context window. If you spent 10 turns on a section you now understand, trim those exchanges and reclaim tokens for the next topic.
Context poisoning. If the AI misunderstands something and you correct it, the wrong explanation still sits in history, potentially confusing later responses. Removing it gives you a cleaner slate.
Going Further: Paid LLM Options
Local models are great for privacy and cost, but sometimes you need more firepower. Longer papers, denser math, or multi-step reasoning tasks can benefit from larger models like Claude or GPT-4o.
The good news: lisette uses litellm under the hood, so switching providers is a one-line change.
For cloud providers, you’ll need API keys. Set them as environment variables or use a .env file:
import osfrom dotenv import load_dotenvload_dotenv()print(f"Print first string of Anthropic key: {os.getenv('ANTHROPIC_API_KEY')[:15]}")
Print first string of Anthropic key: sk-ant-api03-JS
# Claude (Anthropic)chat = Chat(model="anthropic/claude-haiku-4-5", sp=system_prompt)ask(chat, "Help me understand the paper")
You: Help me understand the paper
Tutor: I'd love to help! This is a really interesting paper about improving YouTube recommendations.
Before I dive in, let me ask: **What's your background?** For example:
- Are you familiar with recommendation systems (like how Netflix suggests movies)?
- Do you know what machine learning models are?
- Have you seen probability/statistics before?
Also, **what part interests you most?** The paper covers several things:
1. The *problem* (why similar videos in a feed is bad)
2. The *solution* (using something called DPPs)
3. The *results* (how well it worked on YouTube)
What would you like to start with?
As you can see the workflow stays identical. Your system prompt, conversation history, and teaching flow all transfer across providers.
Key Takeaways
Reading technical papers used to be a solitary grind. You either had the background to parse the notation, or you didn’t. AI changes this equation.
Here’s what I want you to walk away with:
The barrier to technical content has dropped. The same workflow that worked for this paper works for textbooks, API documentation, dense blog posts, or any content where you need concepts explained at your level. The AI adapts to you.
You control the pace and depth. Unlike courses or tutorials that assume a fixed audience, this approach lets you skip what you know and linger on what confuses you. Your learning path, not someone else’s curriculum.
Local LLMs make this nearly free. Ollama plus a capable open model means you can read papers all day without API costs. You can try reading books, YT transcripts, brainstorming using same method. One tool, many uses.
The system prompt is the lever. Generic prompts give generic results. The tutor prompt we built forces the AI into a teaching mode that actually helps. Steal it, tweak it, make it yours.
Start with a paper you actually care about. Don’t practice on something boring. Pick a paper relevant to your work or a concept you’ve always wanted to understand. Motivation carries you through the setup friction.
The tools exist. The cost is minimal. The only question is which paper you’ll read first.
I hope you found this blog useful. If you have questions or want to share your own Claude Code patterns, find me on LinkedIn.