🎓 Complete Tutorial: Your First Hour with LLM Playground

This guide walks you through your first experiments, from setup to advanced usage.

⏰ Quick Start (10 Minutes)

1. Setup (5 min)

# Ensure Ollama is running
ollama serve

# In another terminal, run setup
./setup.sh

# Or manually:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
ollama pull llama2

2. Test It (2 min)

# Run the example script
python example.py

You should see 4 examples execute successfully!

3. Launch the App (3 min)

streamlit run app.py

Your browser will open to http://localhost:8501

📖 Tutorial: Understanding LLM Behavior

Experiment 1: Your First Generation (5 min)

Goal: See basic text generation in action.

Open the Streamlit app
Click "Connect to Model" in sidebar
Go to "💬 Quick Chat" tab
Enter prompt: "Explain quantum computing in simple terms"
Click "Generate"

Observe:

How long did it take?
How many tokens?
Was the response coherent?

Try again with same prompt:

Do you get identical output?
Why or why not? (Hint: temperature)

Experiment 2: Temperature Effects (10 min)

Goal: Understand creativity vs determinism.

Go to "🌡️ Temperature" tab
Use prompt: "Once upon a time, in a distant galaxy"
Test temperatures: 0.1, 0.7, 1.5
Set samples: 3

Compare outputs:

Temperature	What to expect
0.1	Very similar, focused
0.7	Balanced, varied
1.5	Very different, creative

Questions to answer:

Which temperature gives most consistent results?
Which is best for creative writing?
Which would you use for factual answers?

Experiment 3: Zero-Shot vs Few-Shot (10 min)

Goal: See how examples improve performance.

Go to "📚 Few-Shot" tab
Select scenario: "sentiment_analysis"
Test case: Pick any from dropdown
Click "Run Comparison"

Analyze:

Was zero-shot correct?
Did few-shot improve accuracy?
How much did token count increase?

Extension: Try with different scenarios:

entity_extraction
text_classification

Experiment 4: Prompt Sensitivity (10 min)

Goal: See how wording affects output.

Go to "🔍 Sensitivity" tab
Select: "tone_changes"
Run the test

Compare prompts:

"Explain machine learning."
"Explain machine learning simply."
"Explain machine learning technically."

Notice:

Does output complexity match the prompt?
Which prompt gives the most useful answer?

Experiment 5: Analyze Your Logs (5 min)

Goal: Learn from your experiments.

Go to "📋 Logs" tab
Review your interactions
Look for patterns

Questions:

Which experiments used most tokens?
What was your average latency?
Which prompts worked best?

🎯 Advanced Tutorial: Deep Dives

Deep Dive 1: Perfect the Prompt (15 min)

Goal: Systematically improve a prompt.

Task: Get the model to write a product description.

Iteration 1 (Baseline):

"Write a product description"

Vague, generic output

Iteration 2 (Add specificity):

"Write a product description for a wireless headphone"

Better, but still generic

Iteration 3 (Add constraints):

"Write a 50-word product description for a wireless headphone, 
highlighting comfort, battery life, and sound quality"

Much more focused!

Iteration 4 (Add examples):

Example description:
"The ErgoMouse 3000 redefines comfort with its..."

Now write a description for wireless headphones, 
highlighting comfort, battery life, and sound quality. (50 words)

Best results!

Log each iteration and compare.

Deep Dive 2: Temperature Sweet Spots (20 min)

Goal: Find optimal temperature for different tasks.

Test each task with temperatures: 0.1, 0.3, 0.5, 0.7, 0.9, 1.2

Tasks:

Math: "Calculate 15% of 250"
Facts: "What is the capital of Australia?"
Summary: "Summarize: [paste a paragraph]"
Creative: "Write a haiku about technology"
Code: "Write a Python function to reverse a string"

Create a table:

Task	Best Temp	Why
Math	0.1	Need exact answer
Facts	0.3	...
Summary	...	...
Creative	...	...
Code	...	...

Deep Dive 3: Context Window Exploration (15 min)

Goal: Understand context limits.

Create prompts of varying lengths:
- Short: 50 words
- Medium: 200 words
- Long: 500 words
- Very long: 1000 words
Ask the same question at the end of each
Observe:
- Does latency scale linearly?
- Does quality degrade?
- At what point does it fail?

Use the CLI for this:

python cli.py generate "$(cat long_text.txt) Question: What is the main topic?"

Deep Dive 4: Model Comparison (20 min)

Goal: Understand model tradeoffs.

Test prompt: "Explain photosynthesis"

Models to test:

llama2
mistral (if installed: ollama pull mistral)
phi (if installed: ollama pull phi)

Compare:

Model	Speed	Quality	Token Efficiency
llama2	...	...	...
mistral	...	...	...
phi	...	...	...

Which would you choose for:

Development/testing?
Production?
Mobile/edge deployment?

🔬 Mini-Projects (30-60 min each)

Project 1: Sentiment Analysis Dashboard

Goal: Build a simple sentiment analyzer.

# sentiment_analyzer.py
from models import get_model
from logger import get_logger

model = get_model("ollama", "llama2")
logger = get_logger()

def analyze_sentiment(text):
    prompt = f"""Classify the sentiment as Positive, Negative, or Neutral.
    
    Text: "{text}"
    
    Sentiment:"""
    
    response = model.generate(prompt, temperature=0.3, max_tokens=5)
    return response.text.strip()

# Test it
reviews = [
    "This product is amazing!",
    "Terrible experience, would not recommend.",
    "It's okay, nothing special.",
]

for review in reviews:
    sentiment = analyze_sentiment(review)
    print(f"{review} → {sentiment}")

Extensions:

Add confidence scores
Batch process multiple reviews
Create a Streamlit UI

Project 2: Smart Summarizer

Goal: Summarize text with different styles.

def summarize(text, style="concise", max_words=50):
    prompts = {
        "concise": f"Summarize in {max_words} words:\n\n{text}",
        "bullet": f"Summarize in {max_words} words using bullet points:\n\n{text}",
        "eli5": f"Explain this like I'm 5, in {max_words} words:\n\n{text}",
    }
    
    response = model.generate(prompts[style], temperature=0.5, max_tokens=100)
    return response.text

# Test with different styles
article = "..." # Your text here

for style in ["concise", "bullet", "eli5"]:
    print(f"\n{style.upper()}:")
    print(summarize(article, style))

Project 3: Code Documentation Generator

Goal: Auto-generate docstrings.

def generate_docstring(code):
    prompt = f"""Write a Python docstring for this function:

{code}

Docstring (use Google style):"""
    
    response = model.generate(prompt, temperature=0.3, max_tokens=200)
    return response.text

# Example
code = """
def calculate_average(numbers):
    return sum(numbers) / len(numbers)
"""

print(generate_docstring(code))

📊 Analysis Exercises

Exercise 1: Token Economics

Goal: Understand cost implications.

Run 10 different prompts
Check logs for token usage
Calculate:
- Average tokens per request
- Most expensive prompt
- Most efficient prompt (quality/token)

If using OpenAI:

Calculate total cost
Estimate monthly cost for 1000 requests

Exercise 2: Latency Analysis

Goal: Understand performance characteristics.

Measure latency for different prompt lengths
Plot: Prompt Length (tokens) vs Latency (ms)
Calculate tokens per second

Expected pattern:

Fixed overhead (~50-100ms)
Linear scaling with output length
Local models: ~5-20 tokens/sec
API models: ~50-100 tokens/sec

Exercise 3: Quality Assessment

Goal: Quantify output quality.

Create a rubric:

def rate_response(response, task_type):
    """Rate 1-5 on:
    - Accuracy: Is it correct?
    - Relevance: Does it answer the question?
    - Completeness: Is it thorough?
    - Clarity: Is it easy to understand?
    """
    # Your rating logic
    return score

Test 20 responses and find patterns:

Which temperatures give best quality?
Does prompt length correlate with quality?
Are few-shot prompts always better?

🎯 Challenge Problems

Challenge 1: Chain-of-Thought Implementation

Implement a system that:

Breaks complex questions into steps
Solves each step
Combines results

Test with: "If a store has 20% off everything, and an item costs $80 after discount, what was the original price?"

Challenge 2: Multi-Model Consensus

Query multiple models and:

Compare answers
Find consensus
Flag disagreements

When would this be useful?

Challenge 3: Adaptive Temperature

Build a system that:

Detects task type from prompt
Automatically sets appropriate temperature
Logs whether it chose well

✅ Completion Checklist

After this tutorial, you should be able to:

🚀 Next Steps

Read the theory: Go through CONCEPTS.md thoroughly
Experiment freely: Use the playground daily
Build something: Pick a project and implement it
Share findings: Document interesting discoveries
Go deeper: Read papers, try advanced techniques

Remember: The best way to learn is by doing. Run lots of experiments, observe patterns, and build intuition. Every interaction teaches you something! 🎓

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🎓 Complete Tutorial: Your First Hour with LLM Playground

⏰ Quick Start (10 Minutes)

1. Setup (5 min)

2. Test It (2 min)

3. Launch the App (3 min)

📖 Tutorial: Understanding LLM Behavior

Experiment 1: Your First Generation (5 min)

Experiment 2: Temperature Effects (10 min)

Experiment 3: Zero-Shot vs Few-Shot (10 min)

Experiment 4: Prompt Sensitivity (10 min)

Experiment 5: Analyze Your Logs (5 min)

🎯 Advanced Tutorial: Deep Dives

Deep Dive 1: Perfect the Prompt (15 min)

Deep Dive 2: Temperature Sweet Spots (20 min)

Deep Dive 3: Context Window Exploration (15 min)

Deep Dive 4: Model Comparison (20 min)

🔬 Mini-Projects (30-60 min each)

Project 1: Sentiment Analysis Dashboard

Project 2: Smart Summarizer

Project 3: Code Documentation Generator

📊 Analysis Exercises

Exercise 1: Token Economics

Exercise 2: Latency Analysis

Exercise 3: Quality Assessment

🎯 Challenge Problems

Challenge 1: Chain-of-Thought Implementation

Challenge 2: Multi-Model Consensus

Challenge 3: Adaptive Temperature

✅ Completion Checklist

🚀 Next Steps

FilesExpand file tree

TUTORIAL.md

Latest commit

History

TUTORIAL.md

File metadata and controls

🎓 Complete Tutorial: Your First Hour with LLM Playground

⏰ Quick Start (10 Minutes)

1. Setup (5 min)

2. Test It (2 min)

3. Launch the App (3 min)

📖 Tutorial: Understanding LLM Behavior

Experiment 1: Your First Generation (5 min)

Experiment 2: Temperature Effects (10 min)

Experiment 3: Zero-Shot vs Few-Shot (10 min)

Experiment 4: Prompt Sensitivity (10 min)

Experiment 5: Analyze Your Logs (5 min)

🎯 Advanced Tutorial: Deep Dives

Deep Dive 1: Perfect the Prompt (15 min)

Deep Dive 2: Temperature Sweet Spots (20 min)

Deep Dive 3: Context Window Exploration (15 min)

Deep Dive 4: Model Comparison (20 min)

🔬 Mini-Projects (30-60 min each)

Project 1: Sentiment Analysis Dashboard

Project 2: Smart Summarizer

Project 3: Code Documentation Generator

📊 Analysis Exercises

Exercise 1: Token Economics

Exercise 2: Latency Analysis

Exercise 3: Quality Assessment

🎯 Challenge Problems

Challenge 1: Chain-of-Thought Implementation

Challenge 2: Multi-Model Consensus

Challenge 3: Adaptive Temperature

✅ Completion Checklist

🚀 Next Steps