Skip to content

Latest commit

Β 

History

History
477 lines (334 loc) Β· 10 KB

File metadata and controls

477 lines (334 loc) Β· 10 KB

πŸŽ“ Complete Tutorial: Your First Hour with LLM Playground

This guide walks you through your first experiments, from setup to advanced usage.


⏰ Quick Start (10 Minutes)

1. Setup (5 min)

# Ensure Ollama is running
ollama serve

# In another terminal, run setup
./setup.sh

# Or manually:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
ollama pull llama2

2. Test It (2 min)

# Run the example script
python example.py

You should see 4 examples execute successfully!

3. Launch the App (3 min)

streamlit run app.py

Your browser will open to http://localhost:8501


πŸ“– Tutorial: Understanding LLM Behavior

Experiment 1: Your First Generation (5 min)

Goal: See basic text generation in action.

  1. Open the Streamlit app
  2. Click "Connect to Model" in sidebar
  3. Go to "πŸ’¬ Quick Chat" tab
  4. Enter prompt: "Explain quantum computing in simple terms"
  5. Click "Generate"

Observe:

  • How long did it take?
  • How many tokens?
  • Was the response coherent?

Try again with same prompt:

  • Do you get identical output?
  • Why or why not? (Hint: temperature)

Experiment 2: Temperature Effects (10 min)

Goal: Understand creativity vs determinism.

  1. Go to "🌑️ Temperature" tab
  2. Use prompt: "Once upon a time, in a distant galaxy"
  3. Test temperatures: 0.1, 0.7, 1.5
  4. Set samples: 3

Compare outputs:

Temperature What to expect
0.1 Very similar, focused
0.7 Balanced, varied
1.5 Very different, creative

Questions to answer:

  • Which temperature gives most consistent results?
  • Which is best for creative writing?
  • Which would you use for factual answers?

Experiment 3: Zero-Shot vs Few-Shot (10 min)

Goal: See how examples improve performance.

  1. Go to "πŸ“š Few-Shot" tab
  2. Select scenario: "sentiment_analysis"
  3. Test case: Pick any from dropdown
  4. Click "Run Comparison"

Analyze:

  • Was zero-shot correct?
  • Did few-shot improve accuracy?
  • How much did token count increase?

Extension: Try with different scenarios:

  • entity_extraction
  • text_classification

Experiment 4: Prompt Sensitivity (10 min)

Goal: See how wording affects output.

  1. Go to "πŸ” Sensitivity" tab
  2. Select: "tone_changes"
  3. Run the test

Compare prompts:

"Explain machine learning."
"Explain machine learning simply."
"Explain machine learning technically."

Notice:

  • Does output complexity match the prompt?
  • Which prompt gives the most useful answer?

Experiment 5: Analyze Your Logs (5 min)

Goal: Learn from your experiments.

  1. Go to "πŸ“‹ Logs" tab
  2. Review your interactions
  3. Look for patterns

Questions:

  • Which experiments used most tokens?
  • What was your average latency?
  • Which prompts worked best?

🎯 Advanced Tutorial: Deep Dives

Deep Dive 1: Perfect the Prompt (15 min)

Goal: Systematically improve a prompt.

Task: Get the model to write a product description.

Iteration 1 (Baseline):

"Write a product description"
  • Vague, generic output

Iteration 2 (Add specificity):

"Write a product description for a wireless headphone"
  • Better, but still generic

Iteration 3 (Add constraints):

"Write a 50-word product description for a wireless headphone, 
highlighting comfort, battery life, and sound quality"
  • Much more focused!

Iteration 4 (Add examples):

Example description:
"The ErgoMouse 3000 redefines comfort with its..."

Now write a description for wireless headphones, 
highlighting comfort, battery life, and sound quality. (50 words)
  • Best results!

Log each iteration and compare.


Deep Dive 2: Temperature Sweet Spots (20 min)

Goal: Find optimal temperature for different tasks.

Test each task with temperatures: 0.1, 0.3, 0.5, 0.7, 0.9, 1.2

Tasks:

  1. Math: "Calculate 15% of 250"
  2. Facts: "What is the capital of Australia?"
  3. Summary: "Summarize: [paste a paragraph]"
  4. Creative: "Write a haiku about technology"
  5. Code: "Write a Python function to reverse a string"

Create a table:

Task Best Temp Why
Math 0.1 Need exact answer
Facts 0.3 ...
Summary ... ...
Creative ... ...
Code ... ...

Deep Dive 3: Context Window Exploration (15 min)

Goal: Understand context limits.

  1. Create prompts of varying lengths:

    • Short: 50 words
    • Medium: 200 words
    • Long: 500 words
    • Very long: 1000 words
  2. Ask the same question at the end of each

  3. Observe:

    • Does latency scale linearly?
    • Does quality degrade?
    • At what point does it fail?

Use the CLI for this:

python cli.py generate "$(cat long_text.txt) Question: What is the main topic?"

Deep Dive 4: Model Comparison (20 min)

Goal: Understand model tradeoffs.

Test prompt: "Explain photosynthesis"

Models to test:

  • llama2
  • mistral (if installed: ollama pull mistral)
  • phi (if installed: ollama pull phi)

Compare:

Model Speed Quality Token Efficiency
llama2 ... ... ...
mistral ... ... ...
phi ... ... ...

Which would you choose for:

  • Development/testing?
  • Production?
  • Mobile/edge deployment?

πŸ”¬ Mini-Projects (30-60 min each)

Project 1: Sentiment Analysis Dashboard

Goal: Build a simple sentiment analyzer.

# sentiment_analyzer.py
from models import get_model
from logger import get_logger

model = get_model("ollama", "llama2")
logger = get_logger()

def analyze_sentiment(text):
    prompt = f"""Classify the sentiment as Positive, Negative, or Neutral.
    
    Text: "{text}"
    
    Sentiment:"""
    
    response = model.generate(prompt, temperature=0.3, max_tokens=5)
    return response.text.strip()

# Test it
reviews = [
    "This product is amazing!",
    "Terrible experience, would not recommend.",
    "It's okay, nothing special.",
]

for review in reviews:
    sentiment = analyze_sentiment(review)
    print(f"{review} β†’ {sentiment}")

Extensions:

  • Add confidence scores
  • Batch process multiple reviews
  • Create a Streamlit UI

Project 2: Smart Summarizer

Goal: Summarize text with different styles.

def summarize(text, style="concise", max_words=50):
    prompts = {
        "concise": f"Summarize in {max_words} words:\n\n{text}",
        "bullet": f"Summarize in {max_words} words using bullet points:\n\n{text}",
        "eli5": f"Explain this like I'm 5, in {max_words} words:\n\n{text}",
    }
    
    response = model.generate(prompts[style], temperature=0.5, max_tokens=100)
    return response.text

# Test with different styles
article = "..." # Your text here

for style in ["concise", "bullet", "eli5"]:
    print(f"\n{style.upper()}:")
    print(summarize(article, style))

Project 3: Code Documentation Generator

Goal: Auto-generate docstrings.

def generate_docstring(code):
    prompt = f"""Write a Python docstring for this function:

{code}

Docstring (use Google style):"""
    
    response = model.generate(prompt, temperature=0.3, max_tokens=200)
    return response.text

# Example
code = """
def calculate_average(numbers):
    return sum(numbers) / len(numbers)
"""

print(generate_docstring(code))

πŸ“Š Analysis Exercises

Exercise 1: Token Economics

Goal: Understand cost implications.

  1. Run 10 different prompts
  2. Check logs for token usage
  3. Calculate:
    • Average tokens per request
    • Most expensive prompt
    • Most efficient prompt (quality/token)

If using OpenAI:

  • Calculate total cost
  • Estimate monthly cost for 1000 requests

Exercise 2: Latency Analysis

Goal: Understand performance characteristics.

  1. Measure latency for different prompt lengths
  2. Plot: Prompt Length (tokens) vs Latency (ms)
  3. Calculate tokens per second

Expected pattern:

  • Fixed overhead (~50-100ms)
  • Linear scaling with output length
  • Local models: ~5-20 tokens/sec
  • API models: ~50-100 tokens/sec

Exercise 3: Quality Assessment

Goal: Quantify output quality.

Create a rubric:

def rate_response(response, task_type):
    """Rate 1-5 on:
    - Accuracy: Is it correct?
    - Relevance: Does it answer the question?
    - Completeness: Is it thorough?
    - Clarity: Is it easy to understand?
    """
    # Your rating logic
    return score

Test 20 responses and find patterns:

  • Which temperatures give best quality?
  • Does prompt length correlate with quality?
  • Are few-shot prompts always better?

🎯 Challenge Problems

Challenge 1: Chain-of-Thought Implementation

Implement a system that:

  1. Breaks complex questions into steps
  2. Solves each step
  3. Combines results

Test with: "If a store has 20% off everything, and an item costs $80 after discount, what was the original price?"


Challenge 2: Multi-Model Consensus

Query multiple models and:

  1. Compare answers
  2. Find consensus
  3. Flag disagreements

When would this be useful?


Challenge 3: Adaptive Temperature

Build a system that:

  1. Detects task type from prompt
  2. Automatically sets appropriate temperature
  3. Logs whether it chose well

βœ… Completion Checklist

After this tutorial, you should be able to:

  • Generate text with any model
  • Explain what temperature does
  • Write effective prompts
  • Use few-shot learning
  • Analyze logs for insights
  • Choose appropriate parameters
  • Estimate token costs
  • Compare different models
  • Build simple LLM applications
  • Debug problematic outputs

πŸš€ Next Steps

  1. Read the theory: Go through CONCEPTS.md thoroughly
  2. Experiment freely: Use the playground daily
  3. Build something: Pick a project and implement it
  4. Share findings: Document interesting discoveries
  5. Go deeper: Read papers, try advanced techniques

Remember: The best way to learn is by doing. Run lots of experiments, observe patterns, and build intuition. Every interaction teaches you something! πŸŽ“