This guide walks you through your first experiments, from setup to advanced usage.
# Ensure Ollama is running
ollama serve
# In another terminal, run setup
./setup.sh
# Or manually:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
ollama pull llama2# Run the example script
python example.pyYou should see 4 examples execute successfully!
streamlit run app.pyYour browser will open to http://localhost:8501
Goal: See basic text generation in action.
- Open the Streamlit app
- Click "Connect to Model" in sidebar
- Go to "π¬ Quick Chat" tab
- Enter prompt:
"Explain quantum computing in simple terms" - Click "Generate"
Observe:
- How long did it take?
- How many tokens?
- Was the response coherent?
Try again with same prompt:
- Do you get identical output?
- Why or why not? (Hint: temperature)
Goal: Understand creativity vs determinism.
- Go to "π‘οΈ Temperature" tab
- Use prompt:
"Once upon a time, in a distant galaxy" - Test temperatures: 0.1, 0.7, 1.5
- Set samples: 3
Compare outputs:
| Temperature | What to expect |
|---|---|
| 0.1 | Very similar, focused |
| 0.7 | Balanced, varied |
| 1.5 | Very different, creative |
Questions to answer:
- Which temperature gives most consistent results?
- Which is best for creative writing?
- Which would you use for factual answers?
Goal: See how examples improve performance.
- Go to "π Few-Shot" tab
- Select scenario: "sentiment_analysis"
- Test case: Pick any from dropdown
- Click "Run Comparison"
Analyze:
- Was zero-shot correct?
- Did few-shot improve accuracy?
- How much did token count increase?
Extension: Try with different scenarios:
- entity_extraction
- text_classification
Goal: See how wording affects output.
- Go to "π Sensitivity" tab
- Select: "tone_changes"
- Run the test
Compare prompts:
"Explain machine learning."
"Explain machine learning simply."
"Explain machine learning technically."
Notice:
- Does output complexity match the prompt?
- Which prompt gives the most useful answer?
Goal: Learn from your experiments.
- Go to "π Logs" tab
- Review your interactions
- Look for patterns
Questions:
- Which experiments used most tokens?
- What was your average latency?
- Which prompts worked best?
Goal: Systematically improve a prompt.
Task: Get the model to write a product description.
Iteration 1 (Baseline):
"Write a product description"
- Vague, generic output
Iteration 2 (Add specificity):
"Write a product description for a wireless headphone"
- Better, but still generic
Iteration 3 (Add constraints):
"Write a 50-word product description for a wireless headphone,
highlighting comfort, battery life, and sound quality"
- Much more focused!
Iteration 4 (Add examples):
Example description:
"The ErgoMouse 3000 redefines comfort with its..."
Now write a description for wireless headphones,
highlighting comfort, battery life, and sound quality. (50 words)
- Best results!
Log each iteration and compare.
Goal: Find optimal temperature for different tasks.
Test each task with temperatures: 0.1, 0.3, 0.5, 0.7, 0.9, 1.2
Tasks:
- Math: "Calculate 15% of 250"
- Facts: "What is the capital of Australia?"
- Summary: "Summarize: [paste a paragraph]"
- Creative: "Write a haiku about technology"
- Code: "Write a Python function to reverse a string"
Create a table:
| Task | Best Temp | Why |
|---|---|---|
| Math | 0.1 | Need exact answer |
| Facts | 0.3 | ... |
| Summary | ... | ... |
| Creative | ... | ... |
| Code | ... | ... |
Goal: Understand context limits.
-
Create prompts of varying lengths:
- Short: 50 words
- Medium: 200 words
- Long: 500 words
- Very long: 1000 words
-
Ask the same question at the end of each
-
Observe:
- Does latency scale linearly?
- Does quality degrade?
- At what point does it fail?
Use the CLI for this:
python cli.py generate "$(cat long_text.txt) Question: What is the main topic?"Goal: Understand model tradeoffs.
Test prompt: "Explain photosynthesis"
Models to test:
- llama2
- mistral (if installed:
ollama pull mistral) - phi (if installed:
ollama pull phi)
Compare:
| Model | Speed | Quality | Token Efficiency |
|---|---|---|---|
| llama2 | ... | ... | ... |
| mistral | ... | ... | ... |
| phi | ... | ... | ... |
Which would you choose for:
- Development/testing?
- Production?
- Mobile/edge deployment?
Goal: Build a simple sentiment analyzer.
# sentiment_analyzer.py
from models import get_model
from logger import get_logger
model = get_model("ollama", "llama2")
logger = get_logger()
def analyze_sentiment(text):
prompt = f"""Classify the sentiment as Positive, Negative, or Neutral.
Text: "{text}"
Sentiment:"""
response = model.generate(prompt, temperature=0.3, max_tokens=5)
return response.text.strip()
# Test it
reviews = [
"This product is amazing!",
"Terrible experience, would not recommend.",
"It's okay, nothing special.",
]
for review in reviews:
sentiment = analyze_sentiment(review)
print(f"{review} β {sentiment}")Extensions:
- Add confidence scores
- Batch process multiple reviews
- Create a Streamlit UI
Goal: Summarize text with different styles.
def summarize(text, style="concise", max_words=50):
prompts = {
"concise": f"Summarize in {max_words} words:\n\n{text}",
"bullet": f"Summarize in {max_words} words using bullet points:\n\n{text}",
"eli5": f"Explain this like I'm 5, in {max_words} words:\n\n{text}",
}
response = model.generate(prompts[style], temperature=0.5, max_tokens=100)
return response.text
# Test with different styles
article = "..." # Your text here
for style in ["concise", "bullet", "eli5"]:
print(f"\n{style.upper()}:")
print(summarize(article, style))Goal: Auto-generate docstrings.
def generate_docstring(code):
prompt = f"""Write a Python docstring for this function:
{code}
Docstring (use Google style):"""
response = model.generate(prompt, temperature=0.3, max_tokens=200)
return response.text
# Example
code = """
def calculate_average(numbers):
return sum(numbers) / len(numbers)
"""
print(generate_docstring(code))Goal: Understand cost implications.
- Run 10 different prompts
- Check logs for token usage
- Calculate:
- Average tokens per request
- Most expensive prompt
- Most efficient prompt (quality/token)
If using OpenAI:
- Calculate total cost
- Estimate monthly cost for 1000 requests
Goal: Understand performance characteristics.
- Measure latency for different prompt lengths
- Plot: Prompt Length (tokens) vs Latency (ms)
- Calculate tokens per second
Expected pattern:
- Fixed overhead (~50-100ms)
- Linear scaling with output length
- Local models: ~5-20 tokens/sec
- API models: ~50-100 tokens/sec
Goal: Quantify output quality.
Create a rubric:
def rate_response(response, task_type):
"""Rate 1-5 on:
- Accuracy: Is it correct?
- Relevance: Does it answer the question?
- Completeness: Is it thorough?
- Clarity: Is it easy to understand?
"""
# Your rating logic
return scoreTest 20 responses and find patterns:
- Which temperatures give best quality?
- Does prompt length correlate with quality?
- Are few-shot prompts always better?
Implement a system that:
- Breaks complex questions into steps
- Solves each step
- Combines results
Test with: "If a store has 20% off everything, and an item costs $80 after discount, what was the original price?"
Query multiple models and:
- Compare answers
- Find consensus
- Flag disagreements
When would this be useful?
Build a system that:
- Detects task type from prompt
- Automatically sets appropriate temperature
- Logs whether it chose well
After this tutorial, you should be able to:
- Generate text with any model
- Explain what temperature does
- Write effective prompts
- Use few-shot learning
- Analyze logs for insights
- Choose appropriate parameters
- Estimate token costs
- Compare different models
- Build simple LLM applications
- Debug problematic outputs
- Read the theory: Go through CONCEPTS.md thoroughly
- Experiment freely: Use the playground daily
- Build something: Pick a project and implement it
- Share findings: Document interesting discoveries
- Go deeper: Read papers, try advanced techniques
Remember: The best way to learn is by doing. Run lots of experiments, observe patterns, and build intuition. Every interaction teaches you something! π