Skip to content

Feature tabular data#768

Open
paullizer wants to merge 6 commits intoDevelopmentfrom
feature-tabular-data
Open

Feature tabular data#768
paullizer wants to merge 6 commits intoDevelopmentfrom
feature-tabular-data

Conversation

@paullizer
Copy link
Contributor

  • Tabular Data Analysis — SK Mini-Agent for Normal Chat

    • Tabular files (CSV, XLSX, XLS, XLSM) detected in search results now trigger a lightweight Semantic Kernel mini-agent that pre-computes data analysis before the main LLM response. This brings the same analytical depth previously only available in full agent mode to every normal chat conversation.
    • Automatic Detection: When AI Search results include tabular files from any workspace (personal, group, or public) or chat-uploaded documents, the system automatically identifies them via the TABULAR_EXTENSIONS configuration and routes the query through the SK mini-agent pipeline.
    • Unified Workspace and Chat Handling: Tabular files are processed identically regardless of their storage location. The plugin resolves blob paths across all four container types (user-documents, group-documents, public-documents, personal-chat) with automatic fallback resolution if the primary source lookup fails. A user asking about an Excel file in their personal workspace gets the same analytical treatment as one asking about a CSV uploaded directly to a chat.
    • Six Data Analysis Functions: The TabularProcessingPlugin exposes describe_tabular_file, aggregate_column (sum, mean, count, min, max, median, std, nunique, value_counts), filter_rows (==, !=, >, <, >=, <=, contains, startswith, endswith), query_tabular_data (pandas query syntax), group_by_aggregate, and list_tabular_files — all registered as Semantic Kernel functions that the mini-agent orchestrates autonomously.
    • Pre-Computed Results Injected as Context: The mini-agent's computed analysis (exact numerical results, aggregations, filtered data) is injected into the main LLM's system context so it can present accurate, citation-backed answers without hallucinating numbers.
    • Graceful Degradation: If the mini-agent analysis fails for any reason, the system falls back to instructing the main LLM to use the tabular processing plugin functions directly, preserving full functionality.
    • Non-Streaming and Streaming Support: Both chat modes are supported. The mini-agent runs synchronously before the main LLM call in both paths.
    • Requires Enhanced Citations: The tabular processing plugin depends on the blob storage client initialized by the enhanced citations system. The enable_enhanced_citations admin setting must be enabled for tabular data analysis to activate.
    • Files Modified: route_backend_chats.py, semantic_kernel_plugins/tabular_processing_plugin.py, config.py.
    • (Ref: run_tabular_sk_analysis(), TabularProcessingPlugin, collect_tabular_sk_citations(), TABULAR_EXTENSIONS)
  • Tabular Tool Execution Citations

    • Every tool call made by the SK mini-agent during tabular analysis is captured and surfaced as an agent citation, providing full transparency into the data analysis pipeline.
    • Automatic Capture: The existing @plugin_function_logger decorator on all TabularProcessingPlugin functions records each invocation including function name, input parameters, returned results, execution duration, and success/failure status.
    • Citation Format: Tool execution citations appear in the same "Agent Tool Execution" modal used by full agent mode, showing tool_name (e.g., TabularProcessingPlugin.aggregate_column), function_arguments (the exact parameters passed), and function_result (the computed data returned).
    • End-to-End Auditability: Users can verify exactly which aggregations, filters, or queries were run against their data, what parameters were used, and what raw results were returned — before the LLM summarized them into the final response.
    • Files Modified: route_backend_chats.py.
    • (Ref: collect_tabular_sk_citations(), plugin_invocation_logger.py)
  • SK Mini-Agent Performance Optimization

    • Reduced typical tabular analysis time from ~74 seconds to an estimated ~30-33 seconds (55-60% reduction) through three complementary optimizations.
    • DataFrame Caching: Per-request in-memory cache eliminates redundant blob downloads. Previously, each of the ~8 tool calls in a typical analysis downloaded and parsed the same file independently. Now the file is downloaded once and subsequent calls read from cache. Cache is automatically scoped to the request (new plugin instance per analysis) and garbage-collected afterward.
    • Pre-Dispatch Schema Injection: File schemas (columns, data types, row counts, and a 3-row preview) are pre-loaded and injected into the SK mini-agent's system prompt before execution begins. This eliminates 2 LLM round-trips that were previously spent on file discovery (list_tabular_files) and schema inspection (describe_tabular_file), allowing the model to jump directly to analysis tool calls.
    • Async Plugin Functions: All six @kernel_function methods converted to async def using asyncio.to_thread(). This enables Semantic Kernel's built-in asyncio.gather() to truly parallelize batched tool calls (e.g., 3 simultaneous aggregate_column calls) instead of executing them serially on the event loop.
    • Batching Instructions: The system prompt now instructs the model to batch multiple independent function calls in a single response, reducing LLM round-trips further.
    • Files Modified: tabular_processing_plugin.py, route_backend_chats.py, config.py.
    • (Ref: _df_cache, asyncio.to_thread, pre-dispatch schema injection in run_tabular_sk_analysis())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant