Warehouse Assistant
Goal:
Many companies rely on data warehouses to store vast amounts of valuable information. However, accessing this data often requires technical expertise, such as SQL knowledge and a deep understanding of the data schema. As a result, business users must rely on IT teams to fulfill their data requests, which slows down decision-making and delays insights generation. The goal of this Proof of Concept (PoC) is to develop a framework that empowers business users to access data independently. The Warehouse Assistant will understand user requests in natural language, generate accurate SQL queries based on schema knowledge, execute them, and return results seamlessly eliminating the need for technical expertise.
Features:
- Free-Form Query Input:
Users can type or speak their data requests in natural language (e.g., "Show me sales data for Q1 2023"). The system interprets these requests and translates them into actionable queries, making data access intuitive and user-friendly. - Automated SQL Generation:
The assistant automatically generates SQL queries based on user requests, eliminating the need for manual query writing. It also detects and corrects common errors in query generation, such as ambiguous column names or incorrect joins, ensuring accurate and efficient data retrieval. - Handling Large Datasets:
The system is optimized to handle massive amounts of data, including numerous tables and columns, without compromising performance. It leverages advanced query optimization techniques to deliver fast and reliable results, even for complex queries. - Error Detection and Correction:
The system includes robust error-handling mechanisms to diagnose and resolve SQL errors automatically, such as ambiguous column names or syntax issues, improving query success rates.
Challenges:
- Schema Understanding:
The system must accurately interpret and adapt to complex database schemas, including relationships between tables, column naming conventions, and data types. To improve accuracy, additional metadata or schema descriptions may be required.
Solution: Enhance schema understanding by incorporating metadata, data dictionaries, and AI-driven schema mapping techniques. - Ambiguity in User Queries:
Users often provide vague or incomplete queries (e.g., "Show sales data"), which lack the necessary details to construct precise SQL. Disambiguating these queries requires interactive refinement and contextual understanding.
Solution: Implement NLP techniques to identify missing parameters and suggest possible options based on database content. Use follow-up questions to clarify user intent dynamically. - Error Handling in Query Generation and Execution:
SQL errors can arise from issues like ambiguous column names, incorrect joins, or syntax problems. Detecting and resolving these errors automatically is challenging, especially in complex schemas.
Solution: Build robust error detection and correction mechanisms. Use AI to diagnose failed queries, suggest fixes (e.g., resolving ambiguous column names), and retry with alternative approaches. - Security and Access Control:
Ensuring that users only access data they are authorized to view, while maintaining compliance with data privacy regulations, adds another layer of complexity.
Solution: Implement role-based access control (RBAC) to protect sensitive information.
Demo Stand Description
The demo stand enables you to pose questions to the data warehouse in natural language, which are then automatically translated into SQL, executed, and displayed with the results. The schema is based on Classic Models, a retailer specializing in scale models of classic cars. The database includes common business data such as customers, orders, order line items, products, and more. Example questions might include:
- Which territory generates the largest sales volume?
- Which months do sales peak in a year?
- What is the sales growth for 2004 comparing to 2003?
- How many customers do not have sales representative assigned?
- How many products are not included into sales orders?
Database Schema
The database consists of eight tables:
- Offices: Classic Models Inc. has 7 offices worldwide and is headquartered in San Francisco, CA. Based on geography each office is assigned to a sales territory (APAC, NA, EMEA or JAPAN).
- Employees: The company has 23 employees: 6 Execs and 17 Sales Reps, all assigned to one of the company’s seven offices.
- Customers: 122 customers across the world.
- Orders: There are a total of 326 orders, which span the period from 1/1/2003 to 6/1/2005. Orders can be in one of these states:
- In Process (the initial state for all orders),
- Shipped,
- Cancelled (used to indicate that the customer called to cancel the order right after the order was placed and typically before it was shipped),
- Disputed (used to indicate that the customer received the order but doesn’t like it),
- Resolved (used to indicate that the order was disputed, but successfully resolved) or
- On Hold (used to indicate that the order will not ship until a payment is received because the customer’s credit limit has been exceeded).
- Order Details: Each order contains a list of unique products (order line items). Each order line item reflects the negotiated price per product as well as the quantity per product.
- Payments: Payments made by customers against their account
- Products: Classic Models Inc. sells 110 unique models which they purchase from 13 vendors. The models are classified as 7 distinct product lines: Classic Cars, Vintage Cars, Motorcycles, Trucks and Buses, Planes, Ships, Trains. Additionally models are classified based on their scale (e.g. 1:18, 1:72 etc.). For each product the price at which the product was purchased from the vendor (buyPrice) as well as the product’s MSRP are provided.
- Product Lines: Products are classified as 7 distinct product lines. Each product line is associated with a text description, html description as well as with an image.
