Sign up for our newsletter today and never miss a Neradot update
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Imagine a security administrator at a large enterprise, tasked with finding critical vulnerabilities affecting their Windows servers deployed in the last week. In Vulcan Cyber's ExposureOS, this traditionally required navigating through multiple filtering menus, selecting precise parameters, and constructing complex rule-based queries. Now imagine if they could simply type,
"show critical vulns on windows servers from last week"
and instantly get the data they need. This is the future we've built in partnership with Vulcan Cyber.
Enterprise cybersecurity teams face a critical challenge: managing an overwhelming volume of vulnerability data streaming in from multiple scanners across their organizations. This fragmentation often leads to inefficient prioritization and misdirected remediation efforts. To address this, Vulcan Cyber developed ExposureOS - a unified platform that consolidates scan data, prioritizes risks, and orchestrates remediation at scale.
The platform's powerful rule-based filtering system has long enabled users to create precise queries across multiple parameters:
While this rule-based system provides powerful flexibility, it requires users to:
"Our customers love ExposureOS's comprehensive filtering capabilities, but we recognized an opportunity to make this power more accessible,"
"By partnering with Neradot to develop this natural language interface, we're transforming how security teams interact with their vulnerability data, making sophisticated queries as simple as having a conversation."
This partnership aimed to preserve all the power of Vulcan's existing filtering system while making it dramatically more accessible. The core challenge was bridging the gap between natural human expressions and ExposureOS's structured query format. Consider these example transformations:
The technical journey to enable this transformation presented several unique challenges:
In the following sections, we'll explore how we built this system, from creating the initial training dataset to implementing hybrid approaches that combine the flexibility of LLMs with the reliability of traditional algorithmic solutions...
Want to see the ExposureOS Assistant in action? check out this short product video:
The incredible “spotlight” like interface the product team in Vulcan came up with!
Key Takeaways:
1. Start with stakeholder input to build initial dataset examples.
2. Structure training data in three difficulty tiers for systematic learning.
3. Balance formal and informal query styles to match real usage.
Every AI solution is only as good as the data available to train and refine it. This fundamental principle guided our approach to building the dataset. One of our challenges was the absence of clear, consistent usage patterns in how users described their queries. This required us to create a dataset that could accommodate a wide range of natural language expressions, from formal to highly abbreviated.
Our dataset construction process began by thoroughly reviewing the existing rule-based filters and database schemas. We broke down the various filter options available to users, cataloging the different data types, operators, and logical combinations they could employ. This allowed us to identify key "building blocks" that would form the foundation of our prompts.
Next, we gathered input from key stakeholders - product managers, customer success managers, and experienced users. These conversations surfaced common user questions, workflows, and pain points they encountered when interacting with the system. We carefully documented these examples, categorizing them by intent and level of complexity.
A key consideration in building our dataset was the inherent flexibility of natural language. Unlike database queries, which follow a precise syntax, human users can express the same underlying intent in a wide variety of ways - from short, abbreviated phrases to long, detailed sentences. Each valid JSON filter in our database could potentially map to multiple natural language queries, spanning different levels of formality, verbosity, and ambiguity.
For example, this single rule filter:
[
{
’category_name’: ‘Vulnerability’,
‘parameter_name’: ‘Published Date’,
‘operator’: ‘between’,
‘value’: [’2024-12-01’, ‘2025-01-01’]
}
]
could be expressed in any of the following ways:
- "vulnerabilities from December 2024"
- "all the vulnerabilities that happened last month"
- "get me the database table for vulnerabilities in the last month of the year"
- "vulnerabilities where the published date is between the first of December and the end of December"
- “vuls Dec 24”
Capturing this variability in our dataset was crucial. We needed to ensure our model could handle not just the formal, technical phrasings, but also the casual, shortened, and ambiguous ways users might describe their needs.
With this foundation, we constructed an initial seed of queries spanning a range of difficulty levels: easy, medium, hard.
Easy:
For example:
“Assets from type name Host”
Medium:
For example:
“Assets with OS 'AWS Linux' and external facing IP or assets with OS 'Windows'”
Hard:
For example
“Vuln risk high/medium & CVE-2023-22 or risk low & CVE-2024-1223”
For each query, we carefully crafted the expected output - a valid JSON filter. This meticulous process ensured our dataset would provide comprehensive coverage of the types of interactions our model would need to handle.
With our initial seed of "easy," "medium," and "hard" prompts in hand, we leveraged this foundation to systematically expand the dataset. We used the defined difficulty criteria and our detailed understanding of the database schema to programmatically generate hundreds of additional dataset items, covering a wide range of variations. By leveraging programmatic generation and iterative refinement, we systematically addressed the challenge of unclear patterns, ensuring the dataset captured edge cases and diverse user intents.
The final step in our dataset construction process involved reviewing both the queries and the expected outputs. We scrutinized each example, validating the accuracy of the JSON filters, and how realistic were the natural language queries. Where issues were identified, we made targeted corrections, ensuring the dataset maintained a high standard of quality and consistency.
This multi-faceted approach yielded a comprehensive, diverse, and thoroughly vetted prompt set - laying the foundation for a natural language interface that could handle the full complexity of our users' needs.
Key Takeaways:
1. Complex NoSQL schemas require multi-level validation
2. Pydantic is great for the validation layer
3. Provide clear, actionable feedback for schema violations
4. Handle domain-specific terminology and relationships
A database schema that follows a hierarchical, context-dependent structure—where parent objects determine the allowable values, parameters, and operations of their child objects—is a flexible yet organized model often seen in NoSQL systems. This schema allows for dynamic and nested relationships, such as categories dictating valid parameters, which in turn constrain operators and values. While this structure supports complex data models and domain-specific rules, it introduces challenges when using LLMs for text-to-query tasks. LLMs, trained on generalized language patterns, may struggle to generate queries that strictly adhere to the schema’s hierarchical and context-specific constraints. This can result in invalid queries, such as selecting parameters or operators that are incompatible with a given category. Furthermore, the nested nature of the schema complicates query generation, as LLMs must account for multi-level dependencies, enforce Enum validation dynamically, and ensure syntactic correctness across deeply embedded fields.
While many modern LLMs support “structured outputs,” these outputs are typically limited to flat, non-nested, and non-conditioned rules. This means that while it is possible to define keys, simple value types, and even Enums for specific fields, no LLM or inference engine currently supports nested objects as values or enforce context-specific constraints, such as conditioning an Enum on the value of another key. These challenges necessitate implementing robust post-processing layers to validate and adjust the generated queries.
To enforce the hierarchical schema and handle the complexities of nested, context-dependent rules, we utilize Pydantic for validation and error correction. Pydantic provides a structured approach to validate the schema's keys, values, and nested relationships while accommodating conditional logic. This approach was critical for handling the complexities of multi-level NoSQL schemas, ensuring the model's outputs respected both the hierarchical structure and the contextual rules.
The validation process starts with ensuring that the overall schema structure adheres to expected rules, including nested objects and their relationships. Pydantic checks value types rigorously, ensuring that every field matches its defined type, whether it be integers, strings, arrays, or nested objects. In cases where certain fields depend on the values of others—such as Enums conditioned on the parent field's value—Pydantic enforces these constraints programmatically.
A critical part of the process involves detecting and correcting common schema mistakes generated by the model. For instance, if a rule specifies that selecting a specific date requires an array with a start and end date that are equal, but the model outputs a single value, Pydantic reformats the output into the correct structure. Similarly, the system identifies and resolves "redundant nesting," where unnecessary levels of hierarchy are collapsed into simpler, equivalent flat rules.
After each correction, the schema undergoes re-validation in an iterative process, continuing until a valid schema is achieved or an unresolvable issue is encountered. If the schema is valid, it is returned to the user as the final output. However, if validation fails, the system analyzes the issue—whether it's an invalid user query or a failure by the model to adhere to the schema—and provides the user with detailed feedback. This includes identifying the specific problem (e.g., a data type mismatch or an invalid Enum value) and offering actionable suggestions or query refinements. By providing a well-defined follow-up instead of simply outputting an error, users can better understand the cause of the issue, receive guidance on how to fix it, and quickly refine their query to achieve the desired outcome, making the overall process more intuitive and user-friendly, as well as establishing confidence in the model from the user perspective - for example:
searching for an vulnerability name “qualys”, when in fact “qualys” is a vulnerability source or integration.
Key Takeaways:
1. Real security teams prefer shorter, value-focused queries
2. Users naturally abbreviate common security terms
3. Adaptation to user patterns improves accuracy significantly
During the beta phase, we released the solution to a select group of users and closely monitored its performance. The results highlighted several key insights:
Our original development process was well-equipped to address these challenges. We systematically collected user queries and model responses, analyzed the patterns and characteristics of failure cases, and quickly adapted. Using our prompt seed expansion framework, we developed a new dataset to test and address the value queries and aggressively abbreviated queries. This process enabled us to refine the model instructions, adjust the schema structure to better handle varied query formats, and provide additional context and examples. These iterative improvements ensured the model's ability to handle a wider range of real-world queries effectively while maintaining a clear structure for further optimizations.
Key Takeaways:
1. Measure both technical accuracy and user satisfaction
2. Test with multiple real-world security scenarios
3. Monitor performance across different usage patterns
4. Continuously gather and incorporate user feedback
To evaluate the system's performance and ensure continuous improvement, we utilized three distinct datasets, each serving a specific purpose:
For each dataset, we evaluate the system against three critical metrics:
This layered evaluation framework, combining diverse datasets and hierarchical metrics, allows us to identify issues with precision. The hierarchical nature of the metrics (JSON → schema → accuracy) ensures that basic issues are addressed before more complex ones, streamlining debugging and improvement efforts.
We periodically test the model against all datasets to monitor for drift and ensure it continues to perform reliably. Before implementing any changes—whether prompt modifications or functional updates—we validate the updates against all datasets to ensure improvements do not inadvertently harm existing use cases. Additionally, the ongoing production dataset provides a mechanism for long-term monitoring and refinement, allowing us to adapt the model to evolving user behaviors and maintain high-quality performance over time.
In addition to these accuracy-focused evaluations, we also measure performance metrics, such as latency and total generation time. These are tested across various regions, input sizes, and times of day to maintain a consistent user experience and optimize the system’s responsiveness.
Through this comprehensive evaluation approach, we ensure the system remains accurate, adaptable, and high-performing over time.
Quality Report
For 203 dataset items there is 100% JSON validity, 99% Pydantic schema validity and 94% accuracy (exact)
Performance Report
By combining robust prompt engineering, comprehensive datasets, and iterative evaluation, we built a system that bridges the gap between natural language and complex database queries. Our solution has demonstrated significant improvements in handling a variety of real-world challenges, from ambiguous inputs to dynamically adapting hierarchical schemas. Continuous user feedback and monitoring ensure the system evolves with real-world needs, delivering a reliable and intuitive experience.
This journey has shown us that the future of enterprise software lies not in adding more features or documentation, but in making existing capabilities more accessible. We're now exploring:
Collaboration remains central to our process. Working closely with product owners and end-users, we can push the boundaries of natural language interfaces for databases, paving the way for even more seamless and intuitive user experiences.
Iftach Arbel
A data scientist and AI enthusiast who spends his days making machines smarter. I specialize in custom generative AI and NLP projects. When not fine-tuning language models or diving into predictive analytics, I like to keep up to date with ML research and tackle complex optimization problems.