January 19, 2025

Building Natural Language Interface for Vulcan Cyber ExposureOS

Iftach Arbel | Data Scientist

Imagine a security administrator at a large enterprise, tasked with finding critical vulnerabilities affecting their Windows servers deployed in the last week. In Vulcan Cyber's ExposureOS, this traditionally required navigating through multiple filtering menus, selecting precise parameters, and constructing complex rule-based queries. Now imagine if they could simply type,

"show critical vulns on windows servers from last week"

and instantly get the data they need. This is the future we've built in partnership with Vulcan Cyber.

The Challenge: Beyond Traditional Rule-Based Filtering

Enterprise cybersecurity teams face a critical challenge: managing an overwhelming volume of vulnerability data streaming in from multiple scanners across their organizations. This fragmentation often leads to inefficient prioritization and misdirected remediation efforts. To address this, Vulcan Cyber developed ExposureOS - a unified platform that consolidates scan data, prioritizes risks, and orchestrates remediation at scale.

The platform's powerful rule-based filtering system has long enabled users to create precise queries across multiple parameters:

Asset properties (OS, location, tags)
Vulnerability characteristics (severity, CVSS score, publish date)
Remediation status and history
Custom organizational metadata

While this rule-based system provides powerful flexibility, it requires users to:

Understand the exact parameter names and valid values
Navigate through multiple dropdown menus and filter builders
Construct complex nested rule groups for advanced queries
Remember the specific locations of different filtering options

‍

"Our customers love ExposureOS's comprehensive filtering capabilities, but we recognized an opportunity to make this power more accessible,"

"By partnering with Neradot to develop this natural language interface, we're transforming how security teams interact with their vulnerability data, making sophisticated queries as simple as having a conversation."

Introducing Natural Language Queries to ExposureOS

This partnership aimed to preserve all the power of Vulcan's existing filtering system while making it dramatically more accessible. The core challenge was bridging the gap between natural human expressions and ExposureOS's structured query format. Consider these example transformations:

Natural language query (achieves the same results as the previous image above)

The technical journey to enable this transformation presented several unique challenges:

Complex Domain Knowledge: Vulnerability management involves specialized terminology, CVE identifiers, and industry-specific concepts
Rich Metadata: ExposureOS's data model includes hundreds of queryable fields across assets, vulnerabilities, and remediation data
Enterprise Scale: The solution needed to handle enterprise-scale deployments with millions of assets and vulnerabilities

In the following sections, we'll explore how we built this system, from creating the initial training dataset to implementing hybrid approaches that combine the flexibility of LLMs with the reliability of traditional algorithmic solutions...

Want to see the ExposureOS Assistant in action? check out this short product video:

The incredible “spotlight” like interface the product team in Vulcan came up with!

‍

Creating Dataset by Learning Usage Patterns

Key Takeaways:

1. Start with stakeholder input to build initial dataset examples.
2. Structure training data in three difficulty tiers for systematic learning.
3. Balance formal and informal query styles to match real usage.

Every AI solution is only as good as the data available to train and refine it. This fundamental principle guided our approach to building the dataset. One of our challenges was the absence of clear, consistent usage patterns in how users described their queries. This required us to create a dataset that could accommodate a wide range of natural language expressions, from formal to highly abbreviated.

Developing an Initial Seed

Our dataset construction process began by thoroughly reviewing the existing rule-based filters and database schemas. We broke down the various filter options available to users, cataloging the different data types, operators, and logical combinations they could employ. This allowed us to identify key "building blocks" that would form the foundation of our prompts.

Next, we gathered input from key stakeholders - product managers, customer success managers, and experienced users. These conversations surfaced common user questions, workflows, and pain points they encountered when interacting with the system. We carefully documented these examples, categorizing them by intent and level of complexity.

A key consideration in building our dataset was the inherent flexibility of natural language. Unlike database queries, which follow a precise syntax, human users can express the same underlying intent in a wide variety of ways - from short, abbreviated phrases to long, detailed sentences. Each valid JSON filter in our database could potentially map to multiple natural language queries, spanning different levels of formality, verbosity, and ambiguity.

For example, this single rule filter:

[	
    {
      ’category_name’: ‘Vulnerability’, 
      ‘parameter_name’: ‘Published Date’, 		 
      ‘operator’: ‘between’, 		 
      ‘value’: [’2024-12-01’, ‘2025-01-01’]	
    }
]

‍

could be expressed in any of the following ways:

- "vulnerabilities from December 2024"
- "all the vulnerabilities that happened last month"
- "get me the database table for vulnerabilities in the last month of the year"
- "vulnerabilities where the published date is between the first of December and the end of December"
- “vuls Dec 24”

Capturing this variability in our dataset was crucial. We needed to ensure our model could handle not just the formal, technical phrasings, but also the casual, shortened, and ambiguous ways users might describe their needs.

With this foundation, we constructed an initial seed of queries spanning a range of difficulty levels: easy, medium, hard.

Easy:

Queries that matched the database schema exactly, using simple operators and values
Non-nested queries, translating to filter rules without parenthesis between filter groups

For example:

“Assets from type name Host”

Medium:

Queries that deviated somewhat from the schema, combining multiple conditions
Queries with more complex operators, values, or nested structures

For example:

“Assets with OS 'AWS Linux' and external facing IP or assets with OS 'Windows'”

Hard:

Queries that significantly diverged from the expected schema
Non-trivial abbreviations, reflecting internal knowledge

For example

“Vuln risk high/medium & CVE-2023-22 or risk low & CVE-2024-1223”

For each query, we carefully crafted the expected output - a valid JSON filter. This meticulous process ensured our dataset would provide comprehensive coverage of the types of interactions our model would need to handle.

Iterative Expansion and Refinement

With our initial seed of "easy," "medium," and "hard" prompts in hand, we leveraged this foundation to systematically expand the dataset. We used the defined difficulty criteria and our detailed understanding of the database schema to programmatically generate hundreds of additional dataset items, covering a wide range of variations. By leveraging programmatic generation and iterative refinement, we systematically addressed the challenge of unclear patterns, ensuring the dataset captured edge cases and diverse user intents.

The final step in our dataset construction process involved reviewing both the queries and the expected outputs. We scrutinized each example, validating the accuracy of the JSON filters, and how realistic were the natural language queries. Where issues were identified, we made targeted corrections, ensuring the dataset maintained a high standard of quality and consistency.

This multi-faceted approach yielded a comprehensive, diverse, and thoroughly vetted prompt set - laying the foundation for a natural language interface that could handle the full complexity of our users' needs.

Schema-Aware Natural Language Understanding

Key Takeaways:

1. Complex NoSQL schemas require multi-level validation
2. Pydantic is great for the validation layer
3. Provide clear, actionable feedback for schema violations
4. Handle domain-specific terminology and relationships

A database schema that follows a hierarchical, context-dependent structure—where parent objects determine the allowable values, parameters, and operations of their child objects—is a flexible yet organized model often seen in NoSQL systems. This schema allows for dynamic and nested relationships, such as categories dictating valid parameters, which in turn constrain operators and values. While this structure supports complex data models and domain-specific rules, it introduces challenges when using LLMs for text-to-query tasks. LLMs, trained on generalized language patterns, may struggle to generate queries that strictly adhere to the schema’s hierarchical and context-specific constraints. This can result in invalid queries, such as selecting parameters or operators that are incompatible with a given category. Furthermore, the nested nature of the schema complicates query generation, as LLMs must account for multi-level dependencies, enforce Enum validation dynamically, and ensure syntactic correctness across deeply embedded fields.

A simplified representation of an hierarchical, context-dependent structure in the security-domain

While many modern LLMs support “structured outputs,” these outputs are typically limited to flat, non-nested, and non-conditioned rules. This means that while it is possible to define keys, simple value types, and even Enums for specific fields, no LLM or inference engine currently supports nested objects as values or enforce context-specific constraints, such as conditioning an Enum on the value of another key. These challenges necessitate implementing robust post-processing layers to validate and adjust the generated queries.

Schema Validation with Pydantic

To enforce the hierarchical schema and handle the complexities of nested, context-dependent rules, we utilize Pydantic for validation and error correction. Pydantic provides a structured approach to validate the schema's keys, values, and nested relationships while accommodating conditional logic. This approach was critical for handling the complexities of multi-level NoSQL schemas, ensuring the model's outputs respected both the hierarchical structure and the contextual rules.

The validation process starts with ensuring that the overall schema structure adheres to expected rules, including nested objects and their relationships. Pydantic checks value types rigorously, ensuring that every field matches its defined type, whether it be integers, strings, arrays, or nested objects. In cases where certain fields depend on the values of others—such as Enums conditioned on the parent field's value—Pydantic enforces these constraints programmatically.

A critical part of the process involves detecting and correcting common schema mistakes generated by the model. For instance, if a rule specifies that selecting a specific date requires an array with a start and end date that are equal, but the model outputs a single value, Pydantic reformats the output into the correct structure. Similarly, the system identifies and resolves "redundant nesting," where unnecessary levels of hierarchy are collapsed into simpler, equivalent flat rules.

After each correction, the schema undergoes re-validation in an iterative process, continuing until a valid schema is achieved or an unresolvable issue is encountered. If the schema is valid, it is returned to the user as the final output. However, if validation fails, the system analyzes the issue—whether it's an invalid user query or a failure by the model to adhere to the schema—and provides the user with detailed feedback. This includes identifying the specific problem (e.g., a data type mismatch or an invalid Enum value) and offering actionable suggestions or query refinements. By providing a well-defined follow-up instead of simply outputting an error, users can better understand the cause of the issue, receive guidance on how to fix it, and quickly refine their query to achieve the desired outcome, making the overall process more intuitive and user-friendly, as well as establishing confidence in the model from the user perspective - for example:

searching for an vulnerability name “qualys”, when in fact “qualys” is a vulnerability source or integration.

Text-to-query diagram flow of Workflow and Block. Workflow starts with a message from the user (query) which triggers a block execution. For each execution, inputs are validated and preprocessed, then used to construct the prompt for the language model. after the model generates a response, it is parsed as a JSON and then validate with a Pydantic schema. Then the workflow routes depends on the block response status either back to the user for more context (Follow-up), Retry for failure or showing the results for a Complete status.

‍

‍‍

Beta Phase: When Theory Met Reality

Key Takeaways:

1. Real security teams prefer shorter, value-focused queries
2. Users naturally abbreviate common security terms
3. Adaptation to user patterns improves accuracy significantly

During the beta phase, we released the solution to a select group of users and closely monitored its performance. The results highlighted several key insights:

The model performed exceptionally well when users provided clear and detailed queries that aligned with the intended schema structure.
In cases where users employed very aggressive abbreviations, the model delivered reasonable results but struggled to achieve optimal accuracy.
A significant observation emerged regarding the types of queries users submitted. While our development dataset was primarily designed around "parameter queries" (queries centered on specific parameters), many users submitted "value queries", which focused on specific values, often without explicitly mentioning the associated parameter. This distinction between parameter queries and value queries revealed a gap in our initial design. While we expected the user would search for a “Server with name ‘MyFastMachine’”, in reality the query was “show FastMachine”.

Our original development process was well-equipped to address these challenges. We systematically collected user queries and model responses, analyzed the patterns and characteristics of failure cases, and quickly adapted. Using our prompt seed expansion framework, we developed a new dataset to test and address the value queries and aggressively abbreviated queries. This process enabled us to refine the model instructions, adjust the schema structure to better handle varied query formats, and provide additional context and examples. These iterative improvements ensured the model's ability to handle a wider range of real-world queries effectively while maintaining a clear structure for further optimizations.

Process diagram of our “prompt seed expansion” framework and evaluation

Evaluation and Results

Key Takeaways:

1. Measure both technical accuracy and user satisfaction
2. Test with multiple real-world security scenarios
3. Monitor performance across different usage patterns
4. Continuously gather and incorporate user feedback

To evaluate the system's performance and ensure continuous improvement, we utilized three distinct datasets, each serving a specific purpose:

Initial Dataset: This dataset was constructed before the beta phase, based on discussions with product managers and anticipated user behavior. It reflected assumptions about typical query patterns and provided a foundation for the model’s initial development and testing.
Beta Dataset: Developed during the beta phase, this dataset was informed by real user queries and feedback. Leveraging the infrastructure built in the initial phase, we quickly adapted to incorporate the insights gained from actual usage, particularly addressing patterns like "value queries" that were not present in the initial dataset.
Production Dataset: After the release, we began sampling user queries and corresponding model outputs on an ongoing basis. These queries are annotated for accuracy and supplemented with programmatic monitoring, such as detecting repeated submissions and tracking thumbs-up or thumbs-down feedback. This enables continuous extraction of queries of interest for further analysis.

For each dataset, we evaluate the system against three critical metrics:

JSON Validity: The most basic validation, ensuring the filter adheres to the expected JSON structure—an array of (possibly nested) dictionaries with predefined keys.
Schema Validity: At this level, we validate whether the output conforms to the specific schema defined for the application, including its hierarchical and conditional rules.
Filter Accuracy: This evaluates whether the filter accurately reflects the user’s intent as expressed in the query.

This layered evaluation framework, combining diverse datasets and hierarchical metrics, allows us to identify issues with precision. The hierarchical nature of the metrics (JSON → schema → accuracy) ensures that basic issues are addressed before more complex ones, streamlining debugging and improvement efforts.

We periodically test the model against all datasets to monitor for drift and ensure it continues to perform reliably. Before implementing any changes—whether prompt modifications or functional updates—we validate the updates against all datasets to ensure improvements do not inadvertently harm existing use cases. Additionally, the ongoing production dataset provides a mechanism for long-term monitoring and refinement, allowing us to adapt the model to evolving user behaviors and maintain high-quality performance over time.

In addition to these accuracy-focused evaluations, we also measure performance metrics, such as latency and total generation time. These are tested across various regions, input sizes, and times of day to maintain a consistent user experience and optimize the system’s responsiveness.

Through this comprehensive evaluation approach, we ensure the system remains accurate, adaptable, and high-performing over time.

‍

The evaluation process of text-to-query workflow

‍Quality Report

For 203 dataset items there is 100% JSON validity, 99% Pydantic schema validity and 94% accuracy (exact)

‍Performance Report

Conclusion

By combining robust prompt engineering, comprehensive datasets, and iterative evaluation, we built a system that bridges the gap between natural language and complex database queries. Our solution has demonstrated significant improvements in handling a variety of real-world challenges, from ambiguous inputs to dynamically adapting hierarchical schemas. Continuous user feedback and monitoring ensure the system evolves with real-world needs, delivering a reliable and intuitive experience.

Looking Ahead

This journey has shown us that the future of enterprise software lies not in adding more features or documentation, but in making existing capabilities more accessible. We're now exploring:

Contextual awareness for even more natural conversations
Predictive suggestions based on user patterns
Cross-platform natural language interfaces

Collaboration remains central to our process. Working closely with product owners and end-users, we can push the boundaries of natural language interfaces for databases, paving the way for even more seamless and intuitive user experiences.

‍

Application Card:

About:

Iftach Arbel ‍

A data scientist and AI enthusiast who spends his days making machines smarter. I specialize in custom generative AI and NLP projects. When not fine-tuning language models or diving into predictive analytics, I like to keep up to date with ML research and tackle complex optimization problems.

‍

Building Natural Language Interface for Vulcan Cyber ExposureOS

The Challenge: Beyond Traditional Rule-Based Filtering

‍

Introducing Natural Language Queries to ExposureOS

Creating Dataset by Learning Usage Patterns

Developing an Initial Seed

Iterative Expansion and Refinement

Schema-Aware Natural Language Understanding

Schema Validation with Pydantic

Beta Phase: When Theory Met Reality

Evaluation and Results

Conclusion

Looking Ahead

Application Card:

About:

Sign up for our newsletter today and never miss a Neradot update