KRITI BEHL
AI Safety, Multimodal Reasoning, Evaluation Systems
Projects

Systems for responsible, multimodal AI.

I design and build evaluation frameworks, jailbreak defenses, and multimodal reasoning systems. Every project below is implemented end to end: data, models, infrastructure, dashboards, and experiments.

AI safety and evaluation Multimodal reasoning DevOps and MLOps

Flagship AI Safety and Evaluation

Core systems that evaluate, constrain, and improve large language models in realistic settings. These projects combine benchmarks, pipelines, and visual observability.

Evaluation Framework

FairEval-Suite

Human-aligned evaluation for generative models

Most LLM evaluation relies on opaque aggregate scores. FairEval-Suite scores models on helpfulness, relevance, clarity, and toxicity, with real-time dashboards and reproducible scripts.

  • Reduces hallucination error by approximately 17–26 percent over prompt-only baselines.
  • Improves clarity scores by 18.9 percent on a controlled benchmark across multiple models.
  • Unifies scoring across GPT-4o, Claude, R1, and other models with a consistent rubric and pipeline.
Python FastAPI MongoDB Chart.js pytest Prometheus
Safety Pipeline

JailBreakDefense

Jailbreak detection and intent-preserving repair

Simple refusal policies often block legitimate users. This pipeline detects jailbreak prompts, extracts intent, repairs them into safe form, and returns aligned responses instead of generic refusals.

  • Decreases false refusals by around 40 percent compared to naive safety filters.
  • Reduces malicious compliance by more than 90 percent on curated adversarial prompts.
  • Preserves underlying user intent with a dedicated repair and re-evaluation stage.
Python Transformers Custom RepairEngine Safety RAG
Benchmark and Dataset

SpeechIntentEval

Misinterpretation and ambiguity evaluation for speech-driven systems

Speech interfaces often fail on indirect, emotionally loaded, or ambiguous phrasing. SpeechIntentEval captures these failure modes and scores models on intent understanding.

  • Curates examples of indirect requests, hedged language, and emotionally coded speech.
  • Evaluates intent detection and response appropriateness across several model baselines.
  • Designed to plug directly into evaluation and regression-testing pipelines.
Python Dataset design Evaluation scripts

Multimodal Reasoning

Systems that reason jointly over speech, image, and context to reduce hallucination and improve grounded answers.

Multimodal System

VoiceVisionReasoner

Joint reasoning over speech, image, and user context

Many models treat voice commands and visual context independently. VoiceVisionReasoner fuses them to handle indirect queries and ambiguous references more reliably.

  • Reduces misinterpretation of indirect speech by about 35 percent on a targeted test set.
  • Improves grounded-response F1 score by roughly 22 percent versus single-modality baselines.
  • Supports uncertainty-aware outputs and basic reasoning traces.
PyTorch HuggingFace OpenAI API TorchVision
Research Direction

Future Multimodal Experiments

Extending safety and evaluation to richer inputs

The next step is combining multimodal perception with explicit safety and uncertainty modeling, and validating models on real-world, noisy user interactions.

  • Planned benchmarks for conversational agents that use both camera and microphone context.
  • Exploratory work on error taxonomies for multimodal misalignment and failure modes.
  • Integration with existing evaluation pipelines for regression testing and continuous monitoring.
Experiment design Benchmarking Safety integration

Systems and Infrastructure

DevOps and MLOps projects that support observability, reliability, and effective deployment of complex systems.

DevOps Analytics

AutoOps-Insight

CI/CD health and failure analyzer

Large projects often lack clear visibility into CI/CD instability. AutoOps-Insight aggregates logs and metrics to highlight flaky tests, slow stages, and recurring failure patterns.

  • Ingests logs and metrics from Jenkins and GitHub Actions.
  • Surfaces recurring failures, high-latency stages, and regression-prone components.
  • Designed to integrate with dashboards used by engineering teams.
Python FastAPI MongoDB Prometheus Grafana
Chaos and Observability

KubePulse

Kubernetes chaos engineering and monitoring

KubePulse injects controlled failures into Kubernetes clusters and tracks system response, providing an explicit feedback loop for resilience and performance tuning.

  • Injects stressors such as pod disruptions and resource pressure.
  • Uses Prometheus and Grafana to visualize system response and recovery times.
  • Supports scenario-based experiments for services and microservices.
FastAPI Docker Kubernetes Prometheus Grafana

Developer Tools and Applications

Smaller tools and products that make developers and users more effective, built on top of modern LLMs and web stacks.

Developer Tool

Chrome Copilot

Log and code debugging assistant in the browser

Chrome Copilot augments developer workflows by interpreting logs, stack traces, and snippets directly in the browser, using LLMs to propose explanations and next steps.

  • Reads console output, errors, and selected text from web tools.
  • Generates hypotheses, summaries, and possible fixes for failures.
  • Designed as a lightweight companion rather than a full IDE.
JavaScript Chrome extension APIs OpenAI API
Application

ResuMate

Resume and job posting analysis assistant

ResuMate parses job descriptions and resumes to highlight gaps, relevant experience, and suggested edits, helping candidates align more quickly with specific roles.

  • Extracts key requirements from job postings and maps them to user experience.
  • Provides structured feedback on skills coverage and examples.
  • Built as a service with a clear API boundary and front-end integration options.
Python FastAPI LLM APIs