Projects
Systems for responsible, multimodal AI.
I design and build evaluation frameworks, jailbreak defenses, and multimodal reasoning systems.
Every project below is implemented end to end: data, models, infrastructure, dashboards, and experiments.
AI safety and evaluation
Multimodal reasoning
DevOps and MLOps
Evaluation Framework
FairEval-Suite
Human-aligned evaluation for generative models
Most LLM evaluation relies on opaque aggregate scores. FairEval-Suite scores models on helpfulness, relevance,
clarity, and toxicity, with real-time dashboards and reproducible scripts.
- Reduces hallucination error by approximately 17–26 percent over prompt-only baselines.
- Improves clarity scores by 18.9 percent on a controlled benchmark across multiple models.
- Unifies scoring across GPT-4o, Claude, R1, and other models with a consistent rubric and pipeline.
Python
FastAPI
MongoDB
Chart.js
pytest
Prometheus
Safety Pipeline
JailBreakDefense
Jailbreak detection and intent-preserving repair
Simple refusal policies often block legitimate users. This pipeline detects jailbreak prompts, extracts intent,
repairs them into safe form, and returns aligned responses instead of generic refusals.
- Decreases false refusals by around 40 percent compared to naive safety filters.
- Reduces malicious compliance by more than 90 percent on curated adversarial prompts.
- Preserves underlying user intent with a dedicated repair and re-evaluation stage.
Python
Transformers
Custom RepairEngine
Safety RAG
Benchmark and Dataset
SpeechIntentEval
Misinterpretation and ambiguity evaluation for speech-driven systems
Speech interfaces often fail on indirect, emotionally loaded, or ambiguous phrasing.
SpeechIntentEval captures these failure modes and scores models on intent understanding.
- Curates examples of indirect requests, hedged language, and emotionally coded speech.
- Evaluates intent detection and response appropriateness across several model baselines.
- Designed to plug directly into evaluation and regression-testing pipelines.
Python
Dataset design
Evaluation scripts
Multimodal System
VoiceVisionReasoner
Joint reasoning over speech, image, and user context
Many models treat voice commands and visual context independently. VoiceVisionReasoner fuses them to handle
indirect queries and ambiguous references more reliably.
- Reduces misinterpretation of indirect speech by about 35 percent on a targeted test set.
- Improves grounded-response F1 score by roughly 22 percent versus single-modality baselines.
- Supports uncertainty-aware outputs and basic reasoning traces.
PyTorch
HuggingFace
OpenAI API
TorchVision
Research Direction
Future Multimodal Experiments
Extending safety and evaluation to richer inputs
The next step is combining multimodal perception with explicit safety and uncertainty modeling,
and validating models on real-world, noisy user interactions.
- Planned benchmarks for conversational agents that use both camera and microphone context.
- Exploratory work on error taxonomies for multimodal misalignment and failure modes.
- Integration with existing evaluation pipelines for regression testing and continuous monitoring.
Experiment design
Benchmarking
Safety integration
DevOps Analytics
AutoOps-Insight
CI/CD health and failure analyzer
Large projects often lack clear visibility into CI/CD instability. AutoOps-Insight aggregates logs and metrics
to highlight flaky tests, slow stages, and recurring failure patterns.
- Ingests logs and metrics from Jenkins and GitHub Actions.
- Surfaces recurring failures, high-latency stages, and regression-prone components.
- Designed to integrate with dashboards used by engineering teams.
Python
FastAPI
MongoDB
Prometheus
Grafana
Chaos and Observability
KubePulse
Kubernetes chaos engineering and monitoring
KubePulse injects controlled failures into Kubernetes clusters and tracks system response, providing an explicit
feedback loop for resilience and performance tuning.
- Injects stressors such as pod disruptions and resource pressure.
- Uses Prometheus and Grafana to visualize system response and recovery times.
- Supports scenario-based experiments for services and microservices.
FastAPI
Docker
Kubernetes
Prometheus
Grafana