Shanghai, China

Engineer Sr, Tech SW | NVIDIA DevTech

Xiangyi Zhang builds faster, leaner paths for accelerated AI.

I am an Engineer Sr, Tech SW in the NVIDIA DevTech group. My work builds on a background in LLM inference acceleration, KV cache compression, heterogeneous inference, long-context serving, and production-ready model systems. My earlier research spans few-shot segmentation, biological image segmentation, and semi-supervised learning.

Email LinkedIn GitHub Resume

LLM Inference

KV cache compression / long-context optimization / layer sharing / attention-forward separation

Frameworks

vLLM / HuggingFace Transformers / OpenVINO / PyTorch

Engineering

distributed inference / multi-GPU deployment / Docker / Linux

Embodied AI

VLA frameworks / ROS / Pi-0 / VLM systems

Current Work

Accelerated AI systems and efficient LLM serving.

NVIDIA DevTech | Apr. 13, 2026 - Present

Engineer Sr, Tech SW

Senior software engineering role in NVIDIA's DevTech group, focused on developer-facing technical software work at the intersection of GPU acceleration, AI systems, and high-performance model deployment.

Joined NVIDIA DevTech on Apr. 13, 2026.
Current professional focus continues around efficient AI software, inference systems, and accelerated computing.

NVIDIADevTechGPU softwareAI systems

Intel Corporation | Aug. 2024 - Apr. 2026

KV Cache Compressed Multi-Expert Collaborative Inference System

A training-free inference stack for high-concurrency and long-context serving, combining adaptive KV cache budgeting, early-exit acceleration, clustered layer sharing, and model-specific expert routing.

Reduced memory usage by 40%+ on MMLU tasks and 30% on 128K retrieval tasks with less than 1% accuracy loss.
Cut inference tokens by 30-50% for selected reasoning workloads while preserving full reasoning on harder cases.
Improved multi-expert collaborative inference accuracy by 15% with routing latency under 10 ms.

vLLMPagedAttentionPyTorchdistributed inference

Intel Corporation | Oct. 2025 - Apr. 2026

Heterogeneous Attention-Forward Separation Inference System

A heterogeneous architecture that maps memory-intensive attention and KV cache work to Intel CPUs/accelerators while keeping FFN compute on NVIDIA GPUs.

Implemented cross-device pipeline overlapping to hide transmission latency.
Built a unified inference interface for flexible hardware configurations and dynamic load balancing.

DeepSeek-V3Qwen2.5-72Bpipeline parallelism

Intel Corporation | Jul. 2021 - Mar. 2023

High-Performance Inference Deployment and Optimization

Production-focused model deployment and performance tuning across Intel Habana Gaudi, Xeon CPU, and Arc GPU platforms.

Optimized LLaMA, Qwen, Mixtral, and related large models for target hardware.
Designed conversion, containerization, and elastic scaling pipelines for inference services.

Habana GaudiXeonArc GPUONNX

Research

A computer vision foundation that still shapes the systems work.

ECCV 2020 | ShanghaiTech University

Part-aware Prototype Network for Few-shot Semantic Segmentation

A few-shot semantic segmentation framework that preserves detailed local information with part-aware prototypes and scales to unlabeled data.

Paper Code

PLOS ONE | ShanghaiTech University

Soft X-Ray beta-cell Auto-segmentation

A deep-learning pipeline for organelle segmentation and time-dependent analysis of pancreatic beta-cell structure under glucose and Ex-4 stimulation.

Paper Code

Publications

Selected papers, projects, and talks.

Fully-fused Multi-Layer Perceptrons on Intel Data Center GPUs

Kai Yuan*, Christoph Bauinger*, Xiangyi Zhang*, et al.

arXiv, 2024

Paper Code

Part-aware Prototype Network for Few-shot Semantic Segmentation

Xiangyi Zhang*, Yongfei Liu*, Songyang Zhang, Xuming He

European Conference on Computer Vision, 2020

Paper Code

Auto-segmentation and time-dependent systematic analysis of mesoscale cellular structure in beta-cells during insulin secretion

Angdi Li*, Xiangyi Zhang*, Jitin Singla, Kate White, Valentina Loconte, et al.

PLOS ONE, 2022

Paper Code

Graph transformer-based semi-supervised few-shot semantic segmentation

Xiangyi Zhang, Bruce Zhu

Intel AI Everywhere Conference, 2023

Internal talk

Neural Rendering Platform

Xiangyi Zhang, et al.

Intel AI Everywhere Conference, 2024

Internal talk

Timeline

From vision research to accelerated AI infrastructure.

2026 - Present

Engineer Sr, Tech SW, NVIDIA DevTech

Joined NVIDIA DevTech on Apr. 13, 2026, working on senior technical software engineering for accelerated AI systems.

2021 - 2026

LLM Inference Optimization Engineer, Intel Corporation

LLM inference acceleration, heterogeneous deployment, KV cache compression, and high-throughput serving.

2020

Computer Vision Intern, Microsoft Research Asia

Worked with the Intelligent Multimedia Group on self-supervised video instance segmentation.

2018 - 2021

M.Eng. in Computer Science, ShanghaiTech University

Advised by Prof. Xuming He, with research in computer vision and few-shot semantic segmentation.

2014 - 2018

B.Eng. in Information Security, Yunnan University

Studied software, systems, and security fundamentals.

Contact

Interested in accelerated AI systems, efficient inference, or applied vision research?

Email LinkedIn GitHub Resume Google Scholar ORCID