Research

2026

SkipPar: A Hybrid CPU-GPU Framework for Accelerating LLM Training via Efficient Scheduling of Parameter Updates

D. Joshi, K. Namboori, K. M. Kuriakose, S. Vadhiyar, S. Banerjee, A. Singh

IEEE International Parallel and Distributed Processing Symposium (IPDPSW) 2026 Accepted First Author

2024

DRL Based Service Migration and Resource Allocation in Vehicular Edge Networks

G. Bolar, D. Joshi, S. P. Chennamsetti, V. K. Tumuluru

3rd International Conference on AI for IoT (AIIoT)

[paper]

2023

A Two-Layer Connected Component Algorithm for Target Extraction Using K-means and Morphology

D. Joshi, A. A. Gangotri, S. P. Chennamsetti, G. Bolar, G. Thiagarajan, S. Gurugopinath

IFIP/IEEE 31st International Conference on Very Large Scale Integration (VLSI-SoC)

[paper]

2022

Performance Comparison of Learning Methods for Soil Parameter Estimation using Hyperspectral Data

G. Bolar, D. Joshi, S. P. Chennamsetti, S. Gurugopinath

8th International Conference on Signal Processing and Communication (ICSC)

[paper]

SkipPar — Hybrid CPU-GPU LLM Training Framework

Designed a co-execution paradigm overlapping GPU forward/backward passes with concurrent CPU parameter updates. Implemented a 4-thread producer-consumer pipeline with PyTorch DDP hooks. Achieved up to 17% reduction in end-to-end training time on A100/H100 GPUs with LLaMA-2 (10B) and GPT-2 (9B).

CUDA PyTorch DDP LLM Training

PAC-IPV — Prefetch-Aware Cache Replacement

Extended an RRIP-based LLC replacement policy in ZSim with prefetch-aware RRPV assignment. Derived a probabilistic Markov-chain analytical model validated via Monte Carlo analysis (<1.5% error).

C++ ZSim Computer Architecture

VAJRA — Heterogeneous GPU/FPGA Edge Cluster

Architected a heterogeneous edge cluster (Raspberry Pi 5, Intel DE10 SoC FPGAs, NVIDIA Jetson Orin) for model-parallel DNN inference. Demonstrated 400M-parameter inference using 4 GB collective cluster memory.

C CUDA FPGA Embedded Linux

[code]

Joint Resource Allocation in Vehicular Edge Networks

Formulated a joint MDP for resource allocation and service migration across MEC servers. Trained a modified DDPG agent achieving 26.67% reduction in service violations. Published at AIIoT 2024.

Python PyTorch DDPG

[code]