Current Research Projects


Machine learning for compiler optimizations: Generating high-performing codes for increasingly heterogeneous hardware calls for more flexible and adaptive ways to model program behaviors with different optimization decisions than traditional heuristic-based cost models. We take data-driven approaches where the code-performance relationship is accurately learned from code representation and profiling results.  CogR (PACT ’19) guides the OpenMP runtime scheduler by predicting whether an OpenMP target region will execute faster on CPU on GPU using a deep-learning based predictor model, while MetaTune extends an auto-tuning framework in a deep-learning compiler, TVM, to reduce autotuning overheads and generate better-optimized codes for tensor operations. Most recently, One-Shot Tuner (CC ’22) showed how online autotuning overheads can be practically eliminated with a NAS-inspired performance predictor model trained with a small set of samples (open-sourced). 


Compiler technologies for emerging architectures: With the end of Dennard scaling, we are witnessing a major shift in the computer system and microarchitecture design towards exploiting more specialized and lightweight “accelerators” of different types instead of relying on general-purpose processors. Such heterogeneous systems pose an unprecedented challenge for the entire software stack to provide programmability and portability while delivering performance. We work on rethinking compiler and runtime technologies for heterogeneous systems with emerging architectures such as neural processing units (NPU) and compute-augmented memory (NDP/PIM).

PIMFlow (CAL ’22, CGO ’23) proposes software layers specifically designed to accelerate compute-intensive convolutional layers on PIM-DRAM. Integrated with the TVM compiler, PIMFlow provides graph-level optimizations to create more inter-node parallelism so that layers can execute in parallel on GPU and PIM. While PIMFlow is about inference time performance, XLA-NDP (CAL ’23) focuses on enabling GPU-PIM parallel execution during model training (open-sourced). ATiM (to appear at ISCA '25) extends an autotuning deep learning compiler to integrate host and device code generation with search-based optimization for modern Processing-in-Memory (PIM) hardware. 

Autotuning framework for high-performance computing: The increasing complexity and diversity of modern high-performance computing (HPC) systems demand automated performance tuning strategies that can generalize across architectures and workloads. Our group is developing a multi-level, multi-objective autotuning framework tailored specifically for HPC applications. As a first step toward this effort, HYPERF (to appear at HPDC '25) introduces an autotuning framework for HPC applications, combining TVM's tensor-based autotuning capabilities with an OpenMP C/C++ front-end. 

Accelerator design space exploration: DSE is a key research topic for application-specific accelerators, but has not been extensively explored in the context of processing-in-memory hardware design.  ICCAD '23 introduces a heterogeneous analog computing-in-memory (ACiM) architecture that supports multiple tile and subarray sizes (“big-tile, little-tile”) with an end-to-end design space exploration tool to optimize latency, power, and area at the same time (open-sourced). Building upon ICCAD '23, NavCim (PACT '24) presents a comprehensive design space exploration (DSE) mechanism for heterogeneous ACiM architectures, offering improved search efficiency, expanded scope, and incorporating accuracy-aware design optimization (open-sourced). 


OpenCL compiler and runtime support for next-gen supercomputers: Modern supercomputers harness massive parallelism provided by host CPU’s and specialized computing elements such as GPU’s and NPU’s. In collaboration with ETRI and KISTI, we work on building Korea’s own next-generation supercomputers and providing OpenCL programming support with optimizing compilers and runtime.  

troduces a heterogeneous analog computing-in-memory architecture that supports multiple tile and subarray sizes (“big-tile, little-tile”) with an end-to-end design space exploration tool to optimize latency, power, and area at the same time (open-sourced)