Data Scientist · Data Engineer

Hi, I'm William Chen

I build statistical models and data systems that turn messy data into decisions, from Bayesian A/B tests to end-to-end cloud pipelines.

MDS @ UC IrvineGraduating Dec 2026Open to opportunities

View Projects Contact Me Resume

Get to Know Me

About Me

I'm William Chen, an Irvine-based Data Scientist with a strong focus on Statistical Modeling, Machine Learning, and Cloud Data Engineering. I turn messy data into clear decisions through rigorous analysis and end-to-end systems.

What I Do

I specialize in statistical modeling, A/B testing, and causal inference, building rigorous analyses in R and Python. Recent work spans Bayesian inference, GLMs, and end-to-end ML pipelines deployed on AWS.

Credentials & Experience

Master of Data Science candidate at UC Irvine. Industry experience at SHINSOFT in applied computer vision, where I lifted classification accuracy by 15% on 200K+ images through embedding analysis and model fine-tuning.

Outside the Code

Outside of work, I stay active through weightlifting and enjoy giving back through community service and volunteering.

What I'm Looking For

I'm actively seeking full-time roles in data science or data engineering where I can apply statistical rigor to real product decisions. I thrive in environments that value experimentation, clean pipelines, and cross-functional collaboration.

“

Trust the process. Learn from failure. Stay humble.

Arsenal

My Tech Stack

Languages & Databases

PythonRSQL

Statistical Methods

Causal InferenceA/B TestingBayesian InferenceGLMHypothesis TestingRegularized Regression

ML & Modeling

Random ForestLogistic Regressionscikit-learnstatsmodelsPyTorchRAGTwo-Tower Models

Cloud & Tools

AWS S3AWS AthenaAWS EC2DockerFastAPIStreamlitGitpandasNumPyTypeScript

Career

Work Experience

SHINSOFT CO., LTD.

Taipei, Taiwan

Project Engineer — Data Science Focus

AUG 2024 – FEB 2025

Lifted classification accuracy by 15% on 200K+ camera-captured images by diagnosing indoor vs. outdoor distributional gaps through EfficientNet embedding analysis (PCA, t-SNE), revealing that a single general model failed across scene types, and fine-tuning a dedicated model on the underperforming segment.
Identified data scarcity, not model capacity, as the root cause of false positives; designed a GAN-based augmentation strategy to expand the minority-scene training set, improving precision by 5%.

Portfolio

Featured Projects

Showing 9 projects

Bayesian Prior Sensitivity

GitHub

Prior sensitivity in Bayesian logistic regression on birthwt (n=189), showing that rare predictors, not small n, determine when the prior stops mattering.

StatisticsAWSR

Citi Bike Data Pipeline (AWS)

GitHub

AWS pipeline integrating NYC Citi Bike trips with Open-Meteo weather on S3 and Athena, structured as raw / processed / analytics layers.

Data EngineeringAWSPython

Marketing A/B Testing

GitHub

Bayesian and Frequentist A/B testing on 588K users, exposing the gap between statistical significance and practical effect.

StatisticsA/B TestingR

Cookie Cats A/B Testing

GitHub

Mobile game retention experiment on 90K players, showing why 1-day and 7-day metrics tell different stories about gate placement.

StatisticsA/B TestingR

Bike Sharing Demand Forecasting (Poisson)

GitHub

Count regression diagnosing severe overdispersion (variance/mean = 833) and resolving it with Negative Binomial GLM.

StatisticsR

Bike Sharing Demand Forecasting (OLS)

GitHub

Linear baseline with OLS, Ridge, and Lasso under rolling-origin CV, including full residual diagnostics.

StatisticsR

Customer Churn Prediction

GitHub

Telecom churn classifier on 500K+ records, including a train/test distributional inconsistency diagnosis that lifted accuracy from 57% to 94%.

StatisticsMachine LearningR

Multi-Class Skill Classification in StarCraft II

GitHub

Reformulated a published pairwise task into 6-class multinomial classification on 3,340 players, outperforming the baseline in 3 of 4 league pairs.

StatisticsMachine LearningR

Two-Tower Retrieval for Recommendation

GitHub

Two-tower neural retrieval on 100K implicit feedback interactions, indexed in ChromaDB and served via FastAPI.

Machine LearningDeep LearningPython

View all projects on GitHub →

Academic Background

Education

Master of Data Science

University of California, Irvine

Irvine, California

2025 – 2026

Bachelor of Computer Science & Information Engineering

National Ilan University

Ilan, Taiwan

2020 – 2024

Let's Connect

Let's Work Together

Passionate about turning data into actionable insights and building solutions that make an impact. Feel free to connect with me via email or your preferred platform.

Get In Touch

Location

Irvine, California, USA

Available for on-site, remote, and internship roles globally.