ML-Assignment-DNAC1

Assignment — Traditional & Deep Learning pipelines for Tabular, Text, and Image datasets • CO3117
Overview

Project Summary

Status: In Progress

This repository contains an assignment for the Machine Learning course. The goal is to build end-to-end pipelines for tabular, text, and image datasets using both traditional ML and deep learning approaches.

Learning Objectives

  • Design reproducible data preprocessing and feature engineering pipelines.
  • Train, evaluate, and compare traditional and deep-learning models.
  • Work with tabular, text, and image datasets.
Pipelines

Architecture & Steps

Common Components

  1. Exploratory Data Analysis (EDA)
  2. Preprocessing & feature engineering
  3. Model training & hyperparameter tuning
  4. Evaluation & visualizations

Per-Modal Pipelines

Tabular

Steps: EDA → imputation → scaling → categorical encoding → feature selection → train/test split → modeling → training → testing

Text

To Be Added.

Image

To Be Added.

Datasets

Datasets

Tabular: Adult Census Income — Predict whether income exceeds $50K/yr based on census data.
[UCI Dataset Link]

Text: To Be Added.

Images: To Be Added.

Run & Reproduce

How to run

Requirements

  • Python 3.12+
  • Jupyter Notebook or Google Colab

Run notebook using Google Colab:

  1. Open the notebook in Google Colab.
  2. Select RuntimeRun all.
  3. The entire process (installing libraries, loading data, training, evaluating) will run automatically.

Run notebook locally:

  1. Create a virtual environment (recommended):
    python -m venv venv
    source venv/bin/activate   # Linux/Mac
    venv\Scripts\activate      # Windows
    
  2. Install dependencies:
    pip install -r requirements.txt
    
  3. Run notebook using Jupyter:
    cd notebooks
    jupyter notebook notebook_name.ipynb
    
Experiment Result

Model Comparison

The following figure summarizes the performance comparison between different models used in our pipeline:

Model Comparison

Figure 1: SVM with 90% PCA performs the best in Adult Census Income dataset.