Spaces:

MCP-1st-Birthday
/

ML-Starter

Running

App Files Files Community

ML-Starter / knowledge_base /structured_data /structured_data_classification_from_scratch.py

emreatilgan

feat: Initialize mcp_server with embedding and loader modules

9ce984a 16 days ago

raw

history blame contribute delete

12.7 kB

	"""
	Title: Structured data classification from scratch
	Author: [fchollet](https://twitter.com/fchollet)
	Date created: 2020/06/09
	Last modified: 2020/06/09
	Description: Binary classification of structured data including numerical and categorical features.
	Accelerator: GPU
	Made backend-agnostic by: [Humbulani Ndou](https://github.com/Humbulani1234)
	"""

	"""
	## Introduction

	This example demonstrates how to do structured data classification, starting from a raw
	CSV file. Our data includes both numerical and categorical features. We will use Keras
	preprocessing layers to normalize the numerical features and vectorize the categorical
	ones.

	Note that this example should be run with TensorFlow 2.5 or higher.

	### The dataset

	[Our dataset](https://archive.ics.uci.edu/ml/datasets/heart+Disease) is provided by the
	Cleveland Clinic Foundation for Heart Disease.
	It's a CSV file with 303 rows. Each row contains information about a patient (a
	sample), and each column describes an attribute of the patient (a feature). We
	use the features to predict whether a patient has a heart disease (**binary
	classification**).

	Here's the description of each feature:

	Column\| Description\| Feature Type
	------------\|--------------------\|----------------------
	Age \| Age in years \| Numerical
	Sex \| (1 = male; 0 = female) \| Categorical
	CP \| Chest pain type (0, 1, 2, 3, 4) \| Categorical
	Trestbpd \| Resting blood pressure (in mm Hg on admission) \| Numerical
	Chol \| Serum cholesterol in mg/dl \| Numerical
	FBS \| fasting blood sugar in 120 mg/dl (1 = true; 0 = false) \| Categorical
	RestECG \| Resting electrocardiogram results (0, 1, 2) \| Categorical
	Thalach \| Maximum heart rate achieved \| Numerical
	Exang \| Exercise induced angina (1 = yes; 0 = no) \| Categorical
	Oldpeak \| ST depression induced by exercise relative to rest \| Numerical
	Slope \| Slope of the peak exercise ST segment \| Numerical
	CA \| Number of major vessels (0-3) colored by fluoroscopy \| Both numerical & categorical
	Thal \| 3 = normal; 6 = fixed defect; 7 = reversible defect \| Categorical
	Target \| Diagnosis of heart disease (1 = true; 0 = false) \| Target
	"""

	"""
	## Setup
	"""

	import os

	os.environ["KERAS_BACKEND"] = "torch" # or torch, or tensorflow

	import pandas as pd
	import keras
	from keras import layers

	"""
	## Preparing the data

	Let's download the data and load it into a Pandas dataframe:
	"""

	file_url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
	dataframe = pd.read_csv(file_url)

	"""
	The dataset includes 303 samples with 14 columns per sample (13 features, plus the target
	label):
	"""

	dataframe.shape

	"""
	Here's a preview of a few samples:
	"""

	dataframe.head()

	"""
	The last column, "target", indicates whether the patient has a heart disease (1) or not
	(0).

	Let's split the data into a training and validation set:
	"""

	val_dataframe = dataframe.sample(frac=0.2, random_state=1337)
	train_dataframe = dataframe.drop(val_dataframe.index)

	print(
	f"Using {len(train_dataframe)} samples for training "
	f"and {len(val_dataframe)} for validation"
	)


	"""
	## Define dataset metadata

	Here, we define the metadata of the dataset that will be useful for reading and
	parsing the data into input features, and encoding the input features with respect
	to their types.
	"""

	COLUMN_NAMES = [
	"age",
	"sex",
	"cp",
	"trestbps",
	"chol",
	"fbs",
	"restecg",
	"thalach",
	"exang",
	"oldpeak",
	"slope",
	"ca",
	"thal",
	"target",
	]
	# Target feature name.
	TARGET_FEATURE_NAME = "target"
	# Numeric feature names.
	NUMERIC_FEATURE_NAMES = ["age", "trestbps", "thalach", "oldpeak", "slope", "chol"]
	# Categorical features and their vocabulary lists.
	# Note that we add 'v=' as a prefix to all categorical feature values to make
	# sure that they are treated as strings.

	CATEGORICAL_FEATURES_WITH_VOCABULARY = {
	feature_name: sorted(
	[
	# Integer categorcal must be int and string must be str
	value if dataframe[feature_name].dtype == "int64" else str(value)
	for value in list(dataframe[feature_name].unique())
	]
	)
	for feature_name in COLUMN_NAMES
	if feature_name not in list(NUMERIC_FEATURE_NAMES + [TARGET_FEATURE_NAME])
	}
	# All features names.
	FEATURE_NAMES = NUMERIC_FEATURE_NAMES + list(
	CATEGORICAL_FEATURES_WITH_VOCABULARY.keys()
	)


	"""
	## Feature preprocessing with Keras layers


	The following features are categorical features encoded as integers:

	- `sex`
	- `cp`
	- `fbs`
	- `restecg`
	- `exang`
	- `ca`

	We will encode these features using one-hot encoding. We have two options
	here:

	- Use `CategoryEncoding()`, which requires knowing the range of input values
	and will error on input outside the range.
	- Use `IntegerLookup()` which will build a lookup table for inputs and reserve
	an output index for unkown input values.

	For this example, we want a simple solution that will handle out of range inputs
	at inference, so we will use `IntegerLookup()`.

	We also have a categorical feature encoded as a string: `thal`. We will create an
	index of all possible features and encode output using the `StringLookup()` layer.

	Finally, the following feature are continuous numerical features:

	- `age`
	- `trestbps`
	- `chol`
	- `thalach`
	- `oldpeak`
	- `slope`

	For each of these features, we will use a `Normalization()` layer to make sure the mean
	of each feature is 0 and its standard deviation is 1.

	Below, we define 2 utility functions to do the operations:

	- `encode_numerical_feature` to apply featurewise normalization to numerical features.
	- `process` to one-hot encode string or integer categorical features.
	"""

	# Tensorflow required for tf.data.Dataset
	import tensorflow as tf


	# We process our datasets elements here (categorical) and convert them to indices to avoid this step
	# during model training since only tensorflow support strings.
	def encode_categorical(features, target):
	for feature_name in features:
	if feature_name in CATEGORICAL_FEATURES_WITH_VOCABULARY:
	lookup_class = (
	layers.StringLookup
	if features[feature_name].dtype == "string"
	else layers.IntegerLookup
	)
	vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name]
	# Create a lookup to convert a string values to an integer indices.
	# Since we are not using a mask token nor expecting any out of vocabulary
	# (oov) token, we set mask_token to None and num_oov_indices to 0.
	index = lookup_class(
	vocabulary=vocabulary,
	mask_token=None,
	num_oov_indices=0,
	output_mode="binary",
	)
	# Convert the string input values into integer indices.
	value_index = index(features[feature_name])
	features[feature_name] = value_index

	else:
	pass

	# Change features from OrderedDict to Dict to match Inputs as they are Dict.
	return dict(features), target


	def encode_numerical_feature(feature, name, dataset):
	# Create a Normalization layer for our feature
	normalizer = layers.Normalization()
	# Prepare a Dataset that only yields our feature
	feature_ds = dataset.map(lambda x, y: x[name])
	feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))
	# Learn the statistics of the data
	normalizer.adapt(feature_ds)
	# Normalize the input feature
	encoded_feature = normalizer(feature)
	return encoded_feature


	"""
	Let's generate `tf.data.Dataset` objects for each dataframe:
	"""


	def dataframe_to_dataset(dataframe):
	dataframe = dataframe.copy()
	labels = dataframe.pop("target")
	ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels)).map(
	encode_categorical
	)
	ds = ds.shuffle(buffer_size=len(dataframe))
	return ds


	train_ds = dataframe_to_dataset(train_dataframe)
	val_ds = dataframe_to_dataset(val_dataframe)

	"""
	Each `Dataset` yields a tuple `(input, target)` where `input` is a dictionary of features
	and `target` is the value `0` or `1`:
	"""

	for x, y in train_ds.take(1):
	print("Input:", x)
	print("Target:", y)

	"""
	Let's batch the datasets:
	"""

	train_ds = train_ds.batch(32)
	val_ds = val_ds.batch(32)


	"""
	## Build a model

	With this done, we can create our end-to-end model:
	"""


	# Categorical features have different shapes after the encoding, dependent on the
	# vocabulary or unique values of each feature. We create them accordinly to match the
	# input data elements generated by tf.data.Dataset after pre-processing them
	def create_model_inputs():
	inputs = {}

	# This a helper function for creating categorical features
	def create_input_helper(feature_name):
	num_categories = len(CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name])
	inputs[feature_name] = layers.Input(
	name=feature_name, shape=(num_categories,), dtype="int64"
	)
	return inputs

	for feature_name in FEATURE_NAMES:
	if feature_name in CATEGORICAL_FEATURES_WITH_VOCABULARY:
	# Categorical features
	create_input_helper(feature_name)
	else:
	# Make them float32, they are Real numbers
	feature_input = layers.Input(name=feature_name, shape=(1,), dtype="float32")
	# Process the Inputs here
	inputs[feature_name] = encode_numerical_feature(
	feature_input, feature_name, train_ds
	)
	return inputs


	# This Layer defines the logic of the Model to perform the classification
	class Classifier(keras.layers.Layer):

	def __init__(self, **kwargs):
	super().__init__(**kwargs)
	self.dense_1 = layers.Dense(32, activation="relu")
	self.dropout = layers.Dropout(0.5)
	self.dense_2 = layers.Dense(1, activation="sigmoid")

	def call(self, inputs):
	all_features = layers.concatenate(list(inputs.values()))
	x = self.dense_1(all_features)
	x = self.dropout(x)
	output = self.dense_2(x)
	return output

	# Surpress build warnings
	def build(self, input_shape):
	self.built = True


	# Create the Classifier model
	def create_model():
	all_inputs = create_model_inputs()
	output = Classifier()(all_inputs)
	model = keras.Model(all_inputs, output)
	return model


	model = create_model()
	model.compile("adam", "binary_crossentropy", metrics=["accuracy"])

	"""
	Let's visualize our connectivity graph:
	"""

	# `rankdir='LR'` is to make the graph horizontal.
	keras.utils.plot_model(model, show_shapes=True, rankdir="LR")

	"""
	## Train the model
	"""

	model.fit(train_ds, epochs=50, validation_data=val_ds)


	"""
	We quickly get to 80% validation accuracy.
	"""

	"""
	## Inference on new data

	To get a prediction for a new sample, you can simply call `model.predict()`. There are
	just two things you need to do:

	1. wrap scalars into a list so as to have a batch dimension (models only process batches
	of data, not single samples)
	2. Call `convert_to_tensor` on each feature
	"""

	sample = {
	"age": 60,
	"sex": 1,
	"cp": 1,
	"trestbps": 145,
	"chol": 233,
	"fbs": 1,
	"restecg": 2,
	"thalach": 150,
	"exang": 0,
	"oldpeak": 2.3,
	"slope": 3,
	"ca": 0,
	"thal": "fixed",
	}


	# Given the category (in the sample above - key) and the category value (in the sample above - value),
	# we return its one-hot encoding
	def get_cat_encoding(cat, cat_value):
	# Create a list of zeros with the same length as categories
	encoding = [0] * len(cat)
	# Find the index of category_value in categories and set the corresponding position to 1
	if cat_value in cat:
	encoding[cat.index(cat_value)] = 1
	return encoding


	for name, value in sample.items():
	if name in CATEGORICAL_FEATURES_WITH_VOCABULARY:
	sample.update(
	{
	name: get_cat_encoding(
	CATEGORICAL_FEATURES_WITH_VOCABULARY[name], sample[name]
	)
	}
	)
	# Convert inputs to tensors
	input_dict = {name: tf.convert_to_tensor([value]) for name, value in sample.items()}
	predictions = model.predict(input_dict)

	print(
	f"This particular patient had a {100 * predictions[0][0]:.1f} "
	"percent probability of having a heart disease, "
	"as evaluated by our model."
	)

	"""
	## Conclusions

	- The orignal model (the one that runs only on tensorflow) converges quickly to around 80% and remains
	there for extended periods and at times hits 85%
	- The updated model (the backed-agnostic) model may fluctuate between 78% and 83% and at times hitting 86%
	validation accuracy and converges around 80% also.

	"""