Data Preprocessing
Clustering is done taking in consideration the distance between two records. There are few distances for specific data types, like we have Euclidean Distance, Manhattan Distance and Minkowski distance for Numeric Data Type and Hamming Distance and Jaccard Distance for Categorical Data Type.
Depending on the Problem, one chooses any of the above distance metric to find the distance between two records.Before applying any clustering models on the data, we have to standardize the data to bring all the attributes to a common unit, so that the distance metric will not be affected.I have used z-score standardization on the data.
Initially converted all the categorical variables to dummies so that the distance can be calculated but, dummifying the categorical variables have increased the variables to 255 from 15(originally).
This is very high dimensions (curse of dimensionality) in clustering and performing a clustering algorithm on this data will not give any good clusters results.
Dropped a variable Activities from the data which contained 99 levels and reduced the dimensions to 156.