The document summarizes the approach used for solving the problem of obtaining a higher prediction accuracy for 3-D domain swapping using Machine Learning methods. 

Data Cleaning and Feature Engineering method

The data for the challenge was extracted from here- Positive dataset and Negative dataset. Upon extraction of the dataset, Redundant Sequences from the positive FASTA dataset was removed using CD-HIT from \citealt{fu2012cd} at 95% cut-off ratio. For Negative FASTA sequence dataset, no redundancy removal was done, but sequences containing non-natural amino acids like the letter 'X' were removed from the dataset, which constituted 5 out of 462 negative FASTA sequences.
Once the data cleaning was complete, Feature extraction for the Positive and Negative FASTA sequences was done using  the modlAMP package from \citealt{muller2017modlamp}, from which non-amino acid scale dependent features like Molecular Weight, Charge Density, Sequence Length etc were calculated. Apart from this, features based on various amino acid descriptor scales like Z3 scale from \citealt{hellberg1987peptide},  Eisenberg scale from \citealt{eisenberg1982hydrophobic}, GRAVY scale from \citealt{kyte1982simple} etc were extracted from both positive as well as negative FASTA sequences. Apart from this,amino acid frequency and dipeptide frequency for the FASTA sequences were calculated by using a custom Perl script and stored as a CSV data.

Obtaining Top Features

Once the features were extracted from the FASTA sequence datasets, the top features were taken by using scikit-learn \citep{pedregosa2011scikit} mutual information gain for selection of top features. Mutual information gain is a measure of the dependency between the features. It is equal to zero if two features are independent of each other, non-zero if there is a dependency between the features. The usage of the mutual information gain is- 
from sklearn.feature_selection import SelectKBest, mutual_info_classif
selection = SelectKBest(score_func=mutual_info_classif).fit(X, y)
where X is the feature variable and y  is a label variable.
For the machine learning model, the features were taken from the amino acid frequency, dipeptide frequency data as well as physicochemical properties like density, molecular weight etc. On conducting validation, it was found that dipeptide frequency and amino acid frequency of the peptides were found to be the best sets of features for the machine learning model. So we extracted 25 best features from a combination of 420 amino acid and dipeptide features and employed them in our final model.