Written by Rahul Rustagi
If you are unaware of the concept of Random Forest refer this LINK
It is time to put our knowledge to test and build a project using a random dataset of Kaggle. The accuracy might be low, The recall might be high but what good is it to know about all the libraries without ever using them for a project, with that said and done let’s get started with our project.
The very first step of a project is the pre-planning, deciding how the project will go, what will be the various phases and how would you plan to achieve them. In this project we would divide our programme into these phases :
Dataset Importing and cleaning
Model Training & Testing
You can find the dataset HERE
1. Library importing
The libraries we use in a project vary from person to person as approaches would vary too but some libraries that are a staple in almost all machine learning programmes are :
Along with these libraries, an approach I have found to be useful is to use one new library in every project, as in this ever-growing field it’s important to not only rely on your course material or instructor but to try out new libraries, more about that in the later sections.
2. Dataset Importing and cleaning
Now we all know how to read a CSV or excel file using pandas and if not don’t worry we will cover that up too, but after reading the data it’s time to take a look at our data and it’s features.
Here we can see the data and can’t help but notice that the columns sl_no and gender would not be useful for any machine learning model as they would play no role whatsoever in predicting the salary, marks or even placement statement, that’s why our next step would be to drop these columns. After dropping the columns our data would look something like this:-
3. Data EDA
Now we will perform some EDA techniques on our Data but there is a twist, as previously mentioned we are going to use a new library for this project and that library is Pandas_profiling. Pandas profiling is a relatively new open-source library that makes detecting missing values, correlation, data types and end range values super easy. You can read more about the library here.
P.S - By this time it should be clear to you what feature you are going to use as the target variable as the further steps would depend on that
4. Insight EDA
I know you might be wondering that we just did some EDA then why to bother again, but we just learned about the data, not from it yet and contrary to the popular belief there are some insights that can be gained through the data without any ml algorithm and just one line of code.
5. Feature Selection
Out of the many features that a dataset has, not all of them would provide significance to our model, some might even reduce it and make the model inaccurate. The process of selecting features based on various factors that can be used to identify which feature would yield us an optimum model.
There are various ways we can use to detect which feature would play what role in our dataset but as we have already seen the correlation matrix and gained insight about the correlation of various features with respect to each other the next step we will be taking is RFE (Recursive Feature Elimination). RFE can be used to signify what importance a feature holds in the output of the model. More about that here.
6. Feature Engineering
Now that we know what features we are going to choose, we need to modify them so that they can be used by the model. There are various steps that can be taken depending on the dataset we have chosen some of them are:-
Normalize continuous features
Encode categorical features (tip - use dropfirst to drop the first column of the encoded variable)
If the dimensions of the dataset are too big, use a dimensionality reduction algorithm like PCA.
7. Model Training & Testing
Model selection and training is something that would vary from person to person. The model you choose would directly depend upon your target variable as a classification model would be more suited for a categorical target variable and a Regression tool would be more suited for a target variable that is continuous. After that, you can still do cross_validation to assess what model would provide the best result for your project within these classes, more about that here. Remember this, before training the chosen model it is crucial that you divide your dataset into training and testing data. After training the model, we can use the test dataset to check the efficiency, accuracy and other metrics.
8. Model Evaluation
After we have built the model, it’s time to put our model to test and although accuracy score and RMSE might work most of the time. There are many other things that can be used to evaluate how good the model is Precision, Recall etc. The priority of these features might depend upon your target value for example -
If the cost of false negatives is too high, your aim would be to have a higher Recall
If the cost of false positives is too high, your aim would be to have a higher Precision
This article was not a step by step guide to build this project or to even work with the specific dataset rather help you build up your intuitive skills and problem-solving techniques so that the next time you come across a random dataset somewhere you know where to get started.
Full Code link is HERE
P.S - My accuracy score was around 87%, so yeah it works.
About the person behind the keyboard: Rahul is pursuing B.tech and parallelly he is also a data science intern. He is passionate about Machine Learning and NLP and is on his way to be a great blogger. If you guys want to contact him, just click on his name.