Welcome Back Readers !, For those who have not read the previous post on How Machine Learning works ? Please do read previous one before continuing this so that it would be more easy to understand. You can find the previous post here How Machine Learning Works ? - Mathematics Every Where.
For other who are ready to join with me in this journey to understand the Machine Learning, Thank you for being with this post. Till now we are discussing on the the title "How Machine Learning works? " and to understand that we have seen in the first post that there exist the patterns or trends in the data and we try to find that in terms of a mathematical equation called a model and try to fit to the new data that we have not came across for predicting the values. Wow may be we got the entire process complete with this. But there are few thinks that needs to be cared while training the model or creating the model or while processing the data for the model creation.
We will be talking here about two factors that we need to take care in the model creation,
- Data Cleaning
- Data transformation
Data Rules the Kingdom
Human right from the advent of computer and the internet, started producing data that now in the current period accounts for more than Tera and Peta and Zeta bytes of data. Fine we store all the data in the severs and data centers, in the form of various database, data warehouse or using the Hadoop - Big data systems. But why don't we let them go ? Why do we need the need to be retained ? What does a data provide us ?
Many big companies like Google, Facebook, Microsoft etc.. collect the data from the user, what do they do with that huge data ? We live inside home, we go for shopping every day, we may do exercise etc. but how do the Google or any company comes to know about that ? Each user in the world is tracked once you visit the particular website through cookies, or through the GPS in the android phones and other mobiles. Ahhhh !! So are we spy'ed ? The answer is NO, they don't collect the data without knowing us, we are the people who agree the terms and conditions to collect the data without reading the agreement :-P.
Fine lets come to the point, why do we need data ? why do they need data ?
"Data is precious " - Data helps the business people to improve the business, they also help the common man to improve himself and many boons and the same time data can also be used for destruction.
Imagine a business man collect the data about the daily activities of a person, the business man can now target only the people who do jogging every day and sell the shoes with offer rather than to all with that offer, Mutually both are benefited. This is what we actually do with data, we try to understand the information from data and apply to every problem that exist in the world. We do this through the Data Science, Data Analytics , Machine Learning fields ... etc
"One who have the data - He is the master of the future" - this proves that data rules the kingdom.
Data that we collect from the environment does not exist in the pure and structure format. It will have unwanted data mixed in that and we call those data as NOISE. Is the noise the only problem in the data ? No. data that we collect might have missing data that we are interested in. The data collecting devices or the methodology might be error prone. Like this we have lot of problems. Making a data clean from all these is the process that we call as Data Cleaning.
We need to remove the data that are of not of interest from the given set and also try to find the missing data through various methods like filing the blanks with mean value or mode or Variance.
Some theory here :-(
Mean , mode and variance is called the central tendency of the given data. That means for a given set of data the above component gives the representational value which lies in the center of the distributed values. Example Mean gives the value that is the average (Add all the values and divide by the total number of values). Mode gives the highest occurring value in the given set of data. Variance is value that represents how much each data point in the given set of data vary from the mean.
Well the boring theory is done ! The above mentioned steps are few commonly followed methods to remove the noise and the missing data problems. There are still lot more algorithms for doing the same like Bin cleaning etc. Almost 70 to 80 percent of the work of a Data scientist or a Machine learning professional goes into data cleaning and making it ready for next step. Also being in this profession its mandatory that we need to understand the data that we are dealing with. Imagine we are dealing with the car manufacturing unit data, we need to know the various business terms that they use in the domain. The Domain knowledge of the business is very important to understand the data and what make the importance in solving my problem. Not all the data that we collect makes the unit of interest for solving the given problem.
This is going to be the next step after Data cleaning. OMG still is it not over ? Yes though we spent 70 to 80 percent of time in cleaning the data. Now we need to transform the data to other forms if needed.
I can understand that there are lot of questions about why we need transformations !! Will come to know about that shortly.
Numbers - Primary weapon
We know by now that models are represented in the mathematical equation to guide the trends or patterns in the data. So to train the model means to find the parameters in the equation for example if you take y = mx + c , we need to find the m and c from the data that we have. They are a guiding values that exist in the form of numbers (decimal or natural numbers or integers etc), we need our attributes that of interest that exist in the other data types needs to be converted to numbers but retaining the semantics of the data.
For example, Take any attribute "Gender" in the Employee data. It is a categorical attribute (Means it is not a continuous or discrete value but represents a category), Imagine we are gonna try applying the y = mx + c equation means it is the Linear Regression model equation. m and c are the values in number type so how to pass the "Male" or "Female" values of the data ?
In these situations we convert or represent the Male with 0 and Female with 1 and apply the Linear equation. This is not the only case that needs the transformation, there are lot of cases we encounter while solving the problem.
If you take a complete text attribute like comments in the review by the people you need to find the emotions from the text. We can see the process that are involved in both.
Data Cleaning :
- We need to remove the words like is, the, or, a etc which are not of interest. Theses words are called stop words. NLTK library in python helps in doing this
- Next, Machine doesn't need to know the words like Loved, Loving but need to know the word love. So this is called rooting process. We transform all the words to its stem word. NLTK again helps in this.
All cleaning is done.
- We have the bag full of words that we obtained from each comments. This bag of words is called the corpus.
- The words needs to be converted to the numbers, so we represent each word is 0 and 1 means is the word available in that comment or not. The big matrix of each row and column with the words having 1 and 0 will be formed. This is called the sparse matrix.
Now this sparse matrix can now be applied to any classification models that exist to identify sad, happy reactions.
Hope we have might have by now understood the importance of data and the transformation steps needed to make it applicable for the next steps.
Machine learning finds various applications in the almost all industry not limited to health, logistics, trading, stock market etc.
By now we might have got the high level idea about How the machine learning works ?
Once again Thanks for being in my journey to explore more on Machine learning. I am planning to write on various types of algorithms that exist in the machine learning in my next blog.
Stay Tuned with the thirst for machine learning !