Data Science Full Course – Learn Data Science in 10 Hours | Data Science For Beginners | Edureka

Data Science Full Course – Learn Data Science in 10 Hours | Data Science For Beginners | Edureka
Spread the love

Undoubtedly, Data Science is  the most revolutionary technology of the era.  It’s all about deriving useful insights from data  in order to solve real-world complex problems.  Hi all I welcome you to this session  on Data Science full course  that contains everything  that you need to know in order to master data science.  Now before we get started,  let’s take a look at the agenda.  The first module is an reduction to data science  that covers all  the basic fundamentals of data science followed by this.  We have statistics and probability module  where you’ll understand the statistics and math  behind data science and machine learning algorithms.  The next module is the basics of machine learning  where will understand what  exactly machine learning is the different types  of machine learning the different machine  learning algorithms  and so on the next module is the supervised learning  algorithms module  where we’ll start by understanding the most  basic With them or which is linear regression.  The next module is  the logistic regression module where we will see  how logistic regression can be used to solve  classification problems.  After this we’ll discuss about decision trees  and we’ll see  how decision trees can be used to solve  complex data-driven problems.  The next module is random Forest here will understand  how random Forest can be used to solve classification problems  and regression problems with the help  of use cases and examples.  The next module  will be be discussing is the k-nearest neighbor module.  We will understand how gain and can be used to solve  complex classification problems followed by this.  We look at the naive bias module,  which is one of the most important algorithms  in the Gmail spam detection.  The next algorithm is support Vector machine  where we will understand  how svm’s can be used  to draw a hyperplane between different classes of data.  Finally.  We move on to the unsupervised learning module where we  will understand how genes can be used for clustering.  And how you can perform Market Basket analysis by using  Association rule mining.  The next module is reinforcement learning  where we will understand the different concepts  of reinforcement learning  along with a couple  of demonstrations followed by this bill.  Look at the Deep learning module  where we will understand what exactly deep learning is what  our neural networks  with different types of neural networks.  And so on.  The last module is  the data science interview questions module  where we will understand the important concepts of data.  Along with a few tips in order to Ace the interview now  before we get started make sure you subscribe  to Adorama YouTube channel in order to stay updated  about the most trending Technologies data science is one  of the most in-demand Technologies right now.  Now this is probably  because we’re generating data at an Unstoppable pace.  And obviously we need to process  and make sense out of this much data.  This is exactly  where data science comes in in today’s session.  We’ll be talking about data science in depth.  So let’s move ahead and take a look at today’s agenda.  We’re going to begin  with discussing the various sources of data and  how the evolution of technology  and introduction of IOD  and social media have led to the need of data sign next.  We’ll discuss how Walmart is using insightful patterns  from their database to increase the potential of their business.  After that.  We will see what exactly data science is,  then we’ll move on and discuss who are data scientist is  where we will also discuss the various skill sets.  Needed to become a data scientist next  we can move on to see  the various data science job roles  such as data analyst data architect data engineer  and so on after this we  will cover the data life cycle where we will discuss  how data is extracted processed and finally use as a solution.  Once we’re done with that.  We’ll cover the basics of machine learning  where we’ll see what exactly machine learning is  and the different types of machine learning next.  We will move onto the K means algorithm  and we’ll discuss a use case of the k-means clustering  after which we Discuss the various steps involved  in the k-means algorithm  and then we will finally move on to the Hands-On part  where we use the k-means algorithm to Cluster movies  based on their popularity on social media platforms,  like Facebook at the end of today’s session  will also discuss about what a data science certification is  and why you should take it up.  So guys, there’s a lot to cover in today’s session.  Let’s jump into the first topic.  Do you guys remember the times when we have telephones and we  had to go to PC your boots in order to make a phone call.  Call now those things are very simple  because we didn’t generate a lot of data.  We didn’t even store the contacts and our phones  or our telephones.  We used to memorize phone numbers back then or you know,  these have a diary of all our contact  but these days we have smartphones  with store a lot of data.  So there’s everything about us in our mobile phones.  We have images we have contacts.  We have various apps.  We have games.  Everything is stored on a mobile phones these days  similarly the PCS that we use in the earlier times.  It used to process very little data.  All right, there was A lot of data processing needed  because technology was an evolved that much.  So if you guys remember we use floppy disk  back then and floppy.  This was used to store small amounts of data,  but later on hard disks were created and those  used to store GBS of data.  But now if you look around there’s data  everywhere around us.  All right, we have a data stored in the cloud.  We have data in each and every Appliance at our houses.  Similarly.  If you look at smart cars these days they’re connected  to the internet they connected to a mobile phones  and this also generates a lot of data.  What we don’t realize is  that evolution of technology has generated a lot of data.  All right.  Now initially there was very little data  and most of it was even structured only a small part  of the data was unstructured or semi-structured.  And in those days you could use Simple bi Tools in order  to process all of this data and make sense out of it.  But now we have way  too much data and order to process this much data.  We need more complex algorithms.  We need a better process.  All right, and this is  where data science comes in now guys,  I’m not going to get into the depth of data science.  Yet I’m sure all of you have heard of iot  or Internet of things.  Now.  Did you guys know  that we produce 2.5 quintillion bytes  of data each day.  And this is only accelerating with the growth of iot.  Now iot or Internet of Things is just a fancy term  that we use for network of tools or devices  that communicate and transfer data through the internet.  So various devices are connected to each other  through the internet  and they communicate with each other right  now the communication happens by exchange of data or by.  Generation of data now these devices include the vehicles.  We drive the include our TVs of coffee machines  refrigerators washing machines and almost everything else  that we use in a daily basis.  Now, these interconnected devices produce an unimaginable  amount of data guys iot data is measured in zettabytes  and one zettabyte is equal to trillion gigabytes.  So according to a recent survey by Cisco.  It’s estimated that by the end of 2019,  which is almost here.  The iot will generate more than five hundred zettabytes  of data per year.  And this number will only increase through time.  It’s hard to imagine data in that much volume,  imagine processing analyzing and managing this much of data.  It’s only going to cause as a migraine  so guys having to deal with this much data  is not something that traditional bi tools can do.  Okay.  We no longer can rely  on traditional data processing methods.  That’s exactly why we need data science.  It’s our only hope right  now now let’s not get into the details here.  Yet moving on.  Let’s see how social media is adding on  to the generation of data.  Now the fact  that we are all in love with social media.  It’s actually generating a lot of data for us.  Okay.  It’s certainly one of the fuels  for data creation Now all these numbers  that you see on the screen are generated every minute  of the day.  Okay, and this number is just going to increase so  for Instagram it says  that approximately 1.7 million pictures uploaded  in a minute and similarly on Twitter approximately.  A hundred and forty eight thousand tweets are published  every minute of the day.  So guys imagine in one are  how much that would be and then imagine in 24 hours.  So guys, this is the amount of data  that is generated through social media.  It’s unimaginable.  Imagine processing this much  data analyzing it and then trying to figure out, you know,  the important insights from this much data analyzing  this much data is going to be very hard with traditional tools  or traditional methods.  That’s why data science was introduced data science  is a simple process  that will just extract the useful information from data.  All right, it’s just going to process  and analyze the entire data  and then it’s just going to extract  what is needed now guys apart from social media and iot,  there are other factors as well  which contribute to data generation these days  all our transactions are done online, right?  We pay bills online.  We shop online.  We even buy homes online  these days you can even sell your pets on oil excuses.  Not only that when we stream music  and Watch videos on YouTube all of this is generating a lot  of data not to forget.  We’ve also brought Health Care into the internet wall.  Now there are various watches like bit fit  which basically trans our heart rate  and it generates data about a health conditions education is  also an online thing right now.  That’s exactly what you are doing right now.  So with the emergence of the internet,  we now perform all our activities online.  Okay, obviously, this is helping us,  but we are unaware of how much data we are generating  what can be done with All of this data and what  if we could use the data  that we generated to our benefit?  Well, that’s exactly  what data science does data science is all  about extracting the useful insights from data and using  it to grow your business.  Now before we get into the details of data science,  let’s see how Walmart uses data science to grow that business.  So guys Walmart is the world’s biggest retailer  with over 20,000 stores in just 28 countries.  Okay.  Now, it’s currently building the world’s biggest.  Good Cloud,  which will be able to process two point five petabytes  of data every hour now.  The reason behind Walmart success is  how the user customer data  to get useful insights about customers shopping patterns.  Now the data analyst and the data scientist at Walmart.  They know every detail about their customers.  They know that if a customer buys Pop-Tarts,  they might also buy cookies, how do they know all of this?  Like how do they generate information like this now  the user data that they get from their customers.  Hours and the analyze it  to see what a particular customer is looking for.  Now.  Let’s look at a few cases  where Walmart actually analyze the data  and they figured out the customer needs.  So let’s consider the Halloween  and the cookie sales example now during Halloween sales Analyst  at Walmart took a look at the data.  Okay, and he found out  that a specific cookie was popular  across all Walmart stores.  So every Walmart store was selling these cookies very well,  but he found out  that they would to stores which are not selling.  A DOT.  Okay.  So the situation was immediately investigated and it was found  that there was a simple stocking oversight.  Okay, because of which the cookies were not put  on the shelves for sale.  So because this issue was immediately identified  they prevented any further loss  of sales now another such example,  is that true Association rule mining Walmart found out  that strawberry Pop-Tart sales increased by seven times  before a hurricane.  So a data analyst at Walmart identified the association  between ha Hurricane  and strawberry pop tarts through data mining now guys.  Don’t ask me the relationship between Pop-Tarts  and Harry Caine,  but for some reason whenever there was a hurricane  approaching people really wanted to eat strawberry Pop-Tart.  So what Walmart did  was they place all the strawberry Pop-Tarts?  I will check out before a hurricane would occur.  So this way the increase sales of the Pop-Tarts Now,  where’s this is a natural thing.  I’m not making it up.  You can look it up on the internet.  Not only that Walmart is analyzing the data generated  by Social media to find out all the training product so  through social media.  You can find out the likes and dislikes of a person right?  So what Walmart did is they are quite smart  the user data generated  by social media to find out what products are trending  or what products are liked by customers.  Okay an example  of this is 1 mod analyze social media data to find out  that Facebook users were crazy about cake pops.  Okay, so Walmart immediately took a decision  and they introduced cake pops into the Walmart stores.  So guys the only reason Walmart is so successful is  because the huge amount of data  that they get they don’t see it as a burden instead.  They process this data analyze  it and then you try to draw useful insights from it.  Okay, so they invest a lot of money a lot of effort  and a lot of time and data analysis.  Okay, they spend a lot  of time analyzing data in order to find any hidden patterns.  So as soon as they find out hidden pattern or association  between any two products,  these are giving out offers  or Started having discount or something along that line.  So basically Walmart uses data  in a very effective manner the analyzer very, well.  They process the data very well  and they find out the useful insights  that they need in order to get  more customers or in order to improve their business.  So guys, this was all about how Walmart uses  data science now,  let’s move ahead and look at what is data set now  guys data science is all about uncovering findings from data.  It’s all about surfacing the hidden insights  that can help.  Ponies to make smart business decisions.  So all these hidden insights  or these hidden patterns can be used to make better decisions  in a business now an example of this is also Netflix.  So Netflix, basically analyzes the movie viewing patterns  of users to understand  what drives user interest  and to see what users want to watch and then  once they find out they give people  what they want.  So guys actually data has a lot of power.  You should just know how to process this data  and how to extract the useful information.  From data.  Okay.  That’s what data science is all about.  So guys a big question over here is  how do data scientists get useful insights from data.  So it’s all starts with data exploration.  Whenever a data scientist comes across any challenging question  or any sort of challenging situation,  they become detectives so the investigative leads  and they try to understand the different patterns  or the different characteristics of the data.  Okay.  They try to get all the information  that they can from the data and then Then they use it  for the betterment of the organization  or the business.  Now, let’s look at who is a data scientist.  So guys the data scientists  has to be able to view data through a quantitative lengths.  So guys knowing math is one of the very important skills  of data scientists.  Okay.  So mathematics is important because in order to find  a solution you’re going to build a lot of predictive models  and these predictive models are going to be based on hard math.  So you have to be able to understand all  the Underlying mechanics with these models most  of the predictive models most  of the algorithms require mathematics.  Now, there’s a major misconception  that data science is all about statistics.  Now, I’m not saying that statistics is an important.  It is very important,  but it’s not the only type of math that is utilized  in data science.  There are actually many machine learning algorithms  which are based on linear algebra.  So guys overall you need to have a good understanding  of math and apart from that data scientist.  Eli’s technology,  so data scientists have to be really good with technology.  Okay.  So their main work is they utilize all the technology  so that they can analyze  these enormous data sets and work with complex algorithms.  So all of this requires tools,  which are much more sophisticated than Excel  so there’s data scientist need to be very efficient  with coding languages  and few of the core language  has associated with data science include SQL python R & sass.  It is also important for a data scientist.  Be a tactical business consultant.  So guys business problems can be on a sword by data scientist  since our data scientists work so closely with data  they know everything about the business.  If you have a business and you give the entire data set  of your business stored data scientist,  he know each and every aspect of your business.  Okay?  That’s how data scientists work.  They get the entire data set.  They study the data set the analyze it and then we see  where things are going wrong  or what needs to be done more or what?  Needs to be excluded.  So guys having this business Acumen is just as important  as having skills  in algorithms or being good with math and technology.  So guys business is also as important as  these other fields now,  you know who our data scientist is.  Let’s look at the skill sets that a data scientist names.  Okay, it always starts  with Statistics statistics will give you the numbers  from the data.  So a good understanding of Statistics is very important  for becoming a data scientist.  You have to be familiar with satisfaction.  Contest distributions maximum likelihood estimators and all  of that apart  from that you should also have a good understanding  of probability Theory  and descriptive statistics.  These Concepts will help you make Better Business decisions.  So no matter what type  of company or role you’re interviewing for.  You’re going to be expected to know  how to use the tools of the trade.  Okay.  This means that you have  to know a statistical programming language like our  or Python and also you’ll need to know or database.  Wiring language like SQL now  the main reason why people prefer our  and python is because of the number of packages  that these languages have  and these predefined packages have most  of the algorithms in them.  So you don’t have to actually sit down  and code the algorithms instead.  You can just load one of these packages  from their libraries and run it.  So programming languages is a must at the minimum.  You should know our  or python and a database query language now,  let’s move on to data extraction and processing.  So guys That you have  multiple data sources like mySQL database Mongo database.  Okay.  So what you have to do is you have to extract  from such sources and then in order to analyze  and query this database you have to store it in a proper format  or a proper structure.  Okay, finally, then you can load the data in the data warehouse  and you can analyze the data over here.  Okay.  So this entire process is called extraction and processing.  So guys extraction  and processing is all about getting data.  From these different data sources and then  putting it in a format  so that you can analyze it now next is data wrangling  and exploration now guys data wrangling is one  of the most difficult tasks in data science.  This is the most time-consuming task  because data wrangling is all about cleaning the data.  There are a lot of instances  where the data sets have missing values  or they have null values  or they have inconsistent formats or inconsistent values  and you need to understand what to do with such values.  This is Data wrangling  or data cleaning comes into the picture then  after you’re done with that.  You are going to analyze the data.  So where’s after data wrangling and cleaning is done.  You’re going to start exploring.  This is where you try to make sense out of the data.  Okay, so you can do this by looking at the different  patterns in the data the different Trends outliers  and various unexpected results in all of that.  Next.  We have machine learning.  So guys if you’re a large company  or with huge amounts of data or if you’re working at a company.  See where the product is data driven,  like if you’re working in Netflix or Google Maps,  then you have to be  familiar with machine learning methods, right?  You cannot process large amount of data  with traditional methods.  So that’s why you need a machine learning algorithms.  So there are few algorithms.  Like knok nearest neighbor does random Forest  this K means algorithm this support Vector machines,  all of these algorithms.  You have to be aware of all  of these algorithms and let me tell you  that most of these algorithms can be implemented.  Using our or python libraries.  Okay, you need to have an understanding  of machine learning.  If you have large amount of data in front of you  which is going to be the case for most of the people right now  because data is being generated  at an Unstoppable Pace earlier in the session we discussed  how much of data is generated.  So for now knowing machine learning algorithms  and machine learning Concepts is a very required skill  if you want to become a data scientist,  so if you’re sitting for an interview as  a data scientist,  you will be asked machine learning.  Seems you will be asked  how good you are with these algorithms  and how well you can Implement them.  Next we have big data processing Frameworks.  So guys, we know  that we’ve been generating a lot of data and most  of this data can be structured or unstructured as well.  So on such data,  you cannot use traditional data processing system.  So that’s why you need  to know Frameworks like Hadoop and Spark.  Okay.  These Frameworks can be used to handle big data lastly.  We have data visualization.  So guys data visualization is Is one of the most important part  of data analysis,  it is always very important to present the data  in an understandable and Visually appealing format.  So data visualization is one of the skills  that data scientists have to master.  Okay, if you want to communicate the data with the end users  in a better way then data visualization is a must  so guys are a lot of tools  which can be used for data visualization tools like Diablo  and power bi are few the most popular visualization tools.  So with this we sum up the entire skill set  that is needed to become a data scientist apart from this  you should also have data-driven problem solving approach.  You should also be very creative with data.  So now that we know the skills  that are needed to become a data scientist.  Let’s look at the different job roles just data science is  a very vast field.  There are many job roles under data science.  So let’s take a look at each role.  Let’s start off with a data scientist.  So there’s data scientists have to understand.  The challenge is over business  and they have to offer the best solution using data analysis  and data processing.  So for instance  if they are expected to perform predictive analysis,  they should also be able to identify Trends and patterns  that can have the companies in making better decisions  to become a data scientist.  You have to be an expert in our Matlab SQL Python and other  complementary Technologies.  It can also help  if you have a higher degree in mathematics  or computer engineering next we have data.  An analyst so a data analyst is responsible  for a variety of tasks,  including visualization processing of massive amount  of data and among them.  They have to also perform queries on databases.  So they should be aware of the different query languages  and guys one of the most  important skills of a data analyst is optimization.  This is because they have to create and modify algorithms  that can be used to pull information from some  of the biggest databases without corrupting the data  so to become Be done.  You must know Technologies such as SQL our SAS and python.  So certification in any  of these Technologies can boost your job application.  You should also have a good problem solving quality.  Next.  We have a data architect.  So a data architect creates the blueprints  for a data management  so that the databases can be easily integrated  centralized and protected with a best security measures.  Okay.  They also ensure  that the data Engineers have the best tools  and systems to work with So to become a data architect,  you have to have expertise and data warehousing  data modeling extraction transformation and loan.  Okay.  You should also be well versed in Hive Pig  and Spark now apart from this there are data Engineers.  So guys,  the main responsibilities of a data engineer is to build  and test scalable Big Data ecosystems.  Okay, they are also needed to update the existing systems  with newer or upgraded versions  and they are also responsible for improving the efficiency.  For database now.  If you are interested in a career as a data engineer,  then technologies  that require hands-on experience include Hive nosql  are Ruby Java C++ and Matlab,  it would also help  if you can work with popular data apis  and ETL tools next.  We have a statistician.  So as the name suggests you have to have a sound understanding  of statistical theories and data organization.  Not only do they extract and offer valuable insights.  They also create new.  Methodologies for engineers to apply now.  If you want to become a statistician then you have  to have a passion for logic.  They are also good variety  of database systems such as SQL Data Mining  and other various machine learning Technologies by that.  I mean, you should be good with math and you should also  have a good knowledge  about the weight is database system such as SQL  and also the various machine learning Concepts  and algorithms is the most next we have  the database administrator.  So guys the job profile of a database administrator  is Much self-explanatory,  they are basically responsible for the proper functioning  of all the databases  and they are also responsible for granting permission  or the working in services to the employees of the company.  They also have to take care of the database backups  and recoveries.  So some of the skills  that are needed to become a database administrator include  database backup and Recovery data security data modeling  and design next.  We have the business analyst now the role of a business analyst  is a little It different  from all of the other data signs job now.  Don’t get me wrong.  They have a very  good understanding of the data oriented Technologies.  They know how to handle a lot of data and process it  but they are also very focused on how this data can be linked  to actionable business inside.  So they mainly focus on business growth.  Okay.  Now a business analyst acts like a link  between the data engineers and the management Executives.  So in order to become  a business analyst you have to have an understanding  of business finances business intelligence.  And also I did acknowledge,  he’s like data modeling data visualization tools and Etc  at last we have a data  and analytics manager a data and analytics  manager is responsible for the data science operations.  Now the main responsibilities of a data and analytics  manager is to oversee the data science operation.  Okay, he’s responsible for assigning the duties  to the team according to their skills  and expertise now their strength should include Technologies  like SAS our SQL.  And of course,  they should have good management skills apart from that.  They must have excellent social skills leadership qualities  and and out-of-the-box thinking attitude.  And like I said earlier  you need to have a good understanding of Technologies.  Like pythons as our Java and Etc.  So Guys, these were the different job roles  in data science.  I hope you all found this informative.  Now, let’s move ahead and look at the data lifecycle.  So guys are basically six steps in the data life cycle.  It starts with a business requirement.  Next is the data acquisition  after that you would process the data  which is called data processing.  Then there is data exploration modeling  and finally deployment.  So guys before you even start on a data science project.  It is important  that you understand the problem you’re trying to solve.  So in this stage,  you’re just going to focus  on identifying the central objectives of the project  and you will do this by identifying the variables  that need to be predicted next up.  We have data acquisition.  Okay.  So now that you have your objectives I find it’s time  for you to start Gathering the data.  So data mining is the process  of gathering your data from different sources  at this stage some  of the questions you can ask yourself is  what data do I need for my project?  Where does it live?  How can I obtain it?  And what is the most efficient way to store  and access all of it?  Next up there is data processing now usually all the data  that you collected is a huge mess.  Okay.  It’s not formatted.  It’s not structured.  It’s not cleaned.  So if Find any data set  that is cleaned and it’s packaged well for you,  then you’ve actually won the lottery  because finding the right data takes a lot of time  and it takes a lot of effort  and one of the major time-consuming task  in the data science process is data cleaning.  Okay, this requires a lot of time.  It requires a lot of effort  because you have to go through the entire data set  to find out any missing values  or if there are any inconsistent values  or corrupted data,  and you also find the unnecessary data.  Over here and you remove that data.  So this was all  about data processing next we have data exploration.  So now that you have sparkling clean set of data,  you are finally ready to get started with your analysis.  Okay, the data exploration stage is basically the brainstorming  of data analysis.  So in order to understand the patterns in your data,  you can use histogram.  You can just pull up a random subset of data  and plot a histogram.  You can even create interactive visualizations.  This is the point  where you Dive deep into the data  and you try to explore the different models  that can be applied to your data next up.  We have data modeling.  So after processing the data,  what you’re going to do is you’re going to carry  out model training.  Okay.  Now model training is basically about finding a model  that answers the questions more accurately.  So the process of model training involves a lot of steps.  So firstly you’ll start by splitting the input data  into the training data set and the testing data set.  Okay, you’re going to take the entire data set  and you’re going to separate it into Two two parts one is  the training and one is the testing data  after that your build a model  by using the training data set and once you’re done with that,  you’ll evaluate the training  and the test data set now to evaluate the training  and testing data.  So you’ll be using series  of machine learning algorithms after that.  You’ll find out the model  which is the most suitable for your business requirement.  So this was mainly data modeling.  Okay.  This is where you build a model out of your training data set  and then you evaluate this model by using the testing data set.  You have deployment.  So guys a goal of this stage is to deploy the model  into a production or maybe a production like environment.  So this is basically done for final user acceptance  and the users have to validate the performance of the models  and if there are any issues  with the model or any issues with the algorithm,  then they have to be fixed in this stage.  So guys with this we come to the end  of the data lifecycle.  I hope this was clear statistics and probability are essential  because these disciples form the basic Foundation  of all machine  learning algorithms deep learning artificial intelligence  and data science.  In fact, mathematics and probability is  behind everything around us from shapes patterns  and colors to the count  of petals in a flower mathematics is embedded  in each and every aspect of our lives with this in mind.  I welcome you all to today’s session.  So I’m going to go ahead and Scoffs the agenda for today  with you all now going to begin the session by understanding  what is data after that.  We’ll move on and look at the different categories of data,  like quantitative and qualitative data,  then we’ll discuss what exactly statistics is  the basic terminologies in statistics and a couple  of sampling techniques.  Once we’re done with that.  We’ll discuss the different types of Statistics  which involve descriptive and inferential statistics.  Then in the next session will mainly be focusing  on descriptive statistics here will understand  the different measures of center measures  of spread Information Gain  and entropy will also understand all of these measures  with the help of a use case and finally we’ll discuss  what exactly a confusion Matrix is  once we’ve covered the entire descriptive statistics module  will discuss the probability module here will understand what  exactly probability is the different terminologies  in probability will also  study the Different probability distributions,  then we’ll discuss the types of probability which include  marginal probability joint and conditional probability.  Then we move on and discuss a use case  where and we’ll see examples that show us  how the different types of probability work  and to better understand Bayes theorem.  We look at a small example.  Also, I forgot to mention  that at the end of the descriptive statistics module  will be running a small demo in the our language.  So for those of you  who don’t know much about our I’ll be explaining  every line in depth,  but if you want to have a more in-depth understanding  about our I’ll leave a couple of blocks.  And a couple of videos in the description box  you all can definitely check out that content.  Now after we’ve completed the probability module will discuss  the inferential statistics module will start this module  by understanding  what is point estimation.  We will discuss what is confidence interval  and how you can estimate the confidence interval.  We will also discuss margin of error and will understand all  of these concepts by looking at a small use case.  We’d finally end the inferential Real statistic module by looking  at what hypothesis testing is hypothesis.  Testing is a very important part of inferential statistics.  So we’ll end the session by looking at a use case  that discusses how hypothesis testing works  and to sum everything up.  We’ll look at a demo  that explains how inferential statistics Works.  Alright, so guys, there’s a lot to cover today.  So let’s move ahead  and take a look at our first topic  which is what is data.  Now, this is a quite simple question  if I ask any of You what is data?  You’ll see that it’s a set of numbers  or some sort of documents  that have stored in my computer now data is actually everything.  All right, look around you there is data everywhere each click  on your phone generates more data than you know,  now this generated data provides insights for analysis  and helps us make Better Business decisions.  This is why data is so important to give you  a formal definition data refers to facts and statistics.  Collected together for reference or analysis.  All right.  This is the definition of data in terms  of statistics and probability.  So as we know data can be collected it  can be measured and analyzed  it can be visualized by using statistical models  and graphs now data is divided into two major subcategories.  Alright, so first we have qualitative data  and quantitative data.  These are the two different types of data  under qualitative data.  We have nominal and ordinal data and under quantitative data.  We have discrete and continuous data.  Now, let’s focus on qualitative data.  Now this type of data deals with characteristics and descriptors  that can’t be easily measured  but can be observed subjectively  now qualitative data is further divided  into nominal and ordinal data.  So nominal data is any sort of data  that doesn’t have any order or ranking?  Okay.  An example of nominal data is gender.  Now.  There is no ranking in gender.  There’s only male female or other right?  There is no one two,  three four or any sort of ordering in gender race is  another example of nominal data.  Now ordinal data is basically an ordered series of information.  Okay, let’s say that you went to a restaurant.  Okay.  Your information is stored in the form of customer ID.  All right.  So basically you are represented with a customer ID.  Now you would have rated their service as  either good or average.  All right, that’s how no ordinal data is  and similarly they’ll have a record of other customers  who visit the restaurant along with their ratings.  All right.  So any data which has some sort of sequence  or some sort of order to it is known as ordinal data.  All right, so guys,  this is pretty simple to understand now,  let’s move on and look at quantitative data.  So quantitative data basically these He’s  with numbers and things.  Okay, you can understand  that by the word quantitative itself quantitative is  basically quantity.  Right Saudis will numbers  a deals with anything that you can measure objectively.  All right, so there are two types  of quantitative data there is discrete and continuous data  now discrete data is also known as categorical data  and it can hold a finite number of possible values.  Now, the number of students in a class is a finite Number.  All right, you can’t have infinite number  of students in a class.  Let’s say in your fifth grade.  They have a hundred students in your class.  All right, there weren’t infinite number but there  was a definite finite number of students in your class.  Okay, that’s discrete data.  Next.  We have continuous data.  Now this type of data can hold infinite number  of possible values.  Okay.  So when you say weight of a person is an example  of continuous data  what I mean to see is my weight can be 50 kgs or it NB 50.1 kgs  or it can be 50.00 one kgs  or 50.000 one or is 50.0 2 3 and so  on right there are infinite number  of possible values, right?  So this is what I mean by a continuous data.  All right.  This is the difference between discrete and continuous data.  And also I’d like to mention a few other things over here.  Now, there are a couple of types of variables as well.  We have a discrete variable  and we have a continuous variable discrete variable  is also known as a categorical variable  or and it can hold values of different categories.  Let’s say that you have a variable called message  and there are two types of values that this variable  can hold let’s say  that your message can either be a Spam message  or a non spam message.  Okay, that’s when you call a variable as discrete  or categorical variable.  All right, because it can hold values  that represent different categories of data  now continuous variables are basically variables  that can store infinite number of values.  So the weight of a person can be denoted as  a continuous variable.  All right, let’s say there is a variable called weight  and it can store infinite number of possible values.  That’s why we will call it a continuous variable.  So guys basically variable is anything  that can store a value right?  So if you associate any sort of data with a Able,  then it will become either discrete variable  or continuous variable.  There is also dependent and independent type of variables.  Now, we won’t discuss all of that in death because  that’s pretty understandable.  I’m sure all of you know,  what is independent variable and dependent variable right?  Dependent variable is any variable whose value  depends on any other independent variable?  So guys that much knowledge I expect  or if you do have all right.  So now let’s move on and look at our next topic which Which is  what is statistics now coming to the formal definition  of statistics statistics is an area of Applied Mathematics,  which is concerned  with data collection analysis interpretation  and presentation now usually  when I speak about statistics people think statistics is  all about analysis  but statistics has other parts to it it has data collection is  also a part of Statistics data interpretation presentation.  All of this comes  into statistics already are going to use statistical methods  to visualize data to collect data to interpret data.  Alright, so the area of mathematics deals  with understanding  how data can be used to solve complex problems.  Okay.  Now I’ll give you a couple of examples  that can be solved by using statistics.  Okay, let’s say  that your company has created a new drug  that may cure cancer.  How would you conduct a test to confirm  the As Effectiveness now,  even though this sounds like a biology problem.  This can be solved  with Statistics already will have to create a test  which can confirm the effectiveness of the drum  or a this is a common problem  that can be solved using statistics.  Let me give you another example you  and a friend are at a baseball game and out of the blue.  He offers you a bet  that neither team will hit a home run in that game.  Should you take the BET?  All right here you just discuss the probability  of I know you’ll win or lose.  All right, this is another problem  that comes under statistics.  Let’s look at another example.  The latest sales data has just come in  and your boss wants you to prepare a report  for management on places  where the company could improve its business.  What should you look for?  And what should you not look for now?  This problem involves a lot  of data analysis will have to look at the different variables  that are causing your business to go down  or the you have to look at a few variables.  That are increasing the performance of your models  and thus growing your business.  Alright, so this involves a lot of data analysis  and the basic idea  behind data analysis is to use statistical techniques  in order to figure out the relationship  between different variables  or different components in your business.  Okay.  So now let’s move on and look at our next topic  which is basic terminologies in statistics.  Now before you dive deep into statistics,  it is important that you understand basic terminologies  used in statistics.  The two most important terminologies in statistics  are population and Sample.  So throughout the statistics course or throughout any problem  that you’re trying to stall with Statistics.  You will come across these two words,  which is population and Sample Now population is a collection  or a set of individuals or objects or events.  Events whose properties are to be analyzed.  Okay.  So basically you can refer to population as a subject  that you’re trying to analyze now a sample is just  like the word suggests.  It’s a subset of the population.  So you have to make sure that you choose the sample  in such a way  that it represents the entire population.  All right.  It shouldn’t Focus add one part of the population instead.  It should represent the entire population.  That’s how your sample should be chosen.  So Well chosen sample will contain most  of the information about a particular population parameter.  Now, you must be wondering how can one choose a sample  that best represents the entire population now  sampling is a statistical method  that deals with the selection of individual observations  within a population.  So sampling is performed  in order to infer statistical knowledge about a population.  All right, if you want to understand  the different statistics of a population  like the mean  the median Median the mode or the standard deviation  or the variance of a population.  Then you’re going to perform sampling.  All right, because it’s not reasonable for you to study  a large population  and find out the mean median and everything else.  So why is sampling performed you might ask?  What is the point of sampling?  We can just study the entire population now guys,  think of a scenario  where in your asked to perform a survey  about the eating habits of teenagers in the US.  So at present there are over 42 million teens in the US  and this number is growing  as we are speaking right now, correct.  Is it possible to survey each of these 42 million individuals  about their health?  Is it possible?  Well, it might be possible  but this will take forever to do now.  Obviously, it’s not it’s not reasonable to go around  knocking each door  and asking for what does your teenage son eat  and all of that right?  This is not very reasonable.  That’s By sampling is used.  It’s a method wherein a sample of the population is studied  in order to draw inferences about the entire population.  So it’s basically a shortcut to studying  the entire population instead of taking the entire population  and finding out all the solutions.  You just going to take a part of the population  that represents the entire population  and you’re going to perform all your statistical analysis  your inferential statistics on that small sample.  All right,  and that sample basically here Presents the entire population.  All right, so I’m short of made this clear  to y’all what is sample and what is population now?  There are two main types of sampling techniques  that are discussed today.  We have probability sampling and non-probability  sampling now in this video  will only be focusing on probability sampling techniques  because non-probability sampling is not within the scope  of this video.  All right will only discuss the probability part  because we’re focusing  on statistics and probability, correct.  Now again under probability sampling.  We have three different types.  We have random sampling systematic  and stratified sampling.  All right, and just to mention the different types  of non-probability sampling,  ‘s we have no bald Kota judgment and convenience sampling.  All right now guys in this session.  I’ll only be focusing on probability.  So let’s move on  and look at the different types of probability sampling.  So what is probability sampling it is a sampling technique  in which samples  from a large population are chosen by using  the theory of probability.  All right, so there are three types  of probability sampling.  All right first we have the random sampling now  in this method each member  of the population has an equal chance  of being selected in the sample.  All right, so each and every individual or each  and every object  in the population has an equal John’s  of being a part of the sample.  That’s what random sampling is all about.  Okay, you are randomly going to select any individual  or any object.  So this Bay each individual has  an equal chance of being selected.  Correct?  Next.  We have systematic sampling now  in systematic sampling every nth record is chosen  from the population to be a part of the sample.  All right.  Now refer this image  that I’ve shown over here out of these six.  Groups every second group is chosen as a sample.  Okay.  So every second record is chosen here and this is  our systematic sampling works.  Okay, you’re randomly selecting the nth record  and you’re going to add that to your sample.  Next.  We have stratified sampling now in this type  of technique a stratum is used to form samples  from a large population.  So what is a stratum a stratum is basically a subset  of the population that shares at One common characteristics.  So let’s say  that your population has a mix of both male and female  so you can create to straightens  out of this one will have only the male subset  and the other will have the female subset.  All right, this is what stratum is.  It is basically a subset of the population  that shares at least one common characteristics.  All right in our example, it is gender.  So after you’ve created  a stratum you’re going to use random sampling  on these stratums and you’re going to choose.  Choose a final sample.  So random sampling meaning  that all of the individuals in each of the stratum  will have an equal chance of being selected in the sample.  Correct.  So Guys, these were the three different types  of sampling techniques.  Now, let’s move on and look at our next topic  which is the different types of Statistics.  So after this,  we’ll be looking at the more advanced concepts of Statistics,  right so far we discuss the basics of Statistics,  which is basically what is statistics the Friend  sampling techniques and the terminologies and statistics.  All right.  Now we look at the different types of Statistics.  So there are two major types of Statistics  descriptive statistics  and inferential statistics in today’s session.  We will be discussing both of these types  of Statistics in depth.  All right, we’ll also be looking at a demo  which I’ll be running in the our language  in order to make you understand what exactly  descriptive and inferential statistics is soaked.  As which is going to look at the basic,  so don’t worry.  If you don’t have much knowledge,  I’m explaining everything from the basic level.  All right, so guys descriptive statistics is a method  which is used to describe and understand the features  of specific data set by giving a short summary of the data.  Okay, so it is mainly  focused upon the characteristics of data.  It also provides a graphical summary of the data now  in order to make you understand what descriptive statistics is.  Let’s suppose that you want to gift all  your classmates or t-shirt.  So to study the average shirt size of a student  in a classroom.  So if you were to use descriptive statistics to study  the average shirt size of students in your classroom,  then what you would do is you would record the shirt size  of all students in the class  and then you would find out the maximum minimum and average  shirt size of the cloud.  Okay.  So coming to inferential statistics inferential.  Six makes inferences  and predictions about a population based  on the sample of data taken from the population.  Okay.  So in simple words,  it generalizes a large data set and it applies probability  to draw a conclusion.  Okay.  So it allows you to infer data parameters  based on a statistical model by using sample data.  So if we consider the same example of finding  the average shirt size of students in a class  in infinite real statistics.  We’ll take a sample set of the class  which is basically a few people from the entire class.  All right, you already have had grouped the class  into large medium and small.  All right in this method you basically build  a statistical model  and expand it for the entire population in the class.  So guys, there was a brief understanding of descriptive  and inferential statistics.  So that’s the difference between descriptive  and inferential now in the next section,  we will go in depth about descriptive statistics.  Right.  So let’s discuss more about descriptive statistics.  So like I mentioned  earlier descriptive statistics is a method  that is used to describe and understand the features  of a specific data set by giving short summaries about the sample  and measures of the data.  There are two important measures in descriptive statistics.  We have measure of central tendency,  which is also known as measure  of center and we have measures of variability.  This is also known as Measures of spread  so measures of center include mean median and mode now  what is measures  of center measures of the center are statistical measures  that represent the summary of a data set?  Okay, the three main measures of center are mean median  and mode coming to measures of variability  or measures of spread.  We have range interquartile range variance  and standard deviation.  All right.  So now let’s discuss each of these measures.  Has in a little more depth starting  with the measures of center.  Now, I’m sure all of you know,  what the mean is mean is basically the measure  of the average of all the values in a sample.  Okay, so it’s basically the average of all  the values in a sample.  How do you measure the mean I hope all of you know  how the main is measured  if there are 10 numbers  and you want to find the mean of these 10 numbers.  All you have to do is you have to add up all the 10 numbers  and you have to divide it by 10 then.  Represents the number of samples in your data set.  All right, since we have 10 numbers,  we’re going to divide this by 10.  All right, this will give us the average  or the mean so to better understand the measures  of central tendency.  Let’s look at an example.  Now the data set over here is basically the cars data set  and it contains a few variables.  All right, it has something known as cars.  It has mileage per gallon cylinder type displacement  horsepower and relax.  Silver ratio.  All right, all of these measures are related to cars.  Okay.  So what you’re going to do is you’re going  to use descriptive analysis  and you’re going to analyze each of the variables  in the sample data set  for the mean standard deviation median more and so on.  So let’s say that you want to find out the mean  or the average horsepower  of the cars among the population of cards.  Like I mentioned earlier  what you’ll do is you’ll check the average of all the values.  So in this case we will take The sum of the horsepower  of each car and we’ll divide  that by the total number of cards.  Okay, that’s exactly  what I’ve done here in the calculation part.  So this hundred and ten basically  represents the horsepower for the first car.  All right.  Similarly.  I’ve just added up all the values of horsepower  for each of the cars  and I’ve divided it by 8 now 8 is basically the number  of cars in our data set.  All right, so hundred and three point six two five is  what army mean is or the average of horsepower is all right.  Now, let’s understand what median is with an example.  Okay.  So to Define median median is basically a measure  of the central value  of the sample set is called the median.  All right, you can see that it is the middle value.  So if we want to find out the center value  of the mileage per gallon among the population  of cars first,  what we’ll do is we’ll arrange the MGP values in ascending  or descending Order  and choose a middle value right in this case  since we have eight values, right?  We have eight values which is an even entry.  So whenever you have even number of data points  or samples in your data set,  then you’re going to take the average  of the two middle values.  If we had nine values over here.  We can easily figure out the middle value  and you know choose that as a median.  But since they’re even number of values we are going  to take the average of the two middle values.  All right.  Right.  So 22.8 and 23 are my two middle values  and I’m taking the mean  of those 2 and hence I get twenty two point nine,  which is my median.  All right, lastly,  let’s look at how mode is calculated.  So what is mode the value  that is most recurrent  in the sample set is known as mode or basically the value  that occurs most often.  Okay, that is known as mode.  So let’s say  that we want to find out the most common type of cylinder  among the population of cards.  What we have to do is we will check the value  which is repeated the most number of times here.  We can see that the cylinders come in two types.  We have cylinder of Type 4 and cylinder of type 6, right?  So take a look at the data set.  You can see that the most recurring value is 6 right.  We have one two, three four and five.  We have five six and we have one two, three.  Yeah, we have three four types of lenders  and five six types of lenders.  So basically we have three four type cylinders  and we have five six type cylinders.  All right.  So our mode is going to be 6 since 6 is more  recurrent than 4 so guys  those were the measures of the center or the measures  of central tendency.  Now, let’s move on and look at the measures of the spread.  All right.  Now, what is the measure of spread a measure of spread?  Sometimes also called  as measure of dispersion is Used to describe the variability  in a sample or population.  Okay, you can think of it as some sort  of deviation in the sample.  All right, so you measure this with the help  of the different measure of spreads.  We have range interquartile range variance  and standard deviation.  Now range is pretty self-explanatory, right?  It is the given measure of how spread apart the values  in a data set are the range can be calculated  as shown in this formula.  You basically going to subtract the maximum value  in your data set  from the minimum value in your data set.  That’s how you calculate the range of the data.  Alright, next we have interquartile range.  So before we discuss interquartile range,  let’s understand.  What a quartile is red.  So quartiles basically tell us about the spread of a data set  by breaking the data set into different quarters.  Okay, just like how the median breaks the data into two parts  the court is We’ll break it into different quarters.  So to better understand  how quartile and interquartile are calculated.  Let’s look at a small example.  Now this data set basically represents the marks  of hundred students ordered from the lowest  to the highest scores red.  So the quartiles lie  in the following ranges the first quartile,  which is also known as q1 it  lies between the 25th and 26th observation.  All right.  So if you look at this I’ve highlighted Add the 25th  and the 26th observation.  So how you can calculate Q 1 or first quartile is  by taking the average of these two values.  Alright, since both the values are 45  when you add them up and divide them by two  you’ll still get 45 now the second quartile  or Q 2 is between the 50th and the 51st observation.  So you’re going to take the average of 58 and 59  and you will get a value of 58.5.  Now, this is my second quarter the third quartile.  Ah Q3 is between the 75th and the 76th observation here.  Again, we’ll take the average of the two values  which is the 75th value and the 76 value right  and you’ll get a value of 71.  All right, so guys this is exactly  how you calculate the different quarters.  Now, let’s look at what is interquartile range.  So IQR or the interquartile range is a measure  of variability based on dividing a data set  into quartiles now the The interquartile range  is calculated by subtracting the q1 from Q3.  So basically Q3  minus q1 is your IQ are so your IQR is your Q3 minus q1?  All right.  Now this is how each  of the quartiles are each core tile represents a quarter,  which is 25% All right.  So guys, I hope all of you are clear  with interquartile range and what our quartiles now,  let’s look at variance covariance is  basically a measure  that shows How much a random variable the first  from its expected value?  Okay.  It’s basically the variance in any variable now variance  can be calculated by using this formula right here x  basically represents any data point in your data set  n is the total number of data points in your data set  and X bar is basically the main of data points.  All right.  This is how you calculate variance variance is  basically a Computing the squares of deviations.  Okay.  That’s why it says s Square there.  Now let’s look at  what is deviation deviation is just the difference  between each element from the mean.  Okay, so it can be calculated by using this simple formula  where X I basically represents a data point  and mu is the mean of the population  or add this is exactly  how you calculate the deviation Now population variance  and Sample variance are very specific to  whether you’re calculating the variance in your population data  set or in your sample data set.  That’s the A difference between population  and Sample variance.  So the formula for population variance is pretty explanatory.  So X is basically each data point mu is the mean  of the population  n is the number of samples in your data set.  All right.  Now, let’s look at sample.  Variance Now sample variance  is the average of squared differences from the mean.  All right here x i is any data point  or any sample in your data set X bar is the mean  of your sample.  All right.  It’s not the main of your population.  Ation, it’s the mean of your sample.  And if you notice n here is a smaller  n is the number of data points in your sample.  And this is basically the difference between sample  and population variance.  I hope that is clear coming  to standard deviation is the measure of dispersion  of a set of data from its mean.  All right, so it’s basically the deviation from your mean.  That’s what standard deviation is now to better understand  how the measures of spread are calculated.  Let’s look at a small use case.  So let’s see Daenerys has 20 dragons.  They have the numbers nine to five four and so on  as shown on the screen,  what you have to do is you have to work out  the standard deviation or at  in order to calculate the standard deviation.  You need to know the mean right?  So first you’re going to find out the mean of your sample set.  So how do you calculate the mean you add all the numbers  in your data set  and divided by the total number of samples in your data set  so you get a value of 7.  Here then you calculate the rhs  of your standard deviation formula.  All right.  So from each data point you’re going to subtract the mean  and you’re going to square that.  All right.  So when you do that,  you will get the following result.  You’ll basically get this 425 for 925  and so on so finally you will just find the mean  of the squared differences.  All right.  So your standard deviation  will come up to two point nine eight three  once you take the square root.  So guys, it’s pretty simple.  It’s a simple At the magic technique,  all you have to do is you have to substitute the values  in the formula.  All right.  I hope this was clear to all of you.  Now let’s move on  and discuss the next topic which is Information Gain  and entropy now.  This is one of my favorite topics in statistics.  It’s very interesting and this topic is mainly involved  in machine learning algorithms,  like decision trees and random forest.  All right, it’s very important  for you to know  how Information Gain and entropy really work and why they are  so essential in building machine learning models.  We focus on the statistic parts of Information Gain  and entropy and after that we’ll discuss a use case.  And see how Information Gain  and entropy is used in decision trees.  So for those of you  who don’t know what a decision tree is it is  basically a machine learning algorithm.  You don’t have to know anything about this.  I’ll explain everything in depth.  So don’t worry.  Now.  Let’s look at what exactly entropy  and Information Gain Is Now  guys entropy is basically the measure  of any sort of uncertainty that is present in the data.  All right, so it can be measured by using this formula.  So here s is the set of all instances in the data set  or all the data items in the data set  n is the different type of classes in your data set  Pi is the event probability.  Now this might seem a little confusing  to y’all but when we go through the use case,  you’ll understand all of these terms even better.  All right cam.  The information gained  as the word suggests Information Gain indicates  how much information a particular feature  or a particular variable gives us about the final outcome.  Okay, it can be measured by using this formula.  So again here heads of s is the entropy  of the whole data set s SJ is the number  of instances with the J value  of an attribute a s is the total number  of instances in the data set V is the set of distinct values  of an attribute a h  of s j is the entropy of subsets of instances  and hedge of a comma  s is the entropy of an attribute a even  though this seems confusing.  I’ll clear out the confusion.  All right, let’s discuss a small problem statement  where we will understand  how Information Gain  and entropy is used to study the significance of a model.  So like I said Information Gain  and entropy are very important statistical measures  that let us understand  the significance of a predictive model.  Okay to get a more clear understanding.  Let’s look at a use case.  All right now suppose we are given a problem statement.  All right, the statement is that you have to predict  whether a match can be played  or Not by studying the weather conditions.  So the predictor variables here are outlook humidity wind day  is also a predictor variable.  The target variable is basically played  or a the target variable is the variable  that you’re trying to protect.  Okay.  Now the value of the target variable will decide  whether or not a game can be played.  All right, so that’s why The play has two values.  It has no and yes, no,  meaning that the weather conditions are not good.  And therefore you cannot play the game.  Yes, meaning that the weather conditions are good and suitable  for you to play the game.  Alright, so that was our problem statement.  I hope the problem statement is clear to all of you now  to solve such a problem.  We make use of something known as decision trees.  So guys think of an inverted tree  and each branch of the tree denotes some decision.  All right, each branch is Is known as the branch known  and at each branch node,  you’re going to take a decision in such a manner  that you will get an outcome at the end of the branch.  All right.  Now this figure here basically shows  that out of 14 observations 9 observations result in a yes,  meaning that out of 14 days.  The match can be played only on nine days.  Alright, so here  if you see on day 1 Day 2 Day 8 day 9 and 11.  The Outlook has been Alright,  so basically we try to plaster a data set  depending on the Outlook.  So when the Outlook is sunny,  this is our data set when the Outlook is overcast.  This is what we have  and when the Outlook is the rain this is  what we have.  All right, so  when it is sunny we have two yeses and three nodes.  Okay, when the Outlook is overcast.  We have all four as yes has meaning  that on the four days when the Outlook was overcast.  We can play the game.  All right.  Now when it comes to rain,  we have three yeses and two nodes.  All right.  So if you notice here,  the decision is being made by choosing the Outlook variable  as the root node.  Okay.  So the root node is basically the topmost node  in a decision tree.  Now, what we’ve done here is we’ve created a decision tree  that starts with the Outlook node.  All right, then you’re splitting the decision tree further  depending on other parameters like Sunny overcast and rain.  All right now like we know that Outlook has three values.  Sunny overcast and brain so let me explain this  in a more in-depth manner.  Okay.  So what you’re doing here is you’re making  the decision Tree by choosing the Outlook variable  at the root node.  The root note is basically the topmost node  in a decision tree.  Now the Outlook node has three branches coming out from it,  which is sunny overcast and rain.  So basically Outlook  can have three values either it can be sunny.  It can be overcast or it can be rainy.  Okay now these three values Use are assigned  to the immediate Branch nodes and for each  of these values the possibility  of play is equal to yes is calculated.  So the sunny  and the rain branches will give you an impure output.  Meaning that there is a mix of yes and no right.  There are two yeses here three nodes here.  There are three yeses here and two nodes over here,  but when it comes to the overcast variable,  it results in a hundred percent pure subset.  All right, this shows that the overcast baby.  Will result in a definite and certain output.  This is exactly what entropy is used to measure.  All right, it calculates the impurity or the uncertainty.  Alright, so the lesser the uncertainty or the entropy  of a variable more significant is that variable?  So when it comes to overcast there’s literally no impurity  in the data set.  It is a hundred percent pure subset, right?  So be want variables like these in order to build a model.  All right now,  we don’t always Ways get lucky and we don’t always find  variables that will result in pure subsets.  That’s why we have the measure entropy.  So the lesser the entropy of a particular variable the most  significant that variable will be so in a decision tree.  The root node is assigned the best attribute  so that the decision tree  can predict the most precise outcome meaning  that on the root note.  You should have the most significant variable.  All right, that’s why we’ve chosen Outlook  or and now some of you might ask me why haven’t you chosen  overcast Okay is overcast is not a variable.  It is a value of the Outlook variable.  All right.  That’s why we’ve chosen our true cure  because it has a hundred percent pure subset  which is overcast.  All right.  Now the question in your head is how do I decide which variable  or attribute best Blitz the data now right now,  I know I looked at the data  and I told you that,  you know here we have a hundred percent pure subset,  but what if it’s a more complex problem  and you’re not able to understand which variable  will best split the data,  so guys when it comes  to decision tree Information and gain  and entropy will help  you understand which variable will best split the data set.  All right, or which variable you have to assign to the root node  because whichever variable is assigned to the root node.  It will best let the data set  and it has to be the most significant variable.  All right.  So how we can do this is we need to use  Information Gain and entropy.  So from the total of the 14 instances  that we saw nine  of them said yes and five of the instances said know  that you cannot play on that particular day.  All right.  So how do you calculate the entropy?  So this is the formula you just substitute  the values in the formula.  So when you substitute the values in the formula,  you will get a value of 0.9940.  All right.  This is the entropy  or this is the uncertainty of the data present in a sample.  Now in order to ensure  that we choose the best variable for the root node.  Let us look at all the possible combinations  that you can use on the root node.  Okay, so these are All the possible combinations  you can either have Outlook you can have  windy humidity or temperature.  Okay, these are four variables  and you can have any one of these variables  as your root note.  But how do you select  which variable best fits the root node?  That’s what we are going to see by using  Information Gain and entropy.  So guys now the task at hand is to find the information gain  for each of these attributes.  All right.  So for Outlook for windy for humidity and for temperature,  we’re going to find out the information.  Nation gained all right.  Now a point to remember is that the variable  that results in the highest Information Gain must be chosen  because it will give us the most precise and output information.  All right.  So the information gain for attribute windy will calculate  that first here.  We have six instances of true and eight instances of false.  Okay.  So when you substitute all the values in the formula,  you will get a value of zero point zero four eight.  So we get a value of You 2.0 for it.  Now.  This is a very low value for Information Gain.  All right, so the information  that you’re going to get from Windy attribute is pretty low.  So let’s calculate the information gain  of attribute Outlook.  All right, so from the total of 14 instances,  we have five instances with say Sunny for instances,  which are overcast and five instances,  which are rainy.  All right for Sonny.  We have three yeses  and to nose for overcast we have Or the for as yes for any  we have three years and two nodes.  Okay.  So when you calculate the information gain  of the Outlook variable will get a value  of zero point 2 4 7 now compare this to the information gain  of the windy attribute.  This value is actually pretty good.  Right we have zero point 2 4 7 which is a pretty good value  for Information Gain.  Now, let’s look at the information gain  of attribute humidity now over here.  We have seven instances with say hi  and seven instances with same.  Right and under the high Branch node.  We have three instances with say yes,  and the rest for instances would say no similarly  under the normal Branch.  We have one two, three,  four, five six seven instances would say yes  and one instance with says no.  All right.  So when you calculate the information gain  for the humidity variable,  you’re going to get a value of 0.15 one.  Now.  This is also a pretty decent value,  but when you compare it to the Information Gain,  Of the attribute Outlook it is less right now.  Let’s look at the information gain of attribute temperature.  All right, so the temperature can hold repeat.  So basically the temperature attribute can hold  hot mild and cool.  Okay under hot.  We have two instances with says yes and two instances  for no under mild.  We have four instances of yes and two instances of no  and under col we have three instances of yes  and one instance of no.  All right.  When you calculate the information gain  for this attribute,  you will get a value of zero point zero to nine,  which is again very less.  So what you can summarize from here is if we look  at the information gain for each of these variable will see  that for Outlook.  We have the maximum gain.  All right, we have zero point two four seven,  which is the highest Information Gain value  and you must always choose a variable  with the highest Information Gain to split the data  at the root node.  So that’s why we assign The Outlook variable  at the root node.  All right, so guys.  I hope this use case was clear.  If any of you have doubts.  Please keep commenting those doubts now,  let’s move on and look at what exactly a confusion Matrix is  the confusion Matrix is the last topic  for descriptive statistics read after this.  I’ll be running a short demo where I’ll be showing you  how you can calculate mean median mode  and standard deviation variance and all of those values  by using our okay.  So let’s talk about confusion Matrix now guys.  What is the confusion Matrix now don’t get confused.  This is not any complex topic now confusion.  Matrix is a matrix  that is often used to describe the performance of a model.  Right?  And this is specifically used for classification models  or a classifier  and what it does is it will calculate the accuracy  or it will calculate the performance of your classifier  by comparing your actual results and Your predicted results.  All right.  So this is what it looks like to prosit  of true- and all of that.  Now this is a little confusing.  I’ll get back to what exactly true positive  to negative and all of this stands for for now.  Let’s look at an example and let’s try  and understand what exactly confusion Matrix is.  So guys.  I made sure  that I put examples after each and every topic  because it’s important you  understand the Practical part of Statistics.  All right statistics has literally nothing to do  with Theory you need to understand how Calculations  are done in statistics.  Okay.  So here what I’ve done is let’s look at a small use case.  Okay, let’s consider  that your given data about a hundred and sixty-five  patient’s out of which hundred and five patients have a disease  and the remaining 50 patients don’t have a disease.  Okay.  So what you’re going to do is you will build a classifier  that predicts by using  these hundred and sixty five observations your feed all  of these 165 observations to your classifier  and It will predict the output every time  a new patients detail is fed to the classifier right now  out of these 165 cases.  Let’s say that the classifier predicted.  Yes hundred and ten times and no 55 times.  Alright, so yes basically stands for yes.  The person has a disease and no stands for know.  The person has not have a disease.  All right, that’s pretty self-explanatory.  But yeah, so it predicted that a hundred and ten times.  Patient has a disease and 55 times that  nor the patient doesn’t have a disease.  However in reality only hundred and five patients  in the samples have the disease and 60 patients  who do not have the disease, right?  So how do you calculate the accuracy of your model?  You basically build the confusion Matrix?  All right.  This is how the Matrix looks  like and basically denotes the total number of observations  that you have  which is 165 in our case actual denotes the actual use  in the data set  and predicted denotes the predicted values  by the classifier.  So the actual value is no here  and the predicted value is no here.  So your classifier was correctly able  to classify 50 cases as no.  All right, since both of these are no so 50  it was correctly able to classify but 10  of these cases it incorrectly classified meaning  that your actual value here  is no but you classifier predicted it as yes or a  that’s why this And over here similarly it wrongly predicted  that five patients do not have diseases  whereas they actually did have diseases  and it correctly predicted hundred patients,  which have the disease.  All right.  I know this is a little bit confusing.  But if you look at these values no,  no 50 meaning  that it correctly predicted 50 values No  Yes means that it wrongly predicted.  Yes for the values are it was supposed to predict.  No.  All right.  Now what exactly is?  Is this true positive to negative and all of that?  I’ll tell you what exactly it is.  So true positive are the cases in which we predicted a yes  and they do not actually have the disease.  All right, so it is basically this value  already predicted a yes here,  even though they did not have the disease.  So we have 10 true positives right similarly true-  is we predicted know  and they don’t have the disease meaning  that this is correct.  False positive is be predicted.  Yes, but they do not actually have the disease.  All right.  This is also known as type 1 error falls- is we predicted.  No, but they actually do not have the disease.  So guys basically false negative and true negatives are basically  correct classifications.  All right.  So this was confusion Matrix  and I hope this concept is clear again guys.  If you have doubts,  please comment your doubt in the comment section.  So guys that was descriptive statistics now,  Before we go to probability.  I promised all  that will run a small demo in our all right,  we’ll try and understand  how mean median mode works in our okay,  so let’s do that first.  So guys again  what we just discussed so far was descriptive statistics.  All right, next we’re going to discuss probability  and then we’ll move on to inferential statistics.  Okay in financial statistics is  basically the second type of Statistics.  Okay now to make things more clear of you,  let me just zoom in.  So guys it’s always best  to perform practical implementations in order  to understand the concepts in a better way.  Okay, so here will be executing a small demo that will show you  how to calculate the mean median mode variance standard deviation  and how to study the variables by plotting a histogram.  Okay.  Don’t worry.  If you don’t know what a histogram is.  It’s basically a frequency plot.  There’s no big signs behind it.  Alright, this is a very simple demo  but it also forms a foundation that everything.  Machine learning algorithm is built upon.  Okay, you can say  that most of the machine learning algorithms actually  all the machine learning  algorithms and deep learning algorithms have  this basic concept behind them.  Okay, you need to know how mean median mode  and all of that is calculated.  So guys am using the our language to perform this  and I’m running this on our studio.  For those of you who don’t know our language.  I will leave a couple of links in the description box.  You can go through those videos.  So what we’re doing is we are randomly generated.  Eating numbers and Miss  storing it in a variable called data, right?  So if you want to see the generated numbers  just to run the line data,  right this variable basically stores all our numbers.  All right.  Now, what we’re going to do is we’re going  to calculate the mean now.  All you have to do in our is specify the word mean  along with the data  that you’re calculating the mean of and I  was assigned this whole thing into a variable called mean  Just hold the mean value of this data.  So now let’s look at the mean  for that abuser function called print and mean.  All right.  So our mean is around 5.99.  Okay.  Next is calculating the median.  It’s very simple guys.  All you have to do is use the function median  or write and pass the data as a parameter to this function.  That’s all you have to do.  So our provides functions for each and everything.  All right statistics is very easy when it comes to R  because R is basically a statistical language.  Okay.  So all you have to do is just name the function  and that function is Ready in built in your art.  Okay, so your median is around 6.4.  Similarly.  We will calculate the mode.  All right.  Let’s run this function.  I basically created a small function  for calculating the mode.  So guys, this is our mode meaning  that this is the most recurrent value right now.  We’re going to calculate  the variance and the standard deviation for that.  Again.  We have a function in are called as we’re all right.  All you have to do is pass the data to that function.  Okay, similarly will calculate the standard deviation,  which is basically the square root of your variance  right now will Rent the standard deviation, right?  This is our standard deviation value.  Now.  Finally, we will just plot a small histogram histogram  is nothing but it’s a frequency plot already in  show you how frequently a data point is occurring.  So this is the histogram  that we’ve just created it’s quite simple in our  because our has a lot  of packages and a lot of inbuilt functions  that support statistics.  All right.  It is a statistical language  that is mainly used by data scientists or by data  and analysts and machine learning Engineers  because they don’t have to student code these functions.  All they have to do is they have to mention the name  of the function and pass the corresponding parameters.  So guys that was the entire descriptive statistics module  and now we will discuss about probability.  Okay.  So before we understand what exactly probability is,  let me clear out a very common misconception people  often tend to ask me this question.  What is the relationship between statistics and probability?  So probability and statistics are related fields.  All right.  So probability is a mathematical method used  for statistical analysis.  Therefore we can say  that a probability and statistics are interconnected  branches of mathematics  that deal with analyzing the relative frequency of events.  So they’re very interconnected feels  and probability makes use of statistics  and statistics makes use  of probability or a they’re very interconnected Fields.  So that is the relationship  between said It is six and probability.  Now.  Let’s understand what exactly is probability.  So probability is the measure  of How likely an event will occur to be more precise.  It is the ratio  of desired outcome to the total outcomes.  Now, the probability  of all outcomes always sum up to 1 the probability will always  sum up to 1 probability cannot go beyond one.  Okay.  So either your probability can be 0 or it can be 1  or it can In the form of decimals like 0.5  to or 0.55 or it can be in the form of 0.5 0.7 0.9.  But it’s valuable always stay between the range 0 and 1 okay,  another famous example of probability is rolling  a dice example.  So when you roll a dice you get six possible outcomes, right?  You get one two,  three four and five six phases of a dies now each possibility  only has one outcome.  So what is the probability that on rolling a dice?  You will get 3 the probability is 1 by 6 right  because there’s only one phase  which has the number 3 on it out of six phases.  There’s only one phase which has the number three.  So the probability of getting 3  when you roll a dice is 1 by 6 similarly.  If you want to find the probability of getting  a number 5 again,  the probability is going to be 1 by 6.  All right.  So all of this will sum up to 1.  All right, so guys,  this is exactly what Ability is it’s a very simple concept.  We all learnt it in 8 standard onwards right now.  Let’s understand the different terminologies  that are related to probability.  Now that three terminologies  that you often come across when we talk about probability.  We have something known as the random experiment.  Okay.  It’s basically an experiment or a process for which  the outcomes cannot be predicted with certainty.  All right.  That’s why you use probability.  You’re going to use probability in order to predict the outcome  with Some sort  of certainty sample space is the entire possible set of outcomes  of a random experiment  and event is one or more outcomes of an experiment.  So if you consider the example of rolling a dice now,  let’s say that you want to find out the probability  of getting a to when you roll the dice.  Okay.  So finding this probability is the random experiment  the sample space is basically your entire possibility.  Okay.  So one two, three,  four five six Is are there and out of that you  need to find the probability of getting a 2 right?  So all the possible outcomes will basically represent  your sample space gives a 1 to 6 are all your possible outcomes.  This represents your sample space now event is one or more  outcome of an experiment.  So in this case my event is to get a tattoo  when I roll a dice, right?  So my event is the probability of getting a to  when I roll a dice,  so guys, this is basically what random experiment samples.  All space and event really means alright now,  let’s discuss the different types of events.  There are two types of events  that you should know about there is disjoint  and non disjoint events.  Disjoint events are events  that do not have any common outcome.  For example,  if you draw a single card from a deck of cards,  it cannot be a king  and a queen correct it can either be king  or it can be Queen now a non disjoint events are events  that have common out.  For example a student can get hundred marks  in statistics and hundred marks in probability.  All right, and also the outcome  of a ball delivered can be a no ball  and it can be a 6 right.  So this is what non disjoint events are or n?  These are very simple to understand right now.  Let’s move on and look at the different types  of probability distribution.  All right, I’ll be discussing  the three main probability distribution functions.  I’ll be talking about probability density.  Aaron normal distribution and Central limit theorem.  Okay probability density function also known  as PDF is concerned with the relative likelihood  for a continuous random variable to take on a given value.  Alright, so the PDF gives the probability of a variable  that lies between the range A and B.  So basically what you’re trying to do is you’re going to try  and find the probability of a continuous random variable  over a specified range.  Okay.  Now this graph denotes the PDF of a continuous variable.  Now this graph is also known as the bell curve right?  It’s famously called the bell curve  because of its shape and the three important properties  that you need to know about a probability density function.  Now the graph of a PDF will be continuous  over a range this is  because you’re finding the probability that  a continuous variable lies between the ranges A and B,  right the second property.  Is that the area bounded by By the curve of a density function  and the x-axis is equal to 1 basically the area  below the curve is equal to 1 all right,  because it denotes probability again the probability  cannot arrange more than one it has to be  between 0 and 1 property number three is  that the probability  that our random variable assumes a value between A  and B is equal to the area  under the PDF bounded by A and B. Okay.  Now what this means,  is that the probability You is denoted by  the area of the graph.  All right, so whatever value that you get here,  which basically one is the probability  that a random variable will lie between the range A and B.  All right.  So I hope all  of you have understood the probability density function.  It’s basically the probability of finding the value  of a continuous random variable between the range A and B.  All right.  Now, let’s look at our next distribution,  which is normal distribution now.  Normal distribution,  which is also known as the gaussian distribution  is a probability distribution  that denotes the symmetric property  of the mean right meaning  that the idea behind this function.  Is that the data near the mean occurs more  frequently than the data away from the mean.  So what it means to say is  that the data around the mean represents the entire data set.  Okay.  So if you just take a sample of data  around the mean it can represent the entire data set now similar  to Probability density function the normal distribution appears  as a bell curve right now  when it comes to normal distribution.  There are two important factors.  All right, we have the mean of the population  and the standard deviation.  Okay, so the mean and the graph determines the location  of the center of the graph,  right and the standard deviation determines the height  of the graph.  Okay.  So if the standard deviation is large the curve is going  to look something like this.  All right, it’ll be short and wide.  I’d and if the standard deviation is small the curve  is tall and narrow.  All right.  So this was it about normal distribution.  Now, let’s look at the central limit theorem.  Now the central limit theorem states  that the sampling distribution  of the mean of any independent random variable will be normal  or nearly normal  if the sample size is large enough now,  that’s a little confusing.  Okay.  Let me break it down for you now in simple terms  if we had a large population  and be Why did it in too many samples,  then the mean of all the samples  from the population will be almost equal to the mean  of the entire population right?  Meaning that each of the sample is normally distributed.  Right?  So if you compare the mean of each of the sample,  it will almost be equal to the mean of the population.  Right?  So this graph basically shows a more clear understanding  of the central limit theorem red you can see each sample here  and the mean of each sample.  Oil is almost along the same line, right?  Okay.  So this is exactly  what the central limit theorem States now the accuracy  or the resemblance to the normal distribution depends  on two main factors, right?  So the first is the number of sample points  that you consider.  All right,  and the second is the shape of the underlying population.  Now the shape obviously depends on the standard deviation  and the mean of a sample, correct.  So guys the central limit theorem basically states  that eats Bill will be normally distributed in such a way  that the mean of each sample will coincide with the mean  of the actual population.  All right in short terms.  That’s what central limit theorem States.  All right, and this holds true only for a large data set mostly  for a small data set and there are more deviations  when compared to a large data set is because of  the scaling Factor, right?  The small is deviation  in a small data set will change the value vary drastically,  but in a large data set a small deviation  will not matter at all.  Now, let’s move.  Vaughn and look at our next topic  which is the different types of probability.  This is a important topic  because most of your problems can be solved by understanding  which type of probability should I use to solve this problem?  Right?  So we have three important types of probability.  We have marginal joint and conditional probability.  So let’s discuss each  of these now the probability of an event occurring unconditioned  on any other event is known as marginal.  Or unconditional probability.  So let’s say that you want to find the probability  that a card drawn is a heart.  All right.  So if you want to find the probability  that a card drawn is a heart The Profit will be 13 by 52  since there are 52 cards in a deck  and there are 13 hearts in a deck of cards.  Right and there are 52 cards in a total deck.  So your marginal probability will be 13 by 52.  That’s about marginal probability.  Now, let’s understand what is joint probability.  And now joint probability  is a measure of two events happening at the same time.  Okay, let’s say that the two events are A and B.  So the probability of event A  and B occurring is the intersection of A and B.  So for example,  if you want to find the probability  that a card is a four and a red that would be joint probability.  All right, because you’re finding a card  that is 4 and the card has to be red in color.  So for the answer to this would be to Biceps you do  because we have 1/2 in heart and we have 1/2  and diamonds, correct.  So both of these are red and color therefore.  Our probability is to by 52  and if you further down it is 1 by 26, right?  So this is what joint probability is all  about moving on.  Let’s look at what exactly conditional probability is.  So if the probability  of an event or an outcome is based on the occurrence  of a previous event or an outcome.  Then you call it as a conditional probability.  Okay.  So the conditional probability of an event B is the probability  that the event will occur given  that an event a has already occurred.  Right?  So if a and b are dependent events,  then the expression  for conditional probability is given by this.  Now this first term on the left hand side,  which is p b of a is basically the probability  of event B occurring  given that event a has already occurred.  So like I said,  if a and b are dependent events than this is the expression  but if a and b are independent events,  and the expression for conditional probability is  like this, right?  So guys P of A and B of B is obviously the probability  of a and probability of B right now,  let’s move on now in order  to understand conditional probability joint probability  and marginal probability.  Let’s look at a small use case.  Okay now basically we’re going to Take a data set  which examines the salary package and training  undergone my candidates.  Okay.  Now in this there are 60 candidates a without training  and forty five candidates,  which have enrolled for Adder Acres training right.  Now the task here is you have to assess the training  with a salary package.  Okay.  Let’s look at this in a little more depth.  So in total,  we have hundred and five candidates out of which 60  of them have not enrolled Frederick has training  and 45 of them have enrolled for a deer Acres.  Inning.  All right.  This is the small survey that was conducted  and this is the rating of the package or the salary  that they got right?  So if you read through the data,  you can understand there were five candidates  without Eddie record training  who got a very poor salary package.  Okay.  Similarly, there are  30 candidates with Ed Eureka training  who got a good package, right?  So guys, basically you’re comparing the salary package  of a person depending on  whether or not they’ve enrolled for a A core training right?  This is our data set.  Now.  Let’s look at our problem statement find the probability  that a candidate  has undergone editor Acres training quite simple,  which type of probability is this.  This is marginal probability.  Right?  So the probability  that a candidate has undergone Edge rakers training is  obviously 45 divided by a hundred and five  since 45 is the number  of candidates with Eddie record raining  and hundred and five is the total number of candidates,  so you Value of approximately 0.4 to or  I that’s the probability of a candidate  that has undergone a Judaica  straining next question find the probability  that a candidate has attended edger a constraining  and also has good package.  Now.  This is obviously a joint probability problem, right?  So how do you calculate this now?  Since our table is quite formatted we can directly find  that people who have gotten a good package  along with Eddie record raining or 30, right?  So out of hundred and five people 30 people  have education training and a good package, right?  They specifically asking  for people with Ado Rekha training remember that right?  The question is find the probability  that a candidate has attended editor Acres training  and also has a good package.  Alright, so we need to consider two factors  that is a candidate who’s addenda deaderick has training  and who has a good package.  So clearly that number  is 30 30 divided by total number of candidates,  which is 1 0 Five, right.  So here you get the answer clearly.  Next we have find the probability  that a candidate has a good package given that he  has not undergone training.  Okay.  Now this is clearly conditional probability  because here you’re defining a condition you’re saying  that you want to find the probability of a candidate  who has a good package given that he’s not undergone.  Any training, right?  The condition is that he’s not undergone any training.  All right.  So the number of people  who have not undergone training are 60 and out  of that five of them have got a good package, right?  So that’s why this is Phi by 60 and not 5 by hundred and five  because here they have clearly mentioned has  a good package given that he has not undergone training.  You have to only consider people  who have not undergone training, right?  So only five people  who have not undergone training have gotten  a good package, right?  So 5 divided by 60 you get a probability of around 208  which is pretty low, right?  Okay.  So this was all  about the different types of probability.  Now, let’s move on and look at our last Topic in probability,  which is base theorem.  Now guys Bayes theorem is a very important concept  when it comes to statistics and probability.  It is majorly used in knife bias algorithm.  Those of you who aren’t aware.  Now I’ve bias is  a supervised learning classification algorithm  and it is mainly Used in Gmail spam filtering,  right a lot of you might have noticed  that if you open up Gmail,  you’ll see that you have a folder called spam right  or that is carried out through machine learning  and the algorithm used there is knife bias, right?  So now let’s discuss what exactly the Bayes theorem is  and what it denotes the bias theorem is used  to show the relation between one conditional probability  and it’s inverse.  All right, basically Nothing,  but the probability  of an event occurring based on prior knowledge of conditions  that might be related to the same event.  Okay.  So mathematically the bell’s theorem  is represented like this,  right like shown in this equation.  The left-hand term is referred to as the likelihood ratio,  which measures the probability of occurrence of event B,  given an event a okay on the left hand side is  what is known as the posterior right  is referred to as posterior.  Are which means that the probability  of occurrence of a given an event B, right?  The second term is referred to as the likelihood ratio  or a this measures the probability of occurrence of B,  given an event a now P of a is also known as the prior  which refers to the actual probability distribution of A  and P of B is again,  the probability of B, right.  This is the bias theorem  in order to better understand the base theorem.  Let’s look at a small example.  Let’s say that we Three balls we have about a bowel be  and bouncy okay barley contains two blue balls  and for red balls bowel be contains eight blue balls  and for red balls baozi contains one blue ball  and three red balls.  Now if we draw one ball from each Bowl,  what is the probability to draw a blue ball from a bowel a  if we know that we drew exactly a total  of two blue balls right if you didn’t Understand the question.  Please.  Read it.  I shall pause for a second or two.  Right.  So I hope all of you have understood the question.  Okay.  Now what I’m going to do is I’m going to draw  a blueprint for you  and tell you how exactly to solve the problem.  But I want you all to give me the solution  to this problem, right?  I’ll draw a blueprint.  I’ll tell you what exactly the steps are  but I want you to come up with a solution  on your own right the formula is also given to you.  Everything is given to you.  All you have to do is come up with the final answer.  Right?  Let’s look at how you can solve this problem.  So first of all,  what we will do is Let’s consider a all right,  let a be the event  of picking a blue ball from bag in and let  X be the event of picking exactly two blue balls,  right because these are the two events  that we need to calculate the probability of now  there are two probabilities that you need to consider here.  One is the event of picking a blue ball from bag a  and the other is the event of picking exactly two blue balls.  Okay.  So these two are represented by a and X respectively Lee  so what we want is the probability of occurrence  of event a given X,  which means that given  that we’re picking exactly two blue balls,  what is the probability  that we are picking a blue ball from bag?  So by the definition of conditional probability,  this is exactly what our equation will look like.  Correct.  This is basically a occurrence of event a given an event X  and this is the probability of a and x  and this is the probability of X alone, correct?  And what we need to do is we need to find  these two probabilities  which is probability of a and X occurring together  and probability of X. Okay.  This is the entire solution.  So how do you find P probability  of X this you can do in three ways.  So first is white ball from a either white from be  or read from see now first is to find the probability of x x  basically represents the event  of picking exactly two blue balls.  Right.  So these are the three ways in which it is possible.  So you’ll pick one blue ball from bowel a and one from bowel  be in the second case.  You can pick one  from a and another blue ball from see in the third case.  You can pick a blue ball from Bagby  and a blue ball from bagsy.  Right?  These are the three ways in which it is possible.  So you need to find the probability of each  of this step two is  that you need to find the probability of a  and X occurring together.  This is the sum of terms 1 and 2.  Okay, this is  because in both of these events,  we are picking a ball from bag, correct.  So there is find out this probability and let  me know your answer in the comment section.  All right.  We’ll see if you get the answer right?  I gave you the entire solution to this.  All you have to do is substitute the value right?  If you want a second or two,  I’m going to pause on the screen so that you can go through this  in a more clear away.  Right?  Remember that you need to calculate two.  Tease the first probability  that you need to calculate is the event of picking a blue ball  from bag a given  that you’re picking exactly two blue balls.  Okay, II probability you need to calculate  is the event of picking exactly two blue bonds.  All right.  These are the two probabilities.  You need to calculate so remember that and this  is the solution.  All right, so guys make sure you mention your answers  in the comment section for now.  Let’s move on and Look at our next topic,  which is the inferential statistics.  So guys, we just completed the probability module right now.  We will discuss inferential statistics,  which is the second type of Statistics.  We discussed descriptive statistics earlier.  Alright, so like I  mentioned earlier inferential statistics also known as  statistical inference is a branch of Statistics  that deals with forming inferences and predictions  about a population based on a sample of data.  Are taken from the population.  All right, and the question you should ask is  how does one form inferences or predictions on a sample?  The answer is you use Point estimation?  Okay.  Now you must be wondering  what is point estimation one estimation is concerned  with the use of the sample data to measure a single value  which serves as an approximate value  or the best estimate of an unknown population parameter.  That’s a little confusing.  Let me break it down to you for Camping  in order to calculate the mean of a huge population.  What we do is we first draw out the sample of the population  and then we find the sample mean  right the sample mean is then used to estimate  the population mean this is basically Point estimate,  you’re estimating the value of one of the parameters  of the population, right?  Basically the main  you’re trying to estimate the value of the mean.  This is what point estimation is the two main terms  in point estimation.  There’s something known as  as the estimator and the something known  as the estimate estimator is a function of the sample  that is used to find out the estimate.  Alright in this example.  It’s basically the sample mean right so a function  that calculates the sample mean is known as the estimator  and the realized value  of the estimator is the estimate right?  So I hope Point estimation is clear.  Now, how do you find the estimates?  There are four common ways in which you can do this.  The first one is method of Moment you’ll  what you do is you form an equation  in the sample data set  and then you analyze the similar equation  in the population data set as well like  the population mean population variance and so on.  So in simple terms,  what you’re doing is you’re taking down some known facts  about the population  and you’re extending those ideas to the sample.  Alright, once you do that,  you can analyze the sample and estimate more  essential or more complex values right next.  We have maximum likelihood.  But this method basically uses a model to estimate a value.  All right.  Now a maximum likelihood is majorly based on probability.  So there’s a lot of probability involved in this method next.  We have the base estimator this works by minimizing  the errors or the average risk.  Okay, the base estimator  has a lot to do with the Bayes theorem.  All right, let’s not get into the depth  of these estimation methods.  Finally.  We have the best unbiased estimators in this method.  There are seven unbiased estimators that can be used  to approximate a parameter.  Okay.  So Guys these were a couple of methods  that are used to find the estimate  but the most well-known method to find the estimate is known as  the interval estimation.  Okay.  This is one of the most  important estimation methods or at this is  where confidence interval also comes into the picture right  apart from interval estimation.  We also have something known as margin of error.  So I’ll be discussing all of this.  In the upcoming slides.  So first let’s understand.  What is interval estimate?  Okay, an interval or range of values,  which are used to estimate a population parameter is known as  an interval estimation, right?  That’s very understandable.  Basically what they’re trying to see is you’re going to estimate  the value of a parameter.  Let’s say you’re trying to find the mean of a population.  What you’re going to do is you’re going to build a range  and your value will lie in that range or in that interval.  All right.  So this way your output is going to be more accurate  because you’ve not predicted a point estimation instead.  You have estimated an interval  within which your value might occur, right?  Okay.  Now this image clearly shows  how Point estimate and interval estimate or different.  So where’s interval estimate is obviously more accurate  because you’re not just focusing on a particular value  or a particular point  in order to predict the probability instead.  You’re saying that the value might be  within this range between the lower confidence limit  and the upper confidence limit.  All right, this is denotes the range or the interval.  Okay, if you’re still confused about interval estimation,  let me give you a small example  if I stated that I will take 30 minutes to reach the theater.  This is known as Point estimation.  Okay, but if I stated  that I will take between 45 minutes  to an hour to reach the theater.  This is an example of Will estimation all right.  I hope it’s clear.  Now now interval estimation gives rise to two important  statistical terminologies one is known as confidence interval  and the other is known as margin of error.  All right.  So there’s it’s important  that you pay attention  to both of these terminologies confidence interval is one  of the most significant measures  that are used to check  how essential machine learning model is.  All right.  So what is confidence interval confidence interval is  the measure of your confidence  that the interval estimated contains  the population parameter or the population mean  or any of those parameters right now statisticians  use confidence interval to describe the amount  of uncertainty associated  with the sample estimate of a population parameter now guys,  this is a lot of definition.  Let me just make you understand confidence interval  with a small example.  Okay.  Let’s say that you perform  a survey and you survey a group of cat owners.  The see how many cans of cat food they purchase in one year.  Okay, you test  your statistics at the 99 percent confidence level  and you get a confidence interval  of hundred comma 200 this means  that you think  that the cat owners  by between hundred to two hundred cans in a year and also  since the confidence level is 99% shows  that you’re very confident that the results are, correct.  Okay.  I hope all of you are clear with that.  Alright, so your confidence interval here will be  a hundred and two hundred  and your confidence level will be 99% Right?  That’s the difference between confidence interval  and confidence level So within your confidence interval  your value is going to lie and your confidence level will show  how confident you are about your estimation, right?  I hope that was clear.  Let’s look at margin of error.  No margin of error  for a given level of confidence is a greatest possible distance  between the Point estimate  and the value of the parameter  that it is estimating you can say  that it is a deviation from the actual point estimate right.  Now.  The margin of error can be calculated  using this formula now zc her denotes the critical value  or the confidence interval  and this is X standard deviation divided by root  of the sample size.  All right, n is basically the sample size now,  let’s understand how you can estimate  the confidence intervals.  So guys the level of confidence  which is denoted by C is the probability  that the interval estimate contains a population parameter.  Let’s say that you’re trying to estimate the mean.  All right.  So the level of confidence is the probability  that the interval estimate contains a population parameter.  So this interval between minus Z and z  or the area beneath this curve is nothing but the probability  that the interval estimate contains a population parameter.  You don’t all right.  It should basically contain the value  that you are predicting right.  Now.  These are known as critical values.  This is basically your lower limit  and your higher limit confidence level.  Also, there’s something known as the Z score now.  This court can be calculated  by using the standard normal table, right?  If you look it up anywhere on Google you’ll find  the z-score table  or the standard normal table get to understand  how this is done.  Let’s look at a small example.  Okay, let’s say that the level of Vince is 90% This means  that you are 90% confident  that the interval contains the population mean.  Okay, so the remaining 10% which is out of hundred percent.  The remaining 10% is equally distributed  on these Dale regions.  Okay, so you have 0.05 here and 0.05 over here, right?  So on either side  of see you will distribute the other leftover percentage  now these these scores are calculated from the table  as I mentioned before.  All right one.  N64 5 is get collated from the standard normal table.  Okay.  So guys how you estimate the level of confidence.  So to sum it up.  Let me tell you the steps that are involved  in constructing a confidence interval first.  You’ll start by identifying a sample statistic.  Okay.  This is the statistic  that you will use to estimate a population parameter.  This can be anything like the mean  of the sample next you will select a confidence level  now the confidence level describes the uncertainty  of a Sampling method right  after that you’ll find something known as the margin  of error, right?  We discuss margin of error earlier.  So you find this based on the equation  that I explained in the previous slide,  then you’ll finally specify the confidence interval.  All right.  Now, let’s look at a problem statement  to better understand this concept a random sample  of 32 textbook prices is taken from a local College Bookstore.  The mean of the sample is so so  and so and the sample standard deviation is  This use a 95% confident level  and find the margin of error for the mean price  of all text books in the bookstore.  Okay.  Now, this is a very straightforward question.  If you want you can read the question again.  All you have to do is you have to just substitute the values  into the equation.  All right, so guys,  we know the formula for margin of error you take the Z score  from the table.  After that we have deviation Madrid’s 23.4 for right  and that’s standard deviation and n stands for the number  of samples here.  The number of samples is 32 basically 32 textbooks.  So approximately your margin of error is going to be  around 8.1 to this is a pretty simple question.  All right.  I hope all of you understood this now  that you know,  the idea behind confidence interval.  Let’s move ahead to one  of the most important topics in statistical inference,  which is hypothesis testing, right?  So Sigelei statisticians use hypothesis testing  to formally check  whether the hypothesis is accepted or rejected.  Okay, hypothesis.  Testing is an inferential statistical technique  used to determine  whether there is enough evidence in a data sample to infer  that a certain condition holds true for an entire population.  So to understand  the characteristics of a general population,  we take a random sample,  and we analyze the properties of the sample right we test.  Whether or not the identified conclusion represent  the population accurately  and finally we interpret their results now  whether or not to accept the hypothesis depends  upon the percentage value that we get from the hypothesis.  Okay, so to better understand this,  let’s look at a small example before that.  There are few steps that are followed in hypothesis,  testing you begin by stating the null  and the alternative hypothesis.  All right.  I’ll tell you what exactly these terms are  and then you formulate.  Analysis plan right after that you analyze the sample data  and finally you can interpret the results  right now to understand the entire hypothesis testing.  We look at a good example.  Okay now consider for boys Nick jean-bob  and Harry these boys were caught bunking a class  and they were asked to stay back at school  and clean the classroom as a punishment, right?  So what John did is he decided  that four of them would take turns to clean their classrooms.  He came up with a plan of writing each of their names  on chits and putting them in a bout now every day.  They had to pick up a name from the bowel  and that person had to play in the clock, right?  That sounds pretty fair enough now it is been three days  and everybody’s name has come up except John’s assuming  that this event is completely random  and free of bias.  What is a probability  of John not treating right or is the probability  that he’s not actually cheating this can Solved  by using hypothesis testing.  Okay.  So we’ll Begin by calculating the probability of John  not being picked for a day.  Alright, so we’re going to assume  that the event is free of bias.  So we need to find out the probability  of John not cheating right first we’ll find the probability  that John is not picked for a day, right?  We get 3 out of 4,  which is basically 75% 75% is fairly high.  So if John is not picked for three days in a row  the Probability will drop down to approximately 42% Okay.  So three days in a row meaning  that is the probability drops down to 42 percent.  Now, let’s consider a situation  where John is not picked for 12 days in a row  the probability drops down to Tea Point two percent.  Okay, that’s the probability  of John cheating becomes fairly high, right?  So in order  for statisticians to come to a conclusion,  they Define what is known as the threshold value.  Right considering the above situation  if the threshold value is set to 5 percent.  It would indicate  that if the probability lies below 5% then John is cheating  his way out of detention.  But if the probability is about threshold value then John  it just lucky and his name isn’t getting picked.  So the probability  and hypothesis testing give rise to two important components  of hypothesis testing,  which is null hypothesis and alternative hypothesis.  Null.  Hypothesis is based.  Basically approving  the Assumption alternate hypothesis is  when your result disapproves the Assumption right therefore  in our example,  if the probability of an event occurring  is less than 5% which it is then the event is biased hence.  It proves the alternate hypothesis.  Undoubtedly machine learning is the most in-demand technology  in today’s market.  It’s applications.  From Seth driving cause  to predicting deadly diseases such as ALS the high demand  for machine learning skills is the motivation  behind today’s session.  So let me discuss the agenda with you first.  Now, we’re going to begin the session  by understanding the need  for machine learning and why it is important after that.  We look at what exactly machine learning is  and then we’ll discuss a couple of machine learning definitions.  Once we’re done with that.  We’ll look at the machine learning process  and how you can solve a problem by using Using  the machine learning process next we will discuss the types  of machine learning  which includes supervised unsupervised  and reinforcement learning.  Once we’re done with that.  We’ll discuss the different types of problems  that can be solved by using machine learning.  Finally.  We will end this session by looking at a demo  where we’ll see how you can perform weather forecasting  by using machine learning.  All right, so guys,  let’s get started with our first topic.  So what is the importance  or what is the need for machine learning now?  Since the technical Revolution,  we’ve been generating an immeasurable amount  of data as for research  with generating around 2.5 quintillion bytes  of data every single day  and it is estimated  that by 2020 1.7 MB of data will be created every second  for every person on earth.  Now that is a lot of data right now.  This data comes from sources such as  the cloud iot devices social media and all of that.  Since all of us are very interested  in the internet right now with generating a lot of data.  All right, you have no idea how much data we generate  through social media all the chatting  that we do and all the images  that we post on Instagram the videos  that we watch all of this generates a lot of data.  Now how does machine learning fit into all of this  since we’re producing this much data,  we need to find a method  that can analyze process and interpret this much data.  All right, and we need to find a method.  That can make sense out of data.  And that method is machine learning.  Now the lot of talk tire companies  and data driven company such as Netflix and Amazon  which build machine learning models by using tons of data  in order to identify any profitable opportunities.  And if they want to avoid any unwanted risk it make use  of machine learning.  Alright, so through machine learning You can predict risk  You can predict profits you can identify opportunities,  which will help you grow your business.  Business so now I’ll show you  a couple of examples of where in machine learning is used.  All right, so I’m sure all of you have been watch on Netflix.  Now the most important thing  about Netflix is its recommendation engine.  All right.  Most of Netflix’s Revenue comes from its recommendation engine.  So the recommendation engine  basically studies the movie viewing patterns of its users  and then recommends relevant movies to them.  All right, it recommends movies depending on users interests.  Depending on the type  of movies the user watches and all of that.  Alright, so that is  how Netflix uses machine learning.  Next.  We have Facebook’s Auto tagging feature.  Now the logic behind Facebook’s  Auto tagging feature is machine learning  and neural networks.  I’m not sure how many of you know this but Facebook  makes use of deepmind face verification system,  which is based  on machine learning natural language processing  and neural networks.  So deep mine basically studies the facial features  in an image and it tag your friends and family.  Another such example is Amazon’s Alexa now Alexa  is basically an advanced level virtual assistant  that is based  on natural language processing and machine learning.  Now, it can do more than just play music for you.  All right, it can book your Uber it can connect  with other I/O devices  that your house it can track your health.  It can order food online and all of that.  So data, and machine learning are basically the main factors  behind Alex has power  another such example is the Google spam filter.  So guys Gmail basically  makes use of machine learning to filter out spam messages.  If any of you just open your Gmail inbox,  you’ll see that there are separate sections.  There’s one for primary this social the spam  and the Joe general made now basically Gmail makes use  of machine learning algorithms and natural language processing  to an Is emails in real time  and then classify them as either spam  or non-spam now,  this is another famous application of machine learning.  So to sum this up, let’s look at a few reasons.  Why machine learning is so important.  So the first reason is obviously increase  in data generation.  So because of excessive production of data,  we need a method that can be used to structure  and lies and draw useful insights from data.  This is where machine learning comes as in it uses data  to solve problems  and find solutions  to the most complex tasks faced by organizations.  Another important reason is that it improves decision-making.  So by making use of various algorithms machine learning  can be used to make Better Business decisions.  For example machine learning is used to forecast sales.  It is used to predict any downfalls in the stock market.  It is used to identify risks anomalies and so  on now the next reason Is it uncovers patterns  and Trends in data finding hidden patterns and extracting  key insights from data is the most essential part  of machine learning.  So by building predictive models  and using statistical techniques machine learning  allows you to dig beneath the surface  and explore the data  at a minut scale now understanding data  and extracting patterns manually will take a lot of days.  Now, if you do this through machine learning algorithms,  you can perform such computations.  Nations in less than a second.  Another reason is  that it’s solved complex problems.  So from detecting genes  that are linked to deadly ALS disease  is to building self-driving cars  and building phase detection systems machine learning  can be used to solve the most complex problems.  So guys now that you know,  why machine learning is so important.  Let’s look at what exactly machine learning is.  The term machine learning was first coined by  Arthur Samuel in the year 1959 now looking back  that your was probably the most significant in terms  of technological advancements.  There is if you browse through the net about  what is machine learning you’ll get at least  a hundred different definitions.  Now the first and very formal definition was given by Tom  and Mitchell now,  the definition says  that a computer program is set to learn from experience e  with respect to some class.  Of caste and performance measure P  if its performance at tasks in D  as measured by P improves with experience e all right.  Now I know this is a little confusing.  So let’s break it down into simple words.  Now in simple terms machine learning is a subset  of artificial intelligence  which provides machines the ability to learn automatically  and improve from experience without being explicitly  programmed to do so in the sense.  It is the practice of getting machines to solve problems  by gaining the ability to think but wait now  how can a machine think or make decisions?  Well, if you feel a machine a good amount of data,  it will learn how to interpret process  and analyze this data by using machine learning algorithm.  Okay.  Now guys, look at this figure on top.  Now this figure basically shows how a machine learning algorithm  or how the machine learning process really works.  So the machine learning Begins by feeding the machine lots  and lots of data okay by using this data.  The machine is trained to detect hidden insights and Trends.  Now these insights are then used to build  a machine learning model by using an algorithm  in order to solve a problem.  Okay.  So basically you’re going to feed a lot  of data to the machine.  The machine is going to get trained by using this data.  It’s going to use this data  and it’s going to draw useful insights  and patterns from it,  and then it’s going to build a model by Using  machine learning algorithms.  Now this model will help you predict the outcome  or help you solve any complex problem  or any business problem.  So that’s a simple explanation of how machine learning works.  Now, let’s move on and look  at some of the most commonly used machine learning terms.  So first of all, we have algorithm.  Now, this is quite self-explanatory.  Basically algorithm is a set of rules  or statistical techniques,  which are used to learn patterns from data now  an algorithm is The logic behind a machine learning model.  All right, an example  of a machine learning algorithm is linear regression.  I’m not sure how many of you have heard of linear regression.  It’s the most simple and basic machine learning algorithm.  All right.  Next we have model now model is the main component  of machine learning.  All right.  So model will basically map the input to your output  by using the machine learning algorithm and by using the data  that you’re feeding the machine.  So basically the model is  a representation of the entire machine learning process.  So the model is basically fed input  which has a lot of data  and then it will output a particular result  or a particular outcome by using machine learning algorithms.  Next we have something known as predictor variable.  Now predictor variable is a feature of the data  that can be used to predict the output.  So for example, let’s say  that you’re trying to predict the weight of a person depending  on the person’s height and their age.  All right.  So over here the predictor variables are your height  and your age  because you’re using height and age of a person  to predict the person’s weight.  Alright, so the height  and the A’s are the predictor variables now,  Wait on the other hand is the response  or the target variable.  So response variable is a feature or the output variable  that needs to be predicted by using the predictor variables.  All right,  after that we have something known as training data.  So guys the data  that is fed to a machine learning model is always split  into two parts first.  We have the training data  and then we have the testing data now training  data is basically used to build the machine learning model.  So usually training data is much larger.  Than the testing data  because obviously if you’re trying to train  the machine then you’re going to feed it a lot more data.  Testing data is just used to validate and evaluate  the efficiency of the model.  Alright, so that was training data and testing data.  So Guys, these were a few terms that I thought you should know  before we move any further.  Okay.  Now, let’s move on and discuss the machine learning process.  Now, this is going to get very interesting  because I’m going to give you an example  and make you understand how the machine learning.  process works So first of all,  let’s define the different stages  or the different steps involved in the machine learning process.  So machine learning process always begins  with defining the objective or defining the problem  that you’re trying to solve next is is data Gathering  or data collection.  Now the data that you need to solve this problem  is collected at this stage.  This is followed by data preparation  or data processing after that.  You have data exploration and Analysis.  Isis and the next stage is building  a machine learning model.  This is followed by model evaluation.  And finally you have prediction or your output.  Now, let’s try to understand this entire process  with an example.  So our problem statement here is to predict the possibility  of rain by studying the weather conditions.  So let’s say  that you’re given a problem statement  and you’re asked to use a machine learning process  to solve this problem statement.  So let’s get started.  Alright, so the first step is to Find the objective  of the problem statement.  Our objective here is to predict the possibility  of rain by studying the weather conditions.  Now in the first stage of a machine learning process.  You must understand  what exactly needs to be predicted.  Now in our case the objective is to predict the possibility  of rain by studying weather conditions, right?  So at this stage,  it is also essential to take mental notes on what kind  of data can be used to solve this problem  or the type of approach  that you can follow to get.  Get to the solution.  All right, a few questions  that are worth asking during this stage is  what are we trying to predict?  What are the Target features  or what are the predictor variables?  What kind of input data do we need?  And what kind of problem are we facing?  Is it a binary classification problem or is it  a clustering problem now, don’t worry.  If you don’t know what classification  and clustering is I’ll be explaining this  in the upcoming slides.  So guys this was the first step of a machine learning process,  which is Define the Double the problem.  All right.  Now, let’s move on and look at step number two.  So step number two is basically data collection  or data Gathering now at this stage.  You must be asking questions such as what kind of data  is needed to solve the problem is the data available and  if it is available,  how can I get the data?  Okay.  So once you know the type of data that is required,  you must understand  how you can derive this data data collection  can be done manually or by web scraping,  but if you’re a beginner Nor and you’re just looking to learn  machine learning you don’t have to worry about getting the data.  OK there are thousands of data resources on the web.  You can just go ahead  and download the datasets from websites such as kaggle.  Okay, now coming back to the problem  at hand the data needed  for weather forecasting includes  measures such as humidity level temperature pressure locality  whether or not you live in a hill station  and so on so guys such data must be collected  and stored for analysis.  Now the next stage  in machine learning is preparing your data  the data you collected is almost never in the right format.  So basically you’ll encounter a lot of inconsistencies  in the data set.  Okay, this includes  missing values redundant variables duplicate values  and so on removing such values is very important  because they might lead to wrongful computations  and predictions.  So that’s why at this stage you must can the entire data set  for any inconsistencies.  You have to fix them at this stage.  Now.  The next step is exploratory data analysis.  Now data analysis is all about diving deep  into data and finding all the hidden data Mysteries.  Okay.  This is where you become a detective.  So edu or exploratory data analysis is like a brainstorming  of machine learning data exploration involves  understanding the patterns  and the trends in your data.  So at this stage all the useful insights are drawn  and all the correlations.  Turns between the variables are understood.  So you might ask what sort  of correlations are you talking about?  For example in the case of predicting rain fall.  We know that there is a strong possibility of rain  if the temperature has fallen low.  Okay.  So such correlations have to be understood  and mapped at this stage.  Now.  This stage is followed by stage number 5,  which is building a machine learning model.  So all the insights and the patterns  that you derive  during data exploration are used to build the machine learning.  So this stage always Begins by splitting the data set  into two parts training data and the testing data.  So earlier in the session.  I already told you what training  and testing data is now the training data  will be used to build and analyze the model  and the logic of the model  will be based on the machine learning algorithm  that is being implemented.  Okay.  Now in the case of predicting rainfall  since the output will be in the form of true  or false we can use  a classification algorithm like logistically.  Regression now choosing the right algorithm depends  on the type of problem.  You’re trying to solve the data set you have  and the level of complexity of the problem.  So in the upcoming sections will be discussing different types  of problems that can be solved by using machine learning.  So don’t worry.  If you don’t know what classification algorithm is  and what logistic regression in.  Okay.  So all you need to know is at this stage,  you’ll be building a machine learning model  by using machine learning algorithm  and by using the training data set the next  But in on machine learning process is model evaluation  and optimization.  So after building a model by using the training data set  it is finally time to put the model to a test.  Okay.  So the testing data set is used to check the efficiency  of the model and how accurately it can predict the outcome.  So once you calculate the accuracy any improvements  in the model have to be implemented in this stage.  Okay, so methods like parameter tuning and cross-validation  can be used to improve the The performance  of the model this is followed by the last stage,  which is predictions.  So once the model is evaluated  and improved it is finally used to make predictions.  The final output can be a categorical variable  or it can be a continuous quantity in our case  for predicting the occurrence  of rainfall the output will be a categorical variable  in the sense.  Our output will be in the form of true or false.  Yes or no.  Yes, basically represents  that is going to rain and no will represent that.  It wondering okay as simple as that,  so guys that was the entire machine learning process.  A linear regression is one  of the easiest algorithm in machine learning.  It is a statistical model  that attempts to show the relationship  between two variables.  So the linear equation,  but before we drill down  to linear regression algorithm in depth,  I’ll give you a quick overview of today’s agenda.  So we’ll start a session  with a quick overview of what is regression  as linear regression is one of a type  of regression algorithm.  Once we learn about regression,  its use case the various types of it next.  We’ll learn about the algorithm from scratch where I live  To its mathematical implementation first,  then we’ll drill down to the coding part  and Implement linear regression using python  in today’s session will deal  with linear regression  algorithm using least Square method checketts goodness of fit  or how close the data is  to the fitted regression line using the R square method  and then finally  what we’ll do well optimized it using  the gradient descent method  in the last part on the coding session.  I’ll teach you to implement linear regression using Python  and the coding session.  Would be divided into two parts the first part would consist  of linear regression using python from scratch  where you will use the mathematical algorithm  that you have learned in this session.  And in the next part of the coding session  will be using scikit-learn for direct implementation  of linear regression.  All right.  I hope the agenda is clear to you guys are like  so let’s begin our session with what is regression.  Well regression analysis  is a form of predictive modeling technique  which investigates  the relationship between a dependent and independent.  Able a regression analysis involves graphing a line  over a set of data points  that most closely fits the overall shape of the data  or regression shows the changes  in a dependent variable on the y-axis  to the changes  in the explanatory variable on the x-axis fine.  Now you would ask what are the uses of regression?  Well, they are major three uses of regression analysis  the first being determining the strength of predicator,  ‘s the regression  might be used to identify the strength of the effect  that the independent.  Variables have on the dependent variable.  For example, you can ask question.  Like what is the strength of relationship between sales  and marketing spending or what is the relationship between age  and income second is forecasting  an effect in this the regression can be used to forecast effects  or impact of changes.  That is the regression analysis help us to understand  how much the dependent variable changes with the change  in one or more independent variable fine.  For example, you can ask question like how Additional  seal income will I get for each thousand dollars spent  on marketing third is Trend forecasting  in this the regression analysis to predict Trends  and future values.  The regression analysis can be used to get  Point estimates in this you can ask questions.  Like what will be the price of Bitcoin  and next six months, right?  So next topic is linear versus logistic regression by now.  I hope that you know, what a regression is.  So let’s move on and understand its type.  So there are various kinds of regression like linear.  Session logistic regression polynomial regression  and others.  All right, but for this session  will be focusing on linear and logistic regression.  So let’s move on and let me tell you what is linear regression.  And what is logistic regression  then what we’ll do we’ll compare both of them.  All right.  So starting with linear regression  in simple linear regression.  We are interested in things like y equal MX plus C.  So what we are trying to find is the correlation between X  and Y variable this means  that every value of X has a corresponding value of y in it  if it is continuous.  I like however  in logistic regression we are not fitting our data  to a straight line like linear regression instead  what we are doing.  We are mapping Y versus X  to a sigmoid function in logistic regression.  What we find out is is y 1 or 0 for this particular value of x  so thus we are essentially deciding true or false value  for a given value of x fine.  So as a core concept of linear regression You can say  that the data is modeled using a straight line  where in the case of logistic regression  the data is model using a sigmoid function.  The linear regression is used with continuous variables  on the other hand the logistic regression.  It is used with categorical variable the output  or the prediction of a linear regression  is the value of the variable  on the other hand the output of production  of a logistic regression is the probability  of occurrence of the event.  Now, how will you check the accuracy  and goodness of fit in case of linear regression?  We are various methods.  Take measured by loss r squared adjusted r squared Etc  while in the case of logistic regression you  have accuracy precision recall F1 score,  which is nothing but the harmonic mean of precision  and recall next is Roc curve  for determining the probability threshold for classification  or the confusion Matrix Etc.  There are many all right.  So summarizing the difference  between linear and logistic regression.  You can say that the type of function you are mapping  to is the main point of difference between linear  and regression a linear regression Maps a continuous X2  a continuous fi  on the other hand a logistic regression Maps a continuous x  to the bindery why  so we can use logistic regression to make category  or true false decisions from the data find  so let’s move on ahead.  Next is linear regression selection criteria,  or you can say when will you use linear regression?  So the first is classification  and regression capabilities regression models predict  a continuous variable such as the Don’t a day  or predict the temperature of a city their Reliance  on a polynomial  like a straight line to fit a data set  poses a real challenge  when it comes towards building a classification capability.  Let’s imagine that you fit a line with the training points  that you have now imagine you add some more data points to it.  But in order to fit it, what do you have to do?  You have to change your existing model  that is maybe you have to change the threshold itself.  So this will happen  with each new data point you add to the model, hence.  The linear regression is not good for classification.  All’s fine.  Next is data quality each missing value removes  one data point that could optimize the regression  in simple linear regression.  The outliers can significantly  disrupt the outcome just for now.  You can know that if you remove the outliers your model  will become very good.  All right.  So this is about data quality.  Next is computational complexity a linear regression is often  not computationally expensive as compared to the decision tree  or the clustering algorithm the order  of complexity for n training example and X features.  Usually Falls in either Big O  of x square or big of xn next is comprehensible  and transparent the linear regression are  easily comprehensible and transparent in nature.  They can be represented by a simple mathematical notation  to anyone and can be understood very easily.  So these are some of the criteria based  on which you will select the linear regression algorithm.  All right.  Next is where is linear regression used first  is evaluating Trends and sales estimate.  Well linear regression can be used in Business  to evaluate Trends and make estimates  or focused for example,  if a company sales have increased steadily every month  for past few years then conducting a linear analysis  on the sales data with monthly sales on the y axis  and time on the x axis.  This will give you a line  that predicts the upward Trends in the sale after creating  the trendline the company could use the slope  of the lines too focused sale in future months.  Next is analyzing.  The impact of price changes will linear regression  can be To analyze the effect of pricing on consumer behavior.  For instance.  If a company changes  the price on a certain product several times,  then it can record the quantity itself for each price level  and then perform a linear regression  with sold quantity as  a dependent variable and price as the independent variable.  This would result in a line that depicts the extent  to which the customer reduce their consumption of the product  as the prices increasing.  So this result would help us in future pricing decisions.  Next is assessment of risk and fine.  Financial services and insurance domain.  Well linear regression can be used to analyze the risk,  for example health insurance company might conduct  a linear regression algorithm  how it can do it can do it by plotting the number of claims  per customer against its age and they might discover  that the old customers  then to make more health insurance claim.  Well the result of such analysis might guide  important business decisions.  All right, so by now you have just a rough idea of  what linear regression algorithm as like,  What it does where it is used  when you should use it early now,  let’s move on and understand the algorithm and depth.  So suppose you have independent variable on the x-axis  and dependent variable on the y-axis.  All right suppose.  This is the data point on the x axis.  The independent variable is increasing on the x axis.  And so does the dependent variable on the y-axis?  So what kind of linear regression line you would get  you would get a positive linear regression line.  All right as the slope would be positive.  Next is suppose.  You have an independent variable on the x-axis  which is increasing  and on the other hand the dependent variable on the y-axis  that is decreasing.  So what kind of line will you get in that case?  You will get a negative regression line.  In this case as the slope of the line is negative.  And this particular line that is line of y equal MX  plus C is a line of linear regression  which shows the relationship between independent variable  and dependent variable  and this line is only known as line of linear regression.  Okay?  So let’s add some data points to our graph.  So these are some observation or data points on our graphs.  Let’s plot some more.  Okay.  Now all our data points are plotted now our task is  to create a regression line or the best fit line.  All right now  once our regression line is drawn now,  it’s the task of production now suppose.  This is our estimated value or the predicted value  and this is our actual value.  Okay.  So what we have to do our main goal is to reduce this error.  That is to reduce the distance between the estimated  or the predicted value and the actual value.  The best fit line would be the one which had the least error  or the least difference in estimated value  and the actual value.  All right, and other words we have to minimize the error.  This was a brief  understanding of linear regression algorithm soon.  We’ll jump towards mathematical implementation.  All right, but for then let me tell you this  suppose you draw a graph with speed on the x-axis  and distance covered.  On the y axis with the time demeaning constant,  if you plot a graph between the speed travel  by the vehicle  and the distance traveled in a fixed unit of time,  then you will get a positive relationship.  All right.  So suppose the equation of line as y equal MX plus C.  Then in this case Y is the distance traveled  in a fixed duration of time x  is the speed of vehicle m is the positive slope  of the line and see is the y-intercept of the line.  All right suppose the distance remaining constant.  You have to plot a graph between the Rid of the vehicle  and the time taken to travel a fixed distance then  in that case you will get a line with a negative relationship.  All right, the slope of the line is negative here the equation  of line changes to y equal minus of MX plus C  where Y is the time taken to travel  a fixed distance X is the speed  of vehicle m is the negative slope  of the line and see is the y-intercept of the line.  All right.  Now, let’s get back  to our independent and dependent variable.  So in that term why is  our dependent variable and That is our independent variable.  Now, let’s move on and see the mathematical implementation  of the things.  Alright, so we have x  equal 1 2 3 4 5 let’s plot them on the x-axis.  So 0 1 2 3 4 5 6 alike and we have y as 3 4 2 4 5.  All right.  So let’s plot 1  2 3 4 5 on the y-axis now,  let’s plot our coordinates 1 by 1 so x equal 1 and y equal 3,  so We have here x equal 1 and y equal 3.  So this is the point 1 comma 3 so similarly  we have 1 3 2 4 3 2 4 4 & 5 5.  All right.  So moving on ahead.  Let’s calculate the mean of X and Y and plot it on the graph.  All right, so mean of X is 1  plus 2 plus 3 plus 4 plus 5 divided by 5.  That is 3.  All right, similarly mean of Y is 3 plus 4 plus 2  plus 4 plus 5 that is 18.  So it in divided by 5.  That is nothing but 3.6 aligned so next  what we’ll do we’ll plot our mean that is 3 comma 3 .6  on the graph.  Okay.  So there’s a point 3 comma 3 .6  see our goal is to find or predict the best fit line  using the least Square Method All right.  So in order to find  that we first need to find the equation of line,  so let’s find the equation of our regression line.  All right.  So let’s suppose this is  our regression line y equal MX plus C.  Now.  We have an equation of line.  So all we need to do is find the value of M and see  where m equals summation of x  minus X bar X Y minus y bar upon the summation of x  minus X bar whole Square don’t get confused.  Let me resolve it for you.  All right.  So moving on ahead as a part of formula.  What we are going to do will calculate x minus X bar.  So we have X as 1 minus X bar as 3 so 1 minus 3  that is minus 2 next.  We have x equal to minus its mean 3  that is minus 1 similarly.  We have 3 minus 3 is 0 4 –  3 1 5 – 3 2 alight so x minus X bar.  It’s nothing but the distance of all the point  through the line y equal 3  and what does this y minus y bar implies it implies  that distance of all the point from the line x equal 3 .6 fine.  So let’s calculate the value of y minus y bar.  So starting with y equal 3 –  value of y. A bar that is 3.6.  So it is three minus 3.6 how much –  of 0.6 next is 4 minus 3.6 that is 0.4 next to minus 3.6  that is minus of 1 point 6 next is 4 minus 3.6  that is 0.4 again,  5 minus 3.6 that is 1.4.  Alright, so now we are done with Y minus y bar fine now next  we will calculate x  minus X bar whole Square Let’s calculate x  minus X bar whole Square.  So it is minus 2 whole square.  That is 4 minus 1 whole square.  That is 1 0 squared is 0 1 Square 1 2 square for fine.  So now in our table we have x minus X bar y minus y bar  and x minus X bar whole Square.  Now what we need.  We need the product of x minus X bar X Y minus y bar.  Alright, so let’s see the product of x  minus X bar X Y minus  y bar that is minus of 2 x minus of 0.6.  That is one.  Point 2 minus of 1 x  0 point 4 that is minus of 0 point 4 0 x  minus of 1.6.  That is 0 1 multiplied by zero point four  that is 0.4.  And next 2 multiplied by 1 point for that is 2.8.  All right.  Now almost all the parts of our formula is done.  So now what we need to do is get the summation  of last two columns.  All right, so the summation of x minus X bar whole square is 10  and the summation of x minus X bar.  X Y minus y bar is 4 so the value of M will be equal  to 4 by 10 fine.  So let’s put this value  of m equals zero point 4 and our line y equal MX plus C.  So let’s file all the points into the equation  and find the value of C.  So we have y as 3.6 remember the mean by m as 0.4  which we calculated just now X as the mean value of x  that is 3 and we have the  in as 3 point 6 equals 0 point 4 x 3 plus C. Alright  that is 3.6 equal 1 Point 2 plus C.  So what is the value of C that is 3.6 minus 1 Point 2.  That is 2 point 4.  All right.  So what we had we had m equals zero point four see  as 2.4 and then finally  when we calculate the equation of the regression line  what we get is y equal zero point four times of X  plus two point four.  So there is the regression line.  Like so there’s how you’re plotting your points.  This is your actual point.  All right.  Now for given m equals zero point four and SQL 2.4.  Let’s predict the value of y for x equal 1 2 3 4 & 5.  So when x equal 1 the predicted value  of y will be zero point four x  one plus two point four that is 2.8.  Similarly when x equal to predicted value  of y will be zero point 4 x  2 plus 2 point 4 that equals to 3 point.  Two similarly x  equal 3 y will be 3 point 6 x equal 4 y will be 4 point 0  x equal 5 y will be four point four.  So let’s plot them on the graph  and the line passing through all these predicting point  and cutting y-axis at 2.4 as the line of regression.  Now your task is to calculate the distance between the actual  and the predicted value  and your job is to reduce the distance.  All right, or in other words,  you have to reduce the error between the actual  and the predicted.  The line with the least error will be the line  of linear regression  or regression line and it will also be the best fit line.  Alright, so this is how things work in computer.  So what it do it performs a number of iteration  for different values of M for different values of M.  It will calculate the equation of line  where y equals MX plus C.  Right?  So as the value of M changes the line  is changing so iteration will start from one.  All right, and it will perform a number of iteration so  after Every iteration  what it will do it will calculate the predicted value  according to the line  and compare the distance  of actual value to the predicted value  and the value of M  for which the distance between the actual  and the predicted value is minimum will be selected  as the best fit line.  All right.  Now that we have calculated the best fit line now,  it’s time to check the goodness of fit or to check  how good a model is performing.  So in order to do that,  we have a method called R square method.  So what is this R square?  Well r-squared value is a statistical measure of  how close the data are  to the fitted regression line in general.  It is considered  that a high r-squared value model is a good model,  but you can also have a lower squared value  for a good model as well or a higher Squad value for a model  that does not fit at all.  All right.  It is also known as coefficient of determination  or the coefficient of multiple determination.  Let’s move on and see how a square is calculated.  So these are our actual values plotted on the graph.  We had calculated the predicted values  of Y as 2.8 3.2 3.6 4.0 4.4.  Remember when we calculated the predicted values  of Y for the equation Y predicted equals 0 1 4 x  of X plus two point four for every x  equal 1 2 3 4 & 5 from there.  We got the power.  Good values of Phi.  All right.  So let’s plot it on the graph.  So these are point and the line passing  through these points are nothing but the regression line.  All right.  Now, what you need to do is  you have to check and compare the distance of actual –  mean versus the distance of predicted – mean.  Alright.  So basically what you are doing you are calculating the distance  of actual value  to the mean to distance of predicted value to the mean.  All right, so there is nothing  but a square in mathematically you can represent our school.  Whereas summation of Y predicted values minus y  bar whole Square divided by summation of Y minus  y bar whole Square  where Y is the actual value y p is the predicted value  and Y Bar is the mean value of y that is nothing but 3.6.  Remember, this is our formula.  So next what we’ll do we’ll calculate y minus y bar.  So we have y is 3y bar as 3 point 6 so we’ll calculate  it as 3 minus 3.6  that is nothing but minus of 0.6 similarly  for y equals 4 and Y Bar equal 3.6.  We have y minus y bar as zero point 4 then 2 minus 3.6.  It has 1 point 6 4 minus 3.6 again  zero point four and five minus 3.6 it is 1.4.  So we got the value of y minus y bar.  Now what we have to do we have to take it Square.  So we have minus of 0.6 Square as 0.36 0.4 Square as 0.16 –  of 1.6 Square as 2.56 0.4 Square as 0.16 and 1.4 squared  is 1.96 now is a part of formula what we need.  We need our YP minus y BAR value.  So these are VIP values  and we have to subtract it from the No, right.  So 2 .8 minus 3.6 that is minus 0.8.  Similarly.  We will get 3.2 minus 3.6 that is 0.4 and 3.6 minus 3.6  that is 0 for 1 0 minus 3.6 that is 0.4.  Then 4 .4 minus 3.6 that is 0.8.  So we calculated the value of YP minus y bar now,  it’s our turn to calculate the value of y b minus  y bar whole Square next.  We have –  of 0.8 Square as 0.64 – of Point four square as 0.160 Square  0 0 point 4 Square as again 0.16 and 0.8 Square as 0.64.  All right.  Now as a part of formula  what it suggests it suggests me to take the summation of Y P  minus y bar whole square  and summation of Y minus y bar whole Square.  All right.  Let’s see.  So on submitting y minus y bar whole Square  what you get is five point two and summation of Y P minus  y bar whole Square you get one point six.  So the value of R square can be calculated as  1 point 6 upon 5.2 fine.  So the result which will get is approximately equal to 0.3.  Well, this is not a good fit.  All right, so it suggests  that the data points are far away from the regression line.  Alright, so this is  how your graph will look like when R square is 0.3  when you increase the value of R square to 0.7.  So you’ll see  that the actual value would like closer to the regression line  when it reaches to 0.9 it comes.  More clothes and when the value of approximately equals  to 1 then the actual values lies on the regression line itself,  for example, in this case.  If you get a very low value of R square suppose 0.02.  So in that case what you’ll see that the actual values are  very far away from the regression line,  or you can say  that there are too many outliers in your data.  You cannot focus anything from the data.  All right.  So this was all about the calculation of R square now,  you might get a question like are low values  of Square always bad.  Well in some field it is entirely expected that I ask  where value will be low.  For example any field  that attempts to predict human behavior such as psychology  typically has r-squared values lower than around 50%  through which you can conclude  that humans are simply harder  to predict the under physical process furthermore.  If you are squared value is low,  but you have statistically significant predictors,  then you can still draw important conclusion  about how changes in the predicator values associated.  Oh sated with the changes in the response value regardless  of the r-squared  the significant coefficient still represent the mean change  in the response for one unit of change in the predicator  while holding other predators in the model constant,  obviously this type  of information can be extremely valuable.  All right.  All right.  So this was all about the theoretical concept now,  let’s move on to the coding part and understand  the code in depth.  So for implementing linear regression using python,  I will be using Anaconda  with jupyter notebook installed on it.  So I like there’s a jupyter notebook  and we are using python 3.01 it  alright, so we are going to use a data set consisting  of head size and human brain of different people.  All right.  So let’s import our data set percent matplotlib and line.  We are importing numpy  as NP pandas as speedy and matplotlib and from matplotlib.  We are importing pipe out of that as PLT.  Alright next we will import our data had brain dot CSV  and store it in the data variable.  Let’s execute the Run button and see the armor.  But so this asterisk symbol it symbolizes  that it still executing.  So there’s a output  or dataset consists of two thirty seven rows  and four columns.  We have columns as gender age range head size  in centimeter Cube  and brain weights and Graham fine.  So there’s our sample data set  that is how it looks it consists of all these data set.  So now that we have imported our data,  so as you can see they are 237 values in the training set  so we can find a linear.  Relationship between the head size and the Brain weights.  So now what we’ll do we’ll collect X & Y  the X would consist of the head size values  and the Y would consist of brain with values.  So collecting X and Y. Let’s execute the Run.  Done next what we’ll do we need to find the values of b 1  or B not or you can say m and C.  So we’ll need the mean of X and Y values first of all  what we’ll do we’ll calculate the mean of X and Y so mean x  equal NP dot Min X.  So mean is a predefined function of Numb by similarly mean  underscore y equal NP dot mean of Y,  so what it will return  if you’ll return the mean values of Y  next we’ll check the total number of values.  So m equals.  Well length of X. Alright,  then we’ll use the formula to calculate the values of b 1  and B naught or fnc.  All right, let’s execute the Run button and see  what is the result.  So as you can see here  on the screen we have got b 1 as 0 point 2 6 3 +  B not as three twenty five point five seven.  Alright, so now that we have a coefficient.  So comparing it with the equation y equal MX plus C.  You can say that brain weight equals  zero point 2 6 3 X Head size plus three twenty five point  five seven so you can say  that the value of M here is 0.26 3 and the value  of C. Here is three twenty five point five seven.  All right, so there’s our linear model now,  let’s plot it and see graphically.  Let’s execute it.  So this is how our plot looks like this model is not so bad.  But we need to find out how good our model is.  So in order to find it the many methods  like root means Square method the coefficient of determination  or the a square method.  So in this tutorial,  I have told you about our score method.  So let’s focus on that and see how good our model is.  So let’s calculate the R square value.  All right here SS underscore T is the total sum of square SS.  Our is the total sum of square of residuals and R square  as the formula is 1 minus total sum  of squares upon total sum of square of residuals.  All right next when you execute it,  you will get the value of R square as 0.63  which is pretty very good.  Now that you have implemented simple linear regression model  using least Square method,  let’s move on and see  how will you implement the model using machine learning library  called scikit-learn.  All right.  So this scikit-learn is a simple machine.  Young Library in Python welding machine learning model are  very easy using scikit-learn.  So suppose there’s a python code.  So using the scikit-learn libraries your code shortens  to this length  like so let’s execute the Run button and see you  will get the same our to score as Well,  this was all for today’s discussion.  Most of the entities  in this world are related in one way  or another at times finding relationship between entities  can help you take valuable business decisions today.  I’m going to talk about logistic regression,  which is one such approach towards  predicting relationships.  Now, let us see what all we are going to cover  in today’s training.  So we’ll start off the session by getting a quick introduction  to what is regression.  Then we’ll see the different types of regression  and we’ll be discussing the what and by of logistic regression.  So in this part,  we’ll discuss what exactly it is.  It is used why it is used and all those things moving  ahead will compare linear regression  versus logistic regression  along with the various real-life use cases  and finally towards the end.  I will be practically  implementing logistic regression algorithm.  So let’s quickly start off with the very first topic  what is regression.  The regression analysis is a predictive modeling technique.  So it always involves predictions.  So in this session,  we’ll just talk about predictive analysis  and not prescriptive analysis.  Now why because  if descriptive analysis you Need to have a good base  and a stronghold on the predictive part first.  Now, it estimates relationship between the dependent variable  and an independent variable.  So for those of you  who are not aware of these terminologies,  let me give you a quick summary of it.  So dependent variable is nothing but a variable  which you want to predict now,  let’s say I want to know  what will be the sales on 26th of this month.  So sales becomes a dependent variable  or you can see the target variable.  Now this dependent variable  or Target variable are going to depend on a lot of actors.  The number of products you sold till date  or what is the season out there?  Is there the availability of product or how is  the product quality and all these things?  So these are the NeverEnding factors  which are nothing but the different features  that leads to sail  so these variables are called as an independent variable  or you can say the predictor now  if you look at the graph over here,  we have some values of X and we have values of Y now  as you can see over here  if X increases the value  of by also increases so let me explain you this  with an example.  Let’s say we have until the value of x  which is six point seven five and somebody asked you.  What was the value of y  when the value of x is 7 so the way  that you can do it  or how regression comes into the picture is  by fitting a straight line by all these points  and getting the value of M and C.  So this is straight line guys  and the formula for the straight line is y is equal to MX plus C.  So using this we can try to predict the value of y so here  if you notice the X variable can increase as much as it can  but the Y variable will increase according to x  so Why is basically dependent on your X variable?  So for any arbitrary value of x You can predict the value  of y and this is always done through regression.  So that is how regression is useful.  Now regression is basically classified into three types  your linear regression,  then your logistic regression and polynomial regression.  So today we will be discussing logistic regression.  So let’s move forward  and understand the what and by of logistic regression.  Now this algorithm is most widely used  when the dependent variable  or you can see the output is in the binary.  A format.  So here you need to predict the outcome  of a categorical dependent variable.  So the outcome should be always discreet or categorical  in nature Now by discrete.  I mean the value should be binary  or you can say you just have two values it can either be 0  or 1 it can either be yes  or a no either be true or false or high or low.  So only these can be the outcomes so the value  which you need to create it should be discrete  or you can say categorical in nature.  Whereas in linear regression.  We have the value of by  or you can see Val you need to predict within a range  that is how there’s a difference between linear regression  and logistic regression.  We must be having question.  Why not linear regression now  guys in linear regression the value of by or the value,  which you need to predict is in a range,  but in our case as in the logistic regression,  we just have two values it can be either 0  or it can be one.  It should not entertain the values which is  below zero or above one.  But in linear regression,  we have the value of y  in the range so here in order to implement  logic regression we need To clip this part  so we don’t need the value  that is below zero or we don’t need the value  which is above 1  so since the value of y will be between only 0 and 1  that is the main rule of logistic regression.  The linear line has to be clipped at 0 and 1 now.  Once we clip this graph it would look somewhat like this.  So here you’re getting the curve  which is nothing but three different straight lines.  So here we need to make a new way to solve this problem.  So this has to be formulated into equation.  And hence we come up with logistic regression.  So here the outcome is either 0  Or one which is the main rule of logistic regression.  So with this our resulting curve cannot be formulated.  So hence our main aim to bring the values to 0  and 1 is fulfilled.  So that is how we came up with large stick regression now here  once it gets formulated into an equation.  It looks somewhat like this.  So guys, this is nothing but an S curve  or you can say the sigmoid curve a sigmoid function curve.  So this sigmoid function basically converts any value  from minus infinity to Infinity to your discrete values,  which a Logitech regression wants or it Can say the values  which are in binary format either 0 or 1.  So if you see here the values as either 0  or 1 and this is nothing but just a transition of it,  but guys there’s a catch over here.  So let’s say I have a data point that is 0.8.  Now, how can you decide  whether your value is 0  or 1 now here you have the concept  of threshold which basically divides your line.  So here threshold value  basically indicates the probability of either winning  or losing so here by winning.  I mean the value is equal.  One and by losing I mean the values equal to 0  but how does it do that?  Let’s have a data point which is over here.  Let’s say my cursor is at 0.8.  So here I check  whether this value is less than the threshold value or not.  Let’s say if it is more than the threshold value.  It should give me the result as 1 if it is less than that,  then should give me the result is zero.  So here my threshold value is 0.5.  I need to Define that if my value let’s is 0.8.  It is more than 0.5.  Then the value shall be rounded of two one.  One and let’s say if it is less than 0.5.  Let’s I have a value 0.2 then should reduce it to zero.  So here you can use the concept  of threshold value to find output.  So here it should be discreet.  It should be either 0 or it should be one.  So I hope you caught this curve of logistic regression.  So guys, this is the sigmoid S curve.  So to make this curve we need to make an equation.  So let me address that part as well.  So let’s see how an equation is formed to imitate  this functionality so over here,  we have an equation of a straight.  Line, which is y is equal to MX plus C.  So in this case,  I just have only one independent variable but let’s say  if we have many independent variable then the equation  becomes m 1 x 1  plus m 2 x 2 plus m 3 x 3 and so on till M NX n now,  let us put in B and X.  So here the equation becomes Y is equal to b 1 x  1 plus beta 2 x 2 plus b 3 x 3 and so on  till be nxn plus C. So guys equation  of the straight line has a range from minus infinity to Infinity.  Yeah, but in our case or you can say largest equation  the value which we need to predict or you can say  the Y value it can have the range only from 0 to 1.  So in that case we need to transform this equation.  So to do that what we  had done we have just divide this equation by 1 minus y  so now Y is equal  to 0 so 0 over 1 minus 0 which is equal to 1  so 0 over 1 is again 0  and if we take Y is equals to 1 then 1 over 1 minus 1 which is 0  so 1 over 0 is infinity.  So here are my range is now.  Between 0 to Infinity,  but again, we want the range from minus infinity to Infinity.  So for that  what we’ll do we’ll have the log of this equation.  So let’s go ahead  and have the logarithmic of this equation.  So here we have this transform it further to get the range  between minus infinity  to Infinity so over here we have log of Y  over 1 minus 1  and this is your final logistic regression equation.  So guys, don’t worry.  You don’t have to write this formula or memorize  this formula in Python.  You just need to call this function  which is logistic regression  and Everything will be automatically for you.  So I don’t want to scare you with the maths  in the formulas behind it.  But it is always good to know how this formula was generated.  So I hope you guys are clear  with how logistic regression comes into the picture next.  Let us see what are the major differences  between linear regression was a logistic regression the first  of all in linear regression,  we have the value  of y as a continuous variable or the variable  between need to predict are continuous in nature.  Whereas in logistic regression.  We have the categorical variable so here the value  which you need to Should be discrete in nature.  It should be either 0  or 1 or should have just two values to it.  For example,  whether it is raining or it is not raining  is it humid outside or it is not humid outside.  Now, how’s it going to snow and it’s not going to snow.  So these are the few example,  we need to predict  where the values are discrete or you can just predict  where this is happening or not.  Next linear equation solves your regression problems.  So here you have a concept of independent variable  and a dependent variable.  So here you can calculate the value of y  which you need to Plate it.  Using the value of x.  So here your y variable or you can see the value  that you need to predict are in a range.  But whereas in logistic regression,  you have discrete values.  So logistic regression basically solves a classification problem  so it can basically classify it and it can just give you result  whether this event is happening or not.  So I hope it is pretty much Clear till now  next in linear regression.  The graph that you have seen  is a straight line graph so over here,  you can calculate the value of y  with respect to the value of x where as in logistic regression.  Glad that we got was a Escobar.  You can see the sigmoid curve.  So using the sigmoid function You can predict your y values.  So I hope you guys are clear  with the differences between the linear regression  and logistic regression moving the a little see  the various use cases  where in logistic regression is implemented in real life.  So the very first is weather prediction now  largest aggression helps you to predict your weather.  For example, it is used to predict  whether it is raining or not whether it is sunny.  Is it cloudy or not?  So all these things things can be predicted  using logistic regression.  Where as you need to keep in mind  that both linear regression  and logistic regression can be used in predicting the weather.  So in that case linear regression helps you to predict  what will be the temperature tomorrow  whereas logistic regression will only tell you  which is going to rain or not or whether it’s cloudy or not,  which is going to snow or not.  So these values are discrete.  Whereas if you apply linear regression,  you will predicting things like what is the temperature tomorrow  or what is the temperature day after tomorrow  and all those thing?  So these are the slight?  Is between linear regression  and logistic regression the moving ahead.  We have classification problem.  So python performs multi-class classification,  so here it can help you tell whether it’s a bird.  It’s not a board.  Then you classify different kind of mammals.  Let’s say whether it’s a dog or it’s not a dog similarly,  you can check it for reptile  whether it’s a reptile or not a reptile.  So in logistic regression,  it can perform multi-class classification.  So this point I’ve already discussed  that it is using classification problems next.  It also helps you to determine the illnesses.  Where so let me take an example.  Let’s say a patient goes for a routine check up in hospital.  So what doctor will do it,  it will perform various tests on the patient and we’ll check  whether the patient is actually a law or not.  So what will be the features  so doctor can check the sugar level  the blood pressure then what is the age of the patient?  Is it very small or is it the old person then?  What is the previous medical history of the patient  and all of these features will be recorded by the doctor  and finally, dr.  Checks the patient data and Data –  the outcome of Illness and the severity of illness.  So using all the data of a doctor can identify  whether a patient is ill or not.  So these are the various use cases  in which you can use logistic regression now,  I guess enough of theory part.  So let’s move ahead and see some of the Practical implementation  of logistic regression so over here,  I be implementing two projects  when I have the data set  of a Titanic so over here will predict what factors made  people more likely to survive the sinking  of the Titanic ship anime.  Second project will see the data analysis.  On the SUV cars so over here.  We have the data of the SUV cars who can purchase it  and what factors made people more interested in buying SUV.  So these will be the major questions as  to why you should Implement  logistic regression and what output will you get by it?  So let’s start by the very first project  that is Titanic data analysis.  So some of you might know  that there was a ship called as Titanic  with basically hit an iceberg  and sank to the bottom  of the ocean and it was a big disaster at that time  because it was the first voyage of the ship.  It was supposed to be really really strongly built and one  of the best ships of that time.  So it was a big disaster of that time.  And of course there is a movie about this as well.  So many of you might have washed it.  So what we have we have data of the passengers those  who survived and those  who did not survive in this particular tragedy.  So what you have to do you have to look at this data  and analyze which factors would have been contributed  the most to the chances  of a person survival on the ship or not.  So using the logistic regression, we can predict  whether the person survived or the person died.  Now apart from this we also have a look  with the various features along with that.  So first it is explore the data set so over here,  we have the index value then the First Column  is passenger ID,  then my next column is survived so over here,  we have two values a 0 and a 1 so 0 stands  for did not survive and one stands for survive.  So this column is categorical  where the values are discrete next.  We have passenger class so over here,  we have three values 1 2 and 3.  So this basically tells you that whether a I think  a stabbing in the first class second class or third class.  Then we have the name of the passenger.  We have the six or you can see the gender of the passenger  where the passenger is a male or female.  Then we have the age we have the Sip SP.  So this basically means the number of siblings  or the spouses aboard the Titanic so over here,  we have values such as 1 0 and so on then we have  Parts apart is basically the number of parents  or children aboard the Titanic so over here,  we also have some values  then we I have the ticket number.  We have the fear.  We have the cabin number and we have the embarked column.  So in my inbox column,  we have three values we have SC and Q.  So s basically stands  for Southampton C stands for Cherbourg  and Q stands for Queenstown.  So these are the features  that will be applying our model on so here  we’ll perform various steps  and then we’ll be implementing logistic regression.  So now these are the various steps  which are required to implement any algorithm.  So now in our case we are implementing  logistic regression, so,  Very first step is to collect your data  or to import the libraries  that are used for collecting your data  and then taking it forward then my second step is to analyze  your data so over here,  I can go to the various fields and then I can analyze the data.  I can check did the females  or children survive better than the males  or did the rich passenger survived more  than the poor passenger or did the money matter as in  who paid more to get into the shape  with the evacuated first?  And what about the workers does the worker survived  or what is the survival rate?  If you were the worker in the ship and not just  a traveling passenger,  so all of these are very very interesting questions  and you would be going through all of them one by one.  So in this stage,  you need to analyze our data  and explore your data as much as you can then the third step is  to Wrangle your data now  data wrangling basically means cleaning your data so over here,  you can simply remove the unnecessary items or  if you have a null values in the data set.  You can just clear that data and then you can take it forward.  So in this step you can build your model using the train data.  And then you can test it  using a test so over here you will be performing a split  which basically split your data set into training  and testing data set and find you will check the accuracy.  So as to ensure  how much accurate your values are.  So I hope you guys got these five steps  that you’re going to implement in autistic regression.  So now let’s go into all these steps in detail.  So number one.  We have to collect your data  or you can say import the libraries.  So it may show you the implementation part as well.  So I just open my jupyter notebook  and I just Implement all of these steps.  It’s side-by-side.  So guys this is my jupyter notebook first.  Let me just rename jupyter notebook to let’s say  Titanic data analysis.  Now our first step was to import all the libraries  and collect the data.  So let me just import all the libraries first.  So first of all, I’ll import pandas.  So pandas is used for data analysis.  So I’ll say input pandas as PD then I will be importing numpy.  So I’ll say import numpy as NP so numpy is a library in Python  which basically stands for numerical Python  and it is widely used to perform any scientific computation.  Next.  We will be importing Seaborn.  So c 1 is a library for statistical brought think so.  Say import Seaborn as SNS.  I’ll also import matplotlib.  So matplotlib library is again for plotting.  So I’ll say import matplotlib dot Pi plot  as PLT now to run this library in jupyter Notebook all I have  to write in his percentage matplotlib in line.  Next I will be importing one module as well.  So as to calculate the basic mathematical functions,  so I’ll say import mats.  So these are the libraries  that I will be needing in this Titanic data analysis.  So now let me just import my data set.  So I will take a variable.  Let’s say Titanic data and using the pandas.  I will just read my CSV or you can see the data set.  I like the name of my data set that is Titanic dot CSV.  Now.  I have already showed you the data set so over here.  Let me just print the top 10 rows.  So for that I will just say  I take the variable Titanic data dot head  and I’ll say the top ten rules.  So now I’ll just run this  so to run these fellows have to press shift + enter  or else you can just directly click on this cell so over here.  I have the index.  We have the passenger ID, which is nothing.  But again the index  which is starting from 1 then we have the survived column  which has a category.  Call values or you can say the discrete values,  which is in the form of 0 or 1.  Then we have the passenger class.  We have the name of the passenger 6 8  and so on so this is the data set  that I will be going forward  with next let us bring the number of passengers  which are there in this original data set for that.  I’ll just simply type in print.  I’ll say a number of passengers.  And using the length function,  I can calculate the total length.  So I’ll say length and inside this I  will be passing this variable because Titanic data,  so I’ll just copy it from here.  I’ll just paste it dot index  and next set me just bring this one.  So here the number of passengers  which are there in the original data set we have is 891  so around this number  were traveling in the Titanic ship so over here,  my first step is done  where you have just collected data imported all the libraries  and find out the total number of passengers,  which are Titanic so now let me just go back  to presentation and let’s see.  What is my next step.  So we’re done with the collecting data.  Next step is to analyze your data so over here,  we will be creating different plots to check the relationship  between variables as  in how one variable is affecting the other  so you can simply explore your data set by making use  of various columns  and then you can plot a graph between them.  So you can either plot a correlation graph.  You can plot a distribution curve.  It’s up to you guys.  So let me just go back  to my jupyter notebook and let me analyze some of the data.  Over here.  My second part is to analyze data.  So I just put this in headed  to now to put this in here to I just have to go  and code click on mark down and I just run this so first  let us plot account plot  where you can pay between the passengers  who survived and who did not survive.  So for that I will be using the Seabourn Library so over  here I have imported Seaborn as SNS  so I don’t have to write the whole name.  I’ll simply say SNS dot count plot.  I say axis with the survive and the data  that I’ll be using is the Titanic data  or you can say the name  of variable in which you have store your data set.  So now let me just run this  so who were here as you can see I have survived column on my x  axis and on the y axis.  I have the count.  So 0 basically stands for did not survive  and one stands for the passengers  who did survive so over here,  you can see that around 550 of the passengers  who did not survive and they were around 350 passengers  who only survive so here you can basically compute.  There are very less survivors than on survivors.  So this was the very first floor now  that is not another plot to compare the sex as to whether  out of all the passengers  who survived and who did not survive.  How many were men and how many were female  so to do that?  I’ll simply say SNS dot count plot.  I add the Hue as six so I want to know  how many females and how many male survive  then I’ll be specifying the data.  So I’m using Titanic data set and let me just run  this you have done a mistake  over here so over here you can see I have survived  column on the x-axis  and I have the count on the why now.  So here your view color stands for your male passengers  and orange stands for your female.  So as you can see here the passengers  who did not survive  that has a value 0 so we can see that.  Majority of males did not survive and if we see the people  who survived here,  we can see the majority of female survive.  So this basically concludes the gender of the survival rate.  So it appears on average women were more than three  times more likely to survive than men next.  Let us plot another plot  where we have the Hue as the passenger class so over  here we can see which class at the passenger was traveling in  whether it was traveling in class one two,  or three so for that I just tried the same command.  I’ll say SNS dot count plot.  I keep my x-axis as  subtly I’ll change my you to passenger class.  So my variable named as PE class.  And the data said  that I’ll be using is Titanic data.  So this is my result  so over here you can see I have blue for first-class orange  for second class and green for the third class.  So here the passengers  who did not survive a majorly of the third class  or you can say the lowest class  or the cheapest class to get into the dynamic and the people  who did survive majorly belong to the higher classes.  So here 1 & 2 has more eyes than the passenger  who were traveling in the third class.  So here we have concluded that the passengers  who did not survive a majorly of third class.  Us all you can see the lowest class  and the passengers  who were traveling in first and second class  would tend to survive more next.  I just got a graph for the age distribution over here.  I can simply use my data.  So we’ll be using pandas library for this.  I will declare an array and I’ll pass in the column.  That is age.  So I plot and I want a histogram  so I’ll say plot da test.  So you can notice over here  that we have more of young passengers,  or you can see the children between the ages 0 to 10  and then we have the average people  and if you go ahead Lester would be the population.  So this is the analysis on the age column.  So we saw that we have  more young passengers and more mediocre eight passengers,  which are traveling in the Titanic.  So next let me plot a graph of fare as well.  So I’ll say Titanic data.  I say fair.  And again, I got a histogram so I’ll say haste.  So here you can see the fair size is  between zero to hundred now.  Let me add the bin size.  So as to make it more clear over here,  I’ll say Ben is equals to let’s  say 20 and I’ll increase the figure size as well.  So I’ll say fixed size.  Let’s say I’ll give the dimensions as 10 by 5.  So it is bins.  So this is more clear now next.  It is analyzed the other columns as well.  So I’ll just type in Titanic data  and I want the information as to what all columns are left.  So here we have passenger ID,  which I guess it’s of no use then you have see  how many passengers survived  and how many did not we also see the analysis  on the gender basis.  We saw when the female tend to survive more  or the maintain to survive more then we saw the passenger class  where the passenger is traveling in the first class second class  or third class.  Then we have the name.  So in name, we cannot do any analysis.  We saw the sex we saw the age as well.  Then we have sea bass P.  So this stands for the number of siblings or the spouses  which Are aboard the Titanic so let us do this as well.  So I’ll say SNS dot count plot.  I mentioned X SC SP.  And I will be using the Titanic data  so you can see the plot over here so over here you  can conclude that.  It has the maximum value on zero so you can conclude  that neither children nor a spouse was  on board the Titanic now second most highest value is 1  and then we have various values for 2 3 4 and so on next  if I go above the store this column as well.  Similarly can do four parts.  So next we have part  so you can see the number of parents or children  which were aboard the Titanic so similarly can do.  As well then we have the ticket number.  So I don’t think so.  Any analysis is required for Ticket.  Then we have fears of a we have already discussed as  in the people would tend to travel in the first class.  You will be the highest view then we have the cable number  and we have embarked.  So these are the columns  that will be doing data wrangling on  so we have analyzed the data  and we have seen quite a few graphs  in which we can conclude which variable is better than another  or what is the relationship the whole third step  is my data wrangling so data wrangling basically  means Cleaning your data.  So if you have a large data set,  you might be having some null values  or you can say Nan values.  So it’s very important  that you remove all the unnecessary items  that are present in your data set.  So removing this directly affects your accuracy.  So I’ll just go ahead and clean my data  by removing all the n n values and unnecessary columns,  which has a null value in the data set  the next time you’re performing data wrangling.  Supposed to fall I check  whether my data set is null or not.  So I’ll say Titanic data,  which is the name of my data set and I’ll say is null.  So this will basically tell me what all values are null  and will return me a Boolean result.  So this basically checks the missing data  and your result will be in Boolean format  as in the result will be true or false so Falls mean  if it is not null and prove means  if it is null,  so let me just run this.  Over here you can see the values as false or true.  So Falls is where the value is not null and Drew is  where the value is none.  So over here you can see in the cabin column.  We have the very first value  which is null so we have to do something on this so you can see  that we have a large data set.  So the counting does not stop  and we can actually see the some of it.  We can actually print the number of passengers  who have the Nan value in each column.  So I’ll say Titanic underscore data is null  and I want the sum of it all.  Same thought some so  this is basically print the number of passengers  who have the n n values in each column  so we can see  that we have missing values in each column that is 177.  Then we have the maximum value in the cave in column  and we have very Less in the Embark column.  That is 2 so here  if you don’t want to see this numbers,  you can also plot a heat map  and then you can visually analyze it let me just do  that as well.  So I’ll say SNSD heat map.  And save I take labels.  False Choice run this as we have already seen  that there were three columns  in which missing data value was present.  So this might be age so over here almost 20%  of each column has a missing value.  Then we have the cabling columns.  So this is quite a large value  and then we have two values for embark column as well.  Add a see map for color coding.  So I’ll say see map.  So if I do this  so the graph becomes more attractive so over here  yellow stands for Drew or you can say the values are null.  So here we have computed  that we have the missing value of H. We have a lot  of missing values in the cabin column  and we have very less value,  which is not even visible in the Embark column as well.  So to remove these missing values,  you can either replace the values and you can put in  some dummy values to it or you can simply drop the column.  So here let us suppose pick the age column.  So first, let me just plot a box plot  and they will analyze with having a column as H.  So I’ll say SNS dot box plot.  I’ll say x is equals to passenger class.  So it’s p class.  I’ll say Y is equal to H and the data set  that I’ll be using is Titanic side.  So I’ll say three times goes to Titanic data.  You can see the edge  in first class and second class tends to be more older rather  than we have it in the third class.  Well that depends on The Experience  how much you earn  or might be there any number of reasons so here we concluded  that passengers who were traveling in class one and class  two a tend to be older than what we have in the class 3  so we have found that we have some missing values in EM.  Now one way is to either just drop the column  or you can just simply fill in some values to them.  So this method is called as imputation now  to perform data wrangling  or cleaning it is for spring the head of the data set.  So I’ll say tightening knot head.  So it’s Titanic.  Data, let’s say I just want the five rows.  So here we have survived which is again categorical.  So in this particular column,  I can apply logic to progression.  So this can be my y value or the value  that you need to predict.  Then we have the passenger class.  We have the name.  Then we have ticket number.  We’re taping so over here.  We have seen that in keeping.  We have a lot of null values or you can say that any invalid  which is quite visible as well.  So first of all,  we’ll just drop this column for dropping it.  I’ll just say Titanic underscore data.  And I’ll simply type in drop and the column  which I need to draw so I have to drop the cable column.  I mention the access equals to 1 and I’ll say  in place also to true.  So now again, I just print the head and let us see  whether this column has been removed  from the data set or not.  So I’ll say Titanic dot head.  So as you can see here,  we don’t have given column anymore.  Now, you can also drop the na values.  So I’ll say Titanic data dot drop  all the any values or you can say Nan  which is not a number  and I will say in place is equal to True its Titanic.  So over here,  let me again plot the heat map and let’s say  for the values we should before showing a lot of null values.  Has it been removed or not.  So I’ll say SNS dot heat map.  I’ll pass in the data set.  I’ll check it is null.  I’ll say why tick labels is equal to false.  And I don’t want color coding.  So again I say false.  So this will basically help me to check  whether my values has been removed  from the data set or not.  So as you can see here, I don’t have any null values.  So it’s entirely black now.  You can actually know the some as well.  So I’ll just go above So I’ll just copy this part  and I just use the sum function to calculate the sum.  So here the tells me that data set is clean as  in the data set does not contain any null value or any Nan value.  So now we have R Angela data.  You can see cleaner data.  So here we have done just one step in data wrangling  that is just removing one column out of it.  Now you can do a lot of things you can actually  fill in the values with some other values  or you can just calculate the mean  and then you can just fit in the null values.  But now if I see my data set,  so I’ll say Titanic data dot head.  But now if I see you over here I have a lot of string values.  So this has to be converted to a categorical variables  in order to implement logistic regression.  So what we will do we will convert this  to categorical variable  into some dummy variables and this can be done using pandas  because logistic regression just take two values.  So whenever you apply machine learning you need to make sure  that there are no string values present  because it won’t be taking these as your input variables.  So using string you don’t have to predict anything but  in my case I have the survived columns 2210 how many?  People tend to survive  and how many did not so CEO stands for did not survive  and one stands for survive.  So now let me just convert these variables  into dummy variables.  So I’ll just use pandas and a say PD not get dummies.  You can simply press tab to autocomplete  and say Titanic data and I’ll pass the six  so you can just simply click  on shift + tab to get more information on this.  So here we have the type data frame  and we have the passenger ID survived and passenger class.  So if Run this you’ll see  that 0 basically stands for not a female and one stand  for it is a female similarly for male 0 Stanford’s not made  and one Stanford may now we don’t require both these columns  because one column itself is enough to tell us  whether it’s male or you can say female or not.  So let’s say if I want to keep only male I’ll say  if the value of mail is 1  so it is definitely a maid and is not a female.  So that is how you don’t need both of these values.  So for that I just remove the First Column,  let’s say a female so I’ll say drop first.  Andrew it has given me just one column  which is male and has a value 0 and 1.  Let me just set this as a variable hsx so  over here I can say sex dot head  and just want to see the first five rows.  Sorry, it’s dot.  So this is how my data looks like now here.  We have done it for sex.  Then we have the numerical values in age.  We have the numerical values in spouses.  Then we have the ticket number.  We have the pair and we have embarked as well.  So in Embark the values are in.  C and Q so here also we can apply this get dummy function.  So let’s say I will take a variable.  Let’s say embark.  I’ll use the pandas Library.  I’ll enter the column name that is embarked.  Let me just print the head of it.  So I’ll say Embark dot head so over here.  We have c q and s now here also we can drop the First Column  because these two values are enough  with the passenger is either traveling for Q.  That is Q in stone S4 sound time  and if both the values are 0 then definitely  the passenger is from Cherbourg.  That is the third value  so you can again drop the first value.  So I’ll say drop and true.  Let me just run this.  So this is how my output looks like now similarly you can do it  for The class as well.  So here also we have three classes one two,  and three so I’ll just copy the whole statement.  So let’s say I want the variable name.  Let’s say PCL.  I’ll pass in the column name  that is PE class and I’ll just drop the First Column.  So here also the values will be 1 2 or 3  and I’ll just remove the First Column.  So here we just left with two and three so  if both the values are 0 then definitely  the passengers travelling in the first class now,  we have made the values as categorical now,  my next step would be to concatenate all  these new rules into a data set.  We can see Titanic data using the pandas will just concatenate  all these columns.  So I’ll Superior.  One cat and then say if we have to concatenate sex,  we have to concatenate embarked and PCL  and then I will mention the access to one.  I’ll just run this can you  to print the head so over here you can see  that these columns have been added over here.  So we have the mail column with basically tells  where the person is male  or it’s a female then we have the Embark  which is basically q  and s so if it’s traveling  from Queenstown value would be one else it  would be 0 and If both of these values are zeroed,  it is definitely traveling from Cherbourg.  Then we have the passenger class as 2 and 3.  So the value of both  these is 0 then passengers travelling in class one.  So I hope you got this  till now now these are the irrelevant columns  that we have it  over here so we can just drop these columns will drop  in PE class the embarked column  and the sex column.  So I’ll just type  in Titanic data dot drop and mention the columns  that I want to drop.  So I say I even read the passenger ID  because it’s nothing but just the index value  which is starting from one.  So I’ll drop this as well then I don’t want name as well.  So I’ll delete name as well.  Then what else we can drop we can drop the ticket as well.  And then I’ll just mention the axis.  I’ll say in place is equal to True.  Okay.  So now my column name starts uppercase.  So these has been dropped now,  let me just bring my data set again.  So this is my final leader said guys,  we have the survived column which has the value 0  and 1 then we have  the passenger class or we forgot to drop this as well.  So no worries.  I’ll drop this again.  So now let me just run this.  So over here we have the survive.  We have the age.  We have the same SP.  We have the part.  We have Fair mail and these we have just converted.  So here we have just performed data angle.  You can see clean the data  and then we have just converted the values of gender  to male then embarked to q  and s and the passenger Class 2 2 & 3.  So this was all about my data wrangling  or just cleaning the data then my next up is training  and testing your data.  So here we will split the data set into train subset  and test steps.  And then what we’ll do we’ll build a model  on the train data  and then predict the output on your test data set.  So let me just go back to Jupiter  and it is implement this as well over here.  I need to train my data set.  So I just put this indeed heading 3.  So over here,  you need to Define your dependent variable  and independent variable.  So here my Y is the output for you can say the value  that you need to predict so over here,  I will write Titanic data.  I’ll take the column which is survive.  So basically I have to predict this column  whether the passenger survived or not.  And as you can see we have the discrete outcome,  which is in the form of 0 and 1 and rest all the things we  can take it as a features or you can say independent variable.  So I’ll say Titanic data.  Not drop so we just simply drop the survive  and all the other columns will be my independent variable.  So everything else as a features which leads  to the survival rate.  So once we have defined the independent variable  and the dependent variable next step is to split  your data into training and testing subset.  So for that we will be using SK loan.  I just type in from sklearn dot cross validation.  import train test plate Now here  if you just click on shift and tab,  you can go to the documentation  and you can just see the examples over here.  I second class to open it  and then I just go to examples and see  how you can split your data.  So over here you have extra next test wide range  why test and then using this train test platelet  and just passing  your independent variable and dependent variable  and just Define a size and a random straight to it.  So, let me just copy this and I’ll just paste over here.  Over here we will train test  then we have the dependent variable train and test  and using the split function will pass in the independent  and dependent variable and then we’ll set a split size.  So let’s say I’ll put it up 0.3.  So this basically means  that your data set is divided in 0.3  that is in 70/30 ratio,  and then I can add any random straight to it.  So let’s say I’m applying one this is not necessary.  If you want the same result as that of mine,  you can add the random shape.  So this will basically take exactly the same sample  every Next I have to train and predict by creating a model.  So here logistic regression will graph  from the linear regression.  So next I’ll just type in  from SK loan dot linear model import logistic regression.  Next I’ll just create  the instance of this logistic regression model.  So I’ll say log model is equals to largest aggression now.  I just need to fit my model.  So I’ll say log model dot fit  and I’ll just pass in my ex train.  and white rain Alright,  so here it gives me all the details  of logistic regression.  So here it gives me the class way dual fit intercept  and all those things then what I need to do,  I need to make prediction.  So I’ll take a variable and checked addictions  and I’ll pass on the model to it.  So I’ll say log model dot predict and I’ll pass  in the value that is X test.  So here we have just created a model fit  that model and then we had made predictions.  So now to evaluate how my model has been performing.  So you can simply calculate the accuracy  or you can also calculate a classification report.  So don’t worry guys.  I’ll be showing both of these methods.  So I’ll say  from sklearn dot matrix input classification report.  It’s all here are used as fiction report.  And inside this I’ll be passing in white test  and the predictions.  So guys this is my classification report.  So over here, I have the Precision.  I have the recall.  We have the advanced code and then we have support.  So here we have the value of decision as 75 72 and 73  which is not that bad now  in order to calculate the accuracy as well.  You can also use the concept of confusion Matrix.  So if you want to print the confusion Matrix,  I will simply say  from sklearn dot matrix import confusion Matrix first of all,  and then we just print this So  how my function has been imported successfully  so I’ll say confusion Matrix.  And again passing the same variables  which is why test and predictions.  So I hope you guys already know the concept of confusion Matrix.  So I just tell you in a brief what  confusion Matrix is all about?  So confusion Matrix is nothing but a 2 by 2 Matrix  which has a four outcomes.  This basically tells us that how accurate your values are.  So here we have the column as predicted.  No predicted.  Why?  And we have actual no and then actually yes.  So this is the concept of confusion Matrix.  So here let me just fade in these values  which we have just calculated.  So here we have 105.  105 2125 and 63 So as you can see here,  we have got four outcomes now  105 is the value where a model has predicted.  No, and in reality.  It was also a no so  where we have predicted know an actual know similarly.  We have 63 as a predicted.  Yes.  So here the model predicted.  Yes, and actually also it was a yes.  So in order to calculate the accuracy,  you just need to add the sum  of these two values and divide the whole by the some.  So here these two values tells me where the order  has actually predicted the correct output.  This value is also called as true-  This is called as false positive.  This is called as true positive  and this is called a false negative.  Now in order to calculate the accuracy.  You don’t have to do it manually.  So in Python,  you can just import accuracy score function  and you can get the results from that.  So I’ll just do that as well.  So I’ll say from sklearn dot-matrix import accuracy score  and I’ll simply print the accuracy  and we’ll pass in the same variables.  That is why it is and predictions so over.  Here, it tells me the address.  He has 78 which is quite good so over here if you want to do it  manually, we have 2 plus these two numbers,  which is 105 263.  So this comes out to almost 168 and then you have to divide  by the sum of all the phone numbers.  So 105 plus 63 plus 21 plus 25,  so this gives me a result of to 1/4.  So now if you divide these two number,  you’ll get the same accuracy  that is 78 percent or you can say point seven eight.  So that is how you can calculate the See,  so now let me just go back to my presentation.  I let’s see what all we have covered till now.  So here we have first plate our data into train  and test subset then we have build a model  on the train data  and then predicted the output on the test data set  and then my fifth step is to check the accuracy.  So here we have calculator accuracy to almost 78 percent  which is quite good.  You cannot say that accuracy is bad.  So here it tells me  how accurate your results are so him accuracy score defines  that and hence got a good accuracy.  So now moving ahead.  Let us see the second project that is SUV data analysis.  So in this a car company has released new SUV in the market  and using the previous data about the sales of their SUV.  They want to predict the category of people  who might be interested in buying this.  So using the logistic regression,  you need to find what factors made people more interested  in buying this SUV.  So for this let us hear data set where I have user ID.  I have gender as male  and female then we have the age we have the estimated.  Melody and then we have the purchased column.  So this is my discreet column  or you can see the categorical column.  So here we just have the value  that is 0 and 1 and this column we need to predict  whether a person can actually purchase a SUV or Not.  So based on these factors, we will be deciding  whether a person can actually purchase a SUV or not.  So we know the salary of a person we know the age  and using these we can predict  whether person can actually purchase SUV or not.  So, let me just go to my jupyter notebook  and it is Implement a logistic regression.  So guys, I I will not be going through all the details  of data cleaning and analyzing the part start part.  I’ll just leave it on you.  So just go ahead and practice as much as you can.  Alright, so the second project is SUV predictions.  So first of all,  I have to import all the libraries  so I say import numpy as NP and similarly.  I’ll do the rest of it.  Alright, so now let me just print the head  of this data set.  So this we have already seen that we have columns as user ID.  We have gender.  We have the H we have the salary and then we have to calculate  whether person can actually purchase a SUV or not.  So now let us just simply go on to the algorithm part.  So we’ll directly start off with the logistic regression  on how you can train a model.  So for doing all those things,  we first need to Define your independent variable  and dependent variable.  So in this case,  I want my ex at is an independent variable is  a data set.  I lock so here I will be specifying all the School  and basically stands for that and in the columns,  I want only two and three dot values.  So here we should fetch me all the rows  and only the second and third column which is age  and estimated salary.  So these are the factors  which will be used to predict the dependent variable  that is purchase.  So here my dependent variable is purchase  and independent variable is of age and salary  so I’ll say Lena said dot  I love I’ll have all the rows and add just one fourth column.  That is my purchased column.  You don’t values.  All right, so I just forgot  when one square bracket over here.  Alright, so over here.  I have defined my independent variable and dependent variable.  So here my independent variable is age and salary  and dependent variable is the column purchase.  Now, you must be wondering what is this?  I lock function.  So I look function is basically an index of a panda’s data frame  and it is used for integer based indexing  or you can also say selection by index now,  let me just bring these independent variables  and dependent variable.  If I bring the independent variable I have age as  well as a salary next.  Let me print the dependent variable as well.  So over here you can see I just have the values in 0  and 1 so 0 stands for did not purchase next.  Let me just divide my data set into training and test subset.  So I’ll simply write in from sklearn dot cross plate  not cross-validation.  Import drain test next I’ll just  press shift + Tab and over here.  I’ll go to the examples and just copy the same line.  So I’ll just copy this.  As move the points now,  I want to text size to be let’s see 25,  so I have divided the train in tested in 75/25 ratio.  Now, let’s say I’ll take the random set of 0 So  Random State basically ensures the same result  or you can say the same samples taken whenever you run the code.  So let me just run this now.  You can also scale your input values  for better performing  and this can be done using standard scalar.  So let me do that as well.  So I’ll say from sklearn Dot pre-processing.  Import standard scale now.  Why do we scale it now?  If you see a data set we are dealing with large numbers.  Well, although we are using a very small data set.  So whenever you’re working in a prod environment,  you’ll be working with large data set we  will be using thousands and hundred thousands of you pulls  so they’re scaling down will definitely  affect the performance by a large extent.  So here let me just show you  how we can scale down these input values and then  the pre-processing contains all your methods & functionality,  which is Required to transform your data.  So now let us scale down  for test as well as a training data set.  So else First Make an instance of it.  So I’ll say standard scalar.  Then I have Extreme sasc Dot fit fit underscore transform.  I’ll pass in my Xtreme video.  And similarly I can do it for test wherein  I’ll pass the X test.  All right.  Now my next step is to import logistic regression.  So I’ll simply apply logistic regression  by first importing it.  So I’ll say from sklearn sklearn  the linear model import logistic regression over here.  I’ll be using classifier.  So I said classifier dot  is equals to logistically aggression so over here,  I just make an instance of it.  So I’ll say logistic regression and over here.  I just pass in the random state,  which is 0 No, I simply fit the model.  And I simply passing next rain and white rain.  So here it tells me all the details  of logistic regression.  Then I have to predict the value.  So I’ll say why I prayed it’s equal to classifier.  Then predict function and then I just pass in X test.  So now we have created the model.  We have scaled down our input values.  Then we have applied logistic regression.  We have predicted the values  and now we want to know the accuracy.  So now the accuracy first we need to import accuracy scores.  So I’ll say from sklearn dot matrix input accuracy school  and using this function we can calculate the accuracy  or you can manually do  that by creating a confusion Matrix.  So I’ll just pass.  my lightest and my y predicted All right,  so over here I get the accuracy  as 89% So we want to know the accuracy in percentage.  So I just have to multiply it by a hundred and if I run this  so it gives me 89%  So I hope you guys are clear  with whatever I have taught you today.  So here I have taken my independent variable as age  and salary and then we have calculated  that how many people can purchase SUV  and then we have calculated our model by checking  the accuracy so over here  we get the accuracies 89 which is great.  Alright guys that is it for today.  So I’ll Scoffs what all we have covered in today’s training.  First of all,  we had a quick introduction to what is regression  and where the regression is actually use then  we have understood the types of regression  and then got into the details  of what and why of logistic regression  of compared linear was in logistic regression.  We have also seen the various use cases  where you can Implement logistic regression in real life  and then we have picked up two projects  that is Titanic data analysis  and SUV prediction so over here we have seen  how we can collect your data analyze your data then perform.  Modeling on that data train the data test the data  and then finally have calculated the accuracy.  So in your SUV prediction,  you can actually analyze clean your data  and you can do a lot of things  so you can just go ahead pick up any data set  and explore it as much as you can open your eyes and see  around you will find dozens of applications  of machine learning  which you are using  and interacting with in your daily life peed  be using the phase detection.  And Facebook are getting the recommendation  for similar products from Amazon machine learning  is applied almost everywhere.  So hello and welcome all to this YouTube session  will learn about how to build a decision tree.  This session is designed in a way  that you get most out of it.  Alright.  So this decision tree is a type of classification algorithm  which comes under these supervised learning technique.  So before learning about decision tree,  I’ll give you a short introduction to classification  where we’ll learn about.  What is classification what I’d say,  Various types where it is used or what I’d see use cases now,  once you get your fundamental clear will jump  to the decision tree part under this.  First of all, I will teach you to mathematically  create a decision tree  from scratch then once you get your Concepts clear,  we’ll see how you can write a decision tree classifier  from scratch in Python using the card algorithm.  All right.  I hope the agenda is scared you guys what is classification?  I hope every one of you must have used Gmail.  So how do you think the male is getting classified as Spam  or not spam mail.  Well, there’s nothing but classification So  What It Is Well classification is the process  of dividing the data set into different categories  or groups by adding label.  In other way, you can say  that it is a technique of categorizing the observation  into different category.  So basically what you are doing is you are taking  the data analyzing it  and on the basis of some condition  you finely divided into various categories.  Now, why do we classify it?  Well, we classify it to perform predictive analysis on it.  Like when you get the mail  the machine predicts it to be a Spam or not spam mail  and on the basis of that prediction it  add the irrelevant or spam mail  to the respective folder in general this classification.  Algorithm handle questions.  Like is this data belongs to a category or B category?  Like is this a male or is this a female something like that now  the question arises where will you use it?  Well, you can use this of protection order  to check whether the transaction is genuine  or not suppose I am using.  A credit card here  in India now due to some reason I had to fly to Dubai now.  If I’m using the credit card over there,  I will get a notification alert regarding my transaction.  They would ask me to confirm about the transaction.  So this is also kind of predictive analysis  as the machine predicts  that something fishy is  in the transaction as very for our ago.  I made the transaction using the same credit card and India  and 24 hour later.  The same credit card is being used for the payment in Dubai.  So the Machine predicts  that something fishy is going on in the transaction.  So in order to confirm it it sends you a notification alert.  All right.  Well, this is one of the use case of classification  you can even use it to classify different items  like fruits on the base  of its taste color size overweight a machine.  Well trained using the classification algorithm  can easily predict the class  or the type of fruit whenever new data is given to it.  Not just the fruit.  It can be any item.  It can be a car.  It can be a house.  It can be a I’m bored or anything.  Have you noticed  that while you visit some sites  or you try to login into some you get  a picture capture for that right  where you have to identify  whether the given image is of a car or its of a pole or not?  You have to select it for example that 10 images  and you’re selecting three Mages out of it.  So in a way you are training the machine right you are telling  that these three are the picture of a car and rest are not  so who knows you are training at for something big right?  So moving on ahead.  Let’s discuss the types.  S of classification online.  Well, there are several different ways  to perform the same tasks like in order to predict  whether a given person is a male  or a female the machine had to be trained first.  All right,  but there are multiple ways to train the machine and you  can choose any one of them just for Predictive Analytics.  There are many different techniques but the most  common of them all is the decision tree,  which we’ll cover in depth in today’s session.  So as a part of classification algorithm we have  decision tree random Forest name buys k-nearest neighbor.  Logistic regression linear regression support  Vector machines and so on there are many.  Alright, so let me give you an idea about few  of them starting with decision tree.  Well decision tree is a graphical representation  of all the possible solution  to a decision the decisions  which are made they can be explained very easily.  For example here is a task,  which says that should I go to a restaurant  or should I buy a hamburger you are confused on that.  So for that what you will do,  you will create a dish entry for it starting  with the root node will be first of all,  you will check whether you are hungry or not.  All right,  if you’re not hungry then just go back to sleep.  Right?  If you are hungry  and you have $25 then you will decide to go to restaurant.  And if you’re hungry and you don’t have $25,  then you will just go and buy a hamburger.  That’s it.  All right.  So there’s about decision tree now moving on ahead.  Let’s see.  What is a random Forest.  Well random Forest build multiple decision trees  and merges them together  to get a more accurate and stable production.  All right, most of the time random Forest is trained  with a bagging method.  The bagging method is based on the idea  that the combination  of learning model increases the overall result.  If you are combining the learning from different models  and then clubbing it together  what it will do it will Increase the overall result fine.  Just one more thing.  If the size of your data set is huge.  Then in that case one single decision tree would lead  to our Offutt model same way  like a single person might have its own perspective  on the complete population as a population is very huge.  Right?  However, if we implement the voting system and ask  different individual to interpret the data,  then we would be able to cover the pattern  in a much meticulous way even from the diagram.  You can see that in section A  we have Howard large training data set what we do.  We first divide our training data set  into n sub-samples on it  and we create a decision tree for each cell sample.  Now in the B part  what we do we take the vote out of every decision made by  every decision tree.  And finally we Club the vote to get  the random Forest dition fine.  Let’s move on ahead.  Next.  We have neighbor Buys.  So named by is is a classification technique,  which is based on Bayes theorem.  It assumes that It’s of any particular feature  in a class is completely unrelated to the presence  of any other feature named buys is simple  and easy-to-implement algorithm and due to a Simplicity  this algorithm might out perform more complex model  when the size of the data set is not large enough.  All right, a classical use case  of name bias is a document classification.  And that what you do you determine  whether a given text corresponds to one or more categories  in the text case,  the features used might be the presence or absence.  Absence of any keyword.  So this was about Nev from the diagram.  You can see that using neighbor buys.  We have to decide  whether we have a disease or not.  First what we do we check the probability  of having a disease  and not having the disease right probability  of having a disease is 0.1  while on the other hand probability of not having  a disease is 0.9.  Okay first, let’s see  when we have disease and we go to the doctor.  All right, so when we visited the doctor  and the test is positive Adjective so probability  of having a positive test  when you’re having a disease is 0.8 0 and probability  of a negative test  when you already have a disease that is 0.20.  This is also a false negative statement as the test  is detecting negative,  but you still have the disease, right?  So it’s a false negative statement.  Now, let’s move ahead  when you don’t have the disease at all.  So probability of not having a disease is 0.9.  And when you visit the doctor and the doctor is like, yes,  you have the disease.  But you already know that you don’t have the disease.  So it’s a false positive statement.  So probability of having a disease when you actually  know there is no disease is 0.1 and probability  of not having a disease  when you actually know there is no disease.  So and the probability of it is around 0.90 fine.  It is same as probability of not having a disease  in the test is showing the same results  a true positive statement.  So it is 0.9.  All right.  So let’s move on ahead and discuss about kn n algorithm.  So this KNN algorithm or the k-nearest neighbor,  it stores all the available cases  and classifies new cases based on the similarity measure the K  in the KNN algorithm as the nearest neighbor,  we wish to take vote from for example,  if k equal 1 then the object is simply assigned to the class  of that single nearest neighbor from the diagram.  You can see the difference in the image  when k equal 1 k equal 3 and k equal 5, right?  Well the And systems  are now able to use the k-nearest neighbor  for visual pattern recognization to scan  and detect hidden packages in the bottom bin  of a shopping cart at the checkout  if an object is detected  which matches exactly to the object listed  in the database.  Then the price of the spotted product could even  automatically be added to the customers Bill  while this automated billing practice is not used  extensively at this time,  but the technology has been developed  and is available for use  if you want you can just use It and yeah,  one more thing k-nearest neighbor is also used  in retail to detect patterns in the credit card  uses many new transaction scrutinizing  software application use Cayenne algorithms  to analyze register data  and spot unusual pattern  that indicates a species activity.  For example, if register data indicates  that a lot  of customers information is being entered manually rather  than to automated scanning and swapping then in that case.  This could indicate  that the employees were using the register.  Are in fact stealing customers personal information or  if I register data indicates  that a particular good is being returned  or exchanged multiple times.  This could indicate  that employees are misusing the return policy  or trying to make money from doing the fake returns.  Right?  So this was about KNN algorithm  since our main focus for this session will be  on decision tree.  So starting with what is decision tree,  but first, let me tell you why did we choose  the Gentry to start with?  Well, these decision tree are really very easy.  Easy to read and understand it belongs to one of the few models  that are interpretable  where you can understand exactly  why the classifier has made that particular decision right?  Let me tell you a fact that for a given data set.  You cannot say  that this algorithm performs better than that.  It’s like you cannot say that  the Asian trees better than a buys  or name biases performing better than decision tree.  It depends on the data set, right?  You have to apply hit and trial method with all  the algorithms one by one and then compare the The model  which gives the best result as a model  which you can use at for better accuracy  for your data set.  All right.  So let’s start with what is decision tree.  Well a decision tree is a graphical representation  of all the possible solution  to our decision based on certain conditions.  Now, you might be wondering why  this thing is called as decision tree.  Well, it is called so  because it starts with the root  and then branches off to a number of solution just  like a tree right even the trees.  Starts from a roux and it starts growing its branches  once it gets bigger  and bigger similarly in a decision tree.  It has a roux  which keeps on growing with increasing number of decision  and the conditions now,  let me tell you a real life scenario.  I won’t say that all of you,  but most of you must have used it.  Remember whenever you dial the toll-free number  of your credit card company.  It redirects you  to his intelligent computerised assistant  where it asks you questions like,  press one for English or press 2 for Henry,  press 3 for this press 4 for that.  Great now once you select one now again,  it redirects you to a certain set  of questions like press 1 for this press 1 for that  and similarly, right?  So this keeps on repeating  until you finally get to the right person, right?  You might think  that you are caught in a voicemail hell  but what the company was actually doing it  was just using a decision tree to get you to the right person.  I lied.  I’d like you to focus on this particular image  for a moment on this particular slide.  You can see I image where the task is.  Should I accept a new job offer?  Or not.  All right, so you have to decide that for that  what you did you created a decision tree starting  with the base condition or the root node.  Was that the basic salary  or the minimum salary should be $50,000  if it is not $50,000.  Then you are not at all accepting the offer.  All right.  So if your salary is greater than $50,000,  then you will further check  whether the commute is more than one hour or not.  If it is more than one are you will just decline the offer  if it is less than one hour,  then you are getting closer to accepting the job offer.  Photo what you will do you will check  whether the company is offering free coffee or not.  Right if the company is not offering the free coffee,  then you will just declined off  and if it is offering the free coffee and  yeah, you will happily accept the offer right there  are just an example of a decision tree.  Now, let’s move ahead and understand a decision tree.  Well, here is a sample data set  that I will be using it to explain you  about the decision tree.  Alright in this data set each row is an example  and the first two columns provide features.  Attributes that describes the data and the last column  gives the label or the class we want to predict and  if you like you can just modify this data  by adding additional features  and more example  and our program will work in exactly the same way fine.  Now this data set is pretty straightforward  except for one thing.  I hope you have noticed that it is not perfectly separable.  Let me tell you something more about that as  in the second and fifth examples,  they have the same features,  but different labels,  both are Yellow as a Colour and diameter as three,  but the labels are mango and lemon right?  Let’s move on and see  how our decision tree handles this case.  All right, in order to build a tree will use  a decision tree algorithm called card this card algorithm  stands for classification  and regression tree algorithm online.  Let’s see a preview of how it works.  All right to begin with We’ll add a root note for the tree  and all the nodes receive a list  of rows as input and the root will receive the entire.  Training data set now each node will ask true and false question  about one other feature.  And in response to that question will split  or partition the data set into two different subsets  these subsets then become input to child node.  We are to the tree  and the goal of the question is to finally unmix the labels  as we proceed down or in other words to produce  the purest possible distribution of the labels at each node.  For example, the input  of this node contains only one single type of label.  So we See that it’s perfectly unmixed.  There is no uncertainty about the type of label  as it consists of only grapes right  on the other hand the labels in this node are still mixed up.  So we would ask another question to further drill it down.  Right but before that we need to understand  which question to ask and  when and to do  that we need to conduct  by how much question helps to unmix the label  and we can quantify the amount of uncertainty  at a single node using a metric.  Called gini impurity and we can quantify  how much a question reduces  that uncertainty using a concept called Information Gain will use  these to select the best question to ask at each point.  And then what we’ll do we’ll iterate the steps  will recursively build the tree  on each of the new node will continue dividing the data  until they are no further question to ask  and finally we reach to our Leaf.  Alright, alright.  So this was about decision tree.  So in order to create a decision tree,  first of all what you have to do you have to identify  A different set of questions  that you can ask to a tree like is this color green  and what will be these question?  These questions will be decided by your data set like as  this colored green is  the diameter greater than equal to 3 is the color yellow  right questions resembles to your data set remember that?  All right.  So if my color is green,  then what it will do it will divide into two parts.  First.  The Green Mango will be in the true while on the false.  We have lemon and the Mac.  All right if the color is green or the diameter.  Meter is greater than equal to 3  or the color is yellow Asian tree terminologies.  So starting with root node root node is a base node  of a tree the entire tree starts from a root node.  In other words.  It is the first node of a tree it represents  the entire population or sample  and this entire population is further segregated  or divided into two or more homogeneous set fine.  Next is the leaf node.  Well Leaf node is the one  when you reach at the The tree right  that is you cannot further segregated down  to any other level that is the leaf node.  Next is splitting splitting is dividing your root node  or node into different sub part on the basis of some condition.  All right, then comes the branch or the sub tree.  Well, this Branch or subtree gets formed  when you split the tree suppose when you split a root node,  it gets divided into two branches  or two subtrees.  Right?  Next is the concept of pruning.  Well you can Say that pruning is just opposite of splitting  what we are doing here.  We are just removing the sub node of a decision tree  will see more about pruning later in this session.  All right, let’s move on ahead.  Next is parent or child node.  Well, first of all root node is always the parent node  and all other nodes associated  with that is known as chalky node.  Well, you can understand it in a way that all the top node  belongs to a parent node and all the bottom node,  which are derived from a top node is a child node.  Node producing a further note is a child node and the node  which is producing it as a parent node  simple concept, right?  It’s use the cart algorithm and design a tree manually.  So first of all  what you will do you decide which question  to ask and when so how will you do that?  So let’s first of all visualize the decision tree.  So there’s the decision tree which will be creating manually  or like first of all,  let’s have a look at the data set.  You have Outlook temperature humidity  and windy as your different attribute on the basis of  that you have to predict  that whether you can play or not.  So which one among them should you pick first answer determine  the best attribute that classifies the training data?  All right.  So how will you choose the best attribute  or how does a tree decide  where to split or how the tree will decide its root node?  Well before we move on  and split a tree there are some terminologies  that you should know.  All right, first being the gini index.  So what is this gini index?  The gini index is the measure  of impurity or Purity used in building a day.  Gentry and cart algorithm.  All right.  Next is Information Gain this Information Gain is  the decrease in entropy  after data set is split on the basis of an attribute  constructing a decision tree is all about finding an attribute  that Returns the highest Information Gain.  All right, so you will be selecting the node  that would give you the highest Information Gain.  Alright next is reduction in variance.  This reduction in variance is an algorithm,  which is used for continuous Target variable  or regression problems the split  With lower variance is selected as a criteria  to let the population see in general term.  What do you mean by variance?  Variance is how much your data is wearing?  Right?  So if your data is less impure or is more pure  than in that case the variation would be less  as all the data almost similar, right?  So there’s also a way of setting a tree the split  with lower variance is selected as the criteria  to split the population.  Alright.  Next is the chi Square C Square.  It is an algorithm  which is used to find out these statistical significance  between the Is between sub nodes and the parent nodes fine.  Let’s move ahead.  Now.  The main question is  how will you decide the best attribute  for now just understand  that you need to calculate something known as  Information Gain the attribute  with the highest Information Gain is considered the best.  Yeah.  I know your next question might be like,  what is this information again?  But before we move on and see  what exactly Information Gain Is let me first introduce you  to a term called entropy  because this term  will be used in calculating the Information Gain.  Mmmmmm.  Well entropy is just a metric  which measures the impurity of something or in other words,  you can say that as the first step to do  before you solve the problem of a decision tree  as I mentioned is something about impurity.  So let’s move on and understand what is impurity suppose.  You are a basket full of apples and another Bowl  which is full of same label,  which says Apple now  if you are asked to pick one item from each basket  and ball then the probability of getting the apple  and it’s correct label is 1  so in this case,  You can see that impurities zero.  All right.  Now what if there are four different fruits  in the basket and four different labels in the bowl,  then the probability of matching the fruit  to a label is obviously not one.  It’s something less than that.  Well, it could be possible  that I picked banana from the basket  and when I randomly picked the label from the ball,  it says a cherry any random permutation  and combination can be possible.  So in this case I’d say that impurities is nonzero.  I hope the concept of impurities care.  Are so coming back to entropy  as I said entropy is the measure of impurity  from the graph on your left.  You can see that  as the probability is zero or one  that has either they are highly impure  or they are highly pure than in that case the value  of entropy is zero.  And when the probability is 0.5,  then the value of entropy is maximum.  Well, what is impurity impurities the degree  of Randomness how random data is  so if the data is  completely pure in that case the randomness equals 0 or  if the Dies completely Empire even in that case  the value of impurity will be zero question.  Like why is it  that the value of entropy is maximum  at 0.5 might arise in a mine, right?  So let me discuss about that.  Let me derive at mathematically  as you can see here on the slide,  the mathematical formula of entropy is –  of probability of yes,  let’s move on and see  what this graph has to say mathematically suppose s is  our total sample space and it’s divided into two parts.  Yes, and no.  No, like in our data  set the result for playing was divided into two parts.  Yes or no,  which we have to predict either we have to play or not.  Right?  So for that particular case,  you can Define the formula of entropy as entropy  of total sample space equals negative  of probability of e is multiplied by log  of probability of years with a base 2 minus probability  of no X log of probability  of no with base to where s is your total sample space  and P of v s is the probability  of E. And be of known as the probability of no, well,  if the number of yes equal number of know  that is probability  of s equals 0.5 right since you have equal number of yes,  and no so in that case value  of entropy will be one just put the value over there.  All right.  Let me just move to the next slide.  I’ll show you this.  Alright next is if it contains all Yes,  or all know that is probability of a sample space is either 1  or 0 then in that case entropy will be equal to 0  Let’s see the mathematically one by one.  So let’s start with the first condition  where the probability was 0.5.  So this is our formula for entropy, right?  So there’s our first case right which we discuss the art  when the probability of vs equal probability of node  that is in our data set.  We have equal number of yes, and no.  All right.  So probability of yes equal probability of no  and that equals 0.5 or in other words,  you can say that yes plus no equal to Total sample.  He’s all right, since the probability is 0.5.  So when you put the values  in the formula you get something like this  and when you calculate it,  you will get the entropy of the total sample space as one.  All right.  Let’s see for the next case.  What is the next case either you have totally us  or you have totally know so if you have total,  yes, let’s see the formula when we have totally as so  you have all yes and 0  no fine.  So probability of e s equal 1 and yes.  Yes as the total sample space obviously.  So in the formula when you put that thing up here,  you get entropy  of sample space equal negative X of 1 multiplied by log of 1  as the value of log 1 equals 0.  So the total thing will result to 0 similarly is the case  with no even in that case,  you will get the entropy of total sample space as 0  so this was all about entropy.  All right.  Next is what is Information Gain?  Well Information Gain  what it does is it measures the reduction in entropy?  It decides which attributes  should be selected as the decision node.  If s is our total collection  than Information Gain equals entropy,  which we calculated just now that –  weighted average X entropy of each feature.  Don’t worry.  We’ll just see  how it to calculate it with an example.  Let’s manually build a decision tree  for our data set.  So there’s our data set  which consists of 14 different instances  out of which we have nine.  Yes and five know I like so we have the formula  for entropy just put over that since 9 years.  So total probability of e s equals 9  by 14 and total probability of no equals Phi by 14  and when you put up the value and calculate the result,  you will get the value of entropy as 0.94.  All right.  So this was your first step  that is compute the entropy for the entire data set only now,  you have to select  that out of Outlook temperature humidity and windy,  which of the node should you select as the root node  big question right?  I will Decide  that this particular node should be chosen at the base note.  And on the basis of that only I will be creating  the entire tree.  I will select that.  Let’s see.  So you have to do it one  by one you have to calculate the entropy  and Information Gain for all of the different nodes.  So starting with Outlook.  So Outlook has  three different parameters Sunny overcast and rainy.  So first of all select how many number of years  and no are there in the case of Sunny like when it is sunny  how many number of years and how many number of knows?  Are there so in total we have to yes and three Nos  and case of sunny in case of overcast.  We have all yes.  So if it is overcast then we will surely go to play.  It’s like that.  Alright and next it is rainy then total number  of vs equal 3 and total number of no equals 2 fine next  what we do we calculate the entropy  for each feature for here.  We are calculating the entropy when Outlook equals Sunny.  First of all,  we are assuming that Outlook is our root node  and for that we are calculating the Can gain for it.  All right.  So in order to calculate the Information Gain remember  the formula it was entropy of the total sample space –  weighted average X entropy of each feature.  All right.  So what we are doing here,  we are calculating the entropy of Outlook  when it was sunny.  So total number of yes,  when it was Sonny was to and total number of know  that was three fine.  So let’s put up in the formula  since the probability of yes is 2 by 5  and the probability of no is 3 by 5.  So you will get something like this.  All right.  So you are getting the entropy  of sunny as zero point nine seven one fine.  Next we will calculate the entropy for overcast  when it was overcast.  Remember it was all yes, right.  So the probability  of e is equal 1 and when you put over  that you will get the value of entropy as 0 fine  and when it was rainy rainy has 3s and to nose.  So probability of e s in case of Sonny’s 3 by 5  and probability of know in case of Sonny’s 2 by 5  and when you add the You of probability of vs  and probability of note the formula you get the entropy  of sunny as zero point nine seven one point.  Now, you have to calculate  how much information you are getting from Outlook  that equals weighted average.  All right.  So what was this weighted average total number of years  and total number of no fine.  So information from Outlook equals 5 by 14 from  where does this 5 came over?  We are calculating the total number of sample space  within that particular Outlook when it was sunny, right?  So in case of Sunny there was two years and three NOS.  All right.  So weighted average for Sonny would be equal to 5 by 14.  All right,  since the formula was five by 14 x entropy of each feature.  All right, so  as calculated the entropy for Sonny is zero point  nine seven one, right?  So what we’ll do we’ll multiply five by 14 with 0.97 one, right?  Well, this was the calculation for information  when Outlook equal sunny,  but Outlook even equals overcast and rainy.  In that case,  what we’ll do again similarly will calculate for everything  for overcast and sunny  for overcast weighted averages for by 14 x its entropy.  That is 0 and for Sonny it is same 5i 14-3.  Yes and two nodes X its entropy  that is zero point nine seven one.  And finally we’ll take the sum of all of them which equals  to 0.693 right next.  We will calculate the information gained this  what we did earlier was Malaysian taken from Outlook.  Now.  We are calculating.  What is the information?  We are gaining from Outlook right.  Now this Information Gain  that equals to Total entropy minus the information  that is taken from Outlook.  All right.  So total entropy we had 0.94 –  information we took from Outlook as 0.693.  So the value of information gained from Outlook results  to zero point two four seven.  All right.  So next what we have to do.  Let’s assume that Wendy is our root node.  So Wendy consists of two parameters false and true.  Let’s see how many years  and how many nodes are there in case of true and false.  So when Wendy has Falls as its parameter,  then in that case,  it has six years  and two nodes and when it as true as its parameter,  it has 3 S and 3 nodes.  All right.  So let’s move ahead  and similarly calculate the information taken from Wendy  and finally calculate the information gained from Wendy.  Alright, so first of all,  what we’ll do we’ll calculate the entropy of each feature.  ER starting with windy equal true.  So in case of true we had equal number of yes  and equal number of know.  We’ll remember the graph  when we had the probability as 0.5 as total number of years  equal total number of know  and for that case the entropy equals 1  so we can directly write entropy of room  when it’s windy is one  as we had already proved it  when probability equals 0.5 the entropy is the maximum  that equals to 1.  All right.  Next is entropy of false when it is Vending.  I like so similarly just put the probability of yes  and no in the formula and then calculate the result  since you have six years and to nose.  So in total,  you’ll get the probability of yes 6 by 8 and probability  of no as 2 by 8.  All right, so when you will calculate it,  you will get the entropy  of false as zero point eight one one.  Alright now, let’s calculate the information from windy.  So total information collected from Windy  equals information taken  when Wendy equal true plus Action taken  when Wendy equal false.  So we’ll calculate the weighted average for each one of them  and then we’ll sum it up  to finally get the total information taken from windy.  So in this case,  it equals to 8 by 14 multiplied by 0.8 1 1  plus 6 by 14 x 1.  What is this?  8 it is total number of yes, and  no in case when when D equals false, right?  So when it was false, so total number of BS  that equals to 6 and total more of know that equal to 2  that some UPS to 8.  Alright, so that is why the waiter.  Resul results to Aid by 14 similarly information taken  when windy equals true equals to 3 plus 3  that is 3 S and 3 no equal 6 divided by total number  of sample space that is 14 x 1 that is entropy of true.  All right.  So it is 8 by 14 multiplied by 0.8 1 1 plus 6 by 14 x one  which results to 0.89 to this is information taken from Windy.  All right.  Now how much information you are gaining from Wendy?  So for that what you will do,  so total information gained from Windy  that equals to Total entropy –  information taken from Windy.  All right, that is 0.94 –  0.89 to that equals to zero point zero four eight.  So 0.048 is the information gained from Windy.  Similarly.  We calculated for the rest too.  So for Outlook as you can see,  the information was 0.693,  and it’s Information Gain was zero point two four seven in  case of temperature the information was around.  Zero point nine one one and the Information Gain  that was equal to 0.02 9 in case of humidity.  The information gained was 0.15 to and in the case of windy.  The information gained was 0.048.  So what we’ll do we’ll select the attribute  with the maximum fine.  Now, we are selected Outlook as our root node,  and it is further subdivided  into three different parts Sunny overcast and rain,  so in case of overcast we have seen  that it consists of all ears  so we can consider it as a Leaf node,  but in case of sunny and rainy it’s doubtful  as it consists of both.  Yes and both know  so you need to recalculate the things right again  for this node.  You have to recalculate the things.  All right, you have to again select the attribute  which is having the maximum Information Gain.  All right, so there is  how your complete tree will look like.  All right.  So, let’s see when you can play so you can play  when Outlook is overcast.  All right in that case.  You can always play if the Outlook is sunny.  You will further drill.  Time to check the humidity condition.  All right, if the humidity is normal,  then you will play  if the humidity is high then you won’t play right  when the Outlook predicts  that it’s raining then further you will check  whether it’s windy or not.  If it is a week went then you will go  and offer play but if it has strong wind,  then you won’t play right?  So this is  how your entire decision tree would look like at the end.  Now comes the concept of pruning say is  that what should I do to play?  Well you have to do pruning  pruning will decide how you will play.  Say what is this pruning?  Well, this pruning is nothing but cutting down the nodes  and order to get the optimal solution.  All right.  So what pruning does it reduces the complexity?  All right,  as are you can see on the screen that it showing only the result  for yes that is it showing all the result which says  that you can play  before we drill down to a practical session  a common question might come in your mind.  You might think that our tree based model better  than linear model right?  You can think like if I can Was a logistic regression  for classification problem  and linear regression for regression problem.  Then why there is a need to use the tree.  Well, many of us have this question in their mind  and well there’s a valid question too.  Well actually as I said earlier, you can use any algorithm.  It depends on the type of problem.  You’re solving let’s look at some key factor,  which will help you to decide which algorithm to use and  when so the first point being  if the relationship between dependent and independent  variable as well approximated by By a linear model,  then linear regression  will outperform tree base model second case  if there is a high non-linearity  and complex relationship  between dependent and independent variables  at remodel will outperform a classical regression model  in third case.  If you need to build a model  which is easy to explain to people a decision tree model  will always do better than a linear model  as the decision tree models  are simpler to interpret then linear regression.  All right.  Now let’s move on ahead and see  how you can write it as Gentry classifier from scratch  and python using the cart algorithm.  All right for this.  I will be using  jupyter notebook with python 3.0 installed on it.  Alright, so let’s open the Anaconda  and the jupyter notebook.  Where is that?  So this is our Anaconda Navigator  and I will directly jump over to jupyter notebook and hit  the launch button.  I guess everyone knows that jupyter.  Notebook is a web-based interactive Computing notebook  environment where you can run your python codes.  So my Jupiter notebook it opens on my Local Host w89  1 so I will be using this jupyter notebook  in order to write my decision tree classifier  using python for this decision tree classifier.  I have already written the set of codes.  Let me explain you just one by one.  So we’ll start with initializing our training data set.  So there’s our sample data set  for which each row is an example.  The last column is a label  and the first two columns are the features.  If you want you can add some more features an example  for your practice interesting fact is  that This data set is design and way  that the second and fifth example have almost  the same features,  but they have different labels.  All right, so let’s move on and see  how the tree handles this case as you can see here.  Both of them II and the fifth column have the same features.  What did different is just their label?  Right?  So let’s move ahead.  So this is our training data set next what we are doing we  are adding some column labels.  So they are used only to print the trees fine.  So what we’ll do we’ll add header to the columns  like the First Column is of Close second is of diameter  and third is a label column.  All right, next  what we’ll do we’ll Define a function as unique values  in which will pass the rows and the columns.  So this function  what it will do it will find the unique values for a column  in the data set.  So there’s an example for that.  So what we are doing here,  we are passing training data Hazard row  and column number as 0 so  what we are doing we are finding unique values in terms of color.  And in this  since the row is training data and the column is 1  so what you are doing here,  so we are finding the you Values in terms of diameter fine.  So this is just an example next  what we’ll do we’ll Define a function as class count  and we’ll pass the rows into it.  So what it does,  it counts the number of each type of example  within data set.  So in this function  what you are basically doing we are counting the number  of each type for example in the data set  or what we are doing we are counting the unique values  for the label in the data set as a sample.  You can see here we can pass that entire training data set  to this particular function as class underscore count  what it will do it will find all the different types of Label  within the training data set  as you can see here the unique label consists  of mango grape and lemon.  So next what we’ll do.  We’ll Define a function is numeric and we’ll pass  a value into it.  So what it will do it will just test  if the value is numeric or not  and it will return if the value is an integer or a float.  For example, you can see is numeric.  We are passing 7 so it is an integer  so it will return in value and if we are passing red,  it’s not a numeric value, right?  So moving on ahead  where you define a class named as question,  so This question  does this question is used to partition the data set.  This class voted does it just records a column number?  For example 0 for color a light and a column value for example,  green next what we are doing we are defining a match method  which is used to compare the feature value in the example  to the feature values stored in the question.  Let’s see how first of all what you are doing.  We are defining an init function and inside  that we are passing the self column  and the value as parameter.  So next what we do we Define a function  as match what it Does it compares the feature value  in an example to the feature value in this question  when next we’ll Define a function as re PR,  which is just a helper method to print the question  in a readable format next  what we are doing we are defining a function partition.  Well, this function is used to partition  the data set each row in the data set it checks  if it match the question or not  if it does so it adds it to the true rose or  if not then it adds to the false Rose.  All right, for example,  as you can see, it’s partition the training data.  Based on whether the roses are red or not here.  We are calling the function question  and we are passing a value of zero and read to it.  So what did we do it will assign all the red rose  to True underscore Rose  and everything else will be assigned  to false underscore rose fine.  Next.  What we’ll do we’ll Define a gini impurity function  and inside that will pass the list of rows.  So what it will do it will just calculate the gini impurity  for the list of rows.  Next what we are doing here.  We defining a function as Information Gain.  So what this Information Gain function does it calculates  the information game using the uncertainty  of the starting node –  the weighted impurity of the child node.  The next function is find the best plate.  Well, this function is used to find the best question to ask  by iterating over every feature of value  and then calculating the Information Gain.  But the detail explanation on the code,  you can find the code in the description given below.  All right next we’ll define a class as leave  for classifying the data.  It holds a dictionary of glass like mango for how many times  it appears in the row from the training data  that reaches the sleeve.  Alright, next is the decision node.  So this decision node, it will ask a question.  This holds a reference to the question  and the two child nodes on the base of it.  You are deciding which node to add further to which branch.  Alright so next.  What we are doing we are defining a function  of build tree and inside  that we are passing our number of rows.  So this is the function that is used to build the tree.  So initially what we did we Define all the various function  that we’ll be using in order to build a tree.  So let’s start  by partitioning the data set for each unique attribute,  then we’ll calculate the information gain  and then return the question  that produces the highest gain  and on the basis of that will split the tree.  So what we are doing here,  we are partitioning the data set calculating  the Information Gain.  And then what this is returning it is returning the question  that is producing the highest gain.  All right.  Now if gain equals 0 return Leaf Rose,  so what it will do.  So if you are getting no for the gain  that is gain equals 0 then in that case  since no further question could be asked  so what it will do it will return a leaf fine now true  or underscore Rose  or false underscore Rose equal partition with rose  and the question.  So if we are reaching till this position,  then you have already found.  A feature of value  which will be used to partition the data set then  what you will do you will recursively build  the true branch  and similarly recursively build the false Branch.  So return Division and Discord node and side  that will be passing question to branch and false front.  So what it will do it will return a question node.  Alice question owed this recalls the best feature  or the value to ask at this point fine.  Now that we have built our tree next  what we’ll do we’ll Define a print underscore tree function  which will be used to print the tree fine.  So finally what we are doing in this particular function  that we are printing our tree next is the classify function  which will use it to decide  whether to follow the true Branch or the false branch  and then compared  to the feature values stored in the node to the example.  We are considering and last  what we’ll do we’ll finally print the production at Leaf.  So let’s execute it and see okay,  so there’s our testing data.  All right.  So we printed all Leaf  as well now that we have trained our algorithm  with our training data set now it’s time to test it.  So there’s our testing data set.  So let’s finally execute it and see what is the result.  So this is the result you will get so first question,  which is asked by the algorithm is is diameter greater  than equal to 3 if it is true,  then it will further ask if the color is yellow again,  if it is true,  then it will predict mango as one and lemon with one.  And in case it is false,  then it will just predict the mango.  Now.  This was the true part.  Now next coming to diameter is not greater  than or equal to 3 then in that case it’s false  and what it will do it will just predict the grape fine.  Okay.  So this was all about the coding part now,  let’s conclude this session.  But before concluding let me just show you one more thing.  Now, there’s a scikit-learn algorithm cheat sheet,  which explains you  which algorithm you should use and when all right,  let’s build in a decision tree format.  Let’s see how it is built.  So first condition it will check  whether you have 50 samples or not.  If your samples are greater than 50,  then we’ll move ahead if it is less than 50,  then you need to collect more data  if you sample is greater than 50,  then you have to decide  whether you want to predict a category or not.  If you want to predict a category,  then further you will see  that whether you have labeled data or not.  If you have label data,  then that would be a classification  algorithm problem.  If you don’t have the label data,  then it would be a clustering problem.  Now if you don’t want to Category then what?  Do you want to predict predict a quantity?  Well, if you want to predict a quantity,  then in that case,  it would be a regression problem.  If you don’t want to predict a quantity  and you want to keep looking further,  then in that case,  you should go for dimensionality reduction problems and still  if you don’t want to look  and the predicting structure is not working.  Then you have tough luck for that.  I hope this doesn’t recession clarifies all your doubt  over decision tree algorithm.  Let’s begin this tutorial by looking at the topics  that we’ll be covering today.  So first of all,  we’ll start Away by getting a brief introduction  of random forest and then we’ll go  as to see why we actually need random Forest right?  Why not anything else but actually random Forest.  So once we understand it’s need at first place,  then we’ll go on to learn more about what is random forest  and we’ll also look at various.  Examples of random Forest  so that we get a very clear understanding of it.  So for the will also delve inside  in to understand the working of random Forest as to  how exactly random Forest Works will also watch out  the random Forest algorithm step by step,  right so that you are able to write any piece  of code any domain specific algorithm on your own now,  I personally believe  that any learning is really incomplete.  If it’s not put into application  so for its completion will also Implement random forest in r  with a very simple use case that is diabetes prevention.  So let’s get started with the introduction then.  No, random Forest is actually one of the classifiers  which is used for solving classification problems.  Now since some of you might not be really aware  of what classification is.  So let’s quickly understand classification first,  and then we’ll try to related to the random Forest.  So basically classification is a machine learning technique  in which you already have predefined categories  under which you can classify your data.  So it’s nothing but to supervised learning model  where you already have a data based on which you can train  your machine, right?  So your machine actually learns from this data.  So whatever all that predefined data  that you already have it actually works as a fuel  for your machine, right?  So let’s say for an example ever wondered  how your Gmail gets to know about the spam emails  and filters it out  from the rest of the genuine emails any guesses.  All right.  I’ll give you a hint try to think something on the line  that what would it actually look for what can be  the possible parameters based on which you can decide or read.  This is a genuine email or this is a spam email.  So there are certain parameters that your classifier  will actually look for like The subject line  or the text or the HTML tags  and also the IP address  of the source from where is this mail getting  from so it will analyze all these variables  and then it will classify them into this Pam  or the genuine folder.  So let’s say for an example  if your subject line States like mad  or cute or pretty and some other absurd keywords.  Your classifier is smart enough  and it’s trained in such a manner  that it will Get to know.  All right, this is a spam email and it  will automatically filter it out from your genuine emails.  So that is how you classify it works basically,  so that’s pretty much about the classification now,  let’s move forward and see what always can be there  through which you can actually perform classification.  So we have three classifiers  namely decision tree random forest and a base,  right so speaking briefly about Season 3 at first  so decision tree actually splits your entire data set  in this structure of a tree  and it makes decision  at every node and hence called decision tree.  So no big bang theory, right?  So you have certain data set.  There are certain nodes at each node.  It will for the split into the child nodes  and at each node.  It will make a decision.  So final decision will be in the form of positive  and negative, right?  So let’s say for an example you want to purchase a car, right?  So what all will be the parameters?  Let’s say I have a go and I want to purchase a car  and I will keep certain parameters in my mind.  That would be what exactly is my income.  What is my budget?  What is the particular brand that I want to go for?  What is the mileage of the car?  What is the cylinder capacity of the car and so on  and so forth, right?  So I’ll make my decision based on.  All these parameters,  right and that is how you make decisions and further.  If you really want to know more about decision tree as to  how it exactly works.  You can also check out our decision tree tutorial as well.  So let’s begin now to the random Forest now.  So Random Forest isn’t in simple classifier.  Actually now, let’s understand what this war in symbol means.  So in simple methods actually.  Use multiple machine learning algorithms to obtain  better predictive performance.  So particularly talking  about random Forest So Random forests uses  multiple decision trees for prediction, right?  So you are in assembling a lot of decision trees to come up  to your final outcome.  As you can also look here in the image  that your entire data set is actually for the split  into three subsets,  right and each subset for Leads to a particular decision tree.  So here you have three decision trees  and each decision tree will lead to certain outcome.  Now what random Forest will do is it will compile the results  from all the decision trees  and then it will lead to a final outcome.  Right?  So it’s compiled a section of all the multiple decision trees.  That’s all about the random Forest now,  let’s see what’s lies there in a pace, right?  So naive Bayes is very famous classifier,  which is made on a very famous rule called Bayes theorem.  You might have studied about Nee Bayes theorem  in your 10 standard as well.  So let’s just see what Bayes theorem describes.  So based on actually describes the probability  of an event based on certain prior knowledge of conditions  that might be related to the event, right?  So for example,  if cancer is related to age, right,  so then person’s age can be used to more  accurately assess probability  of having a cancer  than without having the knowledge of age.  So if you know the age then it will become handy in addicting  the occurrence of cancer for a particular person.  Right?  So the outcome of first event here is actually affecting  your final outcome, isn’t it?  Yeah.  So this is how naive Bayes classifier actually works.  So that was all to give an overview  of Nave Bayes classifier.  And this were pretty much  about the types of classifiers now,  we’ll try to find out the answer to this particular question as  to why we need random Forest fine.  So like human beings learn from the past experiences.  So unlike human beings a computer does not have  experiences then how does machine takes decisions?  Where does it learn from?  Um, well a computer system actually learns  from the data which represents  some past experiences of an application domain.  So now let’s see  how random Forest helps in building up in learning model  with a very simple use case of credit risk detection.  Now needless to say  that credit card companies  have a very nested interest in identifying  Financial transactions  that are illegitimate and criminal in nature.  And also I would like to mention this point  that according to the Federal Reserve payment study Americans  used credit cards to pay  for twenty six point two million purchases in 2012,  and the estimated loss due to unauthorized transactions  that here was us six point 1 billion dollars now  in the banking industry measuring risk is very critical  because the stakes are too high.  So the overall goal is actually to figure out Out  who all can be fraudulent  before too much Financial damage has been done.  So for this a credit card company receives thousands  of applications for new cards  and each application contains information  about an applicant, right?  So so here as you can see that from all those applications  what we can actually figure out is  that predictor variables.  Like what is the marital status of the person?  What is the gender of the person?  The age of the person and the status  which is actually  whether it is a default pair or a non-default pair.  So default payments are basically when payments  are not made in time  and according to the agreement signed by the cardholder.  So now that account is actually set to be in the default.  So you can easily figure out the history  of the particular card holder from this then we can also look  at the time of payment  whether he has been a regular pair or not.  Regular one, what is the source of income  for that particular person?  And so and so forth.  So to minimize loss  the back actually needs certain decision rule to predict  whether to approve  a particular loan of that particular person or not.  Now here is where the random Forest actually comes  into the picture right now.  Let’s see how random Forest can actually help us  in this particular scenario.  Now, we have taken randomly two parameters.  Out of all the predictive variables  that we saw previously now,  we have taken two predictor variables here.  The first one is the income  and the second one is the H right  and similarly parallel  it to decision trees have been implemented  upon those predicted variables and let’s first assume the case  of the income variable, right?  So here we have divided our income into three categories  the first one being the person earning over 35,000.  And dollars second  from 15 to 35 thousand dollars the third one running  in the range of 0 to 15 thousand dollars.  Now if a person is earning over $35,000,  which is a pretty good income pretty decent.  So now we’ll check out for the credit history.  Now the here the probability is that if a person is earning  a good amount then there is very low risk  that he won’t be able to pay back already earning good.  So the It is  that his application of loan will get approved.  Right?  So there is actually low risk or moderate risk,  but there’s no real issue of high risk  as such we can approve the applicants request here.  Now, let’s move on and watch out for the second category  where the person is actually earning  from 15 to 35 thousand dollars right now here the person may  or may not pay back.  So in such scenarios will look for the credit.  History as to what has been his previous history.  Now if his previous history has been bad  like he has been a default.  ER in the previous transactions will definitely not consider  approving his request  and he will be at the high risk  in which is not good for the bank.  If the previous history  of that particular applicant is really good then we  will just to clarify our doubt will consider another pair.  Dress.  Well, that will be on depth.  I have his already in really high depth then  the risks again increases and there are chances  that he might not pay repay in the future.  So here will not accept the request of the person  having high dipped  if the person is in the low depth  and he has been a good pair in his past history.  Then there are chances  that he might be back and we can consider  approving the request of this particular applicant.  And let’s look at the third category,  which is a person earning from 0 to 15 thousand dollars.  Now, this is something which actually raises I broke  and this person will actually lie  in the category of high risk.  All right.  So the probability is  that his application of loan would probably get rejected now,  we’ll get one final outcome from this income parameter, right?  Now let us look at our second variable  that is age which will lead into the second decision tree.  Now.  Let us say if the person is Young, right?  So now we will look forward to if it is a student now  if it is a student then the chances are high  that he won’t be able to repay back  because he has no learning Source, right?  So here the risks are too high and probability is  that his application of loan will get rejected fine.  Now if the person is Young And he’s not a student  then we’ll probably go on and look for another variable.  That is pan balance.  Now.  Let’s look if the bank balance is less than 5 lakhs.  So again the risk arises and the probabilities  that his application of loan will get rejected.  Now if the person is Young is not a student  and his bank balance  of greater than 5 lakhs is got a pretty good  and stable and balanced then the probabilities  that his zone of application will get approved.  Of not let us take another scenario  if he’s a senior, right?  So if he is a senior will probably go and check out  for this credit history.  How well has he been in his previous transactions?  What kind of a person he is like  whether he’s a defaulter or is Ananda falter now  if he is a very fair kind of person  in his previous transactions then again the risk arises  and the probability of his application  getting rejected actually increases right now.  If he has been an excellent person as  per his transactions in the previous history.  So now again here there is least risk  and the probabilities  that his application of loan will get approved.  So now here these two variables income and age have led  to two different decision trees.  Right and these two different decision trees actually led  to two different results.  Now what random forest does is it will actually compile  these two different results from these two different.  Decision trees and then finally,  it will lead to a final outcome.  That is how random Forest actually works.  Right?  So that is actually the motive of the random Forest.  Now let us move forward and see what is random Forest right?  You can get an idea  of the mechanism from the name itself random forests.  So a collection of trees is a fortress  that’s why I called for is probably and here  also the trees are actually because being trained on subsets  which are being selected at random.  And therefore they are called random forests.  So a random forests is a collection  or an in symbol of decision.  Eat straight head a decision trees actually  built using the whole data set considering all features,  but actually in random Forest only a fraction of the number  of rows is selected  and that too at random and a particular number of features,  which are actually selected at random are trained  upon and that is  how the decision trees are built upon.  Right?  So similarly number of decision trees will be grown  and each decision tree will result in two.  With a certain final outcome  and random Forest will do nothing,  but actually just compiled the results  of all those decision trees to bring up the final result.  As you can see in this particular figure  that a particular instance actually has resulted  into three different decision trees right sonar tree  one results into a final outcome called Class A  and tree to results  into class B. Similarly tree three results into class P  So Random Forest will compile the results  of all these decision trees.  And it will go by the goal of the majority voting now  since head to decision trees have actually voted  into the favor of the Class B that is decision tree two,  and three therefore the final outcome will be  in the favor of the Class B.  And that is how random Forest actually works upon.  Now one really  beautiful thing about this particular algorithm is  that it is one of the versatile algorithms  which is capable  of Performing both regression as well as classification.  Now, let’s try to understand random Forest further  with a very beautiful example or a this is my favorite one.  So let’s say you want to decide  if you want to watch edge of tomorrow or not, right?  So in this particular scenario,  you will have two different actions to work Bond either.  You can just straight away go to your best friend asked  him about or read.  Whether should I go for Edge of Tomorrow?  And what will I like this movie or you can ask a bunch?  Your friends and take their opinion consideration  and then based on the final results.  You can go out and watch Edge of Tomorrow, right?  So now let’s just take the first scenario.  So where you go to your best friend asked about  whether you should go out to watch edge  of tomorrow or not.  So your friend will probably ask you certain questions  like the first one being here Jonah.  So so let’s say your friend asks you  if you really like The Adventurous kind  of movies or not.  So you say yes,  definitely I would love to watch it Venture kind of movie.  So the probabilities  that you will like edge of tomorrow as well.  Since it’s of Tomorrow is also a movie of Adventure  and sci-fi kind of Jonah, right?  So let’s say you do not like the adventure John a movie.  So then again the probability reduces  that you might really not like edge of Morrow right.  So from here you can come to a certain conclusion right?  Let’s say your best friend puts you into another situation  where he’ll ask you or a do you like Emily plant?  And you see definitely I like Emily Blunt  and then he puts another question to you.  Do you like Emily Blunt to be in the main lead  and you say yes, then again,  the probability arises  that you will definitely like edge of tomorrow as  well because Edge of Tomorrow is Has the Emily plant  in the main lead cast so  and if you say oh I do not like Emily Blunt then again,  the probability reduces  that you would like Edge of Tomorrow to write.  So this is one way  where you have one decision tree and your final outcome.  Your final decision will be based on your one decision tree,  or you can see your final outcome will be based  on just one friend.  No, definitely not really convinced.  You want to consider the options of your other friends also  so that you can make very precise and crisp  decision right you go out  and you approach some other bunch of friends of yours.  So now let’s say you go to three of your friends  and you ask them the same question  whether I would like to watch Age of Tomorrow or not.  So you go out and approach  three or four friends friend one friend twin friend three.  Now, you will consider each of their Sport  and then you will your decision now will be dependent  on the compiled results of all of your three friends, right?  Now here, let’s say you go to your first friend  and you ask him whether you would like to watch it  if tomorrow not and your first friend puts you to one question.  Did you like Top Gun?  And you say yes,  definitely I did like the movie Top Gun  and the probabilities  that you would like edge of tomorrow as  well because topgun is actually a military action drama,  which is also Tom Cruise.  So now again the probability Rises that yes,  you will like edge of tomorrow as well and  If you say no I didn’t like Top Gun then again.  The chances are  that you wouldn’t like Edge of Tomorrow, right?  And then another question that he puts you across is  that do you really like to watch action movies?  And you say yes, I would love to watch them.  Then again.  The chances are  that you would like to watch Edge of Tomorrow.  So from your friend  when you can come to one conclusion,  I hear since the ratio of liking the movie  to don’t like is actually 2 is to 1 so the final result.  Actually, you would like Edge of Tomorrow.  Now you go to your second friend and you ask the same question.  So now you are second friend asks you did you like far  and away when we went out and did the last time  when we washed it  and you say no I really didn’t like far and away  then you would say then you are definitely going  to like Edge of Tomorrow.  Why does so because far and away is actually  since most of whom might not be knowing it so far  in a ways Johner of romance  and it revolves around a girl  and a guy Guy falling in love with each other and so on.  So the probability is  that you wouldn’t like edge of tomorrow.  So he ask you another question.  Did you like Bolivian  and to really like to watch Tom Cruise?  And you say Yes, again.  The probability is  that you would like to watch Edge of Tomorrow.  Why because Oblivion again is a science fiction  casting Tom Cruise full of strange experiences.  And where Tom Cruise is the savior of the masses.  Kind well,  that is the same kind of plot in edge of tomorrow as well.  So here it is pure yes  that you would like to watch edge of tomorrow.  So you get  another second decision from your second friend.  Now you go to your third friend and ask him so  probably our third friend is not really interesting  in having any sort  of conversation with you say just simply asks you did you  like Godzilla and you say  no I didn’t like Godzilla’s we say definitely you wouldn’t like  Edge of Tomorrow why  so because Godzilla is also actually Fiction movie  from the adventure Jonah.  So now you have got three results from  three different decision trees from three different friends.  Now you compile the results of all those friends  and then you make a final call that yes,  would you like to watch edge of tomorrow or not?  So this is some very real time and very interesting example  where you can actually Implement random Forest  into ground reality.  Now let us look at various domains  where random Forest is actually used.  So because of its diversity random Forest is actually used  in various diverse to means  like so beat banking beat  medicine beat land use beat marketing name it  and random Forest is there so in banking particularly  random Forest is being actually used to make it out  whether the applicant will be a default a pair  or it will be Older one  so that it can accordingly  approve or reject the applications of loan,  right?  So that is how random Forest is being used in banking  talking about medicine.  Random.  Forest is widely used  in medicine field to predict beforehand.  What is the probability  if a person will actually have a particular disease or not?  Right?  So it’s actually used to look at the various disease Trends.  Let’s say you want to figure out what is the probability  that a person will have diabetes or not?  It and so what would you do?  It’d probably look at the medical history  of the patient and then you will see.  All right.  This has been the glucose concentration.  What was the BMI?  What was the insulin levels  in the patient in the past previous three months.  What is the age of this particular person  and do it’ll make a different decision trees based on each one  of these predictor variables  and then you’ll finally compiled the results  of all those variables and then you will make a final decision.  As to whether the person will have diabetes  in the near future or not.  That is how random Forest will be used  in medicine sector now move.  Random Forest is also actually used to find out the land use.  For example, I want to set up a particular industry  in certain area.  So what would I probably look for a look for?  What is the vegetation over there?  What is the Urban population over there?  Right and how much is the distance  from the nearest modes  of Transport like from the bus station  or the railway station and accordingly.  I will split my parameters  and I will make decision on each one of these parameters  and finally I’ll compile my decision of all  these parameters in that will be my final outcome.  So that is how I am finally going to predict  whether I should put my industry  at this particular location or not.  Right?  So these three examples  have actually been of majorly Classification problem  because we are trying to classify  whether or not with actually trying to answer this question  whether or not right now,  let’s move forward and look  how marketing is revolving around random Forest.  So particularly in marketing  we try to identify the customer churn.  So this is particularly the regression kind  of problem right now  how let’s see so customer churn  is nothing but actually the number of people  which are actually on the number.  Of customers who are losing out.  So we’re going out of your market.  Now you want to identify  what will be your customer churn in near future.  So you’ll most of them  e-commerce Industries are actually using this  like Amazon Flipkart Etc.  So they particularly look at your each Behavior as to  what has been your past history.  What has been your purchasing history.  What do you like based on your activity  around certain things around certain ads  around certain discounts?  And I’m certain kind of materials right  if you would like a particular top your activity will be more  around that particular top.  So that is how they track each and every particular move  of yours and then they try to predict  whether you will be moving out or not.  So that is how they identify the customer churn.  So these all are various domains where random Forest is used.  And this is not the only list so there are  numerous other examples,  which actually Lee are using random forests  that makes it so special actually.  Now, let’s move forward and see how random  Forest actually works.  Right.  So let us start with the random Forest algorithm first.  Let’s just see it step  by step as to how random Forest algorithm works.  So the first step is to actually select  certain M features from T.  Where m is less than T.  So here T is the total number of the predictor variables  that you have  in your data set and out of those total predictor variables.  You will select some random Lisa.  Um few features out of those now why we are actually selecting  a few features only.  The reason is  that if you will select all the predictive variables  or the total predictor variables then each of your decision tree  will be same  so we model is not actually learning something new.  It is learning the same previous thing  because all those decision trees will be similar right  if you actually split your predicted variables  and you select randomly a few predicted variables.  Need let’s say there are 14 total number of variables  and out of those  who randomly pick just three right?  So every time you will get a new decision tree,  so there will be a variety right?  So the classification model will be actually  much more intelligent than the previous one.  Now.  It has got very yet experiences.  So definitely it will make different decisions each time.  And then when you will compile all those different decisions,  it will be a new more.  Are accurate and efficient result, right?  So the first important step is to select certain number  of features out of all the features now,  let’s move on to the second step.  Let’s say for any node D. Now.  The first step is to calculate the best plate at that point.  So, you know that decision tree  how decision trees actually implemented so  you pick up a the most significant variable right?  And then you will split that particular node.  For the child nodes,  that is how the split takes place, right?  So you will do it for M number of variables  that you’ve selected.  Let’s say you have selected three  so you will implement the split at all.  Those three nodes in one particular decision tree,  right the third step is split up the node  into two daughter nodes.  So now you can split your root note  into as many notes  as you want to but here we’ll split our node  into 2.2 notes as to this or that so it will be an answer.  In terms of this or that right  at fourth step will be to repeat all these three steps  that we’ve done previously  and we’ll repeat all this splitting  until we have reached all the N number of nodes, right?  So we need to repeat  until we have reached till the leaf nodes  of a decision tree that is  how we will do it right now after these four steps.  We will have our one decision tree.  But random Forest is actually about Decision trees.  So here our fifth step will come into the picture  which will actually repeat all these previous steps  for D number of times now hit these the the number  of decision trees.  Let’s say I want to implement five decision trees.  So my first step will be to implement  all the previous steps 5 times.  So the head the eye tration is 4/5 number of times right now.  Once I have created  these five decision trees still my task is not completed.  Pleat yet.  Now.  My final task will be to compile the results  of all these five different decision trees  and I will make a call  in the majority voting right here.  As you can see in this picture.  I had in different instances.  Then I created indifferent decision trees.  And finally,  I will compile the result of all these n different decision trees  and I will take my call on the majority voting right.  So whatever my majority vote says It will be my final result.  So this is basically an overview of the random Forest algorithm  how it actually works.  Let’s just have a look at this example to get  much better understanding of what we have learnt.  So let’s say I have this data set  which consists of four different instances, right?  So basically it consists of the weather information  of previous 14 days right from D1 tildy 14,  and this basically Outlook humidity and Win,  this basically gives me the weather condition  of those 14 days.  And finally I have play  which is my target variable weather match did take place  on that particular day or not right.  Now.  My main goal is to find out  whether the match will actually take place  if I have following these weather conditions  with me on any particular day.  Let’s say the Outlook is rainy that day  and humidity is high and the wind is very weak.  So now I need to predict  whether I will be able to play The match  that they are not all right.  So this is a problem statement fine.  Now, let’s see  how random Forest is used  in this to sort it out now here the first step is to actually  split my entire data set into subsets here.  I have split my entire 14 variables into further  smaller subsets right now these subsets may  or may not overlap  like there is certain overlapping between d 1  till D3 and D3 till D6.  Fine, so there is an overlapping of D3.  So it might happen  that there might be overlapping so you need not really worry  about the overlapping but you have to make sure  that all those subsets are actually different right?  So here I have taken three different subsets  my first sub set consists of D1  till D3 Mexican subset consists of D3  till D6 and methods subset consists of D7 tildy.  Now now I will first be focusing on my first subset now here,  let’s say that particular day the out  It was overcast fine.  If yes, it was overcast then the probabilities  that the match will take place.  So overcast is basically when your weather is too cloudy.  So if that is the condition then definitely the match will take  place and let’s say it wasn’t overcast.  Then you will consider these second most probable option  that will be the wind  and we will make a decision based on this now  whether wind was weak or strong if wind was weak,  then you will definitely go out.  And play the match else you would not.  So now the final outcome out of this decision tree will be Play  Because here the ratio between the play  and no play is to is to 1  so we get to a certain decision from a first decision tree.  Now, let us look at the second subset now  since second subset has different number of variables.  So that is why this decision trees absolutely different from  what we saw in our four subsets.  So let’s say if it was overcast then you will play the match.  If it isn’t the overcast  and you would go and look out for humidity now further,  it will get split into two whether it was high or normal.  Now, we’ll take the first case  if the humidity was high and wind was week.  Then you will play the match else  if humidity was high but wind was too strong,  then you would not go out and play the match right now.  Let us look at the second dot to node of humidity  if the humidity was Oil  and the wind was weak then you will definitely go out  and play the match  as you want go out and play the match.  So here if you look at the final result,  then the ratio of placed no play is 3 is to 2 then again.  The final outcome is actually play, right?  So from second subset,  we get the final decision of play now,  let us look at our third subset  which consists of D7 till D9 here  if again the overcast is yes,  then you will A match it’s you will go  and check out for humidity.  And if the humidity  is really high then you won’t play the match  and you will play the match again the probability  of playing the matches.  Yes, because the ratio of no play is Twist one, right?  So three different subsets three different decision trees  three different outcomes  and one final outcome after compiling all the results  from these three different decision trees are so I hope  this gives a better perspective a bit understanding  of random Forest like  how it really works.  All right.  So now let’s just have a look at various features  of random Forest Ray.  So the first and the foremost feature is  that it is one  of the most accurate learning algorithms, right?  So why it is so  because single decision trees are actually prone  to having high variance  or Hive bias and on the contrary actually.  Random Forest it averages the entire variance  across the decision trees.  So let’s say  if the variances say X4 decision tree,  but for random Forest,  let’s say we have implemented n number  of decision trees parallely.  So my entire variance gets averaged to upon  and my final variance actually becomes X upon n so  that is how the entire variance actually goes down  as compared to other algorithms.  Thumbs right now second most important feature is  that it works?  Well for both classification and regression problems  and by far I have come across this is one  and the only algorithm  which works equally well for both of them.  Beh classification kind of problem or a regression kind  of problem, right?  Then it’s really runs efficient on large databases.  So basically it’s really scalable.  Even if you work for the lesser amount of database  or if you work for really huge volume of data, right?  So that’s a very good part about it.  Then the fourth most important point is  that it requires almost no input preparation.  Now, why am I saying this is  because it has got certain implicit methods,  which actually take care.  And remove all the outliers  and all the missing data and you really don’t have to take care  about all that thing  while you are in the stages of input preparations.  So Random Forest is all here to take care  of everything else and next.  Is it performs implicit feature selection, right?  So while we are implementing multiple decision trees,  so it has got implicit method  which will automatically pick up some random features.  Result of all your parameters and then it will go  on and implementing different decision trees.  So for example,  if you just give one simple command  that all right,  I want to implement 500 decision trees no matter  how so Random Forest will automatically take care  and it will Implement all those 500 decision trees  and those all 500 decision trees will be different  from each other and this is  because it has got implicit methods  which will automatically collect different parameters.  Has itself out of all the variables  that you have right,  then it can be easily grown in parallel why it is so  because we are actually  implementing multiple decision trees and all  those decision trees are running  or all those decisions trees are actually  getting implemented parallely.  So if you say I want thousand trees to be implemented.  So all those thousand trees are getting implemented parallely.  So that is how the computation time reduces.  Right, and the last point is  that it has got methods for balancing error  in unbalanced it  as it’s now what exactly unbalanced data sets  are let me just give you an example of that.  So let’s say you’re working on a data set fine  and you create a random forest model and get  90% accuracy immediately.  Fantastic you think right.  So now you start diving deep you go a little Little deeper  and you discovered  that ninety percent of that data actually belongs  to just one class tan your entire data set  your entire decision is actually biased  to just one particular class.  So Random Forest actually takes care of this thing  and it is really not biased  towards any particular decision tree or any particular variable  or any class.  So it has got methods which looks after it  and they Does all the balance of errors in your data sets?  So that’s pretty much  about the features of random forests.  K-nearest neighbor is a simple algorithm  which uses entire data set in its training phase  when our prediction is required for unseen data.  What it does is it searches  through the entire training data set for kaymu similar instances  and the data with the most similar instance is finally  returned as the prediction.  So hello.  Oh and welcome all to this YouTube session  and in today’s session will be dealing with KNN algorithm.  So without doing any further,  let’s move on and discuss agenda for today’s session.  So we’ll start our session with what is KN  where I’ll brief you about the topic  and we’ll move ahead to see what its popular use cases  or how the industry is using KN for their benefit.  Once we are done with it.  We will drill down to the working of algorithm  and while learning the algorithm  you will also understand the significance of K,  or what does this case stands  for in the nearest neighbor algorithm?  Then we’ll see how the prediction is made  using Canon algorithm manually or mathematically.  All right.  Now once we are done with the theoretical concept  will start the Practical or the demo session  where we’ll learn  how to implement KNN algorithm using python.  So let’s start our session.  So starting with what  is KNN algorithm will k-nearest neighbor is a simple algorithm  that stores all the available cases  and classify the new data  or case based on a similarity measure.  It suggests that  if you are similar to your neighbors,  then you have one of them right for example,  if apple looks more similar to banana orange  or Melon rather than a monkey rat or a cat  that most likely Apple belong to the group of fruits.  All right.  Well in general Cayenne is used in Search application  where you are looking for similar items  that is when your task is some form of fine items  similar to this one.  Then you call this search as a Cayenne in search.  But what is this KN KN?  Well this K denotes the number of nearest neighbor  which are voting class of the new data  or the testing data.  For example,  if k equal 1 then the Sting data are given the same label  as a close this example in the training set similarly.  If k equal 3 the labels are the three closes classes are checked  and the most common label is assigned to then testing data.  So this is  what a KN KN algorithm means so moving on ahead.  Let’s see some of the example of scenarios  where KN is used in the industry.  So, let’s see the industrial application  of KNN algorithm starting with recommender system.  Well the biggest use case  of cayenne and search is a recommender system.  Thus recommended system is like an automated.  Good form of a shop counter guy  when you asked him for a product not only shows you the product  but also suggest you or displays your relevant set of products,  which are related to the item.  You’re already interested in buying this KNN algorithm  applies to recommending products like an Amazon  or for recommending media,  like in case of Netflix or even for recommending advertisement  to display to a user  if I’m not wrong almost all of you must have used Amazon  for shopping, right?  So just to tell you more than 35% of revenue is generated by its recommendation engine.  So what’s the strategy Amazon  uses recommendation as a targeted marketing tool  in both the email campaigns around most  of its website Pages Amazon will recommend many products  from different categories based on what you have browser  and it will pull those products in front of you  which you are likely to buy  like the frequently bought together option  that comes at the bottom of the product page to tempt you  into buying the combo.  Well, this recommendation has just one main goal  that is increase average order value or to upsell  and cross-sell customers by providing product suggestions.  Eastern items in the shopping cart or based on the product.  They’re currently looking at on site.  So next industrial application of KNN  algorithm is concept search  or searching semantically similar documents  and classifying documents containing similar topics.  So as you know,  the data on the Internet is increasing exponentially  every single second.  There are billions and billions of documents on the internet  each document on the internet contains multiple Concepts,  that could be a potential concept.  Now, this is a situation where the main problem is  to Extract concept from a set of documents  as each page could have thousands of combination  that could be potential Concepts an average document could have  millions of concept combined  that the vast amount of data on the web.  Well, we are talking about an enormous amount  of data set and Sample.  So what we need is we need to find a concept  from the enormous amount of data set and samples, right?  So for this purpose,  we will be using KNN algorithm more advanced example  could include handwriting detection like an OCR  or image recognization or even video.  Organization.  All right.  So now that you know various use cases  of KNN algorithm.  Let’s proceed and see how does it work.  So how does a KNN algorithm work?  Let’s start by plotting these blue and orange  point on our graph.  So these Blue Points the belong to class A  and the orange ones they belong to class B.  Now you get a star as a new pony and your task is to predict  whether this new point it belongs to class A  or it belongs to the class B.  So to start the production the very first thing  that you have to do is select the Value of K. Just  as I told you KN KN algorithm refers to the number  of nearest neighbors that you want to select.  For example, in this case k equal to 3.  So what does it mean it means  that I am selecting three points  which are the least distance to the new point  or you can say I am selecting three different points  which are closest to the star.  Well at this point of time you can ask  how will you calculate the least distance?  So once you calculate the distance,  you will get one blue  and two orange points which are closest to this star now.  Since in this case  as we have a majority of orange points,  so you can say that for k equal 3D star belongs to class B,  or you can say  that the star is more similar to the orange points  moving on ahead.  Well, what if k equal to 6 well for this case,  you have to look for six different points  which are closest to this star.  So in this case after calculating the distance,  we find that we have four blue points  and two Orange Point  which are closest to the star now  as you can see that the blue points are in majority,  so you Can say  that for k equals 6 this star belongs to class A  or the star is more similar to Blue Points.  So by now,  I guess you know how a KNN algorithm work  and what is the significance of gain KNN algorithm.  So how will you choose the value of K?  So keeping in mind this case the most important parameter  in KNN algorithm.  So, let’s see when you build a k nearest neighbor classifier.  How will you choose a value of K?  Well, you might have a specific value of K in mind  or you could divide up your data and use something  like cross-validation technique to test several values  of K in order.  To determine which works best for your data, for example,  if n equal 2,000 cases then in that case the optimal value  of K lies somewhere in between 1 to 19.  But yes,  unless you try it you cannot be sure of it.  So, you know how the algorithm is working on a higher level.  Let’s move on and see  how things are predicted using KNN algorithm.  Remember I told you  the KNN algorithm uses the least distance measure  in order to find its nearest neighbors.  So, let’s see how these distance is calculated.  Well, there are several distance measure  which can be used.  So to start with Will mainly focus on euclidean distance  and Manhattan distance in this session.  So what is this euclidean distance?  Well, this euclidean distance is defined as the square root  of the sum of difference between a new point x  and an existing Point why  so for example here we have Point P1 and P2 Point  T. 1 is 1 1 and point p 2 is 5  for so what is the euclidean distance between both of them?  So you can see that euclidean distance is  a direct distance between two points.  So what is the distance between the point P1 and P2  so we can calculate it as 5 minus 1 whole square  plus 4 minus 1 whole square  and we can route it over which results to 5.  So next is the Manhattan distance.  Well, this Manhattan distance is used to calculate the distance  between real Vector using this some  of their absolute difference  in this case the Manhattan distance  between the point P1 and P2 is Mode of 5 minus 1  plus mod value of 4 minus 1  which results to 3 plus 4  that is 7 so this slide shows the difference  between euclidean and Manhattan distance from point A to point  B.  So euclidean distance is nothing but the direct  or the least possible distance between A and B.  Whereas the Manhattan distance is a distance between A  and B measured along the axis at right angle.  Let’s take an example and see  how things are predicted using KNN algorithm  or how the cannon algorithm is working.  Suppose we have a data set which consists of height weight  and T-shirt size of some customers.  Now when a new customer come we only have his height  and weight as the information now our task is to predict.  What is the T-shirt size of that particular customer  so for this will be using the KNN algorithm.  So the very first thing what we need to do,  we need to calculate the euclidean distance.  So now that you have a new data of height 160 one centimeter  and weight are 61 kg.  So the very first thing  that we’ll do is we’ll calculate the euclidean distance.  Stance which is nothing but the square root  of 160 1 minus 158 whole square  plus 61 minus 58 whole square and square root of that is 4.24.  Let’s drag and drop it.  So these are the various euclidean distance  of other points.  Now, let’s suppose k equal to 5 then the algorithm  what it does is it searches for the five customer  closest to the new customer  that is most similar to the new data in terms  of its attribute for k equal 5.  Let’s find the top five minimum euclidian distance.  So these are the distance  which we are going to use Two three four and five.  So let’s rank them in the order first.  This is second.  This is third then this one is for again.  This one is 5 so there is our order.  So for k equal 5 we have for t-shirts  which commanders size M and one t-shirt  which comes under size l  so obviously best guess for the best protection  for the T-shirt size of height 160 one centimeter  and wait 60 1 kg is M.  Or you can say that a new customer Fittin to size  M. Well this was all about Body theoretical session,  but before we drill down to the coding part,  let me just tell you why people call KN as a lazy learner.  Well Cannon for classification is a very simple algorithm,  but that’s not why they are called lazy KN is a lazy learner  because it doesn’t have  a discriminative function from the training data.  But what it does it memorizes the training data,  there is no learning phase  of the model and all of the work happens at the time.  Your prediction is requested.  So as such there’s the reason why KN is often referred  to us lazy learning algorithm.  So this was all about Or detail reticle session now,  let’s move on the coding part.  So for the Practical  implementation of the Hands-On part,  I’ll be using the IRS data set.  So this data set consists of 150 observation.  We have four features  and one class label the four features include  the sepal length sepal width petal length  and the petrol head  whereas the class label decides which flower belongs  to which category.  So this was the description of the data set,  which we are using now,  let’s move on and see what are the step  by step solution to perform a KNN algorithm.  So first we’ll start by handling the The data  what we have to do we have to open the data set  from the CSV format  and split the data set  into train and test part next we’ll take the similarity  where we have to calculate the distance  between two data instances.  Once we calculate the distance next.  We’ll look for the neighbor and select K Neighbors  which are having the least distance from a new point.  Now once we get our neighbor,  then we’ll generate a response from a set of data instances.  So this will decide  whether the new Point belongs to class A or Class B.  Finally, we’ll create the accuracy function  and in the end.  We’ll tie it all together in the main function.  So let’s start with our code  for implementing KNN algorithm using python.  I’ll be using jupyter notebook python 3.0 installed on it.  Now, let’s move on and see  how can an algorithm can be implemented using python.  So there’s my jupyter notebook,  which is a web-based interactive Computing notebook environment  with python 3.0 installed on it  so that the launched  its launching so there’s our jupyter notebook  and we’ll be riding our python codes on it.  So the first thing  that we need to do is load our file,  our data is in CSV format without a header line  or any code we can open the file the open function  and read the data line using the reader function  in the CSV module.  So let’s write a code to load our data file.  Let’s execute the Run button.  So once you execute the Run button,  you can see the entire training data set as the output next.  We need to split the data into a training data set  that KN can use to make prediction and a test data set  that we can use to evaluate the accuracy of The module  so we first need to convert the flower measure  that were loaded as string into numbers  that we can work.  Next.  We need to split the data set randomly to train and test ratio  of 67 is 233 for test is to train as a standard ratio,  which is used for this purpose.  So let’s define a function  as load data set  that loads a CSV  with the provided file named and split it randomly into training  and test data set using the provided split ratio.  So this is our function load data set  which is using filenames  that ratio training data set and testing data set.  As its input.  All right.  So let’s execute the Run button and check for any errors.  So it’s executed with zero errors.  Let’s test this function.  So there’s our training set testing set load data set.  So this is our function load data set on inside  that we are passing.  Our file is data with a split ratio of 0.66  and training data set and test data set.  Let’s see what our training data set  and test data set its dividing into so it’s giving a count  of training data set and testing data set.  The total number of training data set  as split into is 97 and total number  of Test data set we have is 53.  So total number of training data set we have here is 97 and total  number of test data set we have here is 53.  All right.  Okay.  So our function load data set is performing.  Well, so let’s move on to step two  which is similarity.  So in order to make prediction,  we need to calculate the similarity between  any two given data instances.  This is needed  so that we can locate the kamo similar data instances  in the training data set are in turn make a prediction given  that all for flour measurement are numeric and have same unit.  We can directly use the euclidean distance measure.  This is nothing but the square root of the sum  of squared differences between two eras  of the number given  that all the for flower measurements are numeric  and have same unit.  We can directly use the euclidean distance measure  which is nothing but the square root of the sum  of squared difference between two arrays  or the number additionally we want to control  which field to include in the distance calculation.  So specifically we only want to include first for attribute.  So our approach will be to limit the euclidean distance  to a fixed length.  All right.  So let’s define our euclidean function.  So these are euclidean distance function  which takes instance one instance to and length  as parameters instance one and instance two are the two points  of which you want to calculate the euclidean distance,  whereas this length and denote  that how many attributes you want to include.  Okay.  So there’s our euclidean function.  Let’s execute it.  It’s executing fine without any errors.  Let’s test the function suppose  the data one or the first instance consists  of the data point us to to to and it belongs to class A. A  and data to consist of four for four  and it belongs to class P.  So when we calculate the euclidean distance  of data one to data to and  what we have to do we have to consider only  first three features of them.  All right.  So let’s print the distance as you can see here.  The distance comes out to be three point four six four.  All right.  So this is nothing but the square root  of 4 minus 2 whole Square.  So this distance is nothing but the euclidean distance  and it is calculated as square root of 4 minus 2 whole square  plus 4 minus 2 whole square  that is nothing but 3 times or 4 minus 2 whole That is 12 +  square root of 12 is nothing but 3.46 for all right.  So now that we have calculated the distance now,  we need to look for K nearest neighbors.  Now that we have a similarity measure we can use it to collect  the kamo similar instances for a given unseen instance.  Well, this is a straightforward process  of calculating the distance for all the instances  and selecting a subset with the smallest distance value.  And now what we have to do we have to select  the smallest distance values.  So for that will be defining a function  as get neighbors.  So for that  what we will be doing will be defining a function  as get neighbors  what it will do it will return the K most similar Neighbors  From the training set for a given test instance.  All right.  So this is how our get nabal function look  like it takes training data set  and test instance and K as its input here.  The K is nothing but the number  of nearest neighbor you want to check for.  All right.  So basically what you’ll be getting  from this get Mabel’s function is K different points  having least euclidean distance from the test instance.  All right, let’s execute it.  So the function executed without any errors.  So let’s test our function.  Suppose the training data set includes the data like 2 to 2  and it belongs to class A  and other data includes four four four  and it belongs to class P  and our testing instances five five five or now.  We have to predict  whether this test instance belongs to class A  or it belongs to class be.  All right for k equal 1 we have to predict  its nearest neighbor and predict  whether this test instance it will belong to class A  or will it belong to class be?  All right.  So let’s execute the Run button aligned.  So an executing the Run button you can see  that we have output is 4 4 4  and B. Be a new instance 5 5 5 is closest to point 4 4 4  which belongs to class be?  All right.  Now once you have located the most similar neighbor  for a test instance next task is to predict a response based  on those neighbors.  So how we can do that.  Well, we can do this  by allowing each neighbor to vote for the class attribute  and take the majority vote as a prediction.  Let’s see how we can do that.  So we have a function  as getresponse with takes neighbors as the input.  Well, this neighbor was nothing but the output  of this get me / function.  The output of get neighbor function will be fed  to get response.  All right, let’s execute the Run button.  It’s executed.  Let’s move ahead and test our function get response.  So we have a neighbor as one one one.  It belongs to class A to to to it belongs to class a33.  It belongs to class B.  So this response  what it will do it will store the value of get response  by passing this neighbor value.  All right.  So what we want to check is we want to predict  whether that test instance five five five.  It belongs to class A or Class B. Be  when the neighbors are 1 1 1 a 2 2 A + 3 3 p.  So let’s check our response now  that we have created all the different function  which are required for a KNN algorithm.  So important main concern is  how do you evaluate the accuracy of the prediction  and easy way to evaluate the accuracy of the model  is to calculate a ratio  of the total correct prediction to all the prediction made.  So for this I will be defining function  as get accuracy and inside  that I’ll be passing my test data set  and the predictions get accuracy function.  Check it.  Executed without any error.  Let’s check it for a sample data set.  So we have our test data set as 1 1 1  which belongs to class A 2/2  which again belongs to class 3 3 3 which belongs to class B  and my predictions is for first test data.  It predicted latter belongs to class A which is true  for next it predicted that belongs to class E,  which is again to and for the next again it predictive  that it belongs to class A which is false in this case  cause the test data belongs to class be.  All right.  So in total we have to correct prediction out of three.  All right.  Right.  So the ratio will be 2 by 3,  which is nothing but 66.6.  So our accuracy rate is 66.6.  So now that you have created all the function  that are required for KNN algorithm.  Let’s compile them into one single main function.  Alright, so this is our main function  and we are using Iris data set  with a split of 0.67 and the value of K is 3 Let’s see.  What is the accuracy score of this check  how accurate are modulus so in training data set,  we have 113 values  and then the test data set we have Seven values.  These are the predicted and the actual values  of the output.  Okay.  So in total,  we got an accuracy of ninety seven point two nine percent,  which is really very good.  Alright, so I hope the concept of this KNN algorithm  is here devised in a world full of machine learning  and artificial intelligence surrounding almost everything  around us classification  and prediction is one  of the most important aspects of machine learning.  So before moving forward, let’s have a Look at the agenda.  I’ll start of this video by explaining you guys.  What exactly is Nave biased then we’ll understand  what is space theorem  which serves as a logic  behind the name pass algorithm moving forward.  I’ll explain the steps involved in the neighb as algorithm one  by one and finally,  I’ll finish off this video with a demo on the Nave bass  using the sklearn package noun a bass is a simple  but surprisingly powerful algorithm  from predictive analysis.  It is a classification technique based on base.  him with an assumption  of Independence among predictors it comprises of two parts,  which is knave  and bias in simple terms neighbors classifier assumes  that the presence of a particular feature  in a class is unrelated to the presence  of any other feature,  even if this features depend on each other  or upon the existence of the other features,  all of these properties independently contribute  to the probability  whether a fruit is an apple or an orange or a banana,  so That is why it  is known as naive now naive based model is easy to build  and particularly useful for very large data sets  in probability Theory and statistics based theorem,  which is already known as the base law  or the base rule describes the probability of an event  based on prior knowledge of the conditions  that might be related to the event now pasted  here m is a way to figure out conditional probability.  The conditional probability is the probability  of an event happening given that it has some relationship.  One or more other events, for example,  your probability of getting a parking space is connected  to the time of the day.  You park where you park  and what conventions are you going on at that time  Bayes theorem is slightly more nuanced in a nutshell.  It gives you an actual probability of an event given  information about the tests.  Now, if you look at the definition  of Bayes theorem,  we can see that given a hypothesis H  and the evidence e-base term states that  the relationship between the E  of the hypothesis before getting the evidence,  which is the P of H and the probability  of the hypothesis after getting the evidence  that is p of H given e is defined as probability  of e given H into probability  of H divided by probability of e it’s rather confusing, right?  So let’s take an example to understand this theorem.  So suppose I have a deck of cards and  if a single card is drawn from the deck of playing cards,  the probability that the card is a king is for by 52  since there are four Kings in a standard deck of 52 cards.  Now if King is an event, this card is a king.  The probability of King is given as 4 by 52  that is equal to 1 by 13.  Now if the evidence is provided  for instance someone looks Such as the card  that the single card is a face card the probability  of King given  that it’s a face can be calculated  using the base theorem by this formula.  Now since every King  is also a face card the probability of face given  that it’s a king is equal to 1  and since there are three phase cards in each suit.  That is the chat king and queen.  The probability of the face card is equal to 12 by 52.  That is 3 by 30.  No using base certain we can find out the probability  of King given that it’s a face.  So our final answer comes to 1 by 3,  which is also true.  So if you have a deck of cards,  which has having only faces now,  there are three types of phases which are  the chat king and queen.  So the probability that it’s the king is 1 by 3.  Now.  This is the simple example of how based on works now  if we look at the proof as in how this paste Serum evolved.  So here we have probability of a given B  and probability of B given a now for  a joint probability distribution over the sets A and B,  the probability of a intersection B,  the conditional probability  of a given B is defined as the probability  of a intersection B divided by probability of B,  and similarly probability of B,  given a is defined as probability of B intersection  a divided by probability of a now we Equate probability  of a intersection p  and probability of B intersection a as  both are the same thing now from this method  as you can see,  we get our final base theorem proof,  which is the probability of a given b equals probability of B,  given a into probability  of P divided by the probability of a now  while this is the equation  that applies to any probability distribution  over the events A and B.  It has a particular nice interpretation in case  where a is represented as the hypothesis h H  and B is represented  as some observed evidence e in that case the formula is p  of H given e is equal to P of e given H into probability  of H divided by probability of e now this relates  the probability of hypothesis before cutting the evidence,  which is p of H to the probability  of the hypothesis after getting the evidence  which is p of H given e  for this reason P of H is known as the prior probability  while P of It’s given e is known as the posterior probability  and the factor  that relates the two is known as the likelihood ratio Now using  this term space theorem can be rephrased  as the procedure probability equals.  The prior probability times the likelihood ratio.  So now that we know the maths  which is involved behind the Bayes theorem.  Let’s see how we can implement this in real life scenario.  So suppose we have a data set.  Set in which we have  the Outlook the humidity and we need to find out  whether we should play or not on that day.  So the Outlook can be sunny overcast rain  and the humidity are high normal  and the wind are categorized into two phases  which are the weak and the strong winds.  The first of all will create a frequency table using  each attribute of the data set.  So the frequency table for the Outlook looks  like this we have Sunny overcast and rainy the frequency table  of humidity looks like this.  And a frequency table of when looks like this we have strong  and weak for wind and high and normal ranges for humidity.  So for each frequency table,  we will generate a likelihood table now now  the likelihood table contains the probability  of a particular day suppose we take the sunny  and we take the play as yes  and no so the probability  of Sunny given that we play yes is 3 by 10,  which is 0.3 the probability of X,  which is the probability of Sunny He is equal to 5 by 14.  Now.  These are all the terms  which are just generated from the data  which we have here.  And finally the probability of yes is 10 out of 14.  So if we have a look at the likelihood of yes given  that it’s a sunny we can see using Bayes theorem.  It’s the probability of Sunny given yes  into probability of yes divided by the probability of Sunny.  So we have all the values here calculated.  So if you put that in our base serum equation,  we get the likelihood of Is a 0.59 similarly the likelihood  of no can also be calculated here is 0.40 now similarly.  We are going to create the likelihood table  for both the humidity  and the win there’s a  for humidity the likelihood for yes given the humidity  is high is equal to 0.4 to and the probability  of playing know  given the Venice High is 0.58 the similarly for table wind.  The probability of e is given  that the wind is week is 0.75 and the probability of no given  that the win is week is 0.25 now suppose we have of day  which has high rain  which has high humidity and the wind is weak.  So should we play or not?  That’s all for that.  We use the base theorem here again the likelihood  of yes on that day is equal  to the probability of Outlook rain given  that it’s a yes into probability.  Of humidity given that say yes,  and the probability of when that is we given  that it’s we are playing yes into the probability of yes,  which equals to zero point zero one nine  and similarly the likelihood of know on that day is equal  to zero point zero one six.  Now if we look at the probability  of yes for that day  of playing we just need to divide it  with the likelihood some of both the yes  and no so the probability of playing tomorrow,  which is yes is .5.  Whereas the probability of not playing is equal to 0.45.  Now.  This is based upon the data which we already have with us.  So now that you have an idea of what exactly is named by as  how it works and we have seen  how it can be implemented on a particular data set.  Let’s see where it is used in the industry.  The started with our first industrial use case,  which is news categorized.  It’s move on to them or we can use  the term text classification to broaden the spectrum  of this algorithm news  in the web are rapidly growing in the era of Information Age  where each new site has its own different layout  and categorization for grouping news.  Now these heterogeneity  of layout and categorization cannot always satisfy  individual users need to remove these heterogeneity  and classifying the news articles.  Owing to the user preference is a formidable task companies  use web crawler to extract useful text  from HTML Pages the news articles  and each of these news articles  is then tokenized now these tokens are nothing  but the categories of the news now  in order to achieve better classification result.  We remove the less significant Words,  which are the stop was from the documents  or the Articles  and then we apply the Nave base classifier  for classifying the news contents based on the news.  Now this is by far one  of the best examples of Neighbors classifier,  which is Spam filtering.  Now.  It’s the Nave Bayes classifier are  a popular statistical technique for email filtering.  They typically use bag-of-words features to identify  at the spam email  and approach commonly used in text classification as well.  Now it works by correlating the use of tokens,  but the spam and non-spam emails and then the Bayes theorem,  which I explained  earlier is used to calculate the probability  that an email is  or not a Spam so named by a Spam filtering is  a baseline technique for dealing with Spam  that container itself  to the emails need of an individual user  and give low false positive spam detection rates  that are generally acceptable to users.  It is one of the oldest ways of doing spam filtering  with its roots  in the 1990s particular words have particular probabilities  of occurring in spam.  And in legitimate email as well for instance most emails  users will frequently encounter the world lottery  or the lucky draw a spam email,  but we’ll sell them see it in other emails.  The filter doesn’t know these probabilities in advance  and must be friends.  So it can build them up to train the filter.  The user must manually indicate  whether a new email is Spam or not for all the words  in each straining email.  The filter will adjust the probability  that each word will appear in a Spam or legitimate.  All in the database now  after training the word probabilities also known  as the likelihood functions are used to compute the probability  that an email with a particular set of words as in belongs  to either category each word  in the email contributes the email spam probability.  This contribution is called the posterior probability  and is computed again using the base 0  then the email spam probability  is computed over all the verse in the email  and if the total exceeds a certain threshold say  Or 95% the filter will Mark the email as spam.  Now object detection is the process of finding instances  of real-world objects such as faces bicycles  and buildings in images  or video now object detection  algorithm typically use extracted features  and learning algorithm  to recognize instance of an object category here again,  a bias plays an important role of categorization  and classification of object now medical area.  This is increasingly voluminous amount of electronic data,  which are becoming more and more complicated.  The produced medical data has certain characteristics  that make the analysis very challenging and attractive  as well among all the different approaches.  The knave bias is used.  It is the most effective and efficient classification  algorithm and has been successfully applied  to many medical problems empirical comparison  of knave bias versus five popular classifiers  on Medical data sets shows  that may bias is well suited for medical application and has  high performance in most of the examine medical problems.  Now in the past various statistical methods have been  used for modeling in the area of disease diagnosis.  These methods require prior assumptions and are  less capable of dealing  with massive and complicated nonlinear and dependent data one  of the main advantages of neighbor as approach  which is appealing to Physicians is  that all the available information is used?  To explain the decision this explanation seems  to be natural for medical diagnosis and prognosis.  That is it is very close to the way  how physician diagnosed patients now weather is one  of the most influential factor in our daily life to an extent  that it may affect the economy of a country  that depends on occupation like agriculture.  Therefore as a countermeasure to reduce the damage  caused by uncertainty in whether Behavior,  there should be an efficient way to print the weather now  whether projecting has Challenging problem  in the meteorological department  since ears even after the technology skill  and scientific advancement the accuracy  and production of weather has never been sufficient even  in current day this domain remains as a research topic  in which scientists  and mathematicians are working to produce a model  or an algorithm  that will accurately predict the weather now  a bias in approach based model is created by  where procedure probabilities are used to calculate  the likelihood of each class label for input.  Data instance and the one with the maximum likelihood  is considered as the resulting output now earlier.  We saw a small implementation of this algorithm as well  where we predicted  whether we should play or not based on the data,  which we have collected earlier.  Now, this is a python Library  which is known as scikit-learn it helps to build in a bias  and model in Python.  Now, there are three types of named by ass model  under scikit-learn Library.  The first one is the caution.  It is used in classification and it Assumes  that the feature follow a normal distribution.  The next we have is multinomial.  It is used for discrete counts.  For example, let’s say we have a text classification problem  and here we consider bernouli trials,  which is one step further  and instead of word occurring in the document.  We have count  how often word occurs  in the document you can think of it  as a number of times outcomes number is observed  in the given number of Trials.  And finally we have the bernouli type.  Of Naples, the binomial model is useful  if your feature vectors are binary bag of words model  where the once  and the zeros are words occur in the document and the verse  which do not occur  in the document respectively based on their data set.  You can choose any of the given discussed model here,  which is the gaussian the multinomial or the bernouli.  So let’s understand how this algorithm works.  And what are the different steps?  One can take to create a bison model and use knave bias  to predict the output so here to understand better.  We are going to predict the onset of diabetes Now  this problem comprises  of 768 observations of medical details  for Pima Indian patients.  The record describes instantaneous measurement taken  from the patient such as the age the number  of times pregnant and the blood work group  now all the patients are women aged 21 and Old  and all the attributes are numeric and the unit’s vary  from attribute to attribute.  Each record has a class value that indicate  whether the patient suffered on onset of diabetes  within five years or the measurements.  Now, these are classified as zero.  Now, I’ve broken the whole process down  into the following steps.  The first step is handling the data in which we load  the data from the CSV file and split it into training  and test data sets.  The second step is summarizing the data.  In which we summarize  the properties in the training data sets so that we  can calculate the probabilities and make predictions.  Now the third step comes is making a particular prediction.  We use the summaries  of the data set to generate a single prediction.  And after that we generate predictions  given a test data set and a summarize training data sets.  And finally we evaluate  the accuracy of the predictions made for a test data set  as the percentage correct out of all the predictions made  and finally We tied together and form.  Our own model of nape is classifier.  Now.  The first thing we need to do is load our data the data is  in the CSV format without a header line  or any codes.  We can open the file with the open function  and read the data lines using the read functions  in the CSV module.  Now, we also need to convert the attributes  that were loaded as strings into numbers  so that we can work with them.  So let me show you  how this can be implemented now for that you need to Tall python  on a system and use the jupyter notebook  or the python shell.  Hey, I’m using the Anaconda Navigator  which has all  the things required to do the programming in Python.  We have the Jupiter lab.  We have the notebook.  We have the QT console.  Even we have a studio as well.  So what you need to do is just install the Anaconda Navigator  it comes with the pre installed python also,  so the moment you click launch on The jupyter Notebook.  It will take you to the Jupiter homepage  in a local system  and here you can do programming in Python.  So let me just rename it as by my India diabetes.  So first, we need to load the data set.  So I’m creating here a function load CSV now before that.  We need to import certain CSV the math  and the random method.  So as you can see,  I’ve created a load CSV function  which will take the pie my Indian diabetes  data dot CSV file using the CSV dot reader method  and then we are converting every element of that data set  into float originally all the Ants are in string,  but we need to convert them into floor  for our calculation purposes.  Now next we need to split the data into training data sets  that nay bias can use to make the prediction  and this data set  that we can use to evaluate the accuracy of the model.  We need to split the data set randomly into training  and testing data set in the ratio of usually  which is 70 to 30,  but for this example,  I am going to use 67  and 33 now 70 and 30 is a Ratio for testing algorithms  so you can play around with this number.  So this is our split data set function.  Now the Navy base model is comprised of summary of the data  in the training data set.  Now this summary is then used while making predictions.  Now the summary of the training data  collected involves the mean the standard deviation  of each attribute by class value now, for example,  if there are two class values and seven numerical attributes,  then we need a mean  and the standard deviation for each of these seven attributes  and the class value  which makes The 14 attribute summaries  so we can break the preparation of this summary down  into the following sub tasks  which are the separating data by class calculating mean  calculating standard deviation summarizing the data sets  and summarizing attributes by class.  So the first task is to separate  the training data set instances by class value  so that we can calculate statistics for each class.  We can do that by creating a map of each class value  to a list of instances that belong to the class.  Class and sort the entire dataset of instances  into the appropriate list.  Now the separate by class function just the same.  So as you can see the function assumes  that the last attribute is the class value  the function returns a map of class value to the list  of data instances next.  We need to calculate the mean of each attribute  for a class value.  Now, the mean is the central middle or the central tendency  of the data and we use it as a middle  of our gaussian distribution  when Calculating the probabilities.  So this is our function for mean now.  We also need to calculate the standard deviation  of each attribute for a class value.  The standard deviation is calculated as a square root  of the variance  and the variance is calculated as the average  of the squared differences  for each attribute value  from the mean now one thing to note  that here is  that we are using n minus one method  which subtracts one  from the number of attributes values  when calculating the variance.  The now that we have the tools to summarize the data  for a given list of instances,  we can calculate the mean and standard deviation  for each attribute.  Now that’s if function groups the values for each attribute  across our data instances into their own lists  so that we can compute the mean and standard deviation values  for each attribute.  The next comes the summarizing attributes by class.  We can pull it all together  by first separating our training data sets  into instances growth by class then calculating the summaries  for each a To be with now.  We are ready to make predictions using the summaries prepared  from our training data making predictions involves  calculating the probability  that a given data instance belong to each class then  selecting the class  with the largest probability as a prediction.  Now we can divide this whole method into four tasks  which are the calculating gaussian probability density  function calculating class probability making a prediction  and then estimating the accuracy  now to calculate the gaussian probability density function.  We use the gaussian function to estimate the probability  of a given attribute value given the node mean  and the standard deviation of the attribute estimated  from the training data.  As you can see the parameters are x  mean and the standard deviation  now in the calculate probability function,  we calculate the exponent first then calculate the main division  this lets us fit the equation nicely into two lines.  Now, the next task  is calculating the class properties now  that we had can calculate the probability of an attribute  belonging to a class.  We can combine the probabilities of all the attributes values  for a data instance  and come up with a probability of the entire.  Our data instance belonging to the class.  So now that we have calculated the class properties.  It’s time to finally make our first prediction now,  we can calculate the probability of the data instance belong  to each class value  and we can look for the largest probability  and return the associated class  and for that we are going to use this function to predict  which uses the summaries  and the input Vector which is basically all the probabilities  which are being input for a particular label  now finally we can An estimate the accuracy  of the model by making predictions  for each data instances in our test data for that.  We use the cat predictions method.  Now this method is used  to calculate the predictions based upon the test data sets  and the summary of the training data set.  Now, the predictions can be compared  to the class values in our test data set  and classification accuracy can be calculated as  an accuracy ratio between the zeros  and the hundred percent.  Now the get accuracy method will calculate this accuracy ratio.  Now finally to sum it all up.  We Define our main function we call all these methods  which we have defined earlier one by one to get  the Courtesy of the model which we have created.  So as you can see,  this is our main function in which we have the file name.  We have defined the split ratio.  We have the data set.  We have the training and test data set.  We are using the split data set method next.  We are using the summarized by class function using  the get prediction and the get accuracy method as well.  So guys as you can see the output of this one gives us  that we are splitting the seven sixty eight rows into 514  which is the training and 254  which is the test data set rows and the accuracy of this model  is 68% Now we can play with the amount of training  and test data sets which are to be used  so we can change the split ratio to seventies.  238 is 220 to get different sort of accuracy.  So suppose I change the split ratio from 0.67 20.8.  So as you can see,  we get the accuracy of 62 percent.  So splitting it into 0.67 gave us a better result  which was 68 percent.  So this is how you can Implement Navy bias caution classifier.  These are the step by step methods  which you need to do in case of using the Nave Bayes classifier,  but don’t worry.  We do not need to write all this many lines  of code to make a model this with The Sacketts.  And I really comes into picture the scikit-learn library has  a predefined method  or as say a predefined function of neighbor bias,  which converts all of these lines,  of course into merely just two or three lines of codes.  So, let me just open another jupyter notebook.  So let me name it as sklearn a pass.  Now here we are going to use the most famous data set  which is the iris dataset.  Now, the iris flower data set is a multivariate  data set introduced by the British statistician  and biologists Roland Fisher  and based on this fish is linear discriminant model this data set  became a typical test case  for many statistical classification techniques  in machine learning.  So here we are going to use the caution NB model,  which is already available in the sklearn.  As I mentioned earlier,  there were three types of Neighbors  which are the question multinomial and the bernouli.  So here we are going to use the caution and be model  which is already present in the sklearn library,  which is the cycle learn Library.  So first of all,  what we need to do is import the sklearn data sets  and the metrics  and we also need to import the caution and be Now  once all these libraries  are lowered we need to load the data set  which is the iris dataset.  The next what we need to do is fit a Nave  by a small to this data set.  So as you can see we have so easily defined the model  which is the gaussian NB which contains  all the programming  which I just showed you earlier all the methods  which are taking the input calculating the mean  the standard deviation separating it bike last  and finally making predictions.  Calculating the prediction accuracy.  All of this comes under the caution and be method  which is inside already present in the sklearn library.  We just need to fit it according to the data set  which we have so next  if we print the model we see which is the gaussian NB model.  The next what we need to do is make the predictions.  So the expected output is data set dot Target  and the projected is using the pretend model  and the model we are using is the cause in NB here.  Here now to summarize the model  which created we calculate the confusion Matrix  and the classification report.  So guys, as you can see the classification to provide  we have the Precision of Point Ninety Six,  we have the recall of 0.96.  We have the F1 score  and the support and finally if we print our confusion Matrix,  as you can see it gives us this output.  So as you can see using the gaussian  and we method just putting it in the model  and using any of the data.  Fitting the model  which you created into a particular data set  and getting the desired output is so easy  with the scikit-learn library.  So guys, this is it.  I hope you understood a lot about the nape Bayes classifier  how it is used  where it is used and what are the different steps involved  in the classification technique  and how the scikit-learn makes all of those techniques  very easy to implement in any data set which we have.  As we M or support Vector machine is one  of the most effective machine learning classifier  and it has been used in various Fields  such as face recognition cancer classification  and so on today’s session  is dedicated to how svm works the various features of svm  and how it is used in the real world.  So without any further due let’s take a look  at the agenda for today.  We’re going to begin the session  with an introduction to machine learning  and the different types of machine learning.  Next we’ll discuss  what exactly support Vector machines are  and then we’ll move on and see  how svm works  and how it can be used to classify linearly  separable data will also briefly discuss about  how nonlinear svm’s work  and then we’ll move on  and look at the use case of svm in colon cancer classification  and finally we’ll end the session by running a demo  where we’ll use svm to predict  whether a patient is suffering from a heart disease or not.  Okay, so that was the agenda.  Let’s get stood with our first topic.  So what is machine learning machine learning is a science  of getting computers to act by feeding them data  and letting them learn a few tricks on their own.  Okay, we’re not going to explicitly program  the machine instead.  We’re going to feed it data and let it learn  the key to machine learning is the data machines learn just  like us humans.  We humans need to collect information  and data to learn similarly machines must also be fed data  in order to learn and make decisions.  Let’s say that you want a machine to predict  the value of a stock.  All right in such situations.  You just feed the machine with relevant data  after which you develop a model  which is used to predict the value of the stock.  NOW one thing to keep in mind is the more data  you feed the machine the better it will learn  and make more accurate predictions obviously machine  learning is not so simple  in order for a machine to analyze and get  useful insights from data.  It must process  and study the data by running different.  Algorithms on it.  All right.  And today we’ll be discussing about one of the most widely  used algorithm called the support Vector machine.  Okay.  Now that you have a brief idea about what machine learning is,  let’s look at the different ways in which machines Lon first.  We have supervised learning in this type  of learning the machine learns under guidance.  All right, that’s why it’s called supervised learning  now at school.  Our teachers guided us  and taught us similarly in supervised learning machines  learn by feeding them labeled data.  Explicitly telling them.  Hey, this is the input and this is  how the output must look.  Okay.  So guys the teacher in this case is the training data.  Next we have unsupervised learning here.  The data is not labeled  and there is no guide of any sort.  Okay, the machine must figure out the data set given  and must find hidden patterns in order to make predictions  about the output an example  of unsupervised learning is an adult’s like you and me.  We don’t need a guide to help us with our daily activities.  They figured things out on our own without any supervision.  All right, that’s exactly how I’m supervised learning work.  Finally.  We have reinforcement learning.  Let’s say you were dropped off at an isolated island.  What would you do now initially you would panic  and you’ll be unsure of what to do  where to get food from How To Live and all of that  but after a while you will have to adapt you must learn  how to live in the island adapt to the changing climate learn  what to eat and what not to eat.  You’re basically following the hit and trial.  Because you’re new to the surrounding  and the only way to learn is experience and then learn  from your experience.  This is exactly what reinforcement learning is.  It is a learning method wherein an agent interacts  with its environment by producing actions  and discovers errors or words.  Alright, and once it gets trained it gets ready to predict  the new data presented to it.  Now in our case the agent was you basically stuck  on the island  and the environment was the island.  All right?  Okay now now let’s move on and see  what svm algorithm is all about.  So guys svm  or support Vector machine is a supervised learning algorithm,  which is mainly used to classify data into different classes now  unlike most algorithms svm makes use of a hyperplane  which acts like a decision boundary  between the various classes  in general svm can be used to generate  multiple separating hyperplanes  so that the data is divided into segments.  Okay and each These segments  will contain only one kind of data.  It’s mainly used  for classification purpose wearing you want to classify  or data into two different segments depending  on the features of the data.  Now before moving any further,  let’s discuss a few features of svm.  Like I mentioned earlier svm is a supervised learning algorithm.  This means that svm trains  on a set of labeled data svm studies the label training data  and then classifies  any new input data depending on what it learned in the training.  In Phase a main advantage of support Vector machine is  that it can be used for both classification  and regression problems.  All right.  Now even though svm is mainly known for classification the svr  which is the support Vector regressor is used  for regression problems.  All right, so svm can be used both for classification.  And for regression.  Now, this is one of the reasons why a lot of people prefer svm  because it’s a very good classifier and along with that.  It is also used for regression.  Another feature is the svm kernel functions svm can be used  for classifying nonlinear data  by using the kernel trick the kernel trick basically means  to transform your data into another dimension  so that you can easily draw a hyperplane  between the different classes of the data.  Alright, nonlinear data is basically data  which cannot be separated with a straight line.  Alright, so svm can even be used on nonlinear data sets.  You just have to use a kernel functions to do this.  All right, so Guys,  I hope you all are clear with the basic concepts of svm.  Now.  Let’s move on and look  at how svm works so guys an order to understand  how svm Works let’s consider a small scenario now  for a second pretend  that you own a firm.  Okay, and let’s say that you have a problem  and you want to set up a fence to protect your rabbits  from the pack of wolves.  Okay, but where do you  build your fence one way to get around?  The problem is to build a classifier based  on the position of the rabbits and words in your Faster.  So what I’m telling you is you can classify the group  of rabbits as one group  and draw a decision boundary between the rabbits  and the world.  All right.  So if I do that and if I try to draw a decision boundary  between the rabbits and the Wolves,  it looks something like this.  Okay.  Now you can clearly build a fence along this line  in simple terms.  This is exactly  how SPM work it draws a decision boundary,  which is a hyperplane  between any two classes in order to separate them or class.  Asif I them now,  I know you’re thinking how do you know  where to draw a hyperplane  the basic principle behind svm is to draw a hyperplane  that best separates the two classes  in our case the two glasses of the rabbits and the Wolves.  So you start off by drawing a random hyperplane  and then you check the distance between the hyperplane  and the closest data points  from each glove these closes on your is data points  to the hyperplane are known as support vectors and that’s  where the name comes from support.  Active machine.  So basically the hyperplane is drawn  based on these support vectors.  So guys an Optimum hyperplane will have  a maximum distance from each of these support vectors.  All right.  So basically the hyper plane which has the maximum distance  from the support vectors is the most optimal hyperplane  and this distance between the hyperplane  and the support vectors is known as the margin.  All right.  So to sum it up svm is used to classify data  by using a hyper plane such  that the distance distance between the hyperplane  and the support vectors is maximum.  So basically your margin has to be maximum.  All right, that way,  you know that you’re actually separating your classes or add  because the distance between the two classes is maximum.  Okay.  Now, let’s try to solve a problem.  Okay.  So let’s say that I input a new data point.  Okay.  This is a new data point  and now I want to draw a hyper plane such  that it best separates the two classes.  Okay, so I start off by drawing a hyperplane like this  and then I check the distance between Hyper plane  and the support vectors.  Okay, so I’m trying to check  if the margin is maximum for this hyperplane,  but what if I draw a hyper plane which is like this?  All right.  Now I’m going to check the support vectors over here.  Then I’m going to check the distance  from the support vectors and with this hyperplane,  it’s clear that the margin is more right  when you compare the margin  of the previous one to this hyperplane.  It is more.  So the reason why I’m choosing this hyperplane is  because the distance between the support vectors  and the hi Hyperplane is maximum in this scenario.  Okay, so guys this is how you choose a hyperplane.  You basically have to make sure  that the hyper plane has a maximum.  Margin.  All right, it has two best separate the two classes.  All right.  Okay so far it was quite easy.  Our data was linearly separable  which means that you could draw a straight line  to separate the two classes.  All right, but what will you do?  If the data set is like this  you possibly can’t draw a hyper plane like this.  All right.  It doesn’t separate the two.  At all, so what do you do  in such situations now earlier in the session I mentioned  how a kernel can be used to transform data  into another dimension  that has a clear dividing margin between the classes of data.  Alright, so kernel functions offer the user this option  of transforming nonlinear spaces into linear ones.  Nonlinear data set is the one  that you can’t separate using a straight line.  All right, in order to deal with such data sets you’re going  to Ants form them into linear data sets  and then use svm on them.  Okay.  So simple trick would be to transform the two variables  X and Y into a new feature space involving  a new variable called Z.  All right, so guys so far we were plotting our data  on two dimensional space.  Correct?  We will only using the X  and the y axis so we had only those two variables X and Y now  in order to deal with this kind  of data a simple trick would be to transform the two variables X  and I into a new feature space involving a new variable  called Z. Ok,  so we’re basically visualizing the data  on a three-dimensional space.  Now when you transform the 2D space into a 3D space,  you can clearly see a dividing margin  between the two classes of data right now.  You can go ahead and separate the two classes  by drawing the best hyperplane between them.  Okay, that’s exactly  what we discussed in the previous slides.  So guys, why don’t you try  this yourself dry drawing a hyperplane,  which is the most Optimum.  For these two classes.  All right, so guys,  I hope you have a good understanding  about nonlinear svm’s now.  Let’s look at a real world use case of support Vector machines.  So guys s VM  as a classifier has been used in cancer classification  since the early 2000s.  So there was an experiment held by a group of professionals  who applied svm in a colon cancer tissue classification.  So the data set consisted  of about 2,000 transmembrane protein samples  and Only about 50 to 200 genes samples were input  Into the svm classifier  Now this sample  which was input  into the svm classifier had both colon cancer tissue samples  and normal colon tissue samples right now.  The main objective of this study was to classify Gene samples  based on whether they are cancerous or not.  Okay, so svm was trained using the 50 to 200 samples  in order to discriminate between non-tumor  from tumor specimens.  So the performance of The svm classifier  was very accurate for even a small data set.  All right, we had only 50 to 200 samples.  And even for the small data set svm was pretty accurate  with its results.  Not only that its performance was compared  to other classification algorithm like naive Bayes  and in each case svm outperform naive Bayes.  So after this experiment it was clear  that svm classify the data more effectively  and it worked exceptionally good with small data sets.  Let’s go ahead  and understand what exactly is unsupervised learning.  So sometimes the given data is unstructured and unlabeled  so it becomes difficult to classify the data  into different categories.  So unsupervised learning helps to solve this problem.  This learning is used to Cluster the input data  in classes on the basis of their statistical properties.  So example, we can cluster Different Bikes  based upon the speed limit their acceleration  or the average.  Average that they are giving so  and suppose learning is a type of machine learning algorithm  used to draw inferences  from data sets consisting of input data  without labels responses.  So if you have a look at the workflow  or the process flow of unsupervised learning,  so the training data is collection of information  without any label.  We have the machine learning algorithm  and then we have the clustering malls.  So what it does is  that distributes the data into different clusters  and again if you provide any Lebanon new data,  it will make a prediction  and find out to which cluster that particular data  or the data set belongs  to or the particular data point belongs to so one  of the most important  algorithms in unsupervised learning is clustering.  So let’s understand exactly what is clustering.  So a clustering  basically is the process of dividing the data sets  into groups consisting of similar data points.  It means grouping of objects based  on the information found in the data describing the objects  or their relationships,  so So clustering malls focus on  and defying groups of similar records  and labeling records according to the group  to which they belong now.  This is done without the benefit  of prior knowledge about the groups  and their creator districts.  So and in fact,  we may not even know exactly how many groups are  there to look for.  Now.  These models are often referred to as  unsupervised learning models,  since there’s no external standard by which  to judge the malls classification performance.  There are no right or wrong answers to these model and  if we talk about why clustering is used  so the goal of clustering is to determine  the intrinsic growth in a set of unlabeled data sometime.  The partitioning is the goal  or the purpose of clustering algorithm is to make sense  of and exact value  from the last set of structured and unstructured data.  So that is why clustering is used in the industry.  And if you have a look at the various use cases  of clustering in Industry so first of all,  it’s being used in marketing.  So discovering distinct groups  in customer databases such as customers  who make a lot of long distance calls customers  who use internet more  than cause they’re also using insurance companies  for like identifying groups  of Corporation insurance policy holders with high average  claim rate Farmers crash cops,  which is profitable.  They are using C Smith studies  and Define probability areas of oil or gas exploration based.  Don’t cease make data and they’re also used  in the recommendation of movies.  If you’d say they are also used in Flickr photos.  They also used by Amazon  for recommending the product which category it lies in.  So basically if we talk  about clustering there are three types of clustering.  So first of all,  we have the exclusive clustering  which is the hard clustering so here and item belongs  exclusively to one cluster not several clusters  and the datapoint belong exclusively to one cluster.  ER so an example of this is the k-means clustering so  claiming clustering does this exclusive kind  of clustering so secondly,  we have overlapping clustering  so it is also known as soft clusters in this  and item can belong  to multiple clusters as its degree of association  with each cluster is shown and for example,  we have fuzzy or the c means clustering  which has been used for overlapping clustering  and finally we have the hierarchical clustering  so When two clusters have a parent-child relationship  or a tree-like structure,  then it is known as hierarchical cluster.  So as you can see here from the example,  we have a parent-child kind  of relationship in the cluster given here.  So let’s understand  what exactly is K means clustering.  So today means clustering is an Enquirer them whose main goal  is to group similar elements of data points into a cluster  and it is a process by which objects are classified  into a predefined number of groups  so that they They are as much just similar as  possible from one group to another group  but as much as similar or possible within each group now  if you have a look at the algorithm working here,  you’re right.  So first of all,  it starts with and defying the number of clusters,  which is K  that I can we find the centroid we find that distance objects  to the distance object  to the centroid distance of object to the centroid.  Then we find the grouping based on the minimum distance.  Past the centroid Converse  if true then we make a cluster false.  We then I can’t find the centroid repeat  all of the steps again and again,  so let me show you  how exactly clustering was with an example here.  So first we need to decide the number  of clusters to be made now another important task here is  how to decide the important number of clusters  or how to decide the number of classes will get  into that later.  So first, let’s assume  that the number of clusters we have decided.  It is three.  So after that then we provide the centroids  for all the Clusters  which is guessing  and the algorithm calculates the euclidean distance  of the point from each centroid  and assize the data point  to the closest cluster now euclidean distance.  All of you know is the square root  of the distance the square root of the square of the distance.  So next when the centroids are calculated again,  we have our new clusters  for each data point then again the distance from the points.  To the new classes are calculated and then  again the points are assigned to the closest cluster.  And then again,  we have the new centroid  scattered and now these steps are repeated  until we have a repetition the centroids  or the new centralized are very close to the very previous ones.  So until unless our output  gets repeated or the outputs are very very close enough.  We do not stop this process.  We keep on calculating the euclidean distance  of all the points to the centroid.  It’s then we calculate the new centroids  and that is how K means clustering Works basically,  so an important part here is to understand  how to decide the value of K or the number of clusters  because it does not make any sense.  If you do not know  how many classes are you going to make?  So to decide the number of clusters?  We have the elbow method.  So let’s assume first  of all compute the sum squared error,  which is sse4 some value of a for example.  Take two four six and eight now the SSE  which is the sum squared error is defined as a sum  of the squared distance between each number member  of the cluster  and its centroid mathematically and  if you mathematically it is given by the equation  which is provided here.  And if you brought the key against the SSE,  you will see  that the error decreases as K gets large not this is  because the number of cluster increases  they should be smaller.  So the Distortion is also smaller know.  The idea of the elbow method is to choose the K at which  the SSE decreases abruptly.  So for example here  if we have a look at the figure given here.  We see that the best number of cluster is at the elbow  as you can see here the graph here changes abruptly  after the number four.  So for this particular example,  we’re going to use for as a number of cluster.  So first of all  while working with k-means clustering there  are two key points to know first of all,  Be careful about where you start so choosing  the first center at random during the second center.  That is far away from the first center similarly choosing  the NIH Center as far away as possible from the closest  of the of the other centers  and the second idea is to do as many runs  of k-means each with different random starting points  so that you get an idea of where exactly  and how many clusters you need to make  and where exactly the centroid lies  and how the data is getting converted.  Divorced now k-means is not exactly a very good method.  So let’s understand the pros and cons of k-means clustering.  We know that k-means is simple and understandable.  Everyone learns to the first go the items automatically assigned  to the Clusters.  Now if we have a look at the cons,  so first of all one needs to define the number of clusters,  there’s a very heavy task asks us  if we have three four or if we have 10 categories,  and if you do not know  what the number of clusters are going to be.  It’s very difficult for anyone.  You know to guess the number of clusters not all the items  are forced into clusters  whether they are actually belong to any other cluster  or any other category.  They are forced to rely  in that other category in which they are closest  to this against happens because of the number  of clusters with not defining the correct number of clusters  or not being able to guess the correct number of clusters.  So and for most of all,  it’s unable to handle the noisy data and the outliners  because anyways machine learning engineers and date.  Our scientists have to clean the data.  But then again it comes down to the analysis  what they’re doing and the method  that they are using so typically people do not clean the data  for k-means clustering or even  if the clean there’s sometimes a now see noisy  and outliners data which affect the whole model  so that was all for k-means clustering.  So what we’re going to do is now use k-means clustering  for the movie datasets,  so, Have to find out the number of clusters  and divide it accordingly.  So the use case is that first of all,  we have a data set of five thousand movies.  And what we want to do is grip them  if the movies into clusters based on the Facebook likes,  so guys, let’s have a look at the demo here.  So first of all,  what we’re going to do is import deep copy numpy pandas  Seaborn the various libraries,  which we’re going to use now  and from my proclivities in the use ply plot.  And we’re going to use this ggplot and next  what we’re going to do is import the data set and look  at the shape of the data set.  So if you have a look at the shape of the data set we can see  that it has 5043 rose with 28 columns.  And if you have a look at the head of the data  set we can see it just 5043 data points,  so George we going to do is place the data points  in the plot we take the director Facebook likes  and we have a look  at the data columns face number in post cars  total Facebook likes director Facebook likes.  So what we have done here  now is taking the director Facebook likes and the actor  three Facebook likes, right.  So we have five thousand forty three rows  and two columns Now using the k-means from sklearn  what we’re going to do is import it.  First we’re going to import k-means from scale  and Dot cluster.  Remember guys eschaton is a very important library  in Python for machine learning.  So and the number of cluster  what we’re going to do is provide as five now this again,  the number of cluster depends upon the SSE,  which is the sum  of squared errors all the we’re going to use the elbow method.  So I’m not going to go into the details of that again.  So we’re going to fit the data into the k-means to fit and  if you find the cluster,  Us than for the k-means and printed.  So what we find is is an array of five clusters  and Fa print the label of the k-means cluster.  Now next what we’re going to do is plot the data  which we have with the Clusters with the new data clusters,  which we have found  and for this we’re going to use the CC Bond  and as you can see here, we have plotted that car.  We have plotted the data  into the grid and you can see here we have five clusters.  So probably what I would say is  that the cluster 3 and the cluster  zero are very very close.  So it might depend see  that’s exactly what I was going to say.  Is that initially the main Challenge  and k-means clustering is to define the number of centers  which are the K.  So as you can see here  that the third Center  and the zeroth cluster the third cluster  and the zeroth cluster up very very close to each other.  So guys It probably could have been  in one another cluster and the another disadvantage was  that we do not exactly know  how the points are to be arranged.  So it’s very difficult to force the data into any other cluster  which makes our analysis a little different works fine.  But sometimes it might be difficult to code  in the k-means clustering now,  let’s understand what exactly is c means clustering.  So the fuzzy see means  is an extension of the k-means clustering the popular simple.  Clustering technique so fuzzy clustering also referred  as soft clustering is a form  of clustering in which each data point can belong  to more than one cluster.  So k-means tries to find the heart clusters  where each point belongs to one cluster.  Whereas the fuzzy c means discovers the soft clusters  in a soft cluster any point can belong  to more than one cluster  at a time with a certain Affinity value  towards each 4zc means assigns the degree of membership,  which Just from 0 to 1 to an object to a given cluster.  So there is a stipulation that the sum of Z membership  of an object to all the cluster.  It belongs to must be equal to 1 so the degree of membership  of this particular point to pull of these clusters as 0.6 0.4.  And if you add up we get 1  so that is one of the logic behind the fuzzy c means  so and and this Affinity is proportional to the distance  from the point to the center of a cluster now  then again We have the pros and cons of fuzzy see means.  So first of all,  it allows a data point to be in multiple cluster.  That’s a pro.  It’s a more neutral representation of the behavior  of jeans jeans usually are involved in multiple functions.  So it is a very good type of clustering  when we’re talking about genes First of and again,  if we talk about the cons again,  we have to Define c which is the number  of clusters same as K next.  We need to determine the membership cutoff value also,  so that takes a lot of I’m and it’s time-consuming  and the Clusters  are sensitive to initial assignment of centroid.  So a slight change  or deviation from the center’s it’s going to result  in a very different kind of, you know,  a funny kind of output with that from the fuzzy see means and one  of the major disadvantage of c means clustering is  that it’s this a non-deterministic algorithm.  So it does not give you a particular output as  in such that’s  that now let’s have a look at At the throat type  of clustering which is the hierarchical clustering.  So hierarchical clustering is an alternative approach  which builds a hierarchy from the bottom up  or the top to bottom  and does not require to specify the number  of clusters beforehand.  Now, the algorithm works as in first of all,  we put each data point  in its own cluster and if I the closest to Cluster  and combine them into one more cluster repeat the above step  till the data points are in a single cluster.  Now, there are two types of hierarchical clustering one is  I’ve number 80 plus string  and the other one is division clustering.  So a cumulative clustering bills the dendogram from bottom level  while the division clustering it starts all the data points  in one cluster the fruit cluster now again  hierarchical clustering also has some sort of pros and cons.  So in the pros don’t know Assumption  of a particular number of cluster is required  and it may correspond to meaningful tax anomalies.  Whereas if we talk about the cons  once a decision is made to combine two clusters.  It cannot be undone and one  of the major disadvantage of these hierarchical clustering is  that it becomes very slow.  If we talked about very very large data sets and nowadays.  I think every industry are using last year  as it’s and collecting large amounts of data.  So hierarchical clustering is not the act or the best method  someone might need to go for so there’s  that Hello everyone  and welcome to this interesting session on a prairie algorithm.  Now many of us have visited retails shops such as  Walmart or Target for our household needs.  Well, let’s say  that we are planning to buy a new iPhone from Target.  What we would typically do is  search for the model by visiting the mobile section of the stove  and then select the product  and head towards the billing counter.  But in today’s world the goal  of the organization is to increase the revenue.  Can this be done by just pitching one?  I worked at a time to the customer.  Now.  The answer to Is is clearly no hence organization began  mining data relating to frequently bought items.  So a Market Basket analysis is one of the key techniques  used by large retailers to uncover associations  between items now examples could be the customers  who purchase Bread have a 60 percent likelihood  to also purchase Jam customers  who purchase laptops are more likely to purchase  laptop bags as well.  They try to find out  associations between different items and products  that can be sold together  which gives assisting in the right product placement.  Typically, it figures out  what products are being bought together  and organizations can place products in a similar manner,  for example, people  who buy bread also tend to buy butter,  right and the marketing team  at retail stores should Target customers  who buy bread and butter and provide an offer to them  so that they buy a But item suppose X  so if a customer buys bread  and butter and sees a discount offer on X,  he will be encouraged to spend more and buy the eggs  and this is what Market Basket analysis is all about.  This is what we are going to talk about in this session,  which is Association rule Mining  and the a prayer real Corinth  now Association rule can be thought of as  an if-then relationship just to elaborate on that.  We have come up with a rule suppose  if an item a is Been bought by the customer.  Then the chances  of Item B being picked by the customer to under  the same transaction ID is found out you need to understand here  that it’s not a cash reality rather.  It’s a co-occurrence pattern that comes to the force.  Now, there are two elements  to this rule first if and second is the then now  if is also known as antecedent.  This is an item or a group of items  that are typically found in the item set  and the later one.  Is called the consequent this comes along as an item  with an antecedent group  or the group of antecedents a purchase.  Now if we look at the image here a arrow B,  it means that  if a person buys an item a then he will also buy an item b  or he will most probably by an item B.  Now the simple example  that I gave you about the bread-and-butter and the x  is just a small example,  but what if you have thousands and thousands of items  if you go to any proof additional data scientist  with that data,  you can just imagine how much of profit you can make  if the data scientist provides you with the right examples  and the right placement of the items,  which you can do and you can get a lot of insights.  That is why Association rule mining is a very  good algorithm which helps the business make profit.  So, let’s see how this algorithm works.  So Association rule mining is all about building the rules  and we have just seen one rule  that If you buy a then there’s a slight possibility  or there is a chance  that you might buy be also this type  of a relationship in which we can find the relationship  between these two items is known as single cardinality,  but what if the customer  who bought a and b also wants to buy C or if a customer  who bought a b and c also wants to buy D. Then  in these cases the cardinality usually increases  and we can have a lot of combination around.  These data and  if you have around 10,000 or more than 10,000 data  or items just imagine  how many rules you’re going to create for each product.  That is why Association rule mining has such measures so  that we do not end up creating tens of thousands of rules.  Now that is where the a priori algorithm comes in.  But before we get into the a priori algorithm,  let’s understand.  What’s the maths behind it.  Now there are three types of matrices.  Which help to measure the association?  We have support confidence and lift.  So support is the frequency of item a  or the combination of item ARB.  It’s basically the frequency of the items,  which we have bought  and what are the combination of the frequency of the item.  We have bought.  So with this what we can do is filter out the items,  which have been bought less frequently.  This is one of the measures which is support now  what confidence tells us so conference.  Gives us how often the items  NB occur together given the number of times a occur.  Now this also helps us solve a lot of other problems  because if somebody is buying a  and b together and not buying see we can just rule out see  at that point of time.  So this solves another problem is  that we obviously do not need to analyze the process  which people just by barely.  So what we can do is  according to the sages we can Define our minimum support  and confidence and when you  have set Values we can put this values in the algorithm  and we can filter out the data and we  can create different rules  and suppose even  after filtering you have like five thousand rules.  And for every item we create these 5,000 rules.  So that’s practically impossible.  So for that we need the third calculation,  which is the lift  so lift is basically the strength of any Rule now,  let’s have a look at the denominator  of the formula given here and if you see Here,  we have the independent support values of A and B.  So this gives us  the independent occurrence probability of A and B.  And obviously there’s a lot of difference  between the random occurrence  and Association and  if the denominator of the lift is more  what it means is  that the occurrence  of Randomness is more rather than the occurs  because of any association.  So left is the final verdict where we know  whether we have to spend time.  On this particular rule what we have got here or not.  Now, let’s have a look at a simple example  of Association rule mining.  So suppose.  We have a set of items a b c d  and e and a set of transactions T1 T2 T3 T4  and T5 and as you can see here,  we have the transactions T1  in which we have ABC T to a CD t3b CDT for a d e and T5 BCE.  Now what we generally do is create.  At some rules or Association rules such as a gives T  or C gives a a gift C B  and C gives a what this basically means is  that if a person buys a then he’s most likely to buy D.  And if a person by C,  then he’s most likely to buy a and  if you have a look at the last one,  if a person buys B and C is most likely to buy the item  a as well now if we calculate the support confidence  and lift using these rules  as you can see here in the table,  we have the rule.  And the support confidence handle lift values.  Let’s discuss about a prairie.  So a priori algorithm uses the frequent itemsets  to generate the association Rule and it is based on the concept  that subset of a frequent itemsets must also be  a frequent item set itself.  Now this raises the question  what exactly is a frequent item set.  So a frequent item set is an item set  whose support value is greater  than the threshold value just now we discussed  that the marketing team  according to the says have a minimum threshold value  for the confidence as well as the support.  So frequent itemsets is that animset  who support value is greater  than the threshold value already specified example,  if A and B is a freaker item set  Than A and B should also be frequent itemsets individually.  Now, let’s consider the following transaction  to make the things such as easier suppose.  We have transactions 1  2 3 4 5 and these Items out there.  So T 1 has 1 3 & 4 T 2 has 2 3 and 5 T3 has  1 2 3 5 T 4 to 5 and T 5 1 3 & 5  now the first step is to build a list  of items sets of size 1 by using this transactional data.  And one thing to note here is that the minimum support count  which is given here is to Let’s suppose it’s too  so the first step is to create item sets  of size 1 and calculate their support values.  So as you can see here.  We have the table see one  in which we have the item sets 1 2 3 4 5  and the support values  if you remember the formula of support,  it was frequency divided by the total number of occurrence.  So as you can see here for the items  that one the support is 3  as you can see here that item set one up here s  and t 1 T 3 and T 5.  So as you can see,  it’s frequency is 1 2 & 3 now as you can see the item set  for has a support of one  as it occurs only once in Transaction one  but the minimum support value is 2  that’s why it’s going to be eliminated.  So we have the final table which is the table F1,  which we have the item sets 1 2 3 and 5  and we have the support values 3 3 4 & 4 now the next step is  to create Adam sets  of size 2 and calculate their support values now  all the combination of the item sets in the F1,  which is the final table in which it is carded the for  are going to be used for this iteration.  So So we get the table c 2.  So as you can see here, we have 1 2 1 3 1  5 2 3 2 5 & 3 5 now  if you calculate the support here again,  we can see  that the item set 1 comma 2 has a support of one  which is again less than the specified threshold.  So we’re going to discard that so if we have a look  at the table f 2  we have 1 comma 3 1 5 2 3 2 5 & 3 5 again,  we’re going to move forward and create the atoms.  That of size 3 and calculate this support values.  Now all the combinations are going to be used  from the item set F to for this particular iterations.  Now before calculating support values,  let’s perform proning on the data set.  Now what is pruning now  after the combinations are being made we device  c 3 item sets to check  if there is another subset whose support is less  than the minimum support value.  That is what frequent items that means.  So if you have a look here the item sets.  We have is 1 2 3 1 2 1 3 2 3 4 the first one  because as you can see here  if we have a look at the subsets of one two,  three, we have 1 comma 2 as well,  so we are going to discard this whole item set same goes  for the second one.  We have one to five.  We have 1/2 in that  which was discarded in the previous set  or the previous step.  That’s why we’re going to discard that also  which leaves us with only two factors,  which is 1 3 5 8.  I’m set and the two three five and the support for this is 2  and 2 as well.  Now if we create the table C for using four elements,  we going to have only one item set,  which is 1 2 3 and 5 and  if you have a look at the table here the transaction table one,  two, three and five appears only one.  So the support is one  and since C for the support of the whole table C  4 is less than 2 so we’re going to stop here  and return to the previous item set  that It is 3 3  so the frequent itemsets have 1 3 5 and 2 3 5 now let’s assume  our minimum confidence value is 60 percent for that.  We’re going to generate all the non-empty subsets  for each frequent itemsets.  Now for I equals 1 comma 3 comma 5 which is the item set.  We get the subset one three one  five three five one three and five similarly  for 2 3 5 we get  to three to five three five two three.  and five now this rule states  that for every subset s  of I the output of the rule gives something like s gives i2s  that implies s recommends I of s  and this is only possible  if the support of I divided by the support of s is greater  than equal to the minimum confidence value now applying  these rules to the item set of F3 we get rule 1 which is 1 3  gives 1 comma 3 comma 5 and 1/3 3 it means 1 and 3 gives 5  so the confidence is equal  to the support of 1 comma 3 comma fire driver support  of 1 comma 3 that equals 2 by 3  which is 66% and which is greater  than the 60 percent.  So the rule 1 is selected now if we come to rule 2  which is 1 comma 5 it gives 1 comma 3 comma 5 and 1 5  it means if we have 1 & 5 it implies.  We also going to have three know.  Calculate the confidence of this one.  We’re going to have support 1 3 5 whereby support 1/5  which gives us a hundred percent  which means rule 2 is selected as well.  But again if you have a look at rule 506 over here similarly,  if it’s select 3 gives  1 3 5 & 3 it means if you have three,  we also get one and five.  So the confidence for this comes  at 50% Which is less than the given 60 percent Target.  So we’re going to reject this Rule and same.  Goes for the rule number six.  Now one thing to keep in mind here is  that all those are rule 1 and Rule 5 look  a lot similar they are not so it really depends  what’s on the left hand side of the arrow.  And what’s on the right-hand sides of the arrow.  It’s the if-then possibility.  I’m sure you guys can understand what exactly these rows are  and how to proceed with this rules.  So, let’s see  how we can implement the same in Python, right?  So for that what I’m going to do is create a new python.  and I’m going to use the chapter notebook.  You’re free to use any sort of ID.  I’m going to name it as a priority.  So the first thing  what we’re going to do is we will be using  the online transactional data  of retail store for generating Association rules.  So firstly what we need to do is get the pandas and ml x  10 libraries imported and read the file.  So as you can see here,  we are using the online retail dot xlsx format file  and from ml extant.  We’re going to import a prairie  and Association rules at all comes under MX 10.  So as you can see here,  we have the invoice the stock quote  the description the quantity  the invoice data unit price customer ID  and the country now next in this step.  What we’re going to do is do data cleanup  which includes removing the spaces from some  of the descriptions.  And drop the rules that do not have invoice  numbers and remove the great grab transactions  because that is of no use to us.  So as you can see here at the output in which  we have like five hundred and thirty two thousand rows  with eight columns.  So after the cleanup,  we need to consolidate the items into one transaction per row  with each product for the sake of keeping the data set small.  We are only looking at the sales for France.  So as you can see here,  we have excluded all the other says we’re just looking  at the sales for France.  Now.  There are a lot of zeros in the data.  But we also need to make sure  any positive values are converted to 1  and anything less than zero is set to 0  so as you can see here,  we are still 392 Rose.  We’re going to encode it and see.  Check again.  Now that you have structured the data properly in this step.  What we’re going to do is generate frequent itemsets  that have support at least seven percent,  but this number is chosen  so that you can get close enough and generated rules  with the corresponding support confidence and lift.  So go ahead you can see here.  The minimum support is 0.71 of what  if we add another constraint  on the rules such as the lift is greater than 6  and the conference is greater than 0.8.  So as you can see here,  we have the left-hand side and the right-hand side  of the association rule,  which is the antecedent and the consequence.  We have the support.  We have the confidence to lift the leverage  and the conviction.  So guys, that’s it for this session.  That is how you create Association rules using the API.  Real gold tone which helps a lot in the marketing business.  It runs on the principle of Market Basket analysis,  which is exactly what big companies like Walmart.  You have Reliance  and Target to even Ikea does it and I hope you got  to know what exactly is Association rule mining  what is lift confidence  and support and how to create Association rules.  So guys reinforcement learning.  Dying is a part of machine learning  where an agent is put in an environment  and he learns to behave in this environment  by performing certain actions.  Okay, so it basically performs actions and it either gets  a rewards on the actions  or it gets a punishment and observing the reward  which it gets from those actions reinforcement learning is all  about taking an appropriate action in order  to maximize the reward in a particular situation.  So guys in supervised learning the training data comprises  of the input  and the expected output  And so the model is trained with the expected output itself,  but when it comes to reinforcement learning,  there is no expected output here.  The reinforcement agent decides what actions  to take in order to perform a given task in the absence  of a training data set.  It is bound to learn from its experience itself.  Alright.  So reinforcement learning is all about an agent  who’s put in an unknown environment  and he’s going to use a hit and trial method  in order to figure out the environment and then come up  with an outcome.  Okay.  Now, let’s look at it.  Reinforcement learning within an analogy.  So consider a scenario where in a baby is learning  how to walk the scenario can go about in two ways.  Now in the first case the baby starts walking  and makes it to the candy here.  The candy is basically the reward it’s going to get so  since the candy is the end goal the baby is happy.  It’s positive.  Okay, so the baby is happy and it gets rewarded a set  of candies now another way in which this could go is  that the baby starts walking  but Falls due to some hurdle in between The baby gets hot  and it doesn’t get any candy and obviously the baby is sad.  So this is a negative reward.  Okay, or you can say this is a setback.  So just like how we humans learn from our mistakes by trial  and error reinforcement learning is also similar.  Okay, so we have an agent  which is basically the baby and a reward  which is the candy over here.  Okay, and with many hurdles in between the agent is supposed  to find the best possible path to read through the reward.  So guys.  I hope you all are clear with the reinforcement learning now,  let’s look at At the reinforcement learning process.  So generally a reinforcement learning system has  two main components, right?  The first is an agent  and the second one is an environment.  Now in the previous case,  we saw that the agent was  the baby and the environment was the living room  where in the baby was crawling.  Okay.  The environment is the setting  that the agent is acting on and the agent over here  represents the reinforcement learning algorithm.  So guys the reinforcement learning process starts  when the environment sends a state to the  And then the agent will take some actions based  on the observations  in turn the environment will send the next state  and the respective reward back to the agent.  The agent will update its knowledge  with the reward returned by the environment and it uses  that to evaluate its previous action.  So guys this Loop keeps continuing  until the environment sends a terminal state which means  that the agent has accomplished all his tasks  and he finally gets the reward.  Okay.  This is exactly  what was depicted in this scenario.  So the agent keeps climbing up ladders  until he reaches his reward to understand this better.  Let’s suppose that our agent is learning to play Counter Strike.  Okay.  So let’s break it down now initially the RL agent  which is basically the player player 1.  Let’s say it’s a player one  who is trying to learn how to play the game.  Okay.  He collects some state from the environment.  Okay.  This could be the first date of Counter-Strike now based  on the state the agent will take some action.  Okay, and this action can be anything  that causes a result.  So if the Almost left  or right it’s also considered as an action.  Okay, so initially the action is going to be random  because obviously the first time you pick up Counter-Strike,  you’re not going to be a master at it.  So you’re going to try with different actions  and you just want to pick up a random action in the beginning.  Now the environment is going to give a new state.  So after clearing  that the environment is now going to give a new state  to the agent or to the player.  So maybe he’s across th one now.  He’s in stage 2.  So now the player will get a reward  our one from the environment.  Because it cleared stage 1.  So this reward can be anything.  It can be additional points or coins or anything like that.  Okay.  So basically this Loop keeps going on  until the player is dead or reaches the destination.  Okay, and it continuously outputs a sequence  of States actions and rewards.  So guys, this was a small example to show you  how reinforcement learning process works.  So you start with an initial State  and once a player clothes that state he gets a reward  after that the environment  will give another stage to the player.  And after it clears that state it’s going to get another award  and it’s going to keep happening  until the player reaches his destination.  All right, so guys, I hope this is clear now,  let’s move on and look  at the reinforcement learning definitions.  So there are a few Concepts that you should be aware  of while studying reinforcement learning.  Let’s look at those definitions over here.  So first we have the agent now an agent is basically  the reinforcement learning algorithm that learns  from trial and error.  Okay, so an agent takes actions like For example a soldier  in Counter-Strike navigating through the game.  That’s also an action.  Okay, if he moves left right or if he shoots at somebody  that’s also an action.  Okay.  So the agent is responsible  for taking actions in the environment.  Now the environment is the whole Counter-Strike game.  Okay.  It’s basically the world through which the agent  moves the environment takes the agents current state  and action as input  and it Returns the agency reward and its next state as output.  Alright next we have action now all the possible.  Steps that an agent can take are called actions.  So like I said,  it can be moving right left or shooting or any of that.  Alright, then we have state now state is  basically the current condition returned by the environment.  So whichever State you are in  if you are in state 1 or if you’re in state  to that represents your current condition.  All right.  Next we have reward a reward is basically an instant return  from the environment to appraise Your Last Action.  Okay, so it can be anything like coins  or it can be audition.  Two points.  So basically a reward is given to an agent  after it clears the specific stages.  Next we have policy policies basically the strategy  that the agent uses to find out his next action based  on his current state policy is just the strategy with which you  approach the game.  Then we have value.  Now while you is the expected long-term return  with discount so value  in action value can be a little bit confusing for you right now,  but as we move further,  you’ll understand what I’m talking.  Kima okay.  So value is basically the long-term return  that you get with discount.  Okay discount.  I’ll explain in the furthest lines.  Then we have action value  now action value is also known as Q value.  Okay.  It’s very similar to Value  except that it takes an extra parameter,  which is the current action.  So basically here you’ll find out the Q value depending  on the particular action that you took.  All right.  So guys don’t get confused with value and action value.  We look at examples  in the further slides and you will understand this better.  Okay.  So guys make sure that you’re familiar with these terms  because you’ll be seeing a lot of these terms  in the further slides.  All right.  Now before we move any further,  I’d like to discuss a few more Concepts.  Okay.  So first we will discuss the reward maximization.  So if you haven’t already realized it the basic aim  of the RL agent is to maximize the reward now,  how does that happen?  Let’s try to understand this in depth.  So the agent must be trained in such a way  that he takes the best action so that the reward is  Because the end goal of reinforcement learning  is to maximize your reward based on a set of actions.  So let me explain this with a small game now  in the figure you can see there is a fox there’s some meat  and there’s a tiger  so our agent is basically the fox and his end goal  is to eat the maximum amount  of meat before being eaten by the tiger now  since the fox is a clever fellow he eats the meat  that is closer to him rather than the meat  which is closer to the tiger.  Now this is because the closer he is to the tiger the  higher our his chances of getting killed.  So because of this the rewards which are near the tiger,  even if they are bigger meat chunks,  they will be discounted.  So this is exactly what discounting means  so our agent is not going to eat the meat chunks  which are closer to the tiger because of the risk.  All right now,  even though the meat chunks might be larger.  He does not want to take the chances of getting killed.  Okay.  This is called discounting.  Okay.  This is where you discount  because it improvise and you just eat the meat  which are closer to you instead of taking risks  and eating the meat which are The to your opponent.  All right.  Now the discounting of reward Works based  on a value called gamma will be discussing gamma  in our further slides  but in short the value of gamma is between 0 and 1.  Okay.  So the smaller the gamma the larger is the discount value.  Okay.  So if the gamma value is lesser,  it means that the agent is not going to explore  and he’s not going to try and eat the meat chunks  which are closer to the tiger.  Okay, but if the gamma value is closer to 1 it means  that our agent is actually We’re going to explore  and it’s going to dry and eat the meat chunks  which are closer to the tiger.  All right, now,  I’ll be explaining this in depth in the further slides.  So don’t worry  if you haven’t got a clear concept yet,  but just understand  that reward maximization is a very important step  when it comes to reinforcement learning  because the agent has to collect maximum rewards  by the end of the game.  All right.  Now, let’s look at another concept  which is called exploration and exploitation.  So exploration like the name suggests is  about exploring and capturing.  More information about an environment on the other  hand exploitation is  about using the already  known exploited information to heighten the rewards.  So guys consider the fox and tiger example  that we discussed now here the fox eats only the meat chunks  which are close to him,  but he does not eat the meat chunks  which are closer to the tiger.  Okay, even though they might give him more Awards.  He does not eat them  if the fox only focuses on the closest rewards,  he will never reach the big chunks of meat.  Okay, this is what exploitation is the  about you just going to use the currently known information  and you’re going to try and get rewards based  on that information.  But if the fox decides to explore a bit,  it can find the bigger award which is the big chunks of meat.  This is exactly what exploration is.  So the agent is not going to stick to one corner instead.  He’s going to explore the entire environment and try  and collect bigger rewards.  All right, so guys,  I hope you all are clear with exploration and exploitation.  Now, let’s look at the markers decision process.  So guys this is basically a mathematical approach  for mapping a solution in reinforcement learning in a way.  The purpose of reinforcement learning is to solve  a Markov decision process.  Okay.  So there are a few parameters  that are used to get to the solution.  So the parameters include the set of actions the set  of states the rewards the policy  that you’re taking to approach the problem and the value  that you get.  Okay, so to sum it up the agent must take  an action a to transition from a start state.  The end State s while doing  so the agent will receive a reward are for each action  that he takes.  So guys a series  of actions taken by the agent Define the policy  or it defines the approach and the rewards  that are collected Define the value.  So the main goal here is to maximize the rewards  by choosing the optimum policy.  All right.  Now, let’s try to understand this with the help  of the shortest path problem.  I’m sure a lot of you might have gone through this problem  when you are in college.  So guys look at the graph over here.  So our aim here is to find the shortest path  between a and d with minimum possible cost.  So the value that you see on each of these edges  basically denotes the cost.  So if I want to go from a to c it’s going to cost me 15 points.  Okay.  So let’s look at how this is done.  Now before we move and look at the problem  in this problem the set of states are denoted by the nodes,  which is ABCD  and the action is to Traverse from one node to the other.  So if I’m going from a Be  that’s an action similarly a to see  that’s an action.  Okay, the reward is basically the cost  which is represented by each Edge over here.  All right.  Now the policy is basically the path  that I choose to reach the destination.  So let’s say I choose a seed be okay  that’s one policy in order to get to D and choosing a CD  which is a policy.  Okay.  It’s basically how I’m approaching the problem.  So guys here you can start off at node a  and you can take baby steps  to your destination now initially you’re Clueless.  So you can just take the next possible node,  which is visible to you.  So guys if you’re smart enough,  you’re going to choose a to see instead of ABCD or ABD.  All right.  So now if you are at nodes see you want  to Traverse to note D. You must again choose a wise path  or red you just have to calculate which path  has the highest cost  or which path will give you the maximum rewards.  So guys, this is a simple problem.  We just drank to calculate the shortest path between a  and d by traversing through these nodes.  So if I travels from a CD it gives me the maximum reward.  Okay, it gives me 65  which is more than any other policy would give me okay.  So if I go from ABD,  it would be 40 when you compare this to a CD.  It gives me more reward.  So obviously I’m going to go with a CB.  Okay, so guys was a simple problem  in order to understand how Markov decision process works.  All right, so guys, I want to ask you a question.  What do you think?  I did hear did I perform exploration  or did I perform exploitation?  Now the policy for the above example is of exploitation  because we didn’t explore the other nodes.  Okay.  We just selected three notes and we Traverse through them.  So that’s why this is called exploitation.  We must always explore the different notes  so that we can find a more optimal policy.  But in this case, obviously a CD has the highest reward  and we’re going with a CD,  but generally it’s not so simple.  There are a lot of nodes there hundreds of notes to Traverse  and they’re like 50 60 policies.  Okay, 50 60 different policies.  So you make sure you explore.  All the policies and then decide on an Optimum policy  which will give you a maximum reward.  So guys before we perform the Hands-On part.  Let’s try to understand the math behind our demo.  Okay.  So in our demo will be using the Q learning algorithm  which is a type of reinforcement learning algorithm.  Okay, it’s simple,  it just means that if you take the best possible actions  to reach your goal or to get the most rewards.  All right, let’s try to understand this with an example.  So guys, this is exactly what be running in In our demo,  so make sure you understand this properly.  Okay.  So our goal here is we’re going to place an agent  in any one of the rooms.  Okay.  So basically these squares you see here our rooms.  OK 0 is a room  for is a room three is a room one is a room  and 2:05 is also a room.  It’s basically a way outside the building.  All right.  So what we’re going to do is we’re going to place an agent  in any one of these rooms  and the goal is to reach outside the building.  Okay outside.  The building is room number five.  Okay, so these are These spaces are basically doors,  which means that you can go from zero to four.  You can go from 4 to 3 3 to 1 1 to 5  and similarly 3 to 2,  but you can’t go from 5 to 2 directly.  All right, so there are certain set of rooms  that don’t get connected directly.  Okay.  So like of mentioned here each room is numbered from 0 to 4,  and the outside of the building is numbered as five and one  thing to note here is Room 1 and room  for directly lead to room number five.  All right.  So room number one and four will directly lead out  to room number five.  So basically our goal over here is to get to room number five.  Okay to set this room as a goal will associate  a reward value to each door.  Okay.  Don’t worry.  I’ll explain what I’m saying.  So if you re present these rooms in a graph this is  how the graph is going to look.  Okay.  So for example from true,  you can go to three and then three two,  one one two five  which will lead us to our goal these arrows represent the link  between the dose.  No, this is quite understandable now.  Our next step is to associate a reward value  to each of these doors.  Okay, so the rooms  that are directly connected to our end room,  which is room number five will get a reward of hundred.  Okay.  So basically our room number one will have a reward five now.  This is obviously  because it’s directly connected to 5 similarly  for will also be associated with a reward of hundred  because it’s directly connected to 5.  Okay.  So if you go out from for it will lead  to five now the other know.  Roads are not directly connected to 5.  So you can’t directly go from 0 to 5.  Okay.  So for this will be assigning a reward of zero.  So basically other doors not directly connected  to the Target room have a zero reward.  Okay now because the doors are to weigh the two arrows  are assigned to each room.  Okay, you can see two arrows assigned to each room.  So basically zero leads to four and four leads back to 0 now.  We have assigned 0 0 over here  because 0 does not directly  lead to five but one directly leads to Five  and that’s why you can see a hundred over here similarly  for directly leads to our goal State and  that’s why we were signed a hundred over here  and obviously five two five is hundred as well.  So here all the direct connections to room number  five are rewarded hundred and all the indirect connections  are awarded zero.  So guys in q-learning the end goal is to reach the state  with the highest reward  so that the agent arrives at the goal.  Okay.  So let me just explain this graph to you  in detail now these These rooms over here labeled one, two,  three to five they represent the state an agent is in so  if I stay to one It means  that the agent is  in room number one similarly the agents movement  from one room to the other represents the action.  Okay.  So if I say one two, three, it represents an action.  All right.  So basically the state is represented as node  and the action is represented by these arrows.  Okay.  So this is what this graph is about these nodes represent  the rooms and these Arrows represent the actions.  Okay.  Let’s look at a small example.  Let’s set the initial state to 0.  So my agent is placed in room number two,  and he has to travel all the way to room number five.  So if I set the initial stage to to he can travel to State 3.  Okay from three he can either go to one  or you can go back to to  or you can go to for if he chooses to go to  for it will directly take him to room number 5, okay,  which is our end goal and even if he goes from room number  3 2 1 it will take him to room number.  High five, so this is  how our algorithm works is going to drivers different rooms.  In order to reach the Gold Room,  which is room number 5.  Now, let’s try and depict these rewards  in the form of a matrix.  Okay, because we’ll be using this our Matrix  or the reward Matrix to calculate the Q value  or the Q Matrix.  Okay.  We’ll see what the Q value is in the next step.  But for now,  let’s see how this reward Matrix is calculated.  Now the –  ones that you see in the table,  they represent the null values.  Now these -1 basically means  that Wherever there is no link between nodes.  It’s represented as minus 1 so 0  2 0 is minus 1 0 to 1 there is no link.  Okay, there’s no direct link from 0 to 1.  So it’s represented as minus 1 similarly 0 to 2 or 2.  There is no link.  You can see there’s no line over here.  So this is also minus 1,  but when it comes to 0 to 4,  there is a connection and we have numbered 0  because the reward for a state  which is not directly connected to the goal is zero,  but if you look  at this 1 comma 5 which is is basically traversing  from Node 1 to node 5, you can see the reward is hundred.  Okay, that’s basically  because one and five are directly connected  and five is our end goal.  So any node  which will directly connected to our goal state will get  a reward of hundred.  Okay.  That’s why I’ve put hundred over here similarly.  If you look at the fourth row over here.  I’ve assigned hundred over here.  This is because from 4 to 5 that is a direct connection.  There’s a direct connection  which gives them a hundred reward.  Okay, you can see from 4 to 5.  There is a direct link.  Okay, so from room number  for to room number five you can go directly.  That’s why there’s a hundred reward over here.  So guys, this is how the reward Matrix is made.  Alright, I hope this is clear to you all.  Okay.  Now that we have the reward Matrix.  We need to create another Matrix called The Q Matrix.  OK here, you’ll store or the Q values  that will calculate now this Q Matrix basically  represents the memory of  what the agent has learned through experience.  Okay.  So once he traverses from one room to the final room,  whatever he’s learned.  It is stored in this Q Matrix.  Okay, in order for him to remember  that the next time he travels this we use this Matrix.  Okay.  It’s basically like a memory.  So guys the rows of the Q Matrix will represent the current state  of the agent The Columns will represent the possible actions  and to calculate the Q value use this formula.  All right, I’ll show you what the Q Matrix looks like,  but first, let’s understand this formula.  Now this Q value  will calculating because we want to fill in the Q Matrix.  Okay.  So this is basically a Matrix over here initially,  it’s all 0  but as the agent Traverse is from different nodes  to the destination node.  This Matrix will get filled up.  Okay.  So basically it will be like a memory to the agent.  He’ll know that okay,  when he traversed using a particular path,  he found out  that his value was maximum or as a reward was maximum of year.  So next time he’ll choose that path.  This is exactly what the Q Matrix is.  Okay.  Let’s go back now guys,  don’t worry about this formula for now  because we’ll be implementing this formula in an example.  In the next slide.  Okay, so don’t worry about this formula for now,  but here just remember  that this Q basically represents the Q Matrix the r represents  the reward Matrix  and the gamma is the gamma value which I’ll talk about shortly  and here you just finding out the maximum from the Q Matrix.  So basically the gamma parameter has a range from 0 to 1  so you can have a value of 0.1 0.3 0.5 0.8 and all of that.  So if the gamma is closer to zero it means  That the agent will consider  only the immediate rewards which means  that the agent will not explore the surrounding.  Basically, it won’t explore different rooms.  It will just choose a particular room  and then we’ll try sticking to it.  But if the value of gamma is high meaning that  if it’s closer to one the agent will consider future Awards  with greater weight.  This means that the agent will explore all  the possible approaches  or all the possible policies in order to get to the end goal.  So guys, this is what I was talking about when I  mention ation and exploration.  All right.  So if the gamma value is closer to 1 it basically means  that you’re actually exploring the entire environment  and then choosing an Optimum policy.  But if your gamma value is closer to zero,  it means that the agent will only stick  to a certain set of policies  and it will calculate the maximum reward based  on those policies.  Now next.  We have the Q learning algorithm  that we’re going to use to solve this problem.  So guys now this is going to look very confusing to y’all.  So let me just explain In this with an example.  Okay.  We’ll see what we’re actually going to run in our demo.  We will do the math behind it.  And then I’ll tell you what this Q learning algorithm is.  Okay, you’ll understand it as I’m showing you the example.  So guys in the Q learning algorithm the agent learns  from his experience.  Okay, so each episode,  which is basically  when the agents are traversing from an initial room  to the end goal is equivalent to one training session  and in every training session the agent will explore  the environment it will Receive some reward  until it reaches the goal state which is five.  So there’s a purpose  of training is to enhance the brain of our agent.  Okay only if he knows the environment very well,  will he know which action to take  and this is why we calculate the Q Matrix okay in Q Matrix,  which is going to calculate the value of traversing  from every state to the end state from every initial room  to the end room.  Okay, so when we calculate all the values  or how much reward we’re getting from each policy  that we We know the optimum policy  that will give us the maximum reward.  Okay, that’s why we have the Q Matrix.  This is very important  because the more you train the agent  and the more Optimum your output will be so basically here  the agent will not perform exploitation instead.  He’ll explore around  and go back and forth through the different rooms  and find the fastest route to the goal.  All right.  Now, let’s look at an example.  Okay.  Let’s see how the algorithm works.  Okay.  Let’s go back to the previous slide  and Here it says  that the first step is to set the gamma parameter.  Okay.  So let’s do that.  Now the first step is to set the value  of the learning parameter,  which is gamma and we have randomly set it  to zero point eight.  Okay.  The next step is to initialize the Matrix Q 2 0 Okay.  So we’ve set Matrix Q  2 0 over here and then we will select the initial stage  Okay, the third step is select a random initial State and here  we’ve selected the initial State as room number one.  Okay.  So after you initialize the matter Q as a zero Matrix  from room number one,  you can either go to room number three or number five.  So if you look at the reward Matrix can see  that from room number one,  you can only go to room number three or room number five.  The other values are minus 1 here,  which means that there is no link from 1 to 0 1  2 1 1 2 2 and 1 to 4.  So the only possible actions from room number one is to go  to room number 3 and to go to room number five.  All right.  Okay.  So let’s select room number five, okay.  So from room number one,  you can go to 3 and 5 and we have randomly selected five.  You can also select three but for example,  let’s select five over here.  Now from Rome five,  you’re going to calculate the maximum Q value  for the next state based on all possible actions.  So from number five,  the next state can be room number one four or five.  So you’re going to calculate the Q value for traversing  5 to 1 5 2 4 5 2 5 and you’re going to find out  which has the maximum Q value and that’s how you’re going.  Compute the Q value.  So let’s Implement our formula.  Okay, this is the q-learning formula.  So right now we’re traversing  from room number one to room number 5.  Okay.  This is our state.  So here I’ve written Q 1 comma 5.  Okay one represents our current state  which is room number one.  Okay.  Our initial state was room number one and we are traversing  to room number five.  Okay.  It’s shown in this figure room number 5 now for this we need  to calculate the Q value next in our formula.  It says the reward Matrix State and action.  So the reward Matrix for 1 comma 5 let’s look at 1 comma  5 1 comma 5 corresponds to a hundred.  Okay, so I reward over here will be hundred so  r 1 comma 5 is basically hundred then you’re going  to add the gamma value.  Now the gamma value will be initialized it  to zero point eight.  So that’s what we have written over here.  And we’re going to multiply it with the maximum value  that we’re going to get for the next date based  on all possible actions.  Okay.  So from 5, the next state is 1 4 and 5.  So if Travis from five to one  that’s what I’ve written over here 5 to 4.  You’re going to calculate the Q value of Fire 2 4 & 5 to 5.  Okay.  That’s what I mentioned over here.  So Q 5 comma 1 5 comma 4  and 5 comma 5 are the next possible actions  that you can take from State V.  So r 1 comma 5 is hundred.  Okay, because from the reward Matrix,  you can see that 1 comma 5 is hundred 0.8 is the value  of gamma after that.  We will calculate Q of 5 comma 1 5 comma  4 and 5 comma 5 Like I mentioned earlier  that we’re going to initialize Matrix Q as zero Matrix  So based setting the value of 0  because initially obviously the agent doesn’t have  any memory of what is happening.  Okay, so he just starting from scratch.  That’s why all these values are 0 so Q  of 5 comma 1 will obviously be 0 5 comma 4 would be 0  and 5 comma 5 will also be zero  and to find out the maximum between these it’s obviously 0.  So when you compute this equation,  you will get hundred so the Q value of 1 comma 5 is  So if I agent goes from room number one to room number five,  he’s going to have a maximum reward  or Q value of hundred.  All right.  Now in the next slide you can see  that I’ve updated the value of Q of 1 comma 5.  Okay, it said 200.  All right now similarly,  let’s look at another example so that you understand this better.  So guys, this is exactly  what we’re going to do in our demo.  It’s only going to be coded.  Okay.  I’m just explaining our code right now.  I’m just telling you the math behind it.  Alright now, let’s look at another example.  Example OK this time.  We’ll start with a randomly chosen initial State.  Let’s say that we’ve chosen State 3.  Okay.  So from room 3,  you can either go to room number one two,  or four randomly will select room number  one and from room number one,  you’re going to calculate the maximum Q value  for the next state based on all possible actions.  So the possible actions from one is to go to 3  and to go to 5 now  if you calculate the Q value using this formula,  so let me explain this to you once again now,  3 comma 1 basically represents  that we’re in room number three and we are going  to room number one.  Okay.  So this represents our action?  Okay.  So we’re going from 3 to 1  which is our action  and three is our current state next we will look at the reward  of going from 3 to 1.  Okay, if you go to the reward Matrix 3 comma 1 is 0 okay.  Now this is  because there’s no direct link between three and five.  Okay, so that’s why the reward here is zero.  So the value here will be 0  after that we have the gamma value,  which is zero point.  Eight and then we’re going to calculate the Q Max  of 1 comma 3 and 1 comma 5 out of these whichever  has the maximum value we’re going to use that.  Okay, so Q of 1 comma 3 is 0.  All right 0 you can see here 1 comma 3 is 0  and 1 comma 5 if you  remember we just calculated 1 comma 5 in the previous slide.  Okay 1 comma 5 is hundred.  So here I’m going to put a hundred.  So the maximum here is hundred.  So 0.8 in 200 will give us c t so that’s the Q value.  Going to get if you Traverse from three two one.  Okay.  I hope that was clear.  So now we have Travers from room number  three to room number one with the reward of 80.  Okay, but we still haven’t reached the end goal  which is room number five.  So for our next episode the state will be room.  Number one.  So guys, like I said, we’ll repeat this in a loop  because room number one is not our end goal.  Okay, our end goal is room number 5.  So now we need to figure out  how to get from room number one to room number 5.  So from room number one,  you can either either go to three or five.  That’s what I’ve drawn over here.  So if we select five we know that it’s our end goal.  Okay.  So from room number 5,  then you have to calculate the maximum Q value  for the next possible actions.  So the next possible actions from five is to go  to room number one room number four or room number five.  So you’re going to calculate the Q value of 5 to 1 5 2 4 & 5  2 5 and find out  which is the maximum Q value  here and you’re going to use that value.  All right.  So let’s look at the formula now now again,  we’re in room number one and Want to go  to room number 5.  Okay, so that’s exactly  what I’ve written here Q 1 comma 5 next is the reward Matrix.  So reward of 1 comma 5 which is hundred.  All right, then we have added the gamma value which is 0.8.  And then we’re going to find the maximum Q value  from 5 to 1 5 2 4 & 5 to 5.  So this is what we’re performing over here.  So 5 comma 1 5 comma 4 and 5 comma 5 are all 0 this is  because we initially set all the values of the Q Matrix as 0  so you get Hundred over here and the Matrix Remains the Same  because we already had calculated Q 1 comma 5  so the value of 1 comma 5 is already fed to the agent.  So when he comes back here, he knows our okay.  He’s already done this before now.  He’s going to try and Implement another method.  Okay is going to try and take another route  or another policy.  So he’s going to try to go from different rooms  and finally land up in room number 5,  so guys, this is exactly how our code runs.  We’re going to Traverse through each and every node  because we want an Optimum ball.  See, okay.  An Optimum policy is attained only  when you Traverse through all possible actions.  Okay.  So if you go through all possible actions  that you can perform only then will you understand  which is the best action  which will lead us to the reward.  I hope this is clear now,  let’s move on and look at our code.  So guys, this is our code and this is executed in Python  and I’m assuming  that all of you have a good background in Python.  Okay, if you don’t understand python very well.  I’m going to leave a link in the description.  You can check out that video on Python  and then maybe come back to this later.  Okay, but I’ll be explaining the code to you anyway,  but I’m not going to spend a lot of time explaining each  and every line of code because I’m assuming  that you know python.  Okay.  So let’s look at the first line of code over here.  So what we’re going to do is we’re going to import numpy.  Okay numpy is basically a python library  for adding support for large multi-dimensional arrays  and matrices and it’s basically for computing  mathematical functions.  Okay so first Want to import  that after that we’re going to create the our Matrix.  Okay.  So this is the our Matrix next we’re going to create a q Matrix  and it’s a 6 into 6 Matrix  because obviously we have six states starting from 0 to 5.  Okay, and we are going to initialize the value to zero.  So basically the Q Matrix is going to be initialized  to zero over here.  All right,  after that we’re setting the gamma parameter to 0.8.  So guys you can play with this parameter  and you know move it to 0.9 or movement logo to 0.8.  Okay, you can see see what happens then then  we’ll set an initial stage.  Okay initial stage is set as 1 after that.  We’re defining a function called available actions.  Okay.  So basically what we’re doing here is  since our initial state is one.  We’re going to check our row number one.  Okay, this is our own number one.  Okay.  This is wrong number zero.  This is zero number one and so on.  So we’re going to check the row number one and we’re going  to find the values  which are greater than or equal to 0  because these values  basically The nodes that we can travel to now  if you select minus 1  you can Traverse 2-1.  Okay, I explained this earlier the –  one represents all the nodes that we can travel to but we  can travel to these nodes.  Okay.  So basically over here a checking all the values  which are equal to 0  or greater than 0 these will be our available actions.  So if our initial state is one we can travel to other states  whose value is equal to 0  or greater than 0  and this is stored in this variable called.  All available act right now.  This will basically get the available actions  in the current state.  Okay.  So we’re just storing the possible actions  in this available act variable over here.  So basically over here  since our initial state is one we’re going to find out  the next possible States we can go to okay  that is stored in the available act variable.  Now next is this function chooses at random which action  to be performed within the range.  So if you remember over here,  so guys initially we are in stage number.  Okay are available actions is  to go to stage number 3 or stage number five.  Sorry room number 3 or room number 5.  Okay.  Now randomly, we need to choose one room.  So for that using this line of code, okay.  So here we are randomly going to choose one of the actions  from the available act this available act.  Like I said earlier stores all our possible actions.  Okay from the initial State.  Okay.  So once it chooses an action is going to store it  in next action,  so guys this action will Present  the next available action to take now next is our Q Matrix.  Remember this formula that we used.  So guys this formula that we use is  what we are going to calculate in the next few lines of code.  So in this block of code,  which is executing and Computing the value of Q.  Okay, this is our formula for computing the value  of Q current state Karma action.  Our current state Karma action gamma into the maximum value.  So here basically  we’re going to calculate the maximum index meaning  that To be going to check  which of the possible actions will give us  the maximum Q value read  if you remember  in our explanation over here this value over here Max Q  or five comma 1 5 comma 4 and 5 comma 5 we had  to choose a maximum Q value  that we get from these three.  So basically that’s exactly  what we’re doing in this line of code,  the calculating the index which gives us the maximum value  after we finish Computing the value of Q will just  have to update our Matrix.  After that, we’ll be updating the Q value  and will be choosing a new initial State.  Okay.  So this is the update function that is defined over here.  Okay.  So I’ve just called the function over here.  So guys this whole set of code will just calculate the Q value.  Okay.  This is exactly what we did in our examples after that.  We have the training phase.  So guys remember the more you train an algorithm the  better it’s going to learn.  Okay so over here I have provided  around 10,000 titrations.  Okay.  So my range is 10 thousand iterations meaning  that my age It will take 10,000 possible scenarios  and in go to 10,000 titrations to find out the best policy.  So you’re exactly  what I’m doing is I’m choosing the current state  randomly after that.  I’m choosing the available action from the current state.  So either I can go to stage 3 or straight five then  I’m calculating the next action  and then I’m finally updating the value  in the Q Matrix and next.  We just normalize the Q Matrix.  So sometimes in our Q Matrix the value might exceed.  Okay, let’s say it.  Heated to 500 600 so that time you want  to normalize The Matrix.  Okay, we want to bring it down a little bit.  Okay, because larger numbers we won’t be able to understand  and computation would be very hard on larger numbers.  That’s why we perform normalization.  You’re taking your calculated value and you’re dividing it  with the maximum Q value in 200.  All right, so you are normalizing it over here.  So guys, this is the testing phase.  Okay here you will just randomly set a current state and you  want given any other data  because you’ve already trained our model.  Okay, you’re To give a Garden State then  you’re going to tell your agent that listen you’re in room.  Number one.  Now.  You need to go to room number five.  Okay, so he has to figure out  how to go to room number 5 because we have trained him now.  All right.  So here we have set the current state to one  and we need to make sure that it’s not equal to 5  because 5 is the end goal.  So guys this is the same Loop that we executed earlier.  So we’re going to do the same I trations again now  if I run this entire code, let’s look at the result.  So our current state here we’ve chosen as one.  Okay and And if we go back to our Matrix,  you can see that there is a direct link from 1 to 5,  which means that the route  that the agent should take is one to five.  Okay directly.  You should go from 1 to 5  because it will get the maximum reward like that.  Okay.  Let’s see if that’s happening.  So if I run this it should give me a direct path from 1 to 5.  Okay, that’s exactly what happened.  So this is the selected path so directly from one to five  it went and it calculated the entire Q Matrix.  Works for me.  So guys this is exactly how it works.  Now.  Let’s try to set the initial stage  as that’s a to so  if I set the initial stage as to and if I try to run the code,  let’s see the path  that it gives so the selected path is  2 3 4 5 now chose this path  because it’s giving us the maximum reward  from this path.  Okay.  This is the Q Matrix that are calculated  and this is the selected path.  All right, so guys with this we come to the end of this demo.  So basically what we did was we just placed an agent  in a room random room  and we ask it to Traverse  through and reach to the end room,  which is room number five.  So basically we trained our agent and we made sure  that it went through all the possible paths.  to calculate the best path the for a robot  and environment is a place where it has been put to use.  Now.  Remember this reward is itself the agent for example  an automobile Factory where a robot is used  to move materials  from one place to another now the task we discussed just now  have a property in common.  Now, these tasks involve and environment and expect  the agent to learn from the environment.  Now, this is where traditional machine learning phase  and hence the need for reinforcement learning now,  it is good to have Establish overview of the problem  that is to be solved using the Q learning  or the reinforcement learning.  So it helps to define the main components  of a reinforcement learning solution.  That is the agent environment action rewards and States.  So let’s suppose we are to build  a few autonomous robots for an automobile building Factory.  Now, these robots will help the factory personal  by conveying them the necessary parts  that they would need in order to pull the car.  Now these different parts are located  at Nine different positions  within the factory warehouse the car part include the chassis  Wheels dashboard the engine  and so on and the factory workers  have prioritized the location  that contains the body  or the chassis to be the topmost but they  have provided the priorities for other locations as well,  which will look into the moment.  Now these locations  within the factory look somewhat like this.  So as you can see here, we have L1 L2 L3  all of these stations.  Now one thing you might notice here  that there are little obstacle prison in between the locations.  So L6 is the top priority location  that contains the chassis for preparing the car bodies.  Now the task is to enable the robots  so that they can find the shortest route  from any given location to another location on their own.  Now the agents in this case are the robots the environment is  the automobile factory warehouse the let’s talk  about the state’s the states.  Are the location in which a particular robot is present  in the particular instance of time  which will denote it states the machines understand numbers  rather than let us so let’s map the location codes to number.  So as you can see here,  we have map location l 1 to this t 0 L 2 and 1  and so on we have L8 as state 7 + L line at state.  So next what we’re going to talk about are the actions.  So in our example,  the action will be the direct location that a robot can.  Call from a particular location,  right consider a robot  that is a tel to location and the Direct locations  to which it can move our L5 L1 and L3.  Now the figure here may come in handy to visualize this now  as you might have already guessed the set of actions  here is nothing but the set  of all possible states  of the robot for each location the set of actions  that a robot can take will be different.  For example, the set of actions will change  if the robot is.  An L1 rather than L2.  So if the robot is in L1,  it can only go to L 4 and L 2 directly now  that we are done with the states and the actions.  Let’s talk about the rewards.  So the states are basically zero one two,  three four and the actions are also 0 1  2 3 4 up till 8:00.  Now, the rewards now will be given to a robot.  If a location  which is the state is directly reachable  from a particular location.  So let’s take an example suppose l Lane is  directly reachable from L8.  Right?  So if a robot goes from LA to align and vice versa,  it will be rewarded by one  and if a location is not directly reachable  from a particular equation.  We do not give any reward a reward of 0 now the reward  is just a number  and nothing else it enables the robots to make sense  of the movements helping them  in deciding what locations are directly reachable  and what are not now with this Q. We  can construct a reward table which contains all the required.  Use mapping between all possible States.  So as you can see here in the table the positions  which are marked green have a positive reward.  And as you can see here,  we have all the possible rewards that a robot can get by moving  in between the different states.  Now comes an interesting decision.  Now remember that the factory administrator prioritized L6  to be the topmost.  So how do we incorporate this fact in the above table now,  this is done by associating the topmost priority location  with a very high reward.  The usual ones so let’s put 999  in the cell L 6 comma and six now the table  of rewards with a higher reward  for the topmost location looks something like this.  We have not formally defined all the vital components  for the solution.  We are aiming for the problem discussed now,  we will shift gears a bit and study some  of the fundamental concepts  that Prevail in the world of reinforcement learning  and q-learning the first of all we’ll start  with the Bellman equation now consider the following Square.  Rooms, which is analogous  to the actual environment from our original problem.  But without the barriers now suppose a robot needs to go  to the room marked  in the green  from its current position a using the specified Direction.  Now, how can we enable the robot to do this programmatically  one idea would be introduced some kind of a footprint  which the robot will be able to follow now here  a constant value is specified in each of the rooms,  which will come along the robots way  if it follows the directions by Fight about now in this way  if it starts at location  a it will be able to scan through this constant value  and will move accordingly  but this will only work  if the direction is prefix  and the robot always starts  at the location a now consider the robot starts  at this location rather than its previous one.  Now the robot now sees Footprints  in two different directions.  It is therefore unable to decide which way to go  in order to get the destination which is the Green Room.  It happens.  Primarily because the robot does not have a way to remember  the directions to proceed.  So our job now is to enable the robot with a memory.  Now, this is where the Bellman equation comes into play.  So as you can see here,  the main reason of the Bellman equation  is to enable the reward with the memory.  That’s the thing we’re going to use.  So the equation goes something like this V  of s gives maximum a r of s comma a plus gamma of vs –  where s is a particular state Which is a room is  the Action Moving between the rooms as –  is the state to which the robot goes from s  and gamma is the discount Factor  now we’ll get into it in a moment  and obviously R of s comma a is a reward function  which takes a state as an action a and outputs the reward now V  of s is the value of being in a particular state  which is the footprint  now we consider all the possible actions  and take the one that yields the maximum value.  Now there is one constraint.  However regarding the value footprint  that is the room marked  in the yellow just below the Green Room.  It will always have the value of 1 to denote  that is one of the nearest room adjacent to the green room.  Now.  This is also to ensure that a robot gets a reward  when it goes from a yellow room to The Green Room.  Let’s see how to make sense of the equation  which we have here.  So let’s assume a discount factor of 0.9  as remember gamma is the discount value  or the discount Factor.  So let’s Take a 0.9.  Now for the room,  which is Mark just below the one or the yellow room,  which is the Aztec Mark for this room.  What will be the V of s  that is the value of being in a particular state?  So for this V of s would be something  like maximum of a will take 0  which is the initial of our s comma.  Hey plus 0.9 which is gamma into 1  that gives us zero point nine now here the robot  will not get any reward  for Owing to a state marked  in yellow hence the IR s comma a is 0 here  but the robot knows the value of being in the yellow room.  Hence V of s Dash is one following this  for the other states.  We should get 0.9 then again,  if we put 0.9 in this equation,  we get 0.81 then zero point seven to nine and then we again  reached the starting point.  So this is  how the table looks with some value Footprints computer.  From the Bellman equation now  a couple of things to notice here is  that the max function has the robot to always  choose the state  that gives it the maximum value of being in that state now  the discount Factor gamma notifies the robot  about how far it is from the destination.  This is typically specified by the developer of the algorithm.  That would be installed in the robot.  Now, the other states can also be given their respective values  in a similar way.  So as you can see here the boxes Into the green one have one and  if we move away from one we get 0.9 0.8 1 0 1 7 to 9.  And finally we reach 0.66 now the robot now  can precede its way  through the Green Room utilizing these value Footprints event  if it’s dropped at any arbitrary room  in the given location now,  if a robot Lance up in the highlighted Sky Blue Area,  it will still find two options to choose  from but eventually either of the parties.  It’s will be good enough for the robot to take  because Auto V  the value Footprints are not only that out.  Now one thing to note is  that the Bellman equation is one of the key equations  in the world of reinforcement learning and Q learning.  So if we think realistically our surroundings do not always work  in the way we expect there is always a bit  of stochastic City involved in it.  So this applies to robot as well.  Sometimes it might so happen  that the robots Machinery got corrupted.  Sometimes the robot makes come across some hindrance on its way  which may not be known to it beforehand.  Right and sometimes even if the robot knows  that it needs to take the right turn it will not so  how do we introduce this to cast a city  in our case now here comes the Markov decision process  now consider the robot is currently in the Red Room  and it needs to go to the green room.  Now.  Let’s now consider the robot has a slight chance  of dysfunctioning and might take the left  or the right or the bottom.  On instead updating the upper turn in order to get  to The Green Room from where it is now,  which is the Red Room.  Now the question is,  how do we enable the robot to handle this when it is out  in the given environment right.  Now, this is a situation  where the decision making  regarding which turn is to be taken is partly random  and partly another control of the robot now partly random  because we are not sure  when exactly the robot mind dysfunctional and partly  under the control of the robot  because it is still Making a decision  of taking a turn right  on its own and with the help of the program embedded into it.  So a Markov decision process  is a discrete time stochastic Control process.  It provides a mathematical framework for modeling  decision-making in situations  where the outcomes are partly random  and partly under control of the decision maker.  Now we need to give this concept a mathematical shape most  likely an equation  which then can be taken further now you might be Price  that we can do this with the help  of the Bellman equation with a few minor tweaks.  So if we have a look  at the original Bellman equation V of X is equal to maximum  of our s comma a plus gamma V of s stash  what needs to be changed in the above equation  so that we can introduce some amount of Randomness  here as long as we are not sure  when the robot might not take the expected turn.  We are then also not sure in which room it might end up  in which is nothing but the room it.  Moves from its current room at this point  according to the equation.  We are not sure of the S stash  which is the next state or the room,  but we do know all the probable turns the reward might take now  in order to incorporate each  of this probabilities into the above equation.  We need to associate a probability with each  of the turns to quantify the robot  if it has got any experts it is chance of taking this turn now  if we do,  so We get PS is equal to maximum of our s comma a plus gamma  into summation of s –  PS comma a comma s stash into V of his stash now the PS a–  and a stash is the probability  of moving from room s to establish with the action a  and the submission here is the expectation  of the situation that the robot in curse,  which is the randomness now,  let’s take a look at this example here.  So when We associate the probabilities to each  of these Stones.  We essentially mean that there is an 80% chance  that the robot will take the upper turn.  Now, if you put all the required values  in our equation,  we get V of s is equal to maximum of our of s comma a +  comma of 0.8 into V of room up plus 0.1  into V of room down 0.03 into a room of V of from left  plus 0.03 into Vo Right now note  that the value Footprints will not change due to the fact  that we are incorporating stochastic Ali here.  But this time we will not calculate  those values Footprints instead.  We will let the robot to figure it out.  Now up until this point.  We have not considered about rewarding the robot  for its action of going into a particular room.  We are only watering the robot  when it gets to the destination now,  ideally there should be a reward  for each action the robot takes to help it better  as Assess the quality of the actions,  but there was need not to be always be the same  but it is much better than having some amount  of reward for the actions than having no rewards at all.  Right and this idea is known as the living penalty in reality.  The reward system can be very complex  and particularly modeling sparse rewards is an active area  of research in the domain of reinforcement learning.  So by now we have got the equation  which we have a so what?  To do is now transition to Q learning.  So this equation gives us the value of going  to a particular State taking the stochastic city  of the environment into account.  Now, we have also learned very briefly about the idea  of living penalty  which deals with associating each move of the robot  with a reward so Q learning processes  and idea of assessing the quality of an action  that is taken to move to a state rather than  determining the possible value of the state  which is being moved to So earlier we had 0.8  into V of s 1 0.03 into V of S 2 0 point 1 into V  of S 3 and so on now  if you incorporate the idea of assessing the quality  of the action for moving to a certain state  so the environment with the agent  and the quality of the action will look something like this.  So instead of 0.8 V  of s 1 will have q of s 1 comma a one will have q  of S 2 comma 2 You  of S3 not the robot now has four different states to choose  from and along with that.  There are four different actions  also for the current state it is in so  how do we calculate Q of s comma  a that is the cumulative quality of the possible actions  the robot might take so let’s break it down.  Now from the equation V of s equals maximum a RS comma a +  comma summation s –  PSAs stash – into V of s –  if we discard the maximum function we have is  of a plus gamma into summation p  and v now essentially in the equation  that produces V  of s we are considering all possible actions  and all possible States  from the current state that the robot is in  and then we are taking the maximum value caused  by taking a certain action  and the equation produces a value footprint,  which is for just one possible action.  In fact if we can think of it as the quality  of the action so Q of s comma a is equal  to RS comma a plus gamma of summation p and v now  that we have got an equation to quantify the quality  of a particular action.  We are going to make a little adjustment  in the equation we can now say  that we of s is the maximum of all the possible values  of Q of s comma a right.  So let’s utilize this fact  and replace V of s Stash as a function  of Q so q s comma a becomes R of s comma a  + comma of summation PSAs –  and maximum of the que es – a – so the equation  of V is now turned into an equation of Q,  which is the quality.  But why would we do that now?  This is done to ease our calculations  because now we have only one function Q,  which is also the core of the Programming language.  We have only one function Q to calculate an R of s comma  a is a Quantified metric  which produces reward of moving to a certain State.  Now, the qualities  of the actions are called The Q values  and from now on we will refer to the value Footprints  as the Q values an important piece  of the puzzle is the temporal difference.  Now temporal difference is the component  that will help the robot calculate the Q values  which respect to the change.  Changes in the environment over time.  So consider our robot is currently in the mark State  and it wants to move to the Upper State.  One thing to note that here is  that the robot already knows the Q value of making the action  that is moving through the Upper State and we know  that the environment is stochastic in nature  and the reward  that the robot will get after moving to the Upper State  might be different from an earlier observation.  So how do we capture this change the real difference?  We calculate the new Q as My a with the same formula  and subtract the previous you known qsa from it.  So this will in turn give us the new QA now the equation  that we just derived gifts the temporal difference  in the Q values  which further helps to capture the random changes  in the environment  which may impose now the new q s comma a  is updated as the following  so Q T of s comma is equal  to QT minus 1 s comma a plus Alpha TD.  ET of a comma s now here  Alpha is the learning rate which controls  how quickly the robot adapts to the random changes imposed  by the environment the qts comma is the current state q value  and a QT minus 1 s comma is the previously recorded Q value.  So if we replace the TDS comma a with its full form equation,  we should get Q T of s comma is equal to QT –  1 of s comma y plus Alpha  into our of S comma a plus gamma maximum  of q s Dash a dash minus QT  minus 1 s comma a now  that we have all the little pieces of q line together.  Let’s move forward to its implementation part.  Now, this is the final equation of q-learning, right?  So, let’s see  how we can implement this and obtain the best path  for any robot to take now to implement the algorithm.  We need to understand the warehouse.  Ian and how that can be mapped to different states.  So let’s start by reconnecting the sample environment.  So as you can see here,  we have L1 L2 L3 to align and as you can see here,  we have certain borders also.  So first of all,  let’s map each of the above locations in the warehouse  two numbers or the states  so that it will ease our calculations, right?  So what I’m going to do is create a new Python 3 file  in the jupyter notebook  and I’ll name it as learning Numb, but  okay, so let’s define the states.  But before that what we need to do is import numpy  because we’re going to use numpy  for this purpose and let’s initialize the parameters.  That is the gamma and Alpha parameters.  So gamma is 0.75,  which is the discount Factor whereas Alpha is 0.9,  which is the learning rate.  Now next what we’re going to do is Define the states and map  it to numbers.  So as I mentioned earlier l 1 is Zero and online.  We have defined the states in the numerical form.  Now.  The next step is to define the actions which is  as mentioned above represents the transition  to the next state.  So as you can see here,  we have an array of actions from 0 to 8.  Now, what we’re going to do is Define the reward table.  So as you can see here is the same Matrix  that we created just now  that I showed you just now now if you understood it correctly,  there isn’t any real Barrel limitation  as depicted in the image,  for example, the transitional for tell one is allowed  but the reward will be 0 to discourage that path  or in tough situation.  What we do is add a minus 1 there  so that it gets a negative reward.  So in the above code snippet as you can see here,  we took each of the It’s and put once in the respective state  that are directly reachable from the certain State.  Now.  If you refer to that reward table, once again,  which we created the above  or reconstruction will be easy to understand  but one thing to note here is  that we did not consider the top priority location L6 yet.  We would also need an inverse mapping  from the state’s back to its original location  and it will be cleaner  when we reach to the other depths of the algorithms.  So for that what we’re going to do is Have the inverse  map location state to location.  We will take the distinct State and location  and convert it back.  Now.  What will do is will not Define a function get optimal  which is the get optimal route,  which will have a start location and an N location.  Don’t worry the code is back.  But I’ll explain you each and every bit of the code.  It’s not the get optimal root function will take two arguments  the starting location in the warehouse  and the end location in the warehouse recipe lovely  and it will return the optimal route  for reaching the end location  from the starting location in the form of an ordered list  containing the letters.  So we’ll start by defining  the function by initializing the Q values to be all zeros.  So as you can see here we have Even the Q value has to be 0  but before that  what we need to do is copy the reward Matrix to a new one.  So this the rewards new and next again,  what we need to do is get the ending State corresponding  to the ending location.  And with this information automatically will set  the priority of the given ending stay to the highest one  that we are not defining it now,  but will automatically set the priority  of the given ending State as nine nine nine.  So what we’re going to do is initialize the Q values to be 0  and in the Learning process what you can see here.  We are taking I in range 1000 and we’re going to pick  up a state randomly.  So we’re going to use the MP dot random randint  and for traversing through the neighbor location  in the same maze we’re going to iterate  through the new reward Matrix and get the actions  which are greater than 0 and after that  what we’re going to do is pick an action randomly from the list  of the playable actions  in years to the next state  will going to compute the temporal difference,  which is TD,  which is the rewards plus gamma into the queue of next state  and will take n p dot ARG Max  of Q of next 8 minus Q of the current state.  We going to then update  the Q values using the Bellman equation  as you can see here.  We have the Bellman equation  and we’re going to update the Q values  and after that we’re going to initialize the optimal route  with a starting location now here we do not know  what the next location yet.  So initialize it with a value of the starting location,  which Again is the random location.  So we do not know about the exact number  of iteration needed to reach to the final location.  Hence while loop will be a good choice for the iteration.  So when you’re going to fetch the starting State fetch  the highest Q value penetrating  to the starting State we go to the index  or the next state,  but we need the corresponding letter.  So we’re going to use that state to location function.  We just mentioned there  and after that we’re going to update the starting location  for the The next iteration  and finally we’ll return the root.  So let’s take the starting location of n line  and and location  of L while and see what part do we actually get?  So as you can see here we get Airline l8l 5 L2 and L1.  And if you have a look at the image here,  we have if we start from L9 to L1.  We got L8 L5 L 2 l 1 l 8l v L2 L1  that would He does the maximum value of the maximum reward  for the robot.  So now we have come to the end of this Q learning session  and I hope you got to know  what exactly is Q learning with the analogy  all the way starting from the number of rooms  and I hope the example which I took the analogy  which I took was good enough  for you to understand q-learning understand the Bellman equation  how to make quick changes to the Bellman equation  and how to create the reward table the cue.  Will and how to update the Q values using  the Bellman equation,  what does alpha do what does karma do?


Leave a Reply

Your email address will not be published. Required fields are marked *