Learn Data Science Tutorial – Full Course for Beginners

Learn Data Science Tutorial – Full Course for Beginners
Spread the love

welcome to data science and introduction  i’m barton polson and what we’re going  to do in this course is we’re going to  have a brief  accessible and non-technical overview of  the field of data science  now some people when they hear data  science they start thinking things like  data and think about piles of equations  and numbers and then to throw on top of  that science and think about people  working in their lab and they start to  say  that’s not for me i’m not really a  technical person and that just seems  much too techy well here’s the important  thing to know while a lot of people get  really fired up about the technical  aspects of data science the important  thing is that data science is not so  much a technical discipline  but creative  and really that’s true the reason i say  that is because in data science you use  tools that come from coding and  statistics and from math  but you use those to work creatively  with data  the idea is that there’s always more  than one way to solve a problem or  answer a question and most importantly  to get insight because the goal  no matter how you go about it is to get  insight from your data and what makes  data science unique compared to so many  other things is that you try to listen  to all of your data even when it doesn’t  fit in easily with your standard  approaches and paradigms  you’re trying to be much more inclusive  in your analysis and the reason you want  to do that is because everything  signifies everything carries meaning and  everything can give you additional  understanding and insight into what’s  going on around you and so in this  course what we’re trying to do is give  you a map to the field of data science  and how you can use it and so now you  have the map in your hands and you can  get ready to get going with data science  welcome back to data science and  introduction and we’re going to begin  this course by defining data science  that makes sense but we’re going to do  it in kind of a funny way the first  thing i’m going to talk about is the  demand for data science so let’s take a  quick look now  data science can be defined in a few  ways i’m going to give you some short  definitions  take one on my definition is that data  science is coding math and statistics in  applied settings that’s a reasonable  working definition  but if you want to be a little more  concise i’ve got take two on a  definition that data science is the  analysis of diverse data or data that  you didn’t think would fit into standard  analytic approaches  a third way to think about it is that  data science is inclusive analysis it  includes all of the data all of the  information that you have in order to  get the most insightful and compelling  answer to your research questions  now you may say to yourself wait  that’s it  well  if you’re not impressed let me show you  a few things first off let’s take a look  at this article  it says data scientists the sexiest job  of the 21st century and please note that  this is coming from harvard business  review so this is an authoritative  source and it’s the official source of  this saying that data science is sexy  now again you may be saying to yourself  sexy i hardly think so oh oh yeah it’s  sexy and the reason data science is sexy  is because first it has rare qualities  and second it has high demand let me say  a little more about those  the rare qualities are that data science  takes unstructured data then finds order  meaning and value in the data those are  important but they’re not easy to come  across second high demand well the  reason it’s in high demand is because  data science provides insight into  what’s going on around you and  critically it provides competitive  advantage which is a huge thing in  business settings  now  let me go back and say a little more  about demand let’s take a look at a few  other sources so for instance the  mckinsey global institute  published a very well known paper and  you can get it with this url  and if you go to that webpage this is  what’s going to come up and we’re going  to take a quick look at this one the  executive summary it’s a pdf that you  can download  and if you open that up you’ll find this  page and let’s take a look at the bottom  right corner two numbers here i’m going  to zoom in on those  the first one is they are projecting a  need in the next few years for somewhere  between 140 and 190 000  deep analytical talent positions so this  means actual practicing data scientists  that’s a huge number but almost 10 times  as high as 1.5 million more data savvy  managers will be needed to take full  advantage of big data in the united  states  now that’s people who aren’t necessarily  doing the analysis but have to  understand it who have to speak data and  that’s one of the main purposes of this  particular course is to help people who  may or may not be the practicing data  scientist  learn to understand what they can get  out of data and some of the methods used  to get there let’s take a look at  another article from linkedin  here’s a shortcut url for it and that  will bring you to this webpage the 25  hottest job skills that got people hired  in 2014 and take a look at number one  here statistical analysis and data  mining very closely related to data  science and just to be clear this was  number one in australia in brazil in  canada in france and india  in the netherlands in south africa and  the united arab emirates in the united  kingdom  everywhere and if you need a little more  let’s take a look at glassdoor which  published an article this year 2016  and it’s about the 25 best jobs in  america  and look at number one right here it’s  data scientist and we can zoom in on  this information  it says there’s going to be  1700 job openings with a median base  salary of over 116 000  and  fabulous career opportunities and job  scores so if you want to take all of  this together the conclusion you can  reach is that data science pays  and i can show you a little more about  that so for instance here’s a list of  the top 10 highest paying salaries that  i got from u.s news  we have physicians or doctors dentists  and lawyers and so on  now if we add data scientists to this  list using data from o’reilly.com we  have to push things around the site and  it goes in third with an average  total salary not the base that we had in  the other one but the total compensation  of about a hundred and forty four  thousand dollars a year that’s  extraordinary  so in sum what do we get from all this  first off we learned that there is a  very high demand for data science  second we learned that there is a  critical need for both specialists those  are the sort of practicing data  scientists  and for generalists people who speak the  language and know what can be done  and of course there’s excellent pay and  all together this makes data science a  compelling career alternative and a way  of making you better at whatever you’re  doing  back here in data science we’re going to  continue our attempt to define data  science by looking at something that’s  really well known in the field the data  science venn diagram  now if you want to you can think of this  in terms of what are the ingredients of  data science  well we’re going to first say thanks to  drew conway the guy who came up with  this  and if you want to see the original  article you can go to this address  but what drew said is that data science  is made of three things and we can put  them as overlapping circles because it’s  the intersection that’s important here  on the top left is coding or computer  programming or as he calls it hacking  on the top right is stats or stats in  mathematics or quantitative abilities in  general  and on the bottom is domain expertise or  intimate familiarity with a particular  field of practice business or health or  education or something like that  and the intersection here in the middle  that is data science so it’s the  combination of coding  and statistics in math and domain  knowledge  now let’s say a little more about coding  the reason coding is important because  it helps you gather and prepare the data  because a lot of the data comes from  novel sources and it’s not necessarily  ready for you to gather and it can be in  very unusual formats and so coding is  important because it can require some  real creativity to get the data from  these sources to put it into your  analysis  now a few kinds of coding that are  important for instance there’s  statistical coding a couple of major  languages in this are r  and python two open source free  programming languages are specifically  for data python’s general purpose but  well adapted to data  the ability to work with databases is  important too the most common language  there is  sql usually pronounced sql which stands  for structured query language  because that’s where the data is also  there’s the command line interface or if  you’re on a mac people just call it the  terminal  the most common language there is bash  which actually  stands for born again shell  and then searching is important and  regex or regular expressions  while there’s not a huge amount to learn  there it’s a it’s a small little field  it’s sort of like  super powered wild card searching that  makes it possible for you to both find  the data and reformat it in ways that  are going to be helpful for your  analysis  now let’s say a few things about the  math  you’re going to need things like a  little bit of probability some algebra  of course regression very common  statistical procedure those things are  important and the reason you need the  math is because that’s going to help you  choose the appropriate procedures to  answer the question with the data that  you have  and probably even more importantly it’s  going to help you diagnose problems when  things don’t go as expected and given  that you’re trying to do new things with  new datas in new ways you’re probably  going to come across problems and so the  ability to understand the mechanics of  what’s going on is going to give you a  big advantage  and the third element of the data  science venn diagram is some sort of  domain expertise think of it as  expertise in the field that you’re in  business settings are common  you need to know about the goals of that  field the methods that are used the  constraints that people come across  and it’s important because whatever your  results are you need to be able to  implement them well data science is very  practical and it’s designed to  accomplish something and your  familiarity with a particular field of  practice is going to make it that much  easier  and more impactful when you implement  the results of your analysis  now let’s go back to our venn diagram  here just for a moment because this is a  ven we also have these intersections of  two circles at a time  at the top is machine learning  at the bottom right is traditional  research and on the bottom left is what  drew conway called the danger zone let  me talk about each of these  stuff machine learning or ml now you  think about machine learning and the  idea here is that it represents coding  or statistical programming and  mathematics without any real domain  expertise sometimes these are referred  to as black box models that you kind of  throw data in and you don’t even  necessarily have to know what it means  or what language it’s in and it’ll just  kind of crunch through it all and it’ll  give you some regularities  that can be very helpful but machine  learning is considered  slightly different from data science  because it doesn’t involve the  particular applications in a specific  domain  also there’s traditional research this  is where you have math or statistics and  you have domain knowledge often very  intensive domain knowledge but without  the coding or programming now you can  get away with that because the data that  you use in traditional research is  highly structured it comes in rows and  columns is typically complete and is  typically ready for analysis it doesn’t  mean your life is easy because now you  have to expand an enormous amount of  effort in the method in designing the  project and in the interpretation of the  data  so  still very heavy intellectual cognitive  work but it comes in a different place  and then finally there’s what conway  called the danger zone and that’s the  intersection of coding and domain  knowledge but without math or statistics  now he says it’s unlikely to happen and  that’s probably true  on the other hand i can think of some  common examples what are called word  counts where you take a large document  or series of documents and you count how  often each word appears in there that  can actually tell you some important  things and also drawing maps and showing  how things change across place and maybe  across time  you don’t necessarily have to have the  math  but it can be very insightful and  helpful  so let’s think about a couple of  backgrounds where people come from here  first is coding  you can have people who are coders who  can do math stats and business so you  get the three things and this is  probably the most common most people  come from a programming background  on the other hand there’s also stats or  statistics and you can get statisticians  who can code and who also can do  business that’s less common but it does  happen and finally there’s people who  come into data science from a particular  domain  these are for instance business people  who can code and do numbers and they’re  the least common but  all of these are important to data  science  and so in sum here’s what we can take  away first several fields make up data  science  second diverse skills and backgrounds  are important and they’re needed in data  science  and third there are many roles involved  because there’s a lot of different  things that need to happen we’ll see  more about that in our next movie  the next step in our data science  introduction and our definition of data  science is to talk about the data  science pathway  so i’d like to think of this as when  you’re working on a major project you’ve  got to do one step at a time to get from  here to there  in data science you can take the various  steps you can put them into a couple of  general categories  first there are the steps that involve  planning second there’s the data prep  third there’s the actual modeling of the  data and fourth there’s the follow-up  and there are several steps within each  of these i’ll explain each of them  briefly  first let’s talk about planning the  first thing you need to do is you need  to define the goals of your project so  you know how to use your resources well  and also so you know when you’re done  second you need to organize your  resources so you might have data from  several different sources you might have  different software packages you might  have different people which gets us to  the third one you need to coordinate the  people so they can work together  productively if you’re doing a handoff  it needs to be clear who’s going to do  what and how their work is going to go  together and then  really to state the obvious you need to  schedule the project so things can move  along smoothly and you can finish in a  reasonable amount of time  next is the data prep we are taking like  food prep and getting the wrong  ingredients ready  first of course is you need to get the  data and it can come from many different  sources and be in many different formats  you need to clean the data and  the sad thing is this tends to be a very  large part of any data science project  and that’s because you’re bringing in  unusual data from a lot of different  places  you also want to explore the data  that is really see what it looks like  how many people are in each group what  the shape the distributions are like  what’s associated with what  and you may need to refine the data and  that means choosing variables to include  choosing cases to include or exclude  making any transformations to the data  you need to do and of course these steps  kind of can bounce back and forth from  one to the other  the third group is modeling or  statistical modeling this is where you  actually want to create the statistical  model so for instance you might do a  regression analysis or you might do a  neural network  but whatever you do once you create your  model you have to validate the model you  might do that with  holdout validation you might do it  really with a very small replication if  you can  you also need to evaluate the model so  once you know that the model is accurate  what does it actually mean and how much  does it tell you  and then finally  you need to refine the model so for  instance there may be variables you want  to throw out there may be additional  ones you want to include you may want to  again transform some of the data you may  want to  get it so it’s easier to interpret and  apply and that gets us to the last part  of the data science pathway and that’s  follow-up and once you’ve created your  model you need to present the model  because it’s usually work that’s being  done for a client could be in-house  could be a third party but you need to  take the insights that you got and share  them in a meaningful way with other  people  you also need to deploy the model it’s  usually being done in order to  accomplish something so for instance if  you’re working with an e-commerce site  you may be developing a recommendation  engine that says people who bought this  and this might buy this you need to  actually stick it on the website and see  if it works the way you expected it to  then you need to revisit the model  because a lot of times the data that you  worked on is not necessarily all of the  data and things can  change when you get out in the real  world or  things just change over time and so you  have to see how well your model is  working  and then  just to be thorough you need to archive  the assets document what you have and  make it possible for you or for others  to repeat the analysis or develop off of  it in the future  so those are the general steps of what i  consider the data science pathway and in  sum what we get from this is three  things first data science isn’t just a  technical field it’s not just coding  things like planning and presenting and  implementing are just as important  also contextual skills knowing how it  works in a particular field knowing how  it will be implemented those skills  matter as well and then  as you got from this whole thing  there’s a lot of things to do and if you  go one step at a time there’ll be less  backtracking and you’ll ultimately be  more productive in your data science  projects  we’ll continue our definition of data  science by looking at the roles that are  involved in data science the way that  different people can contribute to it  that’s because it tends to be a  collaborative thing and it’s nice to be  able to say that we’re all together  working together towards a single goal  so let’s talk about some of the roles  involved in data science and how they  contribute to the projects  first off let’s take a look at engineers  these are people who focus on the  backend hardware for instance the  servers and the software that runs them  this is what makes data science possible  and it includes people like developers  software developers or database  administrators and they provide the  foundation for the rest of the work  next you can also have people who are  big data specialists these are people  who focus on computer science and  mathematics  and they may do machine learning  algorithms as a way of processing very  large amounts of data  and they often create what are called  data products so a thing that tells you  what restaurant to go to or that says  you might know these friends or provides  ways of linking up photos those are data  products and those often involve a huge  amount of very technical work behind  them  there are also researchers these are  people who focus on  domain specific research so for instance  physics or genetics or whatever and  these people tend to have very strong  statistics and they can use some of the  procedures and some of the data that  comes from the other people like the big  data researchers  but they focus on these specific  questions  also  in the data science realm you’ll find  analysts these are people who focus on  the day-to-day tasks of running a  business so for instance they might do  web analytics like google analytics or  they might pull data from a sql  database  and this information  is very important and good for business  and so analysts are key to the  day-to-day functioning of business but  you know they may not exactly be data  science proper because most of the data  they’re they’re working with is going to  be pretty structured nevertheless they  play a critical role in business in  general and then speaking of business  you have the actual business people the  men and women who organize and run  businesses  these people need to be able to frame  business relevant questions that can be  answered with the data  also the business person manages the  project and the efforts and the  resources of others  and while they may not actually be doing  the coding they must speak data they  must know how the data works what it can  answer and how to implement it  you can also have entrepreneurs so you  might have for instance a data startup  they’re starting their own little  social network of their own little  web search platform  an entrepreneur needs data and business  skills and truthfully they have to be  creative at every step along the way  usually because they’re doing it all  themselves  at a smaller scale  then we have in data science something  known as the full stack unicorn and this  is a person who can do everything at an  expert level and they’re called a  unicorn because truthfully they may not  actually exist i’ll have more to say  about that later  but for right now  we can sum up what we got out of this  video  by three things number one data science  is diverse there’s a lot of different  people who go into it and they have  different goals for their work and they  bring in different skills and different  experiences and different approaches  also they tend to work in very different  contexts an entrepreneur works in a very  different place from a business manager  works in a very different place from an  academic researcher but all of them are  connected in some way to data science  and make it a richer field  the last thing i want to say in data  science and introduction where i’m  trying to define data science is to talk  about teams in data science  the idea here is that data science has  many different tools and different  people are going to be experts in each  one of them  now you have for instance coding and you  have statistics also you have fields  like design or business and management  that are involved and the question of  course is  who can do all of it who’s able to do  all of these things at the level that we  need  well that’s where we get this saying  i’ve mentioned it before it’s the  unicorn  and just like in  ancient history the unicorn is a  mythical creature with magical abilities  in data science it works a little  differently  it is a mythical data scientist with  universal abilities the trouble is as we  know from the real world there’s really  no unicorns animals and there’s really  not many unicorns in data science really  there’s just people and so we have to  find out how we can do the projects even  though we don’t have this one person who  can do everything for everybody  so let’s take a hypothetical case just  for a moment i’m going to give you some  fictional people  here is my fictional person auto who has  strong visualization skills who has good  coding but has limited analytics or  statistical ability and if we graph his  stuff out his abilities so here we got  five things that we need to have happen  and for the project to work they all  have to happen at at least a level of  eight on the zero to ten  if we take his coding ability well he’s  almost there statistics not quite  halfway graphics yes he can do that  and then business now all right and  project pretty good  so what you can see here is in only one  of these five areas is auto sufficient  on his own  on the other hand let’s pair him up with  somebody else let’s take a look at lucy  and lucy has strong business training  has good tech skills but has limited  graphics and so if we get her profile on  the same thing that we saw  there’s coding pretty good statistics  pretty good graphics not so much  business good and projects okay  now the important thing here is that  we can make a team so let’s take our two  fictional people otto and lucy and we  can put together their abilities now i  actually have to change the scale here a  little bit to accommodate the both of  them but our criterion still is at eight  we need a level of eight in order to do  the project competently  and if we combine them oh look coding is  now past eight  statistics is past eight graphics is way  past  business  way past and then the projects there too  and so when we combine their skills we  are able to get the level that we need  for everything or to put it another way  we have now created a unicorn by team  and that makes it possible to do the  data science project  so in sum  you usually can’t do data science on  your own that’s a very rare individual  or more specifically  people need people and in data science  you have the opportunity to take several  people and make collective unicorns so  you can get the insight that you need in  your project and you can get the things  done that you want  in order to get a better understanding  of data science it can be helpful to  look at contrasts between data science  and other fields  probably the most informative is with  big data because these two terms are  actually often confused  it makes me think of situations where  you have two things that are very  similar but not the same like we have  here in the piazza san carlo in turn  italy  part of the problem stems from the fact  that data science and big data both have  venn diagrams associated with them  so for instance then number one for data  science is something we’ve seen already  we have three circles  and we have coding and we have math and  we have some domain expertise that put  together get data science  on the other hand venn diagram number  two is for big data  it also has three circles and we have  the high volume of data the rapid  velocity of data and the extreme variety  of data  take those three v’s together you get  big data  now we can also combine these two if we  want in a third venn diagram we call big  data and data science this time it’s  just two circles with big data on the  left and data science on the right and  the intersection there in the middle is  big data science which actually is a  real term  but if you want to do a compare and  contrast it kind of helps to look at how  you can have one without the other so  let’s start by looking at big data  without data science so these are  situations where you may have  the  volume or velocity of variety data but  don’t need all the tools of data science  so we’re just looking at the left side  of the equation right now  now truthfully this only works if you  have big data without all three v’s some  say you have to have the volume velocity  and variety for canada’s big data i  basically say anything that doesn’t fit  into a standard machine  is probably big data  i can think of a couple of examples here  of things that might count as big data  but maybe don’t count as data science  machine learning where you can have very  large data sets and probably very  complex doesn’t require much domain  expertise so that may not be data  science word counts where you have an  enormous amount of data and it’s  actually a pretty simple analysis  again doesn’t require much  sophistication in terms of quantitative  skills or even domain expertise  so  maybe maybe not data science on the  other hand to do any of these you’re  going to need to have at least two  skills you’re going to need to have the  coding and you will probably have to  have  some sort of quantitative skills as well  so how about data science without big  data that’s the right side of this  diagram well to make that happen you’re  probably talking about data with just  one of the three v’s from big data so  either volume or velocity or variety  but singly  so for instance genetics data you have a  huge amount of data  and  it comes in a very set structure and it  tends to come in at once so you got a  lot of volume and it’s a very  challenging thing to work with you have  to use data science but it may or may  not count as big data  similarly streaming sensor data where  you have data coming in very quickly but  you’re not necessarily saving it you’re  just looking at these windows in it  that’s a lot of velocity and it’s  difficult to deal with it takes data  science the full skill set but it may  not require  big data per se  or facial recognition where you have  enormous variety in the data because  you’re getting photos or videos that are  coming in  again  very difficult to deal with requires a  lot of ingenuity and creativity  may or may not count as big data  depending on how much of a stickler you  are about definitions  now if you want to combine the two we  can talk about big data science  and in that case we’re looking right  here  at the middle this is a situation where  you have volume and velocity and variety  in your data and truthfully if you have  the three of those you are going to need  the full data science skill set you’re  going to need coding and statistics and  math and you’re going to have to have  domain expertise  primarily because of the variety you’re  dealing with but taken all together you  do have to have all of it so in sum  here’s what we get  big data is not equal to it’s not  identical to data science now there’s  common ground and a lot of people who  are good at big data are good at data  science and vice versa but they are  conceptually distinct on the other hand  there is the shared middle ground of big  data science that unifies the two  separate fields  another important contrast you can make  in trying to understand data science is  to compare it with coding or computer  programming  now this is where you’re trying to work  with the machine and you’re trying to  talk to that machine to get it to do  things  in one sense you can think of coding as  just giving task instructions how to do  something and it’s a lot like a recipe  when you’re cooking  you get some sort of user input or other  input and then maybe you have if then  logic and you get output from it  to take an extremely simple example if  you’re programming in python version 2  you write print and then then quotes  hello world and that will put the words  hello world on the screen so you gave it  some instructions and it gave you some  output  very simple programming  now coding and data gets a little more  complicated  so for instance there’s word counts  where you take a book or a whole  collection of books you take the words  and you count how many there are in  there now this is this is a conceptually  simple task  and domain expertise and really math and  statistics are not vital  but  to make valid inferences and  generalizations  in the face of variability and  uncertainty in the data  you need statistics and  by extension you need data science  it might help to compare the two by  looking at the tools of the respective  trades  so for instance there are tools for  coding or generic computer programming  and there are tools that are specific to  data science  so what i have right here is a list from  the ieee of the top 10 programming  languages in 2015  and it starts at java and c and goes  down to shell and some of these are also  used for data science so for instance  python and r  and  sql are used for data science  but the other ones aren’t major ones in  data science so let’s in fact take a  look at a different list of most popular  tools for data science  and you see that things move around a  little bit now r is at the top sql’s  there python’s there but for me what’s  the most interesting on this list is  that excel is number five which would  never be considered programming per se  but is in fact a very important tool for  data science and that’s one of the ways  that we can compare and contrast  computer programming with data science  in sum  we can say this data science is not  equal to coding they’re different things  on the other hand they share some of the  tools and they share some practices  specifically when coding for data on the  other hand there is one very big  difference in that statistics  statistical ability is one of the major  separators between  general purpose programming and data  science programming  when we talk about data science and  we’re contrasting with some fields  another field that a lot of people get  confused and think they’re the same  thing as data science and statistics  now i’ll tell you there’s a lot in  common but we can talk a little bit  about the different focuses of each and  we also get into the issue of sort of  definitionalism that data science is  different because we define it  differently even when there’s an awful  lot in common between the two  it helps to take a look at some of the  things that go on in each field so let’s  start here about statistics put a little  circle here and we’ll put data science  and to borrow a term from steven j gould  we can call these  non-overlapping  magisteria noma so you think of them as  separate fields that are  sovereign unto themselves with nothing  to do with each other but you know  that doesn’t seem right and part of that  is if we go back to the data science  venn diagram  you know statistics is one part of it  there it is in the top corner so  now what do we do what’s the  relationship  so  it doesn’t make sense to say these are  totally separate areas  maybe data science and statistics  because they share procedures maybe data  science is a subset or a specialty of  statistics more like this  but  if data science were just a subset or  specialty within statistics then it  would follow that all data scientists  would first  be statisticians and interestingly  that’s just not so  say for instance we take a look at the  data science stars the superstars in the  field we go to a rather intimidating  article  it’s called the world’s seven most  powerful data scientists from forbes.com  and you can see the article if you go to  this url  there’s actually more than seven people  on the list because sometimes he brings  them up in pairs but let’s check their  degree see what their academic training  is in  if we take all the people on this list  we have five degrees in computer science  three in math  two in engineering and one each in  biology  economics  law speech pathology  and one in statistics  and so that tells us of course that  these major people in data science  are not trained as statisticians only  one of them has formal training in that  so that gets us to the next question  where do these two fields statistics and  data science diverge because they seem  like they should have a lot in common  but they don’t have a lot in training  and specifically we can look at the  training  most data scientists are not trained  formally as statisticians  also in practice  things like machine learning and big  data which are central to data science  are not shared generally with most of  statistics and so they have separate  domains there  and then there’s the really important  issue of context  data scientists tend to work in  different settings than statisticians  specifically data scientists very often  work in commercial settings where  they’re trying to get recommendation  engines or ways of developing a product  that will make them money  so  maybe instead of having data science as  a subset of statistics we can think of  it more as these two fields have  different niches they both analyze data  but they do different things in  different ways  so maybe it’s fair to say they share  they overlap they both have analysis in  common of data but otherwise they are  ecologically distinct  so  in sum  what we can say here is that data  science and statistics both use data and  they analyze it  but the people in each tend to come from  different backgrounds  and they tend to function with different  goals and contexts and in that way  render them to be conceptually distinct  fields despite the apparent overlap  as we work to get a grasp on data  science there’s one more contrast i want  to make explicitly and that’s between  data science and business intelligence  or bi  the idea here is that business  intelligence is data in real life it’s  very very applied stuff  the purpose of bi is to  get data on internal operations on  market competitors and so on and make  justifiable decisions as opposed to just  sitting in the bar and doing whatever  comes to your mind  now data science  is involved with this  except you know really there’s no coding  in bi there’s  using apps that already exist  and the statistics in business  intelligence tend to be very simple they  tend to be counts and percentages and  ratios  and so it’s simple the light bulb is  simple it just does its one job there’s  nothing super sophisticated there  instead the focus in business  intelligence is on domain expertise and  on  really useful direct utility  it’s simple it’s effective and it  provides insight  now one of the  main associations with business  intelligence is what are called  dashboards or data dashboards  they look like this it’s a collection of  charts and tables that go together to  give you a very quick overview of what’s  going on your business  and well a lot of data scientists may  let’s say look down their nose upon  dashboards i’ll say this most of them  are very well designed and you can learn  a huge amount about user interaction  and the accessibility of information  from dashboards  so really where does data science come  into this what’s the connection between  data science and business intelligence  well  data science can be useful to bi in  terms of setting it up identifying data  sources and creating or setting up the  framework for something like a dashboard  or a business intelligence system  also  data science can be used to extend it  data science can help get past the easy  questions and the easy data  to get the questions that are actually  most useful to you even if they require  really sometimes data that’s hard to  wrangle and work with  and also there’s an interesting  interaction here that goes the other way  data science practitioners can learn a  lot about design from good business  intelligence applications  so i strongly encourage anybody in data  science to look at them carefully and  see what they can learn  in sum  business intelligence or bi is very  goal-oriented  data science perhaps prepares the data  and sets up the form for business  intelligence  but also data science can learn a lot  about usability  and  accessibility from business intelligence  and so it’s always worth taking a close  look  davis science has a lot of really  wonderful things about it but it is  important to consider some ethical  issues and i’ll specifically call this  do no harm in your data science projects  and for that we can say thanks to  hippocrates the guy who gave us the  hippocratic oath of do no harm  let’s specifically talk about some of  the  important ethical issues very briefly  that come up in data science  number one is privacy  that data tells you a lot about people  and you need to be concerned about the  confidentiality if you have private  information about people their names  their social security numbers their  addresses their credit scores their  health  that’s  private that’s confidential and you  shouldn’t share that information unless  they specifically gave you permission  now one of the reasons this presents a  special challenge in data science  because we’ll see later a lot of the  sources that are used in data science  were not intended for sharing if you  scrape data from a website or from pdfs  you need to make sure that it’s okay to  do that but it was originally created  without the intention of sharing so  privacy is something that really falls  upon the analyst to make sure they’re  doing it properly  next is anonymity  one of the interesting things we find is  that it’s really not hard to identify  people in data  if you have a little bit of gps data and  you know where a person was at four  different points in time you have about  a 95 chance of knowing exactly who they  are you look at things like hipaa that’s  the health insurance portability and  accountability act  before hipaa it was really easy to  identify people from medical records  since then it has become much more  difficult to identify people uniquely  that’s an important thing for  really people’s well-being and then also  proprietary data if you’re working for a  client a company and they give you their  own data that data may have identifiers  you may know who the people are and  they’re not anonymous anymore  so anonymity may or may not be their  major efforts to make data anonymous but  really the primary thing is that even if  you do know who they are that you still  maintain the privacy and confidentiality  of the data  next there’s an issue about copyright  where people try to lock down  information  now just because something is on the web  doesn’t mean that you’re allowed to use  it scraping data from websites is a very  common and a useful way of getting data  for projects you can get data from web  pages from pdfs from images from  audio from  really a huge number of things  but again the assumption that because  it’s on the web it’s okay to use it is  not true you always need to check  copyright and make sure that it’s  acceptable for you to access that  particular data  next in our very ominous picture is data  security  and the idea here is that when you go  through all the effort to gather data to  clean up and prepare for an analysis  you’ve created something that’s very  valuable to a lot of people and you have  to be concerned about hackers trying to  come in and steal the data especially if  the data’s not anonymous and it has  identifiers in it  and so there is an additional burden  placed on the analyst to ensure to the  best of their ability  that the data is safe and cannot be  broken into and stolen and that can  include very simple things like a person  who is on the project but is no longer  but took the data on a flash drive you  have to find ways to make sure that that  can’t happen as well there’s a lot of  possibilities it’s tricky but it’s  something that you have to consider  thoroughly  now two other things that come up in  terms of ethics but don’t usually get  addressed in these conversations number  one is potential bias  the idea here is that the algorithms or  the formulas that are used in data  science are only as neutral and bias  free as the rules and the data that they  get  and so the idea here is that if you have  rules that address something that is  associated  with for instance gender or age or race  or economic standing  you might unintentionally be building in  those factors which say for instance for  title ix you’re not supposed to you  might be building those into the system  without being aware of it and an  algorithm has this sheen of objectivity  and people can say they can place  confidence in it without realizing that  it’s replicating some of the prejudices  that may happen in real life  another issue is overconfidence and the  idea here is that analyses are limited  simplifications they have to be that’s  that’s just what they are  and because of this  you still need humans in the loop to  help interpret and apply this  the problem is when people run an  algorithm to get out a number say to 10  decimal places and they say this must be  true and treat it as written in stone  absolutely unshakable truth when in fact  if the data were biased going in if the  algorithms were incomplete if the  sampling was not representative you can  have enormous problems and  go down the wrong path with too much  confidence in your own analyses so once  again humility is in order when doing  data science work  in sum  data science has enormous potential but  it also has significant risks involved  in the projects  part of the problem is that analyses  can’t be neutral that you have to look  at how the algorithms are associated  with the preferences prejudices and  biases of the people who made them  and what that means is that no matter  what good judgment is always vital to  the quality and success of a data  science project  data science is a field that is strongly  associated with its methods or  procedures  in this section of videos we’re going to  provide a brief overview of the methods  that are used in data science  now just as a quick warning in this  section things can get kind of technical  and that can cause some people to sort  of freak out but  this course is a non-technical overview  the technical hands-on stuff is in the  other courses  and it’s really important to remember  that tech is simply the means to doing  data science  insight or the ability to find meaning  in your data  that’s the goal tech only helps you get  there and so we want to focus primarily  on insight and the tools and the tech  as they serve to further that goal  now there’s a few general categories  we’re going to talk about again with an  overview for each of these  the first one is sourcing or data  sourcing that is how to get the data  that goes into data science the raw  materials that you need  the second is coding that again is  computer programming that can be used to  obtain and manipulate and analyze the  data  after that a tiny bit of math that is  the mathematics behind data science  methods that really form the foundations  of the procedures  and then stats the statistical methods  that are frequently used to summarize  and analyze data especially as applied  to data science  and then there’s machine learning ml  this is a collection of methods for  finding clusters in the data for  predicting categories or scores on  interesting outcomes and even across  these five things even then  the presentations aren’t too techy  crunchy they’re basically still friendly  and you know really that’s the way it is  and so that is the overview of the  overviews in sum we need to remember  that data science includes tech but data  science is greater than tech it’s more  than those procedures  and above all that tech well important  to data science is still  simply a means to insight in data  the first step in discussing data  science methods is to look at the  methods of sourcing or getting data  that’s used in data science  you can think of this as getting the  raw materials that go into your analyses  now you’ve got a few different choices  when it comes to this in data science  you can use existing data you can use  something called data apis you can  scrape web data or you can make data  we’ll talk about each of those very  briefly in a non-technical manner  but right now let me say something about  existing data this is data that already  is at hand and it might be in-house data  so if you work for a company it might be  your company records  or you might have open data for instance  many  governments many scientific  organizations make their data available  to the public  and then there’s also third party data  this is usually data that you buy from a  vendor  but it exists and it’s very easy to plug  it in and go  you can also use apis now that stands  for application programming interface  and this is something that allows  various computer applications to  communicate directly with each other  it’s like phones for your computer  programs  it’s the most common way of getting web  data and the beautiful thing about it is  it allows you to import that data  directly into whatever program or  application you’re using to analyze the  data  next is scraping data and this is where  you want to use data that’s on the web  but they don’t have an existing api  and what that means is usually  data that’s in html web tables and pages  maybe pdfs and  you can do this either with using  specialized applications for scraping  data or you can do it in a programming  language like r or python and write the  code to do the data scraping  or another option is  to make data and this lets you get  exactly what you need you can be very  specific  and you can get what you need you can do  something like interviews or you can do  surveys or you can do experiments  there’s a lot of approaches  most of them require some specialized  training in terms of how to gather  quality data and that’s actually  important to remember because no matter  what method you use for getting or  making new data you need to remember  this one little aphorism you may have  heard from computer science it goes by  the name of gigo that actually stands  for garbage in  garbage out and it means if you have bad  data that you’re feeding into your  system you’re not going to get anything  worthwhile any real insights out of it  consequently it’s important to pay  attention to metrics or methods for  measuring and the meaning exactly what  it is that they tell you  there’s a few ways you can do this for  instance you can talk about business  metrics you can talk about kpis which  means key performance indicators also  used in business settings  or smart goals which is a way of  describing the  goals that are actionable and timely and  so on  you can also talk about in a measurement  sense classification accuracy and i’ll  discuss each of those in a little more  detail in a later movie  but for right now in sum we can say this  data sourcing is important because you  need to get the raw materials for your  analysis the nice thing is there’s many  possible methods many ways that you can  use to get the data for data science  but no matter what you do it’s important  to check the quality and the meaning of  the data so you can get the most insight  possible out of your project  the next step we need to talk about in  data science methods is coding and i’m  going to give you a very brief  non-technical overview of coding and  data science  the idea here is that you’re going to  get in there and you are going to be  king of the jungle master of your domain  and make the data jump when you need it  to jump now if you remember when we  talked about the data science venn  diagram in the beginning coding’s up  here on the top left  and while we often think about sort of  people typing lines of code which is  very frequent  it’s more important to remember when we  talk about coding or just computers in  general what we’re really talking about  here is any technology that lets you  manipulate the data in the ways you need  to perform the procedures you need to  get the insight that you want out of  your data  now  there are three very general categories  that we’ll be discussing here on data  lab  the first is apps these are specialized  applications or programs for working  with data  the second is data or specifically data  formats there are special formats for  web data i’ll mention those in a moment  and then code  there are programming languages that  give you full control over what the  computer does and how you interact with  the data  let’s take a look at each one very  briefly in terms of apps there are  spreadsheets like excel or google sheets  these are the fundamental data tools of  probably the majority of the world  there are specialized applications like  tableau for data visualization  or spss a very common statistical  package in the social sciences and in  business  and one of my favorite jasp which is a  free open source analog of spss which  actually i think is a lot easier to use  and replicate research with  and there are tons of other choices  now in terms of web data it’s helpful to  be familiar with things like html and  xml and json and other formats that are  used to encapsulate data on the web  because those are the things that you’re  going to have to be programming about to  interact with when you get your data  and then there are actual  coding languages r is probably the most  common along with python general purpose  language but it’s been well adapted for  data use there’s sql this structured  query language for databases  and very basic languages like c and c  plus plus and java which are used more  in the back end of data science  and then there’s bash the the most  common command line interface and  regular expressions and we’ll talk about  all of these in other courses here at  datalab  but remember this  tools are just tools they’re only one  part of the entire data science process  they’re a means to the end  and the end the goal is insight you need  to know where you’re trying to go and  then simply choose the tools that help  you reach that particular goal that’s  the most important thing  so in sum here’s a few things number one  use your tools wisely remember your  questions need to drive the process not  the tools themselves  also i’ll just mention that a few tools  is usually enough you can do an awful  lot with excel and r  and then the most important thing is  focus on your goal and choose your tools  and even your data to match the goal so  you can get the most useful insights  from your data  the next step in our discussion of data  science methods is mathematics and i’m  going to give a very brief overview of  the math involved in data science  now the important thing to remember is  that math really forms the foundation of  what we’re going to do if you go back to  the data science venn diagram we’ve got  stats up here in the right corner but  really it’s math and stats or  quantitative ability in general but  we’ll focus on the math part right here  and probably the most important question  is how much math is enough to do what  you need to do or to put it another way  why do you need math at all because  you’ve got a computer to do it well  i can think of three reasons you don’t  want to rely on just the computer but  it’s helpful to have some sound  mathematical understanding  here they are number one you need to  know which procedures to use and why  so you have your question you have your  data you need to have enough of an  understanding to make an informed choice  that’s not terribly difficult  two you need to know what to do when  things don’t work right sometimes you  get impossible results i know in  statistics you can get a negative  adjusted r squared that’s not supposed  to happen and it’s good to know the  mathematics that go into calculating  that so you can understand how something  apparently impossible can work or you’re  trying to do a factor analysis or  principle component you get a rotation  that won’t converge it helps to  understand  what it is about the algorithm that’s  happening and why that won’t work in  that situation  and number three  interestingly some procedures some math  is easier and quicker to do by hand than  by firing up the computer and i’ll show  you a couple of examples in later videos  where that can be the case  now fundamentally there’s a nice sort of  analogy here math is to data science  as for instance chemistry is to cooking  kinesiology is to dancing and grammar is  to writing the idea here is that you can  be a wonderful cook without knowing any  chemistry  but if you know some chemistry it’s  going to help you can be wonderful  dancer without knowing kinesiology  but it’s going to help  and you can probably be a good writer  without having an explicit knowledge of  grammar but it’s going to make a big  difference the same thing is true of  data science you will do it better  if you have some of the foundational  information so the next question is what  kinds of math do you need for data  science  well there’s a few answers to that  number one is algebra you need some  elementary algebra that’s the basically  simple stuff you can have to do some  linear or matrix algebra because that’s  the foundation of a lot of the  calculations  and you can also have systems of linear  equations where you’re trying to solve  several equations all at once  it’s a tricky thing to do in theory but  this is one of things that’s actually  easier to do by hand sometimes  now there’s more math  you can get some calculus you can get  some big o which has to do with the  order of a function which has to do with  sort of how fast it works  probability theory can be important and  then bayes theorem which is a way of  getting what’s called a posterior  probability can also be a really helpful  tool for answering some fundamental  questions in data science  so in sum  a little bit of math can help you make  informed choices when planning your  analyses  very significantly  it can help you find the problems and  fix them when things aren’t going right  it’s the ability to look under the hood  that makes a difference  and then truthfully some mathematical  procedures like systems of linear  equations that can even be done by hand  sometimes faster than you can do with  the computer so you can save yourself  some time and some effort and move ahead  more quickly towards your goal of  insight  now data science wouldn’t be data  science and its methods without a little  bit of statistics so i’m going to give  you a brief statistics overview here of  how things work in data science  now you can think of statistics as  really an attempt to find order in case  find patterns in an overwhelming mess  sort of like trying to see  the forest and the trees  now let’s go back to our little venn  diagram here we recently had math and  stats here in the top corner we’re going  to go back to talking about stats in  particular  what you’re trying to do here one thing  is to explore your data you can have  exploratory graphics because we’re  visual people and it’s usually easiest  to see things you can have exploratory  statistics a numerical exploration of  the data and you can have descriptive  statistics which are the things that  most people would have talked about when  they took a statistics class in college  if they did that  next there’s inference i’ve got smoke  here because you can infer things about  the wind and the air movement by looking  at patterns in smoke  the idea here is that you’re trying to  take information from samples and infer  something about a population you’re  trying to go from one source to another  one common version of this is hypothesis  testing another common version is  estimation sometimes called confidence  intervals there are other ways to do it  but all of these let you go beyond the  data at hand to making larger  conclusions  now one interesting thing about  statistics is you’re going to have to be  concerned with some of the details and  arranging things just so  for instance you get to do something  like feature selection that’s picking  variables that should be included or  combinations  and there are problems that can come up  there are frequent problems and i’ll  address some of those in later videos  there’s also the manner of validation  when you create a statistical model you  have to see if it actually is accurate  hopefully you have enough data that you  can have a holdout sample and do that or  you can replicate the study  and then there’s the choice of  estimators that you use how you actually  get the coefficients or  the combinations in your model and then  there’s ways of assessing how well your  model fits the data all of these are  issues that i’ll address briefly when we  talk about statistical analysis at  greater length  now i do want to mention one thing in  particular here  and i just call this beware the trolls  there are people out there who will tell  you that if you don’t do things exactly  the way they say to do it that your  analysis is meaningless that your data  is junk and you’ve lost all your time  you know what  they’re trolls  so the idea here is  don’t listen to that  you can make enough of an informed  decision on your own to go ahead and do  an analysis that is still useful  probably one of the most important  things to think about in this is this  wonderful quote from a very famous  statistician that says  all models or all statistical models are  wrong  but some are useful and so the question  isn’t whether you’re technically right  or you have some sort of level of  intellectual purity but whether you’ve  done something that is useful that by  the way comes from george box  and i like to think of it basically as  this as wave your flag wave your  do-it-yourself flag and just take pride  in what you’re able to accomplish even  when there are people who may be  criticizing it go ahead you’re doing  something go do it  and so in sum  statistics allow you to explore and  describe your data they allow you to  infer things about the population  there’s a lot of choices available a lot  of procedures but no matter what you do  the goal is useful insight  keep your eyes on that goal and you will  find something meaningful and useful in  your data to help you in your own  research and projects  let’s finish our data science methods  overview by getting a brief overview of  machine learning now i got to admit when  you say the term machine learning people  start thinking about something like the  robot overlords are going to take over  the world that’s not what it is  instead let’s go back to our venn  diagram one more time and in the  intersection at the top between coding  and stats is machine learning  or as it’s commonly called is just  ml the goal of machine learning is to go  and work in a data space so you can for  instance take a whole lot of data we’ve  got tons of books here  and then you can reduce the  dimensionality that is take a very large  scatter data set and try to find the  most essential parts of that data  and then you can use these methods to  find clusters within the data like goes  with like you can use methods like  k-means you can also look for anomalies  or unusual cases that show up in the  data space  or if we go back to categories again i  talked about like for like you can use  things like logistic regression or  k nearest neighbors k n you can use  naive bayes for classification or  decision trees or svm which is support  vector machines or artificial neural  nets any of those will help you find the  patterns and the clumping in your data  so you can get similar cases next to  each other and get the cohesion that you  need to make conclusions about these  groups  also a major element of machine learning  is predictions you’re going to point  your way down the road  the most common approach here the most  basic is linear regression multiple  regression there’s also poisson  regression which is used for modeling  count or frequency data and then there’s  the issue of ensemble models where you  create several models and you take the  predictions from each of those and you  put them together to get an overall more  reliable prediction  now  i’ll talk about each of these in a  little more detail in later courses but  for right now i mostly just want to know  that these things exist and that’s what  we mean when we refer to machine  learning so  in sum machine learning can be used to  categorize cases and to predict scores  on outcomes  and there’s a lot of choices many  choices and procedures available but  again as i said with statistics and i’ll  just say again many times after this  no matter what the goal  is not that i’m going to do an  artificial neural network or an svm the  goal is to get useful insight into your  data machine learning is a tool and use  it  to the extent that it helps you get that  insight that you need  in the last several videos i’ve talked  about the role and data science of  technical things  on the other hand communicating is also  central to the practice and the first  thing i want to talk about there is  interpretability  the idea here is that you want to be  able to lead people through a path on  your data you want to tell a data driven  story and that’s the entire goal of what  we’re doing with data science now  another way to think about this is when  you’re doing your analysis what you’re  trying to do is solve for value you’re  making an equation you take the data  you’re trying to solve for value the  trouble is this a lot of people get hung  up on analysis but they need to remember  that analysis is not the same thing as  value instead i like to think of it this  way  that analysis  times story  is equal to value  now please note  that’s multiplicative not additive  and so one consequence of that is when  you go back to analysis times story  equals value well if you have zero story  you’re going to have zero value because  as you recall anything times zero is  zero  so instead of that let’s go back this  and say what we really want to do is we  want to maximize the story so that we  can maximize the value that results from  our analysis  again maximum value is the overall goal  here the analysis the tools the tech are  simply methods for getting to that goal  so let’s talk about goals for instance  analysis is goal driven you’re trying to  accomplish something in specific and so  the story or the narrative or the  explanation you give about your project  should match those goals  if you’re working for a client and they  had a specific question that they wanted  you to answer then you have a  professional responsibility to answer  those questions clearly and  unambiguously so they know whether you  said yes or no and they know why you  said yes or no  now part of the problem here is the fact  that the client isn’t you and they don’t  see what you do and as i show here  simply covering your face doesn’t make  things disappear  you have to worry about a few  psychological  abstractions you have to worry about  egocentrism and i’m not talking about  being vain i’m talking about the idea  that you think other people see and know  and understand what you know  that’s not true otherwise they wouldn’t  have hired you in the first place and so  you have to put it in terms that the  client works with and that they  understand and you’re going to have to  get out of your own center in order to  do that  also there’s the idea of false consensus  the idea that well everybody knows that  and again that’s not true otherwise they  wouldn’t have hired you you need to  understand that they’re going to come  from a different background with a  different range of experience and  interpretation you’re going to have to  compensate for that  a funny little thing is the idea about  anchoring  when you give somebody an initial  impression they use that as an anchor  and then they adjust away from it so if  you’re going to try to flip things over  on their heads  watch out for giving a false impression  at the beginning unless you absolutely  need to  but most importantly in order to bridge  the gap between the client and you you  need to have clarity and explain  yourself at each step  you can also think about the answers  when you’re explaining the project to  the client you might want to start in a  very simple procedure state the question  that you’re answering  give your answer to that question and if  you need to qualify as needed and then  go in order top to bottom so you’re  trying to make it as clear as possible  what you’re saying what the answer is  and make it really easy to follow  now in terms of discussing your process  how you did this all most of the time  it’s probably the case that they don’t  care they just want to know what the  answer is and that you used a good  method to do that so in terms of  discussing process or the technical  details only when absolutely necessary  that’s something to keep in mind  the process here is to remember that  analysis which means breaking something  apart this by the way is a mechanical  typewriter broken into its individual  components  analysis means to take something apart  an analysis of data is an exercise in  simplification you’re taking the overall  complexity sort of the overwhelmingness  of the data and you’re boiling it down  and finding the patterns that make sense  and serve the needs of your client  now let’s go to a wonderful quote from  our friend albert einstein here who said  everything should be made as simple as  possible but not simpler that’s true in  presenting your analysis  or  if you want to go see the architect and  designer ludwig mizvantero  who said less is more it’s actually  robert browning who originally said that  but  mies van der rohe popularized it  or if you want another way of putting a  principle that comes from my field i’m  actually a psychological researcher  they talk about being minimally  sufficient  just enough  to  adequately answer the question if you’re  in commerce you know about a minimal  viable product it’s sort of the same  idea with an analysis here the minimum  viable analysis  so here’s a few tips when you’re giving  a presentation more charts less text  great  and then simplify the charts remove  everything that doesn’t need to be in  there  generally you want to avoid tables of  data because those are hard to read  and then one more time because i want to  emphasize it less text again  charts tables can usually carry the  message  and so let me give you an example here  i’m going to give a very famous data set  berkeley admissions now these are not  stairs to berkeley but it gives the idea  of trying to get into something that’s  far off and distant  here’s the data  this is graduate school admissions in  1973 so it’s you know it’s over 40 years  ago but the idea is that men and women  were both applying for graduate school  at the university of california at  berkeley  and what we found is that 44 of the men  who applied were admitted that their  part in green  and that of the women  only 35 percent were admitted when they  applied so really at first glance this  is bias and it actually led to a lawsuit  it was it was a major issue  so  what berkeley then tried to do is find  out well which programs are responsible  for this bias and then you got a very  curious set of results  if you break the applications down by  program and here we’re just calling them  a through f six different programs  what you find actually is that in each  of these male applicants are on the left  female applicants are on the right  if you look at program a  women actually got accepted at a higher  rate  and the same is true for b  and the same is true for d and the same  is true for f  and so  this is a very curious set of responses  and something requires explanation now  in statistics this is known as simpson’s  paradox  but here’s the paradox  bias may be negligible at the department  level  and in fact as we saw in for the  departments there was a possible bias in  favor of women  and the problem is that women applied to  more selective programs programs with  lower acceptance rates  now  some people stop right here and say  therefore nothing’s going on nothing to  complain about  but you know that’s still ending the  story a little bit early there are other  questions that you can ask and as  producing a data-driven story this is  stuff that you would want to do  so for instance you may want to ask  why do the programs vary in overall  class size  why do the acceptance rates differ from  one program to the other  why do men and women apply to different  programs  and you might want to look at things  like the admissions criteria for each of  the programs the promotional strategies  how they advertise themselves to  students you might want to look at the  kinds of prior education students have  in each of the programs and you really  want to look at funding levels for each  of the programs and so really you get  one answer it leads to more questions  maybe some more answers and more  questions and you need to address enough  of this to provide a comprehensive  overview and solution to it for your  client  in sum  let’s say this  stories give value to data analyses  and when you tell the story you need to  make sure that you are addressing your  clients goals in a clear unambiguous way  and the overall principle here is be  minimally sufficient  get to the point make it clear say what  you need to but otherwise be concise and  make your message clear  the next step in discussing data science  and communicating is to talk about  actionable insights or  information that can be used  productively to accomplish something  now to give sort of a bizarre segue here  you look at a game controller it may be  a pretty thing it may be a nice object  but remember game controllers exist to  do something they exist to help you play  the game and to do it as effectively as  possible  they have a function they have a purpose  same way  data is for doing now that’s a  paraphrase for one of my favorite  historical figures and this is william  james the father of american psychology  and pragmatism and philosophy and he has  this wonderful quote he said my thinking  is first and last and always for the  sake of my doing and the idea applies to  analysis your analysis and your data is  for the sake of your doing and so you’re  trying to get some sort of specific  insight in how you should proceed what  you want to avoid is the opposite of  this from one of my other favorite  cultural heroes the famous yankees  catcher yogi berra  who said we’re lost but we’re making  good time and so the idea here is that  frantic activity does not make up for a  lack of direction you need to understand  what you’re doing so you can reach the  particular goal and your analysis is  supposed to do that  so  when you’re giving your analysis you’re  going to try to point the way remember  why was the project conducted the goal  is usually to direct some kind of action  reach some kind of goal for your client  and that the analysis should be able to  guide that action in an informed way  one thing you want to do is you want to  be able to give the next steps to your  client give the next steps tell them  what they need to do now  you want to be able to justify each of  those recommendations with the data and  your analysis as much as possible be  specific tell them exactly what they  need to do make sure it’s doable by the  client that it’s within their range of  capability  and that each step should build on the  previous step  now that being said there is one really  fundamental sort of philosophical  problem here and that’s the difference  between correlation  and causation  basically it goes this way  your data gives you correlation you know  that this is associated with that  but your client doesn’t simply want to  know what’s associated they want to know  what causes something because if they’re  going to do something that’s an  intervention it’s designed to produce a  particular result  so really how do you get from the  correlation which is what you have in  the data to the causation which is what  your client wants  well there’s a few ways to do that  one is experimental studies these are  randomized controlled trials now that’s  theoretically the simplest path to  causality but it can be really tricky in  the real world  there are quasi-experiments and these  are methods a whole collection of  methods that use non-randomized data  usually observational data  adjusted in particular ways to get an  estimate of causal inference  or there’s the theory and experience and  this is research-based theory and  domain-specific experience and this is  where you actually get to rely on your  clients information  they can help you interpret the  information especially if they have  greater domain expertise than you do  another thing to think about  are the social factors that affect your  data  now you remember the data science venn  diagram we’ve looked at it lots of times  it’s got these three elements some  people have proposed adding a fourth  circle to this venn diagram and we’ll  kind of put that in there and say that  social understanding  is also important critical really to  valid data science  now i love that idea  and i do think that it’s important to  understand how things are going to play  out there’s a few kinds of social  understanding you want to be aware of  your client’s mission you want to make  sure that your recommendations are  consistent with your client’s mission  also that your recommendations are  consistent with your clients identity  not just this is what we do but this is  really who we are  you need to be aware of the business  context sort of the competitive  environment and the regulatory  environment that they’re working in as  well as the social context and that can  be outside of the organization but even  more often within the organization your  recommendations will affect  relationships within the client’s  organization and you’re going to try to  be aware of those as much as you can to  make it so that your recommendations can  be realized the way they need to be  so in sum  data science is goal focused and when  you’re focusing on that goal for your  client you need to give specific next  steps that are based on your analysis  and justifiable from the data  and in doing so be aware of the social  political and economic context that  gives you the best opportunity of  getting something really useful out of  your analysis  when you’re working in data science and  trying to communicate your results  presentation graphics can be an  enormously helpful tool  think of it this way you are trying to  paint a picture for the benefit of your  client  now when you’re working with graphics  there can be a couple of different goals  it depends on what kind of graphics  you’re working with  there’s the general category of  exploratory graphics these are ones that  you were using as the analyst and for  exploratory graphics you need speed and  responsiveness and so you get very  simple graphics this is a base histogram  in r  and they can get a little more  sophisticated  and this is done in ggplot and then you  can break it down a couple of histograms  or you can make it a different way or  make them see through or split them  apart into small multiples but in each  case this is done for the benefit of you  as the analysts understanding the data  these are quick they’re effective now  they’re not very well labeled and  they’re usually for your insight and  then you do other things as a result of  that  on the other hand presentation graphics  which are for the benefit of your client  those need clarity and they need  narrative flow now let me talk about  each of those characteristics very  briefly  clarity versus distraction  there are things that can go wrong in  graphics number one is colors colors can  actually be a problem  also  three dimensional or false third  dimensions are nearly always a  distraction  one that gets a little touchy for some  people is interaction we think of  interactive graphics as really cool  great things to have  but you run the risk of people getting  distracted by the interaction and start  playing around with it going like oh i  press here it does that  and that distracts from the message so  actually it may be important to not have  interaction  and then the same thing is true of  animation  flat static graphics can often be  more informative because they have fewer  distractions in them  let me give you a quick example of how  not to do things now this is a chart  that i made i made it in excel and i did  it based on some of the mistakes i’ve  seen in graphics submitted to me when i  teach  and i guarantee you everything in here i  have seen in real life just not  necessarily combined all at once let’s  zoom in on this a little bit so we can  see the full badness of this graphic  and let’s see what’s going on here we’ve  got a scale here that starts at eight  goes to 28 and it’s tiny doesn’t even  cover the range of the data we’ve got  this bizarre picture on the wall we have  no access lines on the walls  we come down here the labels for  educational levels are in alphabetical  order instead of the more logical higher  levels of education  then we’ve got the data represented as  cones which are difficult to read and  compare and it’s only made worse by the  colors and the textures you know if you  want to take an extreme this one for  grad degrees doesn’t even make it to the  floor value of eight percent and this  one for high school grad is cut off at  the top at 28 percent  and this by the way is a picture of a  sheep and people do this kind of stuff  and it drives me crazy  if you want to see a better chart with  the exact same data  this is it right here it’s a straight  bar chart it’s flat it’s as simple it’s  as clean as possible  and this is better in many ways  most effective here is that it  communicates clearly there’s no  distractions it’s a logical flow this is  going to get the point across so much  faster  and i can give you another example of it  here’s a chart i showed previously about  salaries for  incomes i have a list here i’ve got data  scientists in it if i want to draw  attention to it i have the option of  like putting a circle around it and i  can put a number next to it to explain  it that’s one way to make it easy to see  what’s going on  but you don’t even have to get fancy you  know i just got out of pen and a post-it  note and i drew a bar chart of some real  data about life expectancy this tells  this story as well that there is  something terribly amiss in sierra leone  but now let’s talk about creating  narrative flow in your presentation  graphics to do this i’m going to pull  some charts from my most cited academic  paper which is called a third voice a  review of empirical research on the  psychological outcomes of restorative  justice think of that as  mediation for juvenile crimes mostly  juvenile  and this paper is interesting because  really it’s about 14 bar charts with  just enough text to hold them together  and you can see there’s a flow the  charts are very simple this is judgments  about whether the criminal justice  system was fair  the two bars on the left are victims the  two bars on the right are offenders  and for each group on the left are  people who participated in restorative  justice or victim offender mediation or  mediation for crimes  and for each set on the right are people  who went through standard criminal  procedures it says court but it usually  means plea bargaining  anyhow it’s really easy to see that in  both cases restorative justice bar is  higher people were more likely to say it  was fair  they also felt that they had an  opportunity to tell their story that’s  one reason they might think it’s fair  they also felt the offender was held  accountable more often in fact if you go  to court on the offenders that lines  below 50 percent and that’s the  offenders themselves making the judgment  then you can go to forgiveness and  apologies  and again this is actually a simple  thing to code and you can see there’s an  enormous difference in fact one of the  reasons there’s such a big difference is  because in standard core proceedings the  offender very rarely meets the victim  now it also turns out  that i need to qualify this a little bit  because a bunch of the studies included  drunk driving with no injuries or  accidents when we take them out we see a  huge change  and then we can go to whether a person  is satisfied with the outcome again we  see an advantage for restorative justice  whether the victim is still upset about  the crime now the bars are a little  different and whether they’re afraid of  re-victimization that’s over a  two-to-one difference  and then finally recidivism for  offenders or re-offending and you see a  big difference there and so what i have  here is a bunch of charts that are very  very simple to read and they kind of  flow in how they’re giving the overall  impression and then detailing it a  little bit more there’s nothing fancy  here there’s nothing interactive there’s  nothing animated there’s nothing kind of  flowing in 17 different directions it’s  easy but it follows a story and it tells  a narrative about the data and that  should be your major goal with  presentation graphics  in some  presenting or the graphics that you use  for presenting are not the same as the  graphics you use for exploring they have  different needs and different goals  but no matter what you’re doing be clear  in your graphics and be focused in what  you’re trying to tell and above all  create a strong narrative that gives a  different level of perspective and  answers questions as you go to  anticipate a client’s question and to  give them the most reliable solid  information and the greatest confidence  in your analysis  the final element of data science and  communicating that i wanted to talk  about is reproducible research  and you can think of it as this idea you  want to be able to play that song again  and the reason for that is data science  projects are rarely one and done  rather they tend to be incremental they  tend to be cumulative and they tend to  adapt to the circumstances that they’re  working in  so one of the important things here  probably if you want to summarize it  very briefly is this  show your work  there’s a few reasons for this  you may have to revise your research at  a later date your own analyses you may  be doing another project and you want to  borrow something from previous studies  more likely you’ll have to hand it off  to somebody else at a future point and  they’re going to have to be able to  understand what you did  and then there’s a very significant  issue in both scientific and economic  research of accountability  you have to be able to show that you did  things in a responsible way and that  your conclusions are justified that’s  for clients funding agencies regulators  academic reviewers any number of people  now you may be familiar with the concept  of open data but you may be less  familiar with the concept of open data  science and that’s more than open data  so for instance i’ll just let you know  that there is something called the open  data science conference and odsc.com and  it meets  three times a year in different places  and this is entirely of course devoted  to open data science using both open  data but making the methods transparent  to people around them  one thing that can make this really  simple is something called the open  science framework which is at osf.io  it’s a way of sharing your data and your  research with an annotation of how you  got through the whole thing with other  people it makes the research transparent  which is what we need  one of my professional organizations the  association for psychological science  has a major initiative on this called  open practices where they are strongly  encouraging people to share their data  as much as is ethically permissible  and to absolutely share their methods  before they even conduct the study as a  way of getting rigorous intellectual  honesty and accountability  now another step in all this is to  archive your data make that information  available put it on the shelf  and what you want to do here is you want  to archive all of your data sets both  the totally raw before you did anything  with a data set and every step in  process until your final clean data set  along with that you want to archive all  of the code that you used to process and  analyze the data if you used a  programming language like r or python  that’s really simple if you used a  program like spss you need to save the  syntax files and then it can be done  that way  and again no matter what make sure to  comment liberally and explain yourself  now part of that is you need to explain  your process you know because you’re not  just this lone person sitting on the  sofa working by yourself you’re with  other people and you need to explain why  you did it the way that you did you need  to explain the choices the consequences  of those choices the times you had to  backtrack and try it over again  this also works into the principle of  future proofing your work  you want to do a few things here number  one the data you want to store the data  in non-proprietary formats like a csv or  comma separated values file because  anything could read csv files if you  stored it in the proprietary spss.save  format you might be in a lot of trouble  when somebody tries to use it later and  they can’t open it also there’s storage  you want to place all of your files in a  secure accessible location like github  it’s probably one of the best choices  and then the code  you may want to use something like a  dependency management package like  packrat for r or virtual environment for  python as a way of making sure that the  packages that you use that there are  always versions that work because  sometimes things get updated and it gets  broken  this is a way of making sure that the  system that you have will always work  overall you can think of this too you  want to explain yourself and a neat way  to do that is to put your narrative in a  notebook now you can have a physical lab  book but you can also do  digital books a really common one  especially if you’re using python is  jupiter with a  y there in the middle  the jupiter notebooks are interactive  notebooks so here’s a screenshot of a  very simple one i made in python  and you have titles you have text you  have the graphics if you’re working in r  you can do this through something called  our markdown  which works in the same way you do it in  r studio use markdown and you can  annotate the whole thing  get more information about that at  rmarkdown.rstudio.com  and so for instance here’s an r analysis  i did and is you see the code on the  left and you see the markdown version on  the right what’s neat about this is that  this little bit of code here this title  and this text and this little bit of r  code then is displayed  as this formatted heading as this  formatted text and this turns into the  entire r output right there it’s a great  way to do things and then if you do our  markdown you actually have the option of  uploading the document into something  called our pubs  and that’s an online document that can  be made accessible to anybody here’s the  same document and if you want to go see  it you can go to this address it’s kind  of long so i’m going to let you write  that one down yourself  but in sum here’s what we have  you want to do your work and archive the  information in a way that supports  collaboration explain your choices say  what you did show how you did it this  allows you to future proof your work so  it will work in other situations for  other people  and as much as possible no matter how  you do it make sure to share your  narrative so people understand your  process and they can see that your  conclusions are justifiable strong and  reliable  now something i’ve mentioned several  times when talking about data science  and i’ll do it again in this conclusion  is that it’s important to give people  next steps so i’m going to do that for  you right now  if you’re wondering what to do after  having watched this very general  overview course i can give you a few  ideas  number one maybe you want to start  trying to do some coding in r or python  we have courses for those  you might want to try doing some data  visualization one of the most important  things that you can do  you may want to brush up on statistics  and maybe some math that goes along with  it and you may want to try your hand at  machine learning  all of these will get you up and rolling  in the practice of data science  you can also try looking at data  sourcing find the information that  you’re going to do but no matter what  happens try to keep it in context  so for instance data science going to be  applied to marketing and sports and  health and education and the arts and  really a huge number of other things and  we will have courses here at datalab.cc  that talk about all of those  you may also want to start getting  involved in the community of data  science one of the best conferences that  you can go to is o’reilly strata which  meets several times a year around the  globe there’s also predictive analytics  world again several times a year around  the world  then there’s much smaller conferences i  love tapestry or tapestryconference.com  which is about storytelling in data  science  and  extract a one-day conference about data  stories that’s put on by import io  one of the great  data sourcing applications that’s  available for scraping web data  if you want to start working with actual  data a great choice is to go to  kaggle.com  and they sponsor data science  competitions which actually have cash  rewards but there’s also wonderful data  sets you can work with there to find out  how they work and compare your results  to those of other people  and once you’re feeling comfortable with  that you may actually try turning around  and doing some service  datakind.org  is the premier organization for data  science as humanitarian service they do  major projects around the world i love  their examples there are other things  you can do there’s a annual event called  do good data and then data datalab.cc  will be sponsoring twice a year datalab  charettes which are opportunities for  people in the utah area to work with  local nonprofits on their data  but above all of this i want you to  remember this one thing data science is  fundamentally democratic it’s something  that everybody needs to learn to do in  some way shape or form the ability to  work with data is a fundamental ability  and everybody would be better off by  learning to work with data intelligently  and sensitively  or to put it another way  data science needs you thanks so much  for joining me for this introductory  course i hope it’s been good and i look  forward to seeing you in the other  courses here at datalab.cc  welcome to data sourcing i’m barton  poulson and in this course we’re going  to talk about  data opus or that’s latin for data  needed the idea here is that no data  no data science and that is a sad thing  so instead of leaving it that we’re  going to use this course to talk about  methods for measuring and evaluating  data and methods for accessing existing  data and even methods for creating new  custom data take those together and it’s  a happy situation  at the same time we’ll do all of this  still at an accessible conceptual and  non-technical level because the  technical hands-on stuff will happen in  later other courses  but for now  let’s talk data  for data sourcing the first thing we  want to talk about is a measurement and  within that category we’re going to talk  about metrics  the idea here is that you actually need  to know what your target is if you want  to have a chance to hit it  there’s a few particular reasons for  this first off  data science is action-oriented the goal  is to do something as opposed to simply  understand something which i say as an  academic practitioner  also your goal needs to be explicit and  that’s important because the goals can  guide your effort so you want to say  exactly what you’re trying to accomplish  so you know when you get there  also  goals exist for the benefit of the  client because they can prevent  frustration they know what you’re  working on they know what you have to do  to get there  and finally  the goals and the metrics exist for the  benefit of the analyst because they help  you use your time well you know when  you’re done you know when you can move  ahead with something and that makes  everything a little more efficient and a  little more productive  now when we talk about this the first  thing you want to do is try to define  success in your particular project or  domain  depending on where you are in commerce  that can include things like sales or  click-through rates or new customers  in education it can include scores on  tests it can include graduation rates or  retention  in government and include things like  housing and jobs in research it can  include the ability to serve the people  that you’re trying to better understand  so whatever domain you’re in there will  be different standards for success and  you’re going to need to know  what applies in your domain  next are specific metrics or ways of  measuring  now again there are a few different  categories here there are business  metrics there are key performance  indicators or kpis there are smart goals  that’s an acronym and there’s also the  issue of having multiple goals i’ll talk  about each of those for just a second  now  first off let’s talk about business  metrics if you’re in the commercial  world there are some common ways of  measuring success  a very obvious one is sales revenue are  you making more money are you moving the  merchandise are you getting sales  also there’s the issue of leads  generated new customers or new potential  customers because that then in turn is  associated with future sales  there’s also the issue of customer value  or lifetime customer value so you may  have a small number of customers but  they all have a lot of revenue and you  can use that to  really predict the overall profitability  of your current system  and then there’s churn rate which has to  do with you know losing and gaining new  customers and having a lot of turnover  so any of these are potential ways of  defining success and measuring it these  are potential metrics there are others  but these are some really common ones  now i mentioned earlier something called  a key performance indicator or kpi  kpis come from david parmenter and he’s  got a few ways of describing them he  says a key performance indicator for  business number one should be  non-financial so not just the bottom  line but something else that might be  associated with it or that measures the  overall productivity of the association  they should be timely for instance  weekly daily or even constantly gathered  information  they should have a ceo focus so the  senior management team are the ones who  generally make the decisions that affect  how the organization acts on the kpis  they should be simple so everybody in  the organization everybody knows what  they are and knows what to do about them  they should be team based so teams can  take joint responsibility for meeting  each one of the kpis  they should have significant impact what  that really means is they should affect  more than one important outcome so you  can do profitability and market reach or  improve manufacturing time and fewer  defects  and finally  an ideal kpi has a limited dark side  that means there’s fewer possibilities  for reinforcing the wrong behaviors and  rewarding people for sort of exploiting  the system  next there are smart goals where smart  stands for specific  measurable  assignable to a particular person  realistic meaning you can actually do it  with the resources you have at hand and  time bound so you know when it can get  done  so whenever you form a goal you should  try to assess it on each of these  criteria and that’s a way of saying that  this is a good goal to be used as a  metric for the success of our  organization  now the trick however is when you have  multiple goals multiple possible  endpoints  and the reason that’s difficult is  because well it’s easy to focus on one  goal if you’re just trying to maximize  revenue or if you’re just trying to  maximize you know graduation rate  there’s a lot of things you can do  it becomes more difficult when you have  to focus on many things simultaneously  especially because some of these goals  may conflict the things that you do to  maximize one may impair the other and so  when that happens you actually need to  start engaging in a deliberate process  of optimization you need to optimize  and there are ways you can do this if  you have enough data you can do  mathematical optimization to find the  ideal balance of efforts to pursue one  goal and the other goal at the same time  now this is a very general summary and  let me finish with this in sum  metrics or methods for measuring can  help awareness of how well your  organization is functioning and how well  you’re reaching your goals  there are many different methods  available for defining success and  measuring progress towards those things  the trick however comes when you have to  balance efforts to reach multiple goals  simultaneously which can bring in the  need for things like optimization  when talking about data sourcing and  measurement one very important issue has  to do with the accuracy of your  measurements  the idea here is that you don’t want to  have to throw away all your ideas you  don’t want to waste effort  one way of doing this in a very  quantitative fashion  is to make a classification table  so what that looks like is this you talk  about for instance  positive results negative results and in  fact let’s start by looking at the top  here the middle two columns here talk  about whether an event is present  whether your house is on fire whether a  sale occurs whether you’ve got  a tax evader whatever  so that’s whether a particular thing is  actually happening or not  on the left here is whether the test or  the indicator suggests that a thing is  or is not happening  and then you have these combinations of  true positives where the test says it’s  happening and it really is  and false positive where the test says  it’s happening but it’s not  and then below that true negatives where  the test says it isn’t happening and  that’s correct and then false negatives  where the test says there’s nothing  going on but there is in fact the event  occurring and then you start to get the  column totals the total number of events  present or absent  and the row totals that talk about the  test results  now from this table what you get is four  kinds of accuracy or really four  different ways of quantifying accuracy  using different standards  and they go by these names sensitivity  specificity  positive predictive value  and negative predictive value i’ll show  you very briefly how each of them works  sensitivity can be expressed this way if  there’s a fire  does the alarm ring you want that to  happen and so that’s a matter of looking  at the true positives  and dividing that by the total number of  alarms so  the test positive means there’s an alarm  and the event present means there’s a  fire you want always to have an alarm  when there’s a fire  specificity on the other hand is sort of  the flip side of this if there isn’t a  fire does the alarm stay quiet  this is where you’re looking at the  ratio of true negatives  to  total absent events where there’s no  fire and the alarms aren’t ringing and  that’s what you want  now those are looking at columns you can  also go sideways across rows so the  first one there is positive predictive  value often just abbreviated as ppv  and we flip around the order a little  bit this one says if the alarm rings was  there a fire  so now you’re looking at the true  positives and dividing it by the total  number of positives total number of  positives is any time the alarm rings  true positives is because there was a  fire  a negative predictive value or npv  says if the alarm doesn’t ring does that  in fact mean that there is no fire  well here you’re looking at true  negatives and dividing it by total  negatives the time that it doesn’t ring  and again you want to maximize that so  the true negatives account for all of  the negatives the same way you want the  true positives to account for all of the  positives and so on  now you can put numbers on all of these  going from zero percent to 100  and the idea is to maximize each one as  much as you can  so in sum  from these tables we get four kinds of  accuracy and there’s a different focus  for each one  but the same overall goal you want to  identify the true positives and true  negatives and avoid the false positives  and false negatives and this is one way  of putting numbers on an index really on  the accuracy of your measurement  now data sourcing may seem like a very  quantitative topic especially when we’re  talking about measurement but i want to  measure one important thing here and  that is the social context of  measurement the idea here really is that  people are people and they all have  their own goals and they’re going their  own ways and we all have our own  thoughts and feelings that don’t always  coincide with each other and this can  affect measurement and so for instance  when you’re trying to define your goals  and you’re trying to maximize them you  want to look at things like for instance  the business model  an organization’s business model the way  they conduct their business the way they  make their money is tied to its identity  and its reason to be and if you make a  recommendation that’s contrary to their  business model that can actually be  perceived as a threat to their core  identity and people tend to get freaked  out in that situation  also restrictions so for instance there  may be laws policies and common  practices both organizationally and  culturally that may limit the ways that  goals can be met now most of these make  a lot of sense so the idea is you can’t  just do anything you want you need to  have these constraints and when you make  your recommendations maybe you’ll work  creatively in them as long as you’re  still behaving legally and ethically  but you do need to be aware of these  constraints  next is the environment  and the idea here is that competition  occurs both between organizations that  company here is trying to reach a goal  but they’re competing with company b  over there but probably even more  significantly there is competition  within the organization this is really a  recognition of office politics and that  when you as a consultant make a  recommendation based on your analysis  you need to understand you’re kind of  dropping a little football into the  office and things are going to further  one person’s career  maybe to the detriment of another  and in order for your recommendations to  have the maximum effectiveness they need  to play out well in the office  that’s something that you need to be  aware of as you’re making your  recommendations  finally there’s the issue of  manipulation  and a sad truism about people is that  any reward system any reward system at  all will be exploited and people will  generally  game the system this happens especially  when you have a strong cutoff you need  to get at least eighty percent or you  get fired and people will do anything to  make their numbers appear to be eighty  percent this happens an awful lot when  you look at executive compensation  systems it looks a lot when you have  very high stakes school testing  it happens in an enormous number of  situations and so you need to be aware  of the risk of exploitation and gaming  now don’t think then that all is lost  don’t give up you can still do  really wonderful assessment you can get  good metrics just be aware of these  particular issues and be sensitive to  them as you  both conduct your research and as you  make your recommendations  so in sum  social factors affect goals and they  affect the way that you meet those goals  there are limits and consequences both  on how you reach the goals and how  really what the goal should be  and that  when you’re making advice on how to  reach those goals please be sensitive  to how things play out with metrics and  how people will adapt their behavior to  meet the goals  that way you can make something that’s  more likely to be implemented the way  you meant and more likely to predict  accurately what can happen with your  goals  when it comes to data sourcing obviously  the most important thing is to get data  but the easiest way to do that at least  in theory is to use existing data  think of it as  going to the bookshelf and getting the  data that you have right there at hand  now there’s a few different ways to do  this you can get in-house data you can  get open data and you can get  third-party data  another nice way to think of that is  proprietary  public  and purchased data the three p’s i’ve  heard it called let’s talk about each of  these a little bit more so in-house data  that’s stuff that’s already in your  organization what’s nice about that is  it can be really fast and easy it’s  right there  and the format may be appropriate for  the kind of software in the computer  that you’re using  if you’re fortunate there’s good  documentation although sometimes when  it’s in-house people just kind of throw  it together so you have to watch out for  that  and there’s the issue of quality control  now this is true with any kind of data  but you need to pay attention with  in-house because you don’t know the  circumstances necessarily under which  people gathered the data and how much  attention they were paying to something  there’s also an issue of restrictions  there may be some data that while it’s  in the house you may not be allowed to  use or you may not be able to publish  the results or share the results with  other people so these are things that  you need to think about when you’re  going to use in-house data in terms of  how can you use it to facilitate your  data science projects  specifically there are a few pros and  cons  in-house data potentially quick easy  free  hopefully is standardized maybe even the  original team that conducted this study  is still there and you might have  identifiers in the data which make it  easier for you to do an individual level  analysis  on the con side however the in-house  data simply may not exist maybe it’s  just not there  or the documentation may be inadequate  and  of course the quality may be uncertain  it’s always true but maybe something you  have to pay more attention to when  you’re using in-house data  now another choice is open data like  going to the library and getting  something  this is prepared data that’s freely  available consists of things like  government data and corporate data and  scientific data from a number of sources  let me show you some of my favorite open  data sources just so you know where they  are and that they exist  probably the best one is data.gov here  in the u.s  that is the it says right here the home  of the us government’s open data or you  may have a state level one for instance  i’m in utah and we have data.utah.gov  also a great source of more regional  information  if you’re in europe you have open  data.europa.eu the european union open  data portal  and then there are major non-profit  organizations so the un has unicef.org  statistics for their statistical and  monitoring data the world health  organization  has the global health observatory at who  dot int  gho  and then there are private organizations  that work in the public interest such as  the pew research center which shares a  lot of its data sets  and the new york times which makes it  possible to use apis to access a huge  amount of the data of things they’ve  published over a huge time span  and then two of the mother lodes there’s  google which at google.com has public  data which is a wonderful thing  and then amazon at  aws.amazon.com datasets has gargantuan  datasets so if you needed a data set  there was like five terabytes in size  this is the place you would go to get it  now there’s some pros and cons to using  this kind of open data first is that you  can get very valuable data sets that  maybe cost millions of dollars to gather  into process  and you can get a very wide range of  topics and times and groups of people  and so on and often the data is very  well formatted and well documented there  are however a few cons sometimes there’s  by a sample say for instance you only  get people who have internet access and  that can mean you know not everybody  sometimes the meaning of the data is not  clear or it may not mean exactly what  you want it to  a potential  problem is that sometimes you may need  to share your analyses and if you’re  doing proprietary research  well it’s going to have to be open  research instead and so that can create  a crimp with some of your clients  and then finally there are issues with  privacy and confidentiality and in  public data that usually means that the  identifiers are not there and you’re  gonna have to work at a larger aggregate  level of measurement  another option is to use data from a  third party these go by the name data as  a service or das you can also call them  data brokers and the thing about data  brokers is they can give you an enormous  amount of data on many different topics  plus they can save you some time and  effort by actually doing some of the  processing for you  and that can include things like  consumer behaviors and preferences they  can get contact information they can do  marketing identity and finances there’s  a lot of things  there’s a number of data brokers of  round here’s a few of them axiom is  probably the biggest one in terms of  marketing data  there’s also nielsen which provides data  primarily for media consumption  and there’s another organization  datasift that’s a smaller newer one and  there’s a pretty wide range of choices  but these are some of the big ones now  the thing about using data brokers  there’s some pros and there’s some cons  the pros are first that it can save you  a lot of time and effort  it can also give you  individual level data which can be hard  to get from open data open data is  usually at the community level they can  give you information about specific  consumers they can even give you  summaries and inferences about things  like credit scores and marital status  possibly even whether a person gambles  or smokes  now the con is this number one it can be  really expensive i mean this is a huge  service it provides a lot of benefit and  is priced accordingly  also you still need to validate it you  still need to double check that it means  what you think it means and that it  works in with what you want  and probably a real sticking point here  is  the use of third-party data is  distasteful to  many people and so  you have to be aware of that as you’re  making your choices  so in sum as far as data sourcing  existing data goes  obviously data science needs data and  there’s the three p’s of data sources  proprietary in public and purchased  but no matter what source you use you  need to pay attention to quality and to  the meaning and the usability of the  data to help you along in your own  projects  when it comes to data sourcing a really  good way of getting data is to use what  are called apis  now i like to think of these as the  digital version of proofrocks mermaids  if you’re familiar with the love song of  j alfred proofrock by tsl yet he says i  have heard the mermaids singing each to  each that’s t.s eliot and i like to  adapt that to say  apis have heard apps singing each to  each and that’s by me  now more specifically when we talk about  an api what we’re talking about is  something called an application  programming interface and this is  something that allows programs to talk  to each other  it’s most important use in terms of data  science is it allows you to get web data  it allows your program to directly go to  the web on its own grab the data bring  it back in  almost as though it were local data and  that’s a really wonderful thing  now the most common version of apis for  data sites are called rest apis that  stands for representational state  transfer that’s the software  architectural style of the world wide  web  and it allows you to access data on web  pages via http that’s the hypertext  transfer protocol that you know runs the  web as we know it  and when you download the data you  usually get it in json format that  stands for javascript object notation  the nice thing about that is that’s  human readable but it’s even better for  machines  then you can take that information and  you can send it directly to other  programs and the nice thing about rest  apis is that they’re what’s called  language agnostic meaning  any programming language can call a rest  api can get data from the web and can  do whatever it needs to with it  now there are a few kinds of apis that  are really common the first is what are  called social apis these are ways of  interfacing with social networks so for  instance the most common is facebook  there’s also twitter  google talk has been a big one and  foursquare as well and then soundcloud  these are on lists of the most popular  ones  and then there are also what are called  visual apis which are for getting  visual data so for instance google maps  is the most common but youtube something  that accesses youtube on a particular  website or accuweather which is for  getting weather information  pinterest for photos and flickr for  photos as well so these are some really  common apis and you can program your  computer to pull in data from any of  these services and sites and integrate  it into your own website or  here into your own data analysis  now there’s a few different ways you can  do this you can program it in r the  statistical programming language you can  do it in python  also you can even do it in the very  basic bash command line interface  and there’s a ton of other applications  basically anything can access an api one  way or another  now i’d like to show you how this works  in r so i’m going to open up a script in  rstudio and then i’m going to use it to  get some very basic information from a  web page let me go to rstudio and show  you how this works  i’ve opened up a script in rstudio that  allows me to do some data sourcing here  now i’m just going to  use a package called json lite i’m going  to load that one up and then i’m going  to go to a couple of websites i’m going  to be getting historical data from  formula 1 car races and i’m going to be  getting it from  airgast.com now if we go to this page  right here i can just go straight to my  browser right now  and this is what it looks like it gives  you the api documentation  so what you’re doing for an api is  you’re just entering a web address and  in that web address it includes the  information that you want  i’ll go back to r here for a second  and if i want to get information about  1957 races in json format i go to this  address and i can skip over to that for  a second  and what you see is it’s kind of a big  long mess here but it is all labeled and  it’s clear to the computer what’s going  on here  i’ll go back to r  and so what i’m going to do  is i’m going to save that url into an  object here in r  and then i’m going to  use the command from json to read that  url and save it into r  and which it is now done and i’m going  to  zoom in on that so you can see what’s  happened i’ve got this  sort of mess of text this is actually a  list object in r  and then i’m going to get just the  structure of that object so i’m going to  do this one right here and you can see  that it’s a list and it gives you the  names of all the variables within each  one of the lists  and what i’m going to do is i’m going to  convert that list to a data frame  by i went through the list and found  exactly where the information i wanted  was located you have to use this big  long statement here  that’ll give me the names of the drivers  let me zoom in on that again  there they are  and then i’m going to get just the  column names for that bit of the data  frame and so what i have here is six  different variables and then what i’m  going to do is i’m going to pick just  the first five cases and i’m going to  select some variables and put them in a  different order  and when i do that  this is what i get i’ll zoom in on that  again  and the first five people listed in this  data set that i pulled in from 1957  are juan fangio makes sense one of the  greatest drivers ever and other people  who competed in that year and so what  i’ve done is by using this  api call in ours very simple thing to do  i was able to pull data off that web  page in a structured format and do a  very simple analysis with it  and let’s sum up what we’ve learned from  all this first off apis make it really  easy to work with web data they  structure they call it for you and then  they feed it straight into the programs  for you to analyze and they’re one of  the best ways of getting data and  getting started in data science  when you’re looking for data another  great way of getting data is through  scraping and what that means is pulling  information from web pages  i like to think of it as when data is  hiding in the open it’s there you can  see it but there’s not an easy immediate  way to get that data  now when you’re dealing with scraping  you can get data in several different  formats you can get html text from web  pages you can html tables the rows and  columns that appear on web pages  you can scrape data from pdfs and you  can scrape data from all sorts of media  like images and video and audio  now we’ll make one very important  qualification before we say anything  else  pay attention to copyright and privacy  just because something is on the web  doesn’t mean you’re allowed to pull it  out information gets copyrighted and so  when i use examples here i make sure  that this is stuff that’s publicly  available and you should do the same  when you’re doing your own analyses  now if you want to scrape data there’s a  couple of ways to do it number one is to  use apps that are developed for this so  for instance import.io is one of my  favorites it’s both a web page that’s  its address and it’s a downloadable app  there’s also a scraper wiki there’s an  application called tabula and you can  even do scraping in google sheets which  i’ll demonstrate in a second and excel  or if you don’t want to use an app or if  you want to do something the apps don’t  really let you do you can code your  scraper you can do it directly in r or  python or bash or even java or php  now what you’re going to do is you’re  going to be looking for information on  the web page if you’re looking for html  text what you’re going to do is you’re  going to pull structured text from web  pages similar to how a reader view works  in a browser it uses html tags on the  web page to identify  what’s the important information so  that’s things like body and h1 for  header 1 and p for paragraph in the  angle brackets  you can also get information from html  tables although this is a physical table  of rows and columns i’m showing you  this also uses html table tags that’s  like table and tr for table row and td  for table data that’s a cell the trick  is when you’re doing this you need the  table number and sometimes you just have  to find that through trial and error let  me give you an example of how this works  let’s take a look at this wikipedia page  on the iron chef america competition i’m  going to go to the web right now and  show you that one  so here we are on wikipedia iron chef  america and if you scroll down a little  bit see we got a whole bunch of text  here and we’ve got our table of contents  and then we come down here we have a  table that lists the winners the  statistics for the winners and let’s say  we want to pull that from this webpage  into another program for us to analyze  well there’s an extremely easy way to do  this with google sheets  all we need to do is open up a google  sheet and in cell a1 of that google  sheet we paste in this formula  it’s import html then you give the web  page then you say that you’re importing  a table you have to put that stuff in  quotes and the index number for the  table i had to poke around a little bit  to figure out that this one was table  number two  so let me go to google sheets and show  you how this works  here i have a google sheet  and right now it’s got nothing in it but  watch this if i come here to this cell  and i simply paste in that information  all this stuff just sort of magically  propagates into the sheet makes it  extremely easy to deal with and now i  can for instance save this as a csv file  put it in another program  lots of options and so this is one way  that i’m scraping the data from a web  page because i didn’t use an api but i  just used a very simple one-length  command in google sheets to get the  information  now that was an html table you can also  scrape data from pdfs you have to be  aware of whether it’s a native pdf i  call that a text pdf or a scanned or  image pdf  and what it does with native pdfs is it  looks for text elements again those are  like code that indicates this is text  and you can deal with raster images  that’s pixel images or vector which  draws the lines and that’s what makes  them infinitely scalable in many  situations  and then in pdfs you can deal with  tabular data but you probably have to  use a specialized program like scraper  wiki or tabula in order to get that  and then finally media like images and  video and audio getting images is easy  you can download them in a lot of  different ways and then if you want to  read data from them say for instance you  have a heat map of a country you can go  through it but you’ll probably have to  write a program that loops through the  image pixel by pixel to read the data  and then encode it numerically into your  statistical program  now that’s my very brief summary and  let’s summarize that  first off  if the data you’re trying to get at  doesn’t have an existing api you can try  scraping and you can use specialized  apps for scraping or you can write code  in a language like r or python but no  matter what you do be sensitive to  issues of copyright and privacy so you  don’t get yourself in hot water but  instead you make an analysis that can be  of great use to you or to your client  the next step in data sourcing is making  data and specifically we’re talking  about getting new data i like to think  of this as you’re getting your hands on  and you’re getting data de novo new data  so  can’t find the data that you need for  your analysis well one simple solution  is  do it yourself and we’re going to talk  about a few general strategies used for  doing that  now these strategies vary in a few  dimensions first off is the role are you  passive and simply observing stuff  that’s happening already or are you  active where you play a role in creating  the situation to get the data  and then there’s the qq  question and that is are you going to  get quantitative or numerical data or  are you going to get qualitative data  which usually means text paragraph  sentences as well as things like photos  and videos and audio  and also how are you going to get the  data do you want to get it online or do  you want to get it in person now there’s  other choices in these but these are  some of the big delineators of the  different methods  when you look at those you get a few  possible options number one is  interviews and i’ll say more about those  another one is surveys  a third one is card sorting  and the fourth one is experiments  although i actually want to split  experiments into two kinds of categories  the first one is laboratory experiments  and that’s in-person projects where you  shape the information or an experience  for the participants as a way of seeing  how that involvement changes their  reactions doesn’t necessarily mean that  you’re a participant but you create the  situation  and then there’s also a b testing this  is automated online testing of two or  more variations on a web page it’s a  very very simple kind of experimentation  there’s actually very useful for  optimizing websites  so  in sum from this very short introduction  make sure you can get exactly what you  need get the data you need to answer  your question and if you can’t find it  somewhere then make it and as always you  have many possible methods each of which  have their own strengths and their own  compromises and we’ll talk about each of  those in the following sections  the first method of data sourcing where  you’re making new data that i want to  talk about is interviews and that’s not  because it’s the most common but because  it’s the one you would do for the most  basic problem now basically an interview  is nothing more than a conversation with  another person or a group of people  and  the fundamental question is why do  interviews as opposed to doing a survey  or something else  well there’s a few good reasons to do  that number one  you’re working with a new topic and you  don’t know what people’s responses will  be how they’ll react and so you need  something very open-ended  number two you’re working with a new  audience you don’t know how they will  react in particular to what it is you’re  trying to do  and number three  something’s going on with the current  situation it’s not working anymore and  you need to find what’s going on and you  need to find ways to improve the  open-ended information where you get  past your existing categories and  boundaries can be one of the most useful  methods for getting that data  if you want to put it another way you  want to do interviews when you don’t  want to constrain responses  now when it comes to interviews you have  one very basic choice and that’s whether  you do a structured interview  and with a structured interview you have  a predetermined set of questions and  everyone gets the same questions in the  same order it gives a lot of consistency  even though the responses are open-ended  and then you can also have what’s called  an unstructured interview and this is a  whole lot more like a conversation where  you as the interviewer and the person  you’re talking to  your questions arise in response to  their answers  consequently an unstructured interview  can be different for each person that  you talk to  also interviews are usually done in  person but not surprisingly they can be  done over the phone or often online  now a couple of things to keep in mind  about interviews number one is time  interviews can range from just a few  minutes to several hours per person  second is training interview is a  special skill that usually requires  specific training now asking the  questions is not necessarily the hard  part the really tricky part is the  analysis the hardest part of interviews  by far is analyzing the answers for  themes and way of extracting the new  categories and the dimensions that you  need for your further research  the beautiful thing about interviews is  they allow you to learn things that you  never expected  so in sum  interviews are best for new situations  or new audiences  on the other hand they can be time  consuming and they also require special  training both to conduct the interview  but even more to analyze the highly  qualitative data that you get from them  an interesting topic in data sourcing  when you’re making data is card sorting  now this isn’t something that comes up  very often in academic research but in  web research this can be a really  important method  think of it as what you’re trying to do  is like building a model of a molecule  here you’re trying to build a mental  model or a model of people’s mental  structures  but more specifically how do people  organize information intuitively and  also how does that relate to the things  that you’re doing online  now the basic procedure goes like this  you take a bunch of little topics and  you write each one on a separate card  and you can do this physically with like  three by five cards or there’s a lot of  programs that allow you to do a digital  version of it  then what you do is you give this  information to a group of respondents  and the people sort those cards so they  put  similar topics with each other different  topics over here and so on  and then you take that information and  from that you’re able to calculate  what’s called dissimilarity data think  of it as like the distance or the  difference between various topics  and that gives you the raw data to  analyze how things are structured  now there are two very general kinds of  card sorting tasks  they are generative and there’s  evaluative  a generative card sorting task is one in  which respondents create their own sets  their own piles of cards using any  number of groupings they like and this  might be used for instance to design a  website  if people are going to be looking for  one kind of information next to another  one then you want to put that together  on the website so they know where to  expect it  on the other hand if you’ve already  created a website then you can do an  evaluative card sorting this is where  you have a fixed number or fixed names  of categories like for instance the way  you’ve set up your menus already and  then what you do is you see if people  naturally put the cards into these  various categories that you’ve created  that’s a way of verifying  that your hierarchical structure makes  sense to people  now whichever method you do generative  or evaluative what you end up with when  you do a card structure is an  interesting kind of visualization it’s  called a dendrogram that actually means  branches  and what we have here is actually 150  data points if you’re familiar with the  fischer’s iris data that’s what’s going  on here  and it groups it from one giant group on  the left and then splits it in pieces  and pieces and pieces until you end up  with lots of different obser well  actually individual level observations  at the end but you can cut things off  into two or three groups or wherever  it’s most useful for you here as a way  of visualizing the entire collection of  similarity or dissimilarity between the  individual pieces of information that  you had people sort  now  i’ll just mention very quickly if you  want to do digital card sorting which  makes your life infinitely easier  because keeping track of physical cards  is really hard  you can use something like optimal  workshop or user zoom or ux suite these  are some of the most common choices  now  let’s just sum up what we’ve learned  about card sorting in this extremely  brief overview number one card sorting  allows you to see intuitive organization  of information in a hierarchical format  you can do it with physical cards or you  also have digital choices for doing the  same thing  and when you’re done you actually get  this hierarchical or branched  visualization of how the information is  structured and related to each other  when you’re doing your data sourcing  you’re making data sometimes you can’t  get what you want through the easy ways  and you got to take the hard way and you  can do what i’m calling laboratory  experiments now of course when i mention  laboratory experiments people start to  think of stuff like you know dr  frankenstein in his lab but  lab experiments are less like this and  in fact they’re a little  more like this  nearly every experiment i have done in  my career has been a paper and pencil  one with people in a well-lighted room  and it’s not been the threatening kind  now the reason you do a lab experiment  is because you want to determine cause  and effect  and this is the single most  theoretically viable way of getting that  information  now what makes an experiment an  experiment is the fact that researchers  play active roles in experiments with  manipulations now people get a little  freaked out when they hear manipulations  things that you’re coercing people and  messing with their mind all that means  is you are manipulating the situation  you’re causing something to be different  for one group of people or one situation  than another  it’s a benign thing but it allows you to  see how people react to those different  variations  now you’re going to want to do an  experiment you’re going to want to have  focused research it’s usually done to  test one thing or one variation at a  time  and it’s usually hypothesis driven  usually you don’t do an experiment until  you’ve done enough background research  to say i expect people to react this way  to this situation and this way to the  other a key component of all of this is  that experiments almost always have  random assignments so regardless of how  you got your sample  when they’re in your study you randomly  assign them to one condition or another  and what that does is it balances out  the pre-existing differences between  groups and that’s a great way of taking  care of confounds and artifacts the  things that are unintentionally  associated with differences between  groups that provide alternate  explanations for your data  if you’ve done good random assignment  and you have a large enough people then  those confounds and artifacts are  basically minimized  now some places where you’re likely to  see a laboratory experiments in this  version are for instance eye tracking  and web design that’s where you have to  bring people in front of a computer and  you stick a thing there that sees where  they’re looking that’s how we know for  instance that people don’t really look  at ads on the side of web pages  another very common place is research in  medicine and education and in my field  psychology and in all of these what you  find is that experimental research is  considered the gold standard for  reliable valid information about cause  and effect  on the other hand while it’s a wonderful  thing to have it does come at a cost  here’s how that works number one  experimentation requires extensive  specialized training it’s not a simple  thing to pick up  two experiments are often very time  consuming and labor intensive i’ve known  some that take hours per person  and number three experiments can be very  expensive  so what that all means is you want to  make sure that you’ve done enough  background research and you need to have  a situation where it’s sufficiently  important to get  really reliable cause and effect  information to justify these costs for  experimentation  in sum  laboratory experimentation is generally  considered the best method for causality  or assessing causality  that’s because it allows you to control  for confounds through randomization  on the other hand it can be difficult to  do so be careful and thoughtful when  considering whether you need to do an  experiment and how to actually go about  doing it  there’s one final procedure i want to  talk about in terms of data sourcing and  making new data it’s a form of  experimentation it’s simply called a b  testing and it’s extremely common in the  web world so for instance i just barely  grabbed a screenshot of amazon.com’s  home page and you’ve got these various  elements on the homepage and i just  noticed by the way when i did this  that this woman is actually a animated  gift so she moves around that was kind  of weird never seen that before  but the thing about this is this entire  layout  how things are organized and how they’re  on there will have been determined by  variations on a b testing by amazon  here’s how it works  for your web page you pick one element  like what’s the headline or what are the  colors or what’s the organization or how  do you word something and you create  multiple versions maybe just two version  a and version b which is why he called a  b testing  and then when people visit your web page  you randomly assign those visitors to  one version or another you have software  that does that for you automatically  and then you compare the response rates  on some response i’ll show you those in  a second  and then once you have enough data you  implement the best version you sort of  set that one solid and then you go on to  something else  now in terms of response rates there’s a  lot of different outcomes you can look  at you can look at how long a person’s  on a page you can actually do mouse  tracking if you want to you can look at  click-throughs you can also look at  shopping cart value or abandonment a lot  of possible outcomes  all of these contribute through a b  testing to the general concept of  website optimization  to make your website as effective as it  can possibly be  now the idea also is that this is  something you’re going to do a lot you  can perform a b tests continually in  fact i’ve seen one person say that what  a b testing really stands for is always  b testing  kind of cute but it does give you the  idea that improvement is a constant  process now if you want some software to  do a b testing two of the most common  choices are optimizely and vwo which  stands for visual web optimizer now many  others are available but these are  especially common  and when you get the data you’re going  to use statistical hypothesis testing to  compare the differences or really the  software does it for you automatically  but you may want to adjust the  parameters because most software  packages cut off testing a little too  soon and the information is not quite as  reliable as it should be  but in sum here’s what we can say about  a b testing it’s a version of website  experimentation it’s done online which  makes it really easy to get a lot of  data very quickly  it allows you to optimize the design of  your website for whatever outcome is  important to you  and it can be done as a series of  continual assessments testing and  development to make sure that you’re  accomplishing what you want to as  effectively as possible for as many  people as possible  the next logical step in data sourcing  and making data is surveys  now think of this if you want to know  something just ask that’s the easy way  and you want to do a survey  under certain situations the real  question is do you know your topic and  your audience well enough to anticipate  their answers to know what the range of  their answers and the dimensions and the  categories that are going to be  important  if you do then a survey might be a good  approach  now just as there were a few dimensions  for interviews there’s a few dimensions  for surveys you can do what’s called a  closed-ended survey that’s also called a  forced choice it’s where you give people  just particular options like a multiple  choice  you can have an open-ended survey where  you have the same questions for  everybody but you allow them to write in  a free form response  you can do surveys in person  and you can also do them online or over  the mail or phone or however  and now it’s very common to use software  when doing surveys some really common  applications for online surveys  are surveymonkey and qualtrics or at the  very simple end there’s google forms and  at the simple and pretty end there’s  type form there’s a lot more choices but  these are some of the major players and  how you can get data from online  participants in survey format  now the nice thing about surveys is you  know they’re really easy to do they’re  very easy to set up and they’re really  easy to send out to large groups of  people you can get tons of data really  fast  on the other hand the same way that  they’re easy to do they’re also really  easy to do badly the problem is that the  questions you ask they can be ambiguous  they can be double-barreled they can be  loaded and the response scales can be  confusing so if you say i never think  this particular way and a person would  strongly disagree they may not know  exactly what you’re trying to get at so  you have to take special effort to make  sure that the meaning is clear  unambiguous and that the rating scale  the way that people respond is very  clear and they know where their answer  falls which gets us into one of the  things about people behaving badly and  that is beware the push pull  now especially during election time like  we’re in right now a push-pull is  something that sounds like a survey but  really what it is is a very biased  attempt to get data just fodder for  social media campaigns or i’m going to  make a chart that says that 98 percent  of people agree with  is one me  so biased there’s really only one way to  answer to the questions this is  considered extremely irresponsible and  unethical from a research point of view  just hang up on them  now aside from that egregious violation  of research ethics you do need to do  other things like watch out for bias  in the question wording in the response  options and also in the sample selection  because any one of those can push your  responses off one way or another without  you really being aware that it’s  happening  so in sum let’s say this about surveys  you can get lots of data quickly on the  other hand it requires familiarity with  the possible answers in your audience so  you know sort of what to expect  and no matter what you do you need to  watch for bias to make sure that your  answers are going to be representative  of the group that you’re really  concerned about understanding  the very last thing i want to talk about  in terms of data sourcing is to talk  about the next steps and probably the  most important thing is you know  don’t just sit there  i want you to go and see what you  already have try to explore some open  data sources and if it helps check with  a few data vendors  and if those don’t give you what you  need to do your project then consider  making new data again the idea here is  get what you need  and get going thanks for joining me and  good luck on your own projects  welcome to coding in data science i’m  bart poulson and what we’re going to do  in this series of videos is we’re going  to take a little look at the tools of  data science so i’m inviting you to know  your tools but probably even more  important than that is to know their  proper place  now i mention that because a lot of the  times when people talk about data tools  they talk about it as though that were  the same thing as data science as though  they were the same set  but i think if you look at it for just a  second that’s not really the case data  tools are simply one element of data  science because data science is made up  of a lot more than the tools that you  use it includes things like business  knowledge it includes the meaning making  an interpretation it includes social  factors and so there’s much more than  just the tools involved  that being said you will need at least a  few tools and so we’re going to talk  about some of the things that you can  use in data science if it works well for  you  in terms of getting started the basic  things number one is spreadsheets it’s  the universal data tool and i’ll talk  about how they play an important role in  data science  number two is a visualization program  called tableau there’s tableau public  which is free and there’s tableau  desktop and there’s also something  called tableau server but tableau is a  fabulous program for data visualization  and i’m convinced for most people  provides the great majority of what they  need  and though while it’s not a tool i do  need to talk about the formats used in  web data because you have to be able to  navigate that when doing a lot of data  science work  then we can talk about some of the  essential tools for data science those  include the programming language r which  is specifically for data  there’s the general purpose programming  language python which has been well  adapted to data  and there’s the database language sql or  sql for a structured query language  then if you want to go beyond that there  are some other things that you can do  there are the general purpose  programming languages c c plus plus and  java which are very frequently used to  form the foundation of data science and  sort of  high level production code is going to  rely on those as well  there’s the command lined interface  language bash which is very common as a  very quick tool for manipulating data  and then there’s the sort of wild card  supercharged regular expressions or  regex we’ll talk about all of these in  separate courses  but as you consider all the tools that  you can use  don’t forget the 80 20 rule also known  as the pareto principle  and the idea here is that you’re going  to get a lot of bang for your buck out  of a small number of things  and i’m going to show you a little  sample graph here imagine that you have  10 different tools and we’ll call them a  through b  a does a lot for you b does a little bit  less and it kind of tapers down to  you’ve got a bunch of tools that do just  a little bit of stuff that you need  now instead of looking at the individual  effectiveness look at the cumulative  effectiveness how much are you able to  accomplish with the combination of tools  well the first one’s right here at 60  where the tool started  then you add on the 20 from b and it  goes up and then you add on c and d and  you add up little smaller smaller pieces  and by the time you get to the end  you’ve got 100 of effectiveness from  your 10 tools combined  the important thing about this is  you only have to go to the second tool  that’s 2 out of 10 so that’s b that’s 20  percent of your tools  and in this made-up example you’ve got  80 percent of your output so 80 of the  output from 20 of the tools that’s a  that’s a fictional example of the pareto  principle but i find in real life it  tends to work something approximately  like that  and so you don’t necessarily have to  learn everything and you don’t have to  learn how to do everything in everything  instead you want to focus on the tools  that will be most productive and  specifically most productive for you  so in sum let’s say these three things  number one  coding or simply the ability to  manipulate data with programs and  computers  coding is important  but  data science is much greater than the  collection of tools that’s used in it  and then finally as you’re trying to  decide what tools to use and what you  need to learn and how to work remember  the 80 20 rule you’re going to get a lot  of bang from a small set of tools so  focus on the things are going to be most  useful for you in conducting your own  data science projects  as we begin our discussion of coding and  data science i actually want to begin  with something that’s not coding i want  to talk about applications or programs  that are already created that allow you  to manipulate data and we’re going to  begin with the most basic of these  spreadsheets  we’re going to do the rows and columns  and cells of excel and the reason for  this is you need spreadsheets  now you may be saying to yourself no no  no not me because you know what i’m  fancy  i’m working in my big set of servers i  got fancy things going on but you know  what youtube fancy people you need  spreadsheets as well there’s a few  reasons for this  most importantly  spreadsheets can be the right tool for  data science in a lot of circumstances  there are a few reasons for that  number one spreadsheets they’re  everywhere they’re ubiquitous they’re  installed on a billion machines around  the world  and  everybody uses them they probably have  more data sets in spreadsheets than  anything else and so it’s a very common  format  importantly it’s probably your clients  format  a lot of your clients are going to be  using spreadsheets for their own data  i’ve worked with billion dollar  companies that keep all of their data in  spreadsheets and so when you’re working  with them you need to know how to  manipulate that and how to work with it  also regardless of what you’re doing  spreadsheets or specifically csv comma  separated value files are sort of the  lingua franca the universal interchange  format for data transfer to allow you to  take it from one program to another  and then truthfully in a lot of  situations they’re really easy to use  and if you want a second opinion on this  let’s take a look at this ranking  there’s a survey of data mining experts  it’s the kd nuggets data mining poll  and these are the tools they most use in  their own work and look at this lowly  excel is fifth on the list and in fact  what’s interesting about it  it’s above hadoop and spark two of the  major big data fancy tools and so excel  really does have a place of pride in a  toolkit for a data analyst  now since we’re going to go into sort of  the low-tech end of things let’s talk  about some of the things that you can do  with a spreadsheet  number one they’re really good for data  browsing you actually get to see all the  data in front of you which isn’t true if  you’re doing something like r python  they’re really good for sorting data  sort by this column then this column  then this column they’re really good for  rearranging columns and  cells and moving things around they’re  good for finding and replacing and  seeing what happens so you know that it  worked right  some more uses they’re really good for  formatting especially conditional  formatting they’re good for transposing  data switching the rows and the columns  they make that really easy  they’re good for tracking changes now  it’s true if you’re a  big fancy data scientist you’re probably  using github  but for everybody else in the world  spreadsheets and the tracking changes is  a wonderful way to do it  you can make pivot tables that allows  you to explore the data in a very  hands-on way  in a very intuitive way and they’re also  really good for arranging the output for  consumption  now when you’re working with  spreadsheets however there’s one thing  you need to be aware of they’re really  flexible but that flexibility can be a  problem and that when you’re working in  data science you specifically want to be  concerned about something called tidy  data that’s a term i borrowed from  hadley wickham very well known developer  in the r world  tidy data is for transferring data and  making it work well there’s a few rules  here that undo some of the flexibility  inheritance spreadsheets number one  what you want to do is have a column  be equivalent to the same thing as a  variable columns variables they are the  same thing  and then rows are equal exactly the same  thing as cases and then you have one  sheet per file and that you have one  level of measurement say individual then  organization then state  per file again this is undoing some of  the flexibility that’s inherent in  spreadsheets but it makes it really easy  to move the data from one program to  another  let me show you how all this works you  can try this in excel  if you’ve downloaded the files for this  course we simply want to open up this  spreadsheet let me go to excel and show  you how it works  so when you open up this spreadsheet  what you get is  totally fictional data here that i made  up but it’s showing sales over time of  several products at two locations like  if you’re selling stuff at a baseball  field  and this is the way spreadsheets often  appear we’ve got blank rows and columns  we’ve got stuff arranged in a way that  makes it easy for the person to process  it  and we’ve got totals here and with  formulas putting them all together and  that’s fine that works well for the  person who made it and then that’s for  one month and then we have another month  right here and then we have another  month right here and then we combine  them all for the first quarter of 2014.  we’ve got some  headers here we’ve got some conditional  formatting and changes and if we come to  the bottom  we’ve got a very busy line graphic that  eventually loads it’s not a good graphic  by the way  but similar to what you will often find  so this is the stuff that while it may  be useful for the client’s own personal  use you know you can’t feed this into r  or python it’ll just choke and it won’t  know what to do with it and so you need  to go through a process of tidying up  the data  and what this involves is undoing some  of this stuff so for instance here’s  data that is almost tidy here we have a  single column for the date a single  column for the day  a column for this site that so we have  two locations a and b  and then we have six columns for the six  different things that are sold and how  many were sold on each day  now in certain situations you would want  the data laid out exactly like this if  you’re doing for instance a time series  you’ll do something vaguely similar to  this  but for true tidy stuff we’re going to  collapse it even further  let me come here to the tidy data and  now what i’ve done is i’ve created a new  column that says what is the item being  sold and so by the way what this means  is that we’ve got a really long data set  now it’s got over a thousand  rows  come back up to the top here  but what that shows you is that now it’s  a in a format that’s really easy to  import from one program to another that  makes it tidy and you can re-manipulate  it however you want once you get to each  of those  so let’s sum up our little presentation  here in a few lines number one no matter  who you are no matter what you’re doing  in data science you need spreadsheets  and the reason for that is that  spreadsheets are often the right tool  for data science  keep one thing in mind though and that  is as you’re moving back and forth from  one language to another tidy data or  well-formatted data is going to be  important for exporting data into your  analytical program or language of choice  as we move through coding and data  science and specifically the  applications that can be used there’s  one that stands out for me more than  almost anything else and that’s tableau  and tableau public  now if you’re not familiar with these  these are visualization programs the  idea here is that when you have data the  most important thing you can do is to  first look  and see what you have and work with it  from there and in fact  i’m convinced that for many  organizations  tableau might be  all that they really need it will give  them the level of insight that they need  to work constructively with data so  let’s take a quick look by going to  tableau.com  now there are a few different versions  of tableau  right here we have tableau desktop and  tableau server and these are the  paid versions of tableau and they  actually cost a lot of money unless you  work for a non-profit organization in  which case you can get them for free  which is a beautiful thing  what we’re usually looking for however  is not this paid version but we’re  looking for something called tableau  public  and if you come in here  and go to products and we’ve got these  three paid ones  come over here to tableau public  when we click on that  it brings us to this page it’s  public.tableau.com  and this is the one that has what we  want it’s a free version of tableau with  one  major caveat you don’t save files  locally to your computer which is why i  didn’t give you a file to open instead  it saves them to the web in a public  form  so if you’re willing to trade privacy  you can get an immensely powerful  application for data visualization  that’s a catch for a lot of people which  is why people are willing to pay a lot  of money for the desktop version and  again if you work for a non-profit you  can get the desktop version for free but  i’m going to show you how things work in  tableau public  so that’s something that you can work  with personally  the first thing you want to do is you  want to download it and so  you put in your email address you  download it’s going to know what you’re  on it’s a pretty big download and once  it’s downloaded you can install and open  up the application and here i am in  tableau public right here this is the  blank version by the way you also need  to create an account with tableau in  order to save your stuff online and to  see it we’ll show you what that looks  like  but you’re presented with a blank thing  right here  and the first thing you need to do is  you need to bring in some data  i’m going to bring in an excel file now  if you’ve downloaded the files for the  course you’ll see that there’s this one  right here  dso322 tableau public dot xlsx it’s a  excel file and in fact it’s the one that  i used in talking about spreadsheets in  the first video in this course  i’m going to select that one and i’m  going to open it  and a lot of programs don’t like  bringing in excel because it’s got all  the worksheets and all the weirdness in  it  this one works better with it but what  i’m going to do is i’m going to take the  tidy data  by the way you see there to put them in  alphabetical order here  and i’m going to take tiny data i’m just  going to drag it over  to let it know that it’s the one that i  want and now what it does is it shows me  a version of the data set  along with  things that you can do here you can  rename it you can i like you can create  bin groups there’s a lot of things that  you can do here i’m going to do  something very very quick with this  particular one  now i’ve got the data set right here  what i’m going to do now is i’m going to  go to a worksheet that’s where you  actually create several cancel that and  go to worksheet one  okay  this is a drag and drop interface  and so what we’re going to do is we’re  going to pull the bits and pieces of  information we want to make graphics  there’s immense flexibility here i’m  going to show you two very basic ones  i’m going to look at the sales of my  fictional ballpark items  so i’m going to grab sales right here  and i’m going to put that as the field  that we’re going to measure okay and you  see it put it down right here and this  is our total sales we’re going to break  it down  by item and by time  so let me take item right here and you  can drag it over here or i can put it  right up here into rows  those will be my rows and that’s how  many we’ve sold total of each of the  items fine that’s really easy  and then let’s take  date and we’ll put that here in columns  to spread it across now by default it’s  doing it by year i don’t want to do that  i only have three months of data and so  what i can do is i can click right here  and i can choose a different time frame  i can go to quarter but that’s not going  to help because i only have one  quarter’s worth of data that’s three  months  i’m going to come down to week  actually let me go today  if i do date you see it gets enormously  complicated so that’s no good so i’m  going to back up to week  and i’ve got a lot of numbers there but  what i want is a graph  and so to get that i’m going to come  over here and click on this  and tell it that i want a graph and so  we’re seeing the information except it  lost items so i’m going to bring item  and i’m going to put it back up into  this graph to say this is a row for the  data  and now i’ve got  rows for sales by week for each of my  items that’s great i want to break it  down one more by putting in the site the  place that it sold so i’m going to grab  that  and i’m going to put it right over here  and now you see i’ve got it broken down  by the item that is sold  and the different sites and i’m going to  color the sites and all i got to do to  that is i’m going to grab site and drag  it onto color  now i’ve got two different colors for my  sites and this makes it a lot easier to  tell what’s going on and in fact there’s  some other cool stuff you can do  one of the things i’m going to do is i  can come over here to analytics and i  can tell it for instance to put an  average line through everything so i’ll  just drag this over here and  say  now we have the average for each line  that’s good  and i can even do forecasting  let me get a little bit of a forecast  right here  i’ll drag this on  and if you go over here i can get this  out of the way for a second  now i have a forecast for  the next few weeks and that’s a really  convenient quick and easy thing and  again for some organizations that might  be all that they really need  and so what i’m showing you here is the  absolute basic operation of tableau  which allows you to do an incredible  range of visualizations and manipulate  the data and create interactive  dashboards there’s so much to it and  we’ll show that in another course but  for right now i want to show you one  last thing about tableau public and that  is saving the files  so now when i come here and save it  it’s going to ask me to sign it to  tableau public  now i sign in  and it asks me how i want to save this  same name as the video  there we go and i’m going to hit save  and then that opens up a web browser and  since i’m already logged into my account  see here’s my account my profile  here’s the page that i created  and it’s got everything i need there  i’m going to edit just a few details  i’m going to say for instance i’m going  to leave its name like that i could put  more of a description in there if i  wanted  i can  allow people to download the workbook  and its data i’m going to leave that  there so you can download it if you need  to if i had more than one tab i would do  this thing that says show the different  sheets as tabs  hit save  and there’s my data set and also it’s  published online and people can now find  it  and so what you have here is an  incredible tool for creating interactive  visualizations you can create them with  drop-down menus and you can rearrange  things and you can make an entire  dashboard it’s a fabulous way of  presenting information and as i said  before i think that for some  organizations this may be  as much as they need to get really good  useful information out of their data and  so i strongly recommend that you take  some time to explore with tableau either  the paid desktop version or the public  version and see what you can do to get  some really compelling and insightful  visualizations out of your work and data  science  for many people their first experience  of coding in data science is with the  application spss  now  i think of spss and the first thing that  comes to my mind is sort of life in the  ivory tower though this looks more like  you know harry potter  but if you think about it  the package name spss comes from  statistical package for the social  sciences although if you ask ibm about  it now they’ll act like it doesn’t stand  for anything  but it has its background in social  science research which is generally  academic and truthfully i’m a social  psychologist and that’s where i first  learned how to use spss  but let’s take a quick look at their  webpage  spss  if you type that in that’ll just be a  alias that’ll take you to ibm’s main  webpage now ibm didn’t create spss but  they bought it around version 16 and it  was very briefly known as pasw  predictive analytics software  that only lasted briefly  and now it’s back to spss which is where  it’s been for a long time spss is a  desktop program it’s pretty big it does  a lot of things it’s very powerful it’s  used in a lot of academic research it’s  also used in a lot of  business consulting management and even  some medical research  and the thing about spss is it looks  like a spreadsheet but has drop down  menus to make your life a little bit  easier compared to some of the  programming languages that you can use  now you can get a free temporary version  if your student you can get a cheap  version otherwise spss costs a lot of  money  but  if you have it one way or the other when  you open it up this is what it’s going  to look like  i’m showing spss version 22. now it’s  currently on 24  and  the thing about spss versioning is in  any other software package these would  be point updates so i sort of feel like  we should be on 17.3 as opposed to 22 or  24  because the variations are so small that  anything you learn from the earlier ones  is going to work on the later ones and  there’s a lot of backwards and forwards  compatibility so i’d almost say that  this one the version you have  practically doesn’t matter  you get this little welcome splash  screen and if you don’t want to see it  anymore you can get rid of it i’m just  going to hit cancel here  and this is our main interface looks a  lot like a spreadsheet the difference is  you have a separate pane for looking at  variable information and then you have  separate windows for output and then an  optional one for something called syntax  but let me show you how this works by  first opening up a data set  spss has a lot of sample data sets in  them but they’re not easy to get to and  they’re really well hidden  on my mac for instance let me go to  where they are  in my mac i go to the finder i have to  go mac to applications  to the folder ibm to spss to statistics  to 22 the version number to samples then  i have to say i want the ones that are  in english  and then it brings them up  the dot sav files are the actual data  files there are different kinds in here  so dot sps is a different kind of file  and then we have a different one about  planning analyses so there are versions  of it  i’m going to open up a file here called  marketvalues.sav  it’s a small dataset in spss format and  if you don’t have that you can open up  something else it really doesn’t matter  for now  by the way in case you haven’t noticed  spss tends to be really really slow when  it opens  it also despite being at version 24  tends to be kind of buggy and crashes  and so  when you work with spss you want to get  in the habit of saving your work  constantly and also being patient when  it’s time to open the program  so here’s a data set that just shows  addresses and house values for and  square feet for some information this i  don’t even know if this is real  information it looks it looks artificial  to me  but spss lets you do point and click  analyses which is unusual for a lot of  things so i’m going to come up here  and i’m going to say for instance make a  graph i’m going to make a i’m actually  going to use what’s called a legacy  dialog  to get a histogram of house prices  so i simply click values  put that right there and i’ll put a  normal curve on top of it and hit ok  and then it’s going to open up a new  window and it opened up a microscopic  version of it here so i’m going to make  that bigger  this is the output window and so this is  a separate window  and it has a  navigation pane here on the side  it tells me where the data came from and  it saves the command here  and then you know there’s my default  histogram and so we see most of the  houses were  right around 125 000 and then they went  up to  at least 400 000.  i have a mean of 256 000 the standard  deviation of about 80 000 and there’s 94  houses in the data set fine that’s great  the other thing i can do is if i want to  do some analyses let me go back to the  data just for a moment  for instance i can come here to analyze  and i can do descriptives i’m actually  going to do excuse me i’m going to do  one here called explore  and i’ll take the  purchase price and i’ll put it right  here  and i’m going to get a whole bunch of  stuff just by default i’m going to hit  ok  and it goes back to the output window  once again made it tiny  and so now you see beneath my chart  i now have  a table and i’ve got a bunch of  information a stem and leaf plot and  i’ve got a box plot too great way of  checking for outliers  and so this is a really convenient way  to save things you can export this  information  as images you can export the entire file  as an html you can do it as a pdf or a  powerpoint there’s a lot of options here  and you can customize everything that’s  on here  now i just want to show you one more  thing that makes your life so much  easier in spss you see right here that  it’s putting down these commands it’s  actually saying graph and then histogram  normal equals value  and then down here we’ve got this little  command right here  most people  don’t know how to save their work in  spss and they kind of just have to do it  over again every time but there’s a very  simple way to do this what i’m going to  do is i’m going to open up something  called a syntax file  i’m going to go to new  syntax  and this is just a blank window  that’s a programming window it’s for  saving code and let me go back to my  analysis i did a moment ago  i’ll go back to analyze i can still get  it right here  and descriptives and explore and my  information is still there  and what happens here is even though i  set it up with drop down menus and point  a click if i do this thing paste  then what it does is it takes the code  that creates that command and it saves  it to the syntax window and this is just  a text file  it saves it as dot sps but it’s a text  file that can be opened in anything and  what’s beautiful about this is it’s  really easy to copy and paste and you  can even take this into like word and do  find and replace on it and  it’s really easy to replicate the  analyses and so for me  spss is a good program but until you use  syntax you don’t know the true power of  it and it makes your life so much easier  as a way of operating it  anyhow this is my extremely brief  introduction to spss all i want to say  is that it’s a very common program kind  of looks like a spreadsheet but it gives  you a lot more power and options and you  can use both drop down menus and text  based syntax commands as well to  automate your work and make it easier to  replicate it in the future  i want to take a look at one more  application for coding and data science  that’s called jasp this is a new  application not very familiar to a lot  of people and still in beta but with  amazing promise you can basically think  of it as a free version of spss and you  know what we love free  but jasper is not just free  it’s also open source  and it’s intuitive  and it makes analyses replicable  and it even includes bayesian approaches  and so take that all together you know  we’re pretty happy and we’re jumping for  joy  so before we move on you just may be  asking yourself you know jasp what is  that well  the  creators emphatically deny that it  stands for just another statistics  program but  be that as it may we’ll just go ahead  and call it jasp and use it very happily  you can get to it by going to jasp  stats.org and let’s take a look at that  right now  jasper is a new program they say a  low-fat alternative to spss but it is a  really wonderful great way of doing  statistics you’re going to want to  download it  specifying your platform it even comes  in linux format which is beautiful and  again it’s beta so stay posted things  are updating regularly and if you’re on  mac you’re going to need to use x quartz  but that’s an easy thing to install it  makes a lot of things work better and  it’s a wonderful way to do analyses  when you open up jasp it’s going to look  like this  it’s a pretty blank interface but it’s  really easy to get going with it so for  instance you can come over here to file  and you can even choose some example  data sets  so for instance here’s one called big5  that’s personality factors  and you’ve got data here that’s really  easy to work with let me scroll this  over here for a moment  so there’s our five variables and let’s  do some quick analyses with these say  for instance we want to get descriptives  we can pick a few variables  now if you’re familiar with spss the  layout feels very much the same and the  output looks a lot the same you know all  i have to do is select what i want and  it immediately pops up over here and  then i can choose additional statistics  i can get quartiles i can get the median  and you can choose plots  let’s get some plots all i do is click  on it and they show up  and that’s a really beautiful thing and  you can modify these things a little bit  so for instance i can take the plots and  let’s see if i can drag that down  and if i make it small enough i can see  the five plots well i went a little too  far on that one  anyhow you can do a lot of things here  and i can hide this  i can collapse that  and i can go on and do other analyses  now what’s really neat though is when  you navigate away from it so i just  clicked in the blank area of the results  pane we’re back to the data here but if  i click on one of these tables like this  one right here  it immediately brings up the commands  that produced it and i can just modify  it some more if i want say i want  skewness and kurtosis  boom they’re in there  it’s an amazing thing and then i can  come back out here i can click away from  that  and i can come down to the plots  expand those and if i click on that it  brings up the commands that made them  it’s an amazingly easy and intuitive way  to do things now  there’s another really nice thing about  jasp and that is that you can share the  information online really well through a  program called osf.io  that stands for the open science  foundation that’s its web address osf.io  so let’s take a quick look at what  that’s like  here’s the open science framework  website and it’s a wonderful service  it’s free and it’s designed to support  open transparent accessible accountable  collaborative research and i really  can’t say enough nice things about it  what’s neat about this is once you sign  up for osf  you can create your own  area and i’ve got one of my own i’ll go  to that right now  so for instance here’s the data lab page  in open science framework  and what i’ve done is i created a  version  of this jasp analysis  and i’ve saved it here in fact  let’s open up my jasp analysis in jasp  and then i’ll show you what it looks  like in osf  so let’s first go back to jasp  and when we’re here we can come over to  file  and click computer and i just saved this  file to the desktop  click on desktop  and you should have been able to  download this with all the other files  ds0324  jasp i’m going to double click on that  to open it  and now it’s going to open up a new  window  and you see i was working with the same  data set but i did a lot more analyses  i’ve got these graphs i have correlation  scatter plots  come down here i did a linear regression  and we just click on that and you can  see the commands that produced it as  well as the options  didn’t do anything special for that but  i did do some confidence intervals and  specified that  and it’s really a great way to work with  all this i’ll click back in an empty  area and you can see the commands go  away and so i’ve got my output here in  jasp  when i saved it though i had the option  of saving it  to  osf  in fact if you go to this web page  osf.io  3t2jg you’ll actually be able to go to a  page where you can see and download the  analysis that i conducted let’s take a  look  this is that page there’s the address i  just barely gave you and what you see  here is the same analysis that i  conducted it’s all right here so if  you’re collaborating with people or if  you want to show things to people this  is a wonderful way to do it everything’s  right there now this is a static image  but  up at the top people have the option of  downloading the original file and  working with it on their own  so in case you can’t tell i’m really  enthusiastic about jasp and about its  potential still in beta still growing  rapidly  i see it really as an open source free  and collaborative replacement spss and i  think it’s going to make data science  work so much easier for so many people i  strongly recommend you give jasp a close  look  let’s finish up our discussion of coding  and data science the applications part  of it by just briefly looking at some  other software choices and i’ll have to  admit it gets kind of overwhelming  because there are just so many choices  now this is in addition to the  spreadsheets and tableau and spss and  jasp that we’ve already talked about i  mean there’s so much more than that i’m  going to give you a range of things that  i’m aware of and i’m sure i’ve left out  some important ones or things that other  people like really well  but these are some common choices and  some less common but interesting ones  number one in terms of things that i did  not mention is sas sas is an extremely  common analytical program very powerful  used for a lot of things it’s actually  the first program that i learned  and on the other hand it can be kind of  hard to use and it can be expensive  but there’s a couple of interesting  alternatives sas also has something  called the sas university edition if  you’re a student this is free  and it’s slightly you know reduced in  what it does but the fact that it’s free  and also it runs in a virtual machine  which  makes it an enormous download  but it’s a good way to learn sas if it’s  something that you want to do sas also  makes a program that i really love we’re  not so extraordinarily expensive and  that is called jump  and it’s a visualization software  think a little bit of like tableau how  we saw you work with it visually and  this one you can drag things around it’s  a really wonderful program i personally  find it prohibitively expensive  another very common choice among working  analysts is stata  and some people use minitab  now for mathematical people there’s  matlab  and then of course there’s mathematica  itself but that’s really more a language  than a program on the other hand wolfram  who makes mathematica is also  the people who give us wolfram alpha  most people don’t think of this as a  stats application because you can run it  on your iphone  but wolfram alpha is in fact incredibly  capable and especially if you pay for  the pro account you can do amazing  things in this including analyses  regression models visualizations and so  it’s worth taking a little closer look  at that also because it actually  provides a lot of the data that you need  so wolfram alpha is an interesting one  now several applications that are more  specifically geared towards data mining  so you don’t want to do your regular you  know  little t-tests and stuff on these but  there’s rapidminer  and there’s nime  and orange and those are all really nice  to use because they are  control languages where you drag nodes  onto a screen and you connect them with  lines and you can see how things run  through  all three of them are free or have free  versions and all three of them work in  pretty similar manners there’s also big  ml which is for machine learning and  this is unusual because it’s  browser-based it runs on their servers  there’s a free version although you  can’t download a whole lot it doesn’t  cost a lot to use big ml and actually is  a very friendly very accessible program  then in terms of programs you can  actually install for free on your own  computer there’s one called sofa  statistics that means statistics open  for all kind of a cheesy title but it’s  a good program and then one with a  webpage straight out of 1990 is past  three this is paleontological software  on the other hand does do very general  stuff it runs on many platforms and it’s  a really powerful thing and it’s free  but it is relatively unknown and then  speaking of relatively unknown one  that’s near and dear to my heart is a  web application called statcrunch it  costs but it costs like six or twelve  bucks a year it’s it’s really cheap and  it’s very good especially for basic  statistics and for learning i used it in  some of the classes that i was teaching  and then if you’re deeply wedded to  excel and you just can’t stand to leave  that environment you can purchase  add-ins like excel stat which give you a  lot of statistical functions within the  excel environment itself  that’s a lot of choices and the most  important thing here is don’t get  overwhelmed there’s a lot of choices but  you don’t even have to try all of them  really the important question is what  works best for you and the projects that  you’re working on there’s a few things  you might want to consider in that  regard  first off is functionality does it  actually do what you want or does it  even run on your machine  you don’t need everything that a program  can do i mean think about all the stuff  that excel can do people probably use  one five percent of what is available  then there’s also ease of use some of  these programs are a lot easier to use  than the others and i personally find  that the ones that are easy to use i  like them  and so you might say no i need to  program because i need to do custom  stuff but i’m willing to bet that 95 of  what people do does not require anything  custom  also the existence of a community  constantly when you’re working you come  across problems don’t know how to solve  it and being able to simply get online  and do a search for an answer and have  enough of a community that there are  people there who have put answers up and  discuss these things those are wonderful  some of these programs have very  substantial communities some of them  it’s practically non-existent and you  get to decide how important that is to  you and then finally of course there’s  the issue of cost  many of these programs i mentioned are  free  some of them are very cheap some of them  run on sort of a freemium model and some  of them are outrageously expensive so  you don’t buy them unless somebody else  is paying for it  so these are some of the things that you  want to keep in mind when you’re trying  to look at various programs  also let’s mention this don’t forget the  80 20 rule you’re going to be able to do  most of the stuff that you need to do  with only a small number of tools one or  two maybe three will probably be all  that you ever need so you don’t need to  explore the range of every possible tool  find something that does what you need  find something you’re comfortable with  and really try to extract as much value  as you can out of that  so  in sum in our discussion of available  applications for coding and data science  first remember applications are tools  they don’t drive you you use them and  that your goals are what drive the  choice of your applications and the way  that you do it and the single most  important thing is to remember what  works for you may work well for somebody  else if you’re not comfortable with it  if it’s not the questions you address  then it’s more important to think about  what works for you and the projects that  you’re working on as you make your own  choices for tools for working in data  science  when you’re coding in data science one  of the most important things you can do  is be able to work with web data and if  you work with web data you’re going to  be working with  html  now in case you’re not familiar with it  html is what makes the world wide web go  round  what it stands for is hyper text markup  language  and if you’ve never dealt with web pages  before  here’s a little secret  web pages are just text  it’s just a text document  but it uses tags to define the structure  of the document and a web browser knows  what those tags are and it displays them  in the right way  so for instance some of the tags they  look like this they’re in angle brackets  and you have angle bracket and then a  beginning tag so body then you have the  body the main part of your text and then  you have an angle brackets and a  backslash body to let the computer know  that you’re done with that part  you also have p and backslash p for  paragraphs  h1 is for header1 and you put it in  between that text  td is for table data or the cell in the  table and you mark it off that way  if you want to see what it looks like  just go to this document dso331  html dot text  i’m going to go to that one right now  now depending on what text editor you  open this up it may actually give you  the web preview i’ve opened it up in  text mate and so it actually is showing  the text the way i typed it i typed this  manually just typed it all in there and  i have  html to see what a document is i have an  empty header but that sort of needs to  be there  this i say what the body is and then i  have some text  li is for list items  i have headers this is for a link to a  web page  then i have a small table  and if you want to see what this looks  like when it’s actually displayed as a  web page we’ll just go up here to window  and show web preview  this is the same document but now it’s  in a browser  and  that’s how you make a web page now  i know this is very fundamental stuff  but the reason this is important is  because if you’re going to be extracting  data from the web you have to understand  how that information is encoded in the  web and is going to be in html most of  the time for a regular web page  now i will mention something that  there’s another thing called css and web  pages use css to define the appearance  of a document html is theoretically  there to give the content and css gives  the appearance and that stands for  cascading style sheets i’m not going to  worry about that right now because we’re  really interested in the content  and now you have the key to being able  to read web pages and pull data from web  pages for your data science projects  so in sum first the web runs on html  that’s what makes the web pages that are  there  html defines the page structure and the  content that’s in the page  and you need to learn how to navigate  the tags in the structure in order to  get data from the web pages for your  data science projects  the next step in coding in data science  when you’re working with web data is to  understand a little bit about xml  i like to think of this as the part of  web data that follows the imperative  data defined thyself  xml stands for extensible markup  language and what it is xml is  semi-structured data what that means is  that tags define data so computer knows  what a particular piece of information  is  but unlike html the tags are free to be  defined any way you want and so you have  this enormous flexibility in there but  you’re still able to specify so the  computer can read it  now there’s a couple of places where  you’re going to see xml files  number one is in web data html defines  the structure of a web page but if  they’re feeding data into it then that  will often come in the form of an xml  file  interestingly microsoft office files if  you have docx or xlsx  part at the end stands for a version of  xml that’s used to create these  documents  if you use itunes the library  information that has all of your artists  and your genres and your ratings and  stuff that’s all stored in an xml file  and then finally data files that often  go with particular programs can be saved  as xml as a way of representing the  structure of the data to the program  and for xml tags use opening and closing  angle brackets just like html did again  the major difference is that you’re free  to define the tags however you want  so for instance thinking about itunes  you can define a tag that’s genre and  you have the angle brackets and genre to  begin that information and then you have  the angle brackets with the backslash to  let it know you’re done with that piece  of information  or you can do it for composer or you can  do it for rating or you can do it for  comments and you can create any tags you  want and you put the information in  between those two things  now let’s take an example of how this  works  i’m going to show you a quick data set  that comes from the web  it’s at  airgast.com and api this is a website  that stores information about automobile  formula one racing let’s go to this  webpage and take a quick look at what  it’s like  so here we are at airgast.com  and it’s the api for formula one and  what i’m bringing up is the results of  the 1957 season in formula one racing  and here you can see who the competitors  were in each race and how they finished  and so on  so this is a data set that’s being  displayed in a web page  if you want to see what it looks like in  xml all you have to do is type xml onto  the end of this dot xml i’ve done that  already so i’m just going to go to that  one  you see it’s only this little bit that  i’ve added dot xml now it looks exactly  the same because the web page is  structuring xml data by default but if  you want to see what it looks like in  its raw format just do an option click  on the web page  and go to view page source at least  that’s how it works in chrome  and this is the structured xml page and  you can see we’ve got tags here it says  race name circuit name location and  obviously these are not standard html  tags they’re defined for the purposes of  this particular data set  but we begin with one we have circuit  name right there and then we close it  using the backslash right there  and so this is structured data the  computer knows how to read it which is  exactly this is how it displays it by  default  and so it’s a really good way of  displaying data and it’s a good way to  know how to pull data from the web you  can actually use what’s called an api an  application programming interface to  access this xml data and it pulls it in  along with its structure which makes  working with it really easy  what’s even more interesting is how easy  it is to take xml data and convert it  between different formats because it’s  structured and the computer knows what  you’re dealing with  so  for example one it’s really easy to  convert xml to csv or comma separated  value files that’s the spreadsheet  format because it knows exactly what the  headings are and what piece of  information goes in each column  example two it’s really easy to convert  html documents to xml  because you can think of html with its  restricted set of tags as sort of a  subset of the much freer xml  and three you can convert csv or your  spreadsheet comma separated value to xml  vice versa you can bounce them all back  and forth because the structure is made  clear to the programs that you’re  working with  so  in sum here’s what we can say number one  xml is semi-structured data what that  means is it has tags to tell the  computer what the piece of information  is but you can make the tags whatever  you want them to be  and xml is very common for web data and  it’s really easy to translate the  formats xml html csv so on and so forth  it’s really easy to translate them back  and forth which gives you a lot of  flexibility in manipulating data so you  can get into the format you need for  your own analysis  the last thing i want to mention about  coding and data science and web data is  something called json  and i like to think of it as a version  of smaller is better now what json  stands for is javascript object notation  although javascript’s supposed to be one  word  and  what it is is that like xml json is  semi-structured data that is you have  tags that define the data so the  computer knows what each piece of  information is  but like xml the tags can vary freely  and so there’s a lot in common between  xml and json so xml  is a markup language that’s what the ml  stands for and that gives meaning to the  text that lets the computer know what  each piece of information is  also xml allows you to make comments in  the document and it allows you to put  metadata in the tag so you can actually  put some information there in the angle  brackets to provide additional context  json on the other hand is specifically  designed for data interchange and so  it’s got that special focus  and the structured json corresponds with  data structures you know it directly  represents objects and arrays and  numbers and strings and booleans and  that works really well with the programs  that you use to analyze data also  json is typically shorter than xml  because it does not require the closing  tags now there are ways to do that with  xml but that’s not typically how it’s  done  as a result of these differences  json is basically taking xml’s place in  web data xml still exists still used for  a lot of things but json is slowly  replacing it and we’ll take a look at  the comparison between the three by  going back to the example we used in xml  this is data about formula one car races  in 1957 from airgast.com  and so you can just go to the first web  page here and then we’ll navigate to the  others from that  so this is the general page this is if  you just type in without the dot xml or  json or anything so it’s a table  of information about races in 1957.  and we saw earlier that if you add  just.xml to the end of this it looks  exactly the same that’s because this  browser is displaying xml properly by  default  but if you were to right click on it and  go to view page source  you would get this instead and you can  see the structure this is still xml and  so everything has an opening tag and a  closing tag and some extra information  in there  but  if you type in  json  what you really get is this jumbled mess  now that’s unfortunate because there is  a lot of structure to this so what i’m  going to do is i’m actually going to  copy  all of this data  then i’m going to go to a little webpage  there’s a lot of things you can do here  and secure phrase it’s called json  pretty print and that is make it look  structured so it’s easier to read i just  paste that in there  and hit pretty print json  and now you can see the hierarchical  structure of the data  the interesting thing is that  the json tags only have tags at the  beginning  so it says series in quotes and then a  colon and then it gives the piece of  information in quotes and a comma and it  moves on to the next one  and this is a lot more similar to the  way that data would be represented in  something like r or python  and so it’s also more compact  again there’s things you can do with xml  but this is one of the reasons that json  is generally becoming preferred as a  data carrier for websites  and as you might have guessed it’s  really easy to convert between the  formats  it’s easy to convert between xml json  csv  etc and so you can get a web page where  you just paste one version in and you  get the other version out there are some  differences but for the vast majority of  situations they’re just kind of  interchangeable  so  in sum what do we get from this like xml  json is semi-structured data where there  are tags that say what the information  is but you can define the tags however  you want and json is specifically  designed for data interchange and  because it reflects the structure of the  data in the programs that makes it  really easy  and then  also because it’s relatively compact  json is replacing gradually xml on the  web as a container for data on web pages  if we’re going to talk about coding in  data science and the languages that are  used then first and foremost is r  the reason for that is according to many  standards r is the language of data and  data science for example take a look at  this chart this is a ranking based on a  survey of data mining experts of the  software that they use in doing their  work  and r is right there at the top r is  first and in fact that’s important  because there’s python which is usually  taken hand in hand with r for data  science  but r sees 50 more use than python does  at least in this particular list  now there’s a few reasons for that  popularity number one r is free and it’s  open source both of which make things  very easy  second are especially developed for  vector operations that means it’s able  to go through an entire list of data  without having to write for loops to go  through if you’ve ever had to write for  loops you know that that would be kind  of disastrous having to do that with  data analysis  next r has a fabulous community behind  it it’s very easy to get help on things  with r you google it you’re going to end  up in a place where you’re going to be  able to find good examples of what you  need  and probably most importantly r is very  capable on its own but there are 7 000  packages actually many more than that 7  000 packages that add capabilities to r  essentially it can do anything  now when you’re working with r you  actually have a choice of interfaces  that is how do you actually do the  coding and how do you get your results  r comes with its own ide or interactive  development environment  you can do that or if you’re on a mac or  linux you can actually do r through the  terminal through command line if you’ve  installed r you just type r and it  starts up  there’s also a very popular development  environment called rstudio and that’s  actually the one that i use and i’ll be  using for all my examples  but another new competitor is jupiter  which very commonly used for python is  what i use for example is there it works  in the browser window even though it’s  locally installed  and our studio and jupyter there’s  pluses and minuses to each one of them  i’ll mention them as we get to them  but no matter what interface you use  ours command line you are typing lines  of code in order to get the commands  some people get really scared about that  but really there are some advantages to  that in terms of the replicability and  really the accessibility the  transparency of your commands  so for instance here’s a short example  of some commands in r  you can enter them into what’s called  the console and that’s just like one  line at a time that’s called an  interactive way  or you can save scripts and run bits and  pieces of them selectively that makes  your life a lot easier  no matter how you do it if you’re  familiar with programming in other  languages then you’re going to find that  r is a little weird it has an  idiosyncratic model  it makes sense once you get used to it  but it is a different approach and so it  does take some adaptation if you’re  accustomed to programming in other  languages  now once you do your programming to get  your output what you’re going to get  is graphs in a separate window you’re  going to get text and numbers numerical  output in the console  and no matter what you get you can save  the output to files so that makes it  portable you can do it in other  environments  but most importantly i like to think of  this here’s our box of chocolates where  you never know what you’re going to get  the beauty of r is in the packages that  are available to expand its capabilities  now there are two sources of packages  for r one goes by the name of cran and  that stands for the comprehensive r  network  and that’s at cran.rstudio.com  and what that does is it takes these  seven thousand or so different packages  that are available and it organizes them  in topics that they call task views and  for each one if they’ve done their  homework  you have data sets that come along with  that package you have a manual in pdf  format and you can even have vignettes  where they run through examples of how  to do it  another interface is called crantastic  and the exclamation point is part of the  title that’s at crantastic.org  and what this is is an alternative  interface that links to cran so if you  find something like in crantastic and  you click on the link it’s going to open  in crayon but the nice thing about  crantastic is it shows the popularity of  packages and it also shows how recently  they were updated and that can be a nice  way of knowing that you’re getting sort  of the latest and greatest  now from this very abstract presentation  we can say a few things about r  number one  according to many r is the language of  data science  and it’s a command line interface you’re  typing lines of code  so that gives it both a strength and  it’s a challenge for some people  but the beautiful thing is the thousands  and thousands of packages of additional  code and capability that are available  for our that make it possible to do  nearly anything in this statistical  programming language  when talking about coding in data  science in the languages along with r we  need to talk about python  now  python the snakes is a general purpose  program that can do it all and that’s  its beauty if we go back to this survey  of the software used by  data mining experts  you see that python’s there it’s number  three on the list and what’s significant  about that is on this list python is the  only general purpose programming  language it’s the only one that can  theoretically be used to develop any  kind of application that you want  that gives it some special power  compared to all these others most of  which are very specific to data science  work  so the nice things about python are  number one it’s general purpose  it’s also really easy to use and if you  have a macintosh or a linux computer  python is built into it  also python has a fabulous community  around it with hundreds of thousands of  people involved and also python has  thousands of packages now it actually  has something like 70 or 80 000 packages  but in terms of ones that are specific  to data  there are still thousands available to  give it some incredible capabilities  now a couple of things to know about  python first is about versions there are  two versions of python that are in wide  circulation there’s two point x so that  means like 2.5 2.6 and there’s three  point x 3.1 3.2  version 2 and version 3 are similar  but they’re not identical and in fact  the problem is this  there are some compatibility issues  where code that runs in one does not run  in the other  and consequently most people have to  choose between one or the other  and what this leads to is that many  people still use two point x i have to  admit the examples that i use i’m using  two point x  because so many of the data science  packages are developed with that in mind  now let me say a few things about  interfaces for python  first python does come with its own  interactive development learning  environment they call it idle you can  also run it from the terminal or command  line interface or any ide that you have  now a very common and very good choice  is jupiter jupiter is a browser-based  framework for programming and it was  originally called ipython and that  served as its initial version so a lot  of times when people talk about ipython  what they’re really talking about is  python in jupiter and the two are  sometimes used interchangeably  one of the neat things you can do is  there are two companies there’s  continuum and and thought both of which  have made special distributions of  python  with hundreds and hundreds of packages  pre-configured to make it very easy to  work with data  i personally prefer continuum anaconda  it’s the one that i use a lot of other  people use it but either one’s going to  work and it’s going to get you up and  running  and like i said with r no matter what  interface you use all of them are  command line you’re typing lines of code  again there are some tremendous  strengths to that but it can be  intimidating to some people at first  in terms of the actual commands of  python you have some examples here on  the side the important thing to remember  is that it’s a text interface  on the other hand python is familiar to  millions of coders because it’s very  often a first programming language that  people learn to do general purpose  programming and there are a lot of very  simple adaptations for data that make it  very powerful for data science work so  let me say something else again data  science loves jupiter and jupiter is the  browser-based framework it’s a local  installation but you access it through a  web browser that makes it possible to  really do some excellent work in data  science  there’s a few reasons for this  when you’re working in jupyter you get  text output and you can use what’s  called markdown as a way of formatting  documents  you can get inline graphics where the  graphics just show up directly beneath  the code that you did it  also it’s really easy to organize and  present and to share  analyses that are done in jupiter which  makes it a strong  contender for your choices in how you do  data science programming  another one of the beautiful things  about python like r is that there are  thousands of packages available  in python there’s one main repository it  goes by the name pi pi which is for the  python package index  right here says there’s over eighty  thousand packages and seven or eight  thousand of those are for data specific  purposes  now some of the packages that you’ll get  to be very familiar with are numpy and  scipy which are for scientific computing  in general  matplotlib and a development of it  called seaborn are for data  visualization and graphics  pandas is the main  package for doing statistical analysis  and for machine learning almost nothing  beats scikit-learn  and when i go through hands-on examples  in python i will be using all of these  as a way of demonstrating the power of  the program for working with data  so  in sum we can say a few things  number one python is a very popular  program familiar to millions of people  and that makes it a good choice  second of all the languages we use for  data science on a frequent basis this is  the only one that’s general purpose  which means it can be used for a lot of  things other than processing data  and it gets its power like r does from  having thousands of contributed packages  which greatly expand its capabilities  especially in terms of doing data  science work  a choice for coding in data science one  of the languages that may not come  immediately to mind when they think data  science is sql or sql  and sql is the language of databases and  we think why do we want to work in sql  well to paraphrase the famous bank  robber willie sutton who apparently  explained why he robbed banks and said  because that’s where the money is the  reason we would work with sql in data  science is because that’s where the data  is  and so let’s take another look at our  ranking of software among data mining  professionals and their sql it’s third  on the list and also of this list it’s  the first database tool other tools in  there for instance get much fancier and  they’re much newer and shinier but sql’s  been around for a while is very very  capable  now there’s a few things to know about  sql  by the way you’ll notice i’m saying sql  even though that stands for something  that stands for structured query  language  sql is a language it’s not an  application there’s not a program sql  it’s a  language that can be used in different  applications  primarily sql is designed for what are  called relational databases  and those are special ways of storing  structured data that you can pull in you  can put things together you can join  them in special ways you can get some  summary statistics and then what you  usually do is you then export that data  into your analytical application of  choice  so  the big word here is  rdbms  and that stands for relational database  management system and that’s where you  will usually see sql as a query language  being used  in terms of relational database  management systems there are a few very  common choices  in the industrial world where people  have some money to spend there’s oracle  database is a very common one and  microsoft sql or sql server  in the open source world two very common  choices are mysql even though we  generally say sql when it’s here you  generally say mysql and then another one  is  postgresql  these are both open source free versions  of the language they’re sort of dialects  of each that make it possible for you to  be working with databases and get your  data out  the neat thing about them no matter what  you do is that databases minimize data  redundancy by using connected tables  each table has rows and columns and they  store different levels or different of  abstraction or measurement which means  you only have to put the information in  one place and then it can refer to lots  of other tables makes it very easy to  keep things organized and up to date  when you’re looking at a way of working  with a relational database management  system you get to choose impart between  using a graphical user interface or gui  and some of those include sql developer  and sql server management studio two  very common choices  and there are a lot of other ones like  toad and some other choices that are  graphical interfaces for working with  these databases  and there are also text based interfaces  so really any command line interface  and any interactive development  environment or programming tool is going  to be able to do that  now you can think of yourself being on  the command deck of your ship and learn  a few basic commands that are very  important for working with sql  they’re just a handful of commands that  can get you most of where you need to go  there is the select command where you’re  choosing the cases that you want to  include  from says which tables are you going to  be extracting them from  where is a way of specifying conditions  and then order by  obviously it’s just a way of putting it  all together  this works because usually when you’re  in a sql database you’re just pulling  out the information  you want to select it you want to  organize it and then what you’re going  to do is you’re going to send the data  to your program of choice for further  analysis like r or python or whatever  so in sum here’s what we can say about  sql  number one as a language it’s generally  associated with relational databases  which are very efficient and  well-structured ways of storing data  just a handful of basic commands can be  extremely useful when working with  databases you don’t have to be a super  ninja expert really a handful five ten  commands will probably get you  everything you need out of a sql  database  and then once you get the data organized  it’s typically exported to some other  program for analysis  when you talk about coding  in any field one of the languages or one  of the groups of languages that come up  most often are c  c plus plus and java  now these are extremely powerful  applications and very frequently used  for sort of professional production  level coding  in data science the place where you’re  going to see these languages most often  is in the bedrock the absolute  fundamental layer that makes the rest of  data science possible  so for instance c and c plus plus  c is from the 60s c plus plus is from  the 80s and they have extraordinarily  wide usage and their major advantage is  that they’re really really fast  in fact c is usually used as the  benchmark for how fast as a language  they’re also very very stable which  makes them really well suited to  production level code and for instance  server use  what’s really neat is that in certain  situations if time is really important  if speed’s important then you can  actually use c code in r or other  statistical languages  next is java  java is based on c plus plus  its major contribution was the wara or  write wants run anywhere the idea that  you’re going to be able to develop code  that is portable to different machines  and different environments  and because of that java is actually the  most popular computer programming  language  overall against all tech situations  and the place where you would use these  in data science is like i said when time  is of the essence when something has to  be fast it has to get the job  accomplished quickly and it has to not  break then these are the ones that  you’re probably going to use  the people who are going to use it are  primarily going to be  engineers so the engineers and the  developers the software developers who  deal with the inner workings of the  algorithms in data science or the back  end of data science the servers and the  mainframes and the entire structure that  makes analysis possible  in terms of analysts  people who are actually analyzing the  data typically don’t do hands-on work  with the foundational elements they  don’t usually touch c or c plus plus  more the work is on the front end or  closer to the high level languages  like r or python  in sum  c c plus plus and java  form a foundational bedrock and the back  end of data and data science and they do  this because they’re very fast and they  are very reliable  on the other hand given their nature  that work is typically reserved for the  engineers who are working with the  equipment that runs in the back that  makes the rest of the analysis possible  i want to finish our extremely brief  discussion of coding and data sciences  and the languages that can be used by  mentioning one other that’s called  bash  and bash really is a great example of  old tools that have survived and are  still being used actively and  productively with new data  you can think of it this way it’s almost  like typing on your typewriter you’re  working at the command line you’re  typing out code through a command line  interface or cli  now this method of interacting with  computers practically goes back to the  typewriter phase because it predates  monitors  so before you even had a monitor you  would type in the code and it would  print it out on a piece of paper and the  important thing to know about the  command line is it’s simply a method of  interacting it’s not a language because  lots of different languages can run at  the command line  so for instance it’s important to talk  about the concept of a shell now in  computer science a shell is a language  or something that wraps around the  computer it’s a shell around the  language that is the interaction level  for the user to get things done at the  lower levels that aren’t really human  friendly  on mac computers and linux the most  common is bash which is short for born  again shell  on windows computers the most common  version is power shell but whatever you  do there actually are a lot of choices  there’s the bourne shell there’s the  seashell which is why i have a seashell  right here the z shell there’s fish for  friendly interactive shell and a whole  bunch of other choices  but bash is the most common on mac and  lindex and powershell is the most common  on windows as a method of interacting  with the computer at the command line  level  now there’s a few things you need to  know about this first you have a prompt  of some kind in bash it’s a dollar sign  and that just means type your command  here  then the other thing is you type one  line at a time  it’s actually amazing how much you can  get done with what’s called a one-liner  program by sort of piping things  together so one feeds into the other you  can run more complex commands if you use  a script and so you call a text document  that has a bunch of things in it and you  can get much more elaborate analyses  done  now we have our tools here  in bash we talk about utilities and what  these are are specific programs that  accomplish specific tools  bash really thrives on do one thing and  do it very well  there are two general categories of  utilities for bash  number one is the built-ins these are  the ones that come installed with it and  so you’re able to use them at any time  by simply calling in their name  some of the most common ones are cat  which is for catinate and that’s to put  information together  there’s awk  which is its own interpreted programming  language but it’s often used for text  processing from the command line by the  way the name is comes from the initials  of the people who created it  then there’s grep  which is for global search with a  regular expression and print it’s a way  of searching for information  and then there’s said which stands for  stream editor and its main use is to  transform text you can do an enormous  amount with just these four utilities  a few more  are head and tail that display the first  or last 10 lines of a document  sort and unique which sort and count the  number of unique answers in a document  wc which is for word count  and printf which formats the output that  you get in your console  and while you can get a huge amount of  work done with just this small number of  built-in utilities there are also a wide  range of installables or  other command line utilities that you  can add to bash or to whatever program  you’re using  so for instance some really good ones  that have been recently developed are jq  which is for pulling in json or  javascript object notation data from the  web  and then there’s json to csv which is a  way of converting json to csv format  which is what a lot of statistical  programs are going to be happier with  there’s rio which allows you to run a  wide range of commands from the  statistical programming language r  in the command line as part of bash  and then there’s big m eller and this is  a command line tool that allows you to  access big ml’s machine learning servers  through the command line normally you do  it through a web browser and it accesses  their service remote it’s an amazingly  useful program but to be able to just  pull it up when you’re in the command  line is an enormous benefit  what’s interesting is that even though  you have all these opportunities all  these different utilities  you can do amazing things and that there  still is active development of utilities  for the command line so let’s say this  in sum  despite being in one sense as old as the  dinosaurs the command line survives  because it is extremely well evolved and  well suited to its purpose of working  with data the utilities both the  built-in and the installables are fast  and they are easy and generally they do  one thing and they do it very very well  and then surprisingly  there is an enormous amount of very  active development of command line  utilities for these purposes especially  with data science  one critical task when you’re coding in  data science is to be able to find the  things that you’re looking for and regex  which is short for regular expressions  is a wonderful way to do that you can  think of it as the supercharged method  for finding needles in haystacks  now regex tends to look a little cryptic  so for instance here’s an example of  something that’s designed to determine  whether something is a valid email  address  and it specifies what can go in the  beginning you have the at sign in the  middle then you’ve got a certain number  of numbers and letters and then you have  to have a dot something at the end  and so this is a special kind of code  for indicating what can go where  now regular expressions or regex are  really a form of pattern matching in  text and it’s a way of specifying  exactly what needs to be where what can  vary and how much it can vary you can  write both specific patterns say i only  want a one letter variation here or very  general like the email validator that i  showed you and the idea here is you can  write this search pattern your little  wild card thing you can find the data  and then once you identify those cases  then you can export them into another  program for analysis  so here’s a short example of how it can  work what i’ve done is i’ve taken some  text documents they’re actually the text  to emma and to pygmalion two books i got  off of project gutenberg  and this is the command  grep carrot l dot ve  space asterisk dot txt  so what i’m looking for are lines in  either of these books that start with l  then they have one character can be  whatever and then that’s followed by v e  and then the dot txt means search for  all of the text files in that particular  folder and what it found is  lines that began with love and lived and  lovely and so on  now in terms of the actual nuts and  bolts of regular expressions there are  some certain elements there are literals  and those are things that mean exactly  what they are you type the letter l  you’re looking for the letter l  there are also meta characters which  specify for instance things needs to go  here they they’re characters but they’re  actually really code that give  representations  there are also escape sequences which  are something they use you say well  normally this character is used as a  variable but i actually want to really  look for a period as opposed to a  placeholder  then you have the entire search  expression that you create and then you  have the target string the thing that  it’s searching through  so let me give a few very short examples  this is the carrot it’s the  sometimes called a hat or in french a  circumflex  and what that means is you’re looking  for something that’s at the beginning of  the text that you’re searching so for  example you can have carrot and capital  m that means you need something that  begins with a capital m so  for instance the word mac true it will  find that but if you have imac there’s a  capital m but it’s not the first thing  so that’ll be false it won’t find that  the dollar sign means you’re looking for  something that is at the end of the  string  so for example ing and then dollar sign  that’ll find the word fling because it  ends with ing but it won’t find the word  flings because it actually ends with an  s  and then the dot the period simply means  we’re looking for one letter and it can  be anything  so for example you can write a t  period  and that will find  data because it has an a a t and then  one letter after it  but it won’t find flat because flat  doesn’t have anything after the aat and  so these are extremely simple examples  of how it can work  obviously it gets more complicated and  the real power is when you start  combining these bits and elements  now one interesting thing about this is  you can actually treat this as a game i  love this website it’s called regex golf  and it’s at regex.alf.nu  and what it does is it brings up lists  of words two columns and your job is to  write a regular expression in the top  that matches all the words on the left  column  and none of the words on the right and  that uses the fewest characters possible  you get a score and it’s a great way of  learning how to do regular expressions  and learning how to search in a way  that’s going to get you the data that  you need for your projects  so  in sum  regex or regular expressions help you  find the right data for your project  they’re very powerful and they’re very  flexible now on the other hand they are  cryptic at least when you first look at  them but at the same time it’s like a  puzzle and it can be a lot of fun if you  practice it and you see how you can find  what you need  i want to thank you for joining me in  coding and data science and we’ll wrap  up this course by talking about some of  the specific next steps that you can  take for working in data science  the idea here is that you want to get  some tools and you want to start working  with those tools  now please keep in mind something that  i’ve said at another time  data tools and data science are related  they’re important but don’t make the  mistake of thinking that if you know the  tools that you have done the same thing  has actually conducted data science  that’s not true people sometimes get a  little enthusiastic and they get a  little carried away  what you need to remember is the  relationship really is this data tools  are an important part of data science  but data science itself is much bigger  than just the tools  now speaking of tools remember there’s a  few kinds that you can use and that you  might want to get some experience with  these number one in terms of apps or  just specific built applications  excel and tableau are really fundamental  for both getting the data from clients  or doing some basic data browsing and  tableau is really wonderful for  interactive data visualization i  strongly recommend that you get very  comfortable with both of those  in terms of code it’s a good idea to  learn either r  or python or ideally to learn both  because you can use them hand in hand  in terms of utilities  it’s a great idea to learn how to work  with bash the command line utility and  to use regular expressions or regex you  can actually use those in lots and lots  of programs regular expressions and so  they can have a very wide application  and then finally data science requires  some kind of domain expertise you’re  going to need some sort of field  experience  or intimate understanding of a  particular domain and the challenges  that come up and what constitutes  workable answers and the kind of data  that’s available  now as you go through all of this you  don’t need to build this monstrous list  of things remember you don’t need  everything you don’t need every tool you  don’t need every function you don’t need  every approach  instead remember get what’s best for  your needs and for your style but no  matter what you do remember tools are  tools they’re a means to an end instead  you want to focus on the goal of your  data science project whatever it is and  i can tell you really the goal is  meaning extracting meaning out of your  data to make informed choices in fact  i’ll say a little more the goal is  always meaning and so with that i  strongly encourage you get some tools  get started in data science and start  finding meaning in the data that’s  around you  welcome to mathematics and data science  i’m barton poulson and we’re going to  talk about how mathematics matters for  data science  now you may be saying to yourself  why math  and  computers can do it i don’t need to do  it  and really fundamentally i don’t need  math i’m just here to do my work well  i’m here to tell you no  you need math  that is if you want to be a data  scientist and i assume that you do so  we’re going to talk about some of the  basic elements of mathematics really at  a conceptual level in how they apply to  data science  there are a few ways that math really  matters to data science number one  it allows you to know which procedures  to use and why so you can answer your  questions in a way that’s the most  informative and most useful  two  if you have a good understanding of math  then you know what to do when things  don’t work right that you get impossible  values or things won’t compute and that  makes a huge difference  and then three an interesting thing is  that some mathematical procedures  are easier and quicker to do by hand  than by actually firing up the computer  and so for all three of these reasons  it’s helpful to have at least a  grounding in mathematics if you’re going  to do work in data science  now probably the most important thing to  start with is algebra  and there are three kinds of algebra  that we want to mention the first is  elementary algebra that’s the regular x  plus y  then there’s linear or matrix algebra  which looks more complex but is  conceptually simpler and it’s used by  computers to actually do the  calculations and then finally i’m going  to mention systems of linear equations  where you have multiple equations  simultaneously that you’re trying to  solve  now there’s more math than just algebra  a few other things that i’m going to  cover in this course  a little bit of calculus  a little bit of big o or order which has  to do with the speed or the complexity  of operations  a little bit of probability theory and a  little bit of bayes or bayes theorem  which is used for getting posterior  probabilities and changes the way that  you interpret the results of an analysis  and for the purposes of this course i’m  going to demonstrate the procedures by  hand of course you would use software to  do this in the real world but we’re  dealing with simple problems at  conceptual levels and really the most  important thing to remember is even  though a lot of people get put off by  math really you can do it  and so in sum let’s say these three  things about math first off you do need  some math to do good data science it  helps you diagnose problems it helps you  choose the right procedures and  interestingly you can do a lot of it by  hand or you can use software computers  to do the calculations as well  as we begin our discussion of the role  of mathematics and data science will of  course begin with the foundational  elements in the data science nothing is  more foundational than elementary  algebra  now  i’d like to begin this with really just  a little bit of history  in case you’re not aware the first book  on algebra was written in 820 by  muhammad ibn musa al-khwarizmi and it  was called the compendious book on  calculation by completion and balancing  actually it was called this which if you  transliterate that comes out to this but  look at this word right here  that’s the algebra which means  restoration  any case that’s where it comes from and  for our concerns there are several kinds  of algebra that we’re going to talk  about  there’s elementary algebra there’s  linear algebra and there are systems of  linear equations we’ll talk about each  of those in different videos  but to put it into context let’s take an  example here of salaries now this is  actually based on real data from a  survey of the salary of people employed  in data science and to give a simple  version of it the salary was equal to a  constant that’s sort of an average value  that everybody started with and to that  you added years and then you added some  measure of bargaining skills and how  many hours they worked per week and that  gave you a prediction but because it  wasn’t exact there’s also some error to  throw into it to get to the precise  value that each person has  now if you want to abbreviate this you  can write it kind of like this s plus c  plus y plus b plus h plus e  although it’s more common to write it in  although it’s more common to write it  symbolically like this and let’s go  through this equation very quickly  the first thing we have is outcome  we call that y the variable y for person  i i stands for each case in our  observations  so  here’s outcome y for person i  this letter right here is a greek beta  and it represents the intercept or the  average that’s why it has a zero because  we don’t multiply it times anything  but right next to it we have the  coefficient for variable one so beta  which means our coefficient sub one for  the first variable  and then we have variable one and then  x one means variable 1 and then the i  means it’s the score on that variable  for person i whoever we’re talking about  then we do the same thing for variables  2 and 3 and then at the end we have  little epsilon here with an i for the  error term for person i which says how  far off the prediction was from their  actual score  now i’m going to run through some of  these procedures and we’ll see how they  can be applied to data science but for  right now let’s just say this in sum  first off algebra is vital to data  science  it allows you to combine multiple scores  get a single outcome do a lot of other  manipulations and really the  calculations are easy for one case at a  time especially when you’re doing it by  hand  the next step in mathematics for data  science foundations is to look at linear  algebra or an extension of elementary  algebra  and depending on your background you may  know this by another name and i like to  think welcome to the matrix because it’s  also known as matrix algebra because  we’re dealing with matrices  now let’s go back to an example i gave  in the last video about salary where  salary is equal to a constant plus years  plus bargaining plus hours plus error  okay that’s a way to write it out in  words and if you want to put it in  symbolic form it’s going to look like  this  now before we get started with matrix  algebra we need to talk about a few new  words maybe you’re familiar with them  already  the first is  scalar and this means a single number  and then a vector  is a single row or a single column of  numbers that can be treated as like a  collection that usually means a  variable and then finally a matrix  consists of many rows and columns sort  of a big rectangle of numbers the plural  of that by the way is matrices and the  thing to remember is that machines love  matrices  now let’s take a look at a very simple  example of this here is a  very basic representation  of matrix algebra or linear algebra  where we’re showing data on two people  on four variables  so over here on the left we have the  outcomes for cases one and two are  people one and two and you put them in  the square brackets to indicate that  it’s a vector or a matrix  here on the far left it’s a vector  because it’s a single column of values  next to that is a matrix that has here  on the top the scores for case 1 which  i’ve written as x’s  x1 is for variable one x2 is for  variable two and the second subscript is  to indicate that it’s for person one  below that are the scores for case two  the second person and then over here in  another vertical column are the  regression coefficients that’s a beta  there that we’re using  and then finally we’ve got a tiny little  vector here at the end which contains  the error terms for cases 1 and 2. now  even though you would not do this by  hand it’s kind of helpful to run through  the procedure so i’m going to show it to  you by hand and we’re going to take two  fictional people this will be fictional  person number one we’ll call her sophie  we’ll say that she’s 28 years old and  will say that she has good bargaining  skills a four on a scale of five and  that she works 50 hours a week  and that her salary is a hundred and  eighteen thousand  our second fictional person we’ll call  him lars and we’ll say that he’s 34  years old and he has moderate bargaining  skills three out of five  works 35 hours per week and has a salary  of 84 000  and so  if we’re trying to look at salaries we  can go back to our matrix representation  that we had here with our variables  indicated with their latin and sometimes  greek symbols  and we’re going to replace those  variables with actual numbers so we can  get the salary for sophie our first  person  so let me plug in the numbers here  and let’s start with the result here  sophie’s salary is 118 000 and here’s  how these numbers all add up to get that  the first thing here is the intercept  and we just multiply that times one so  that’s sort of the starting point  and then we get this number 10 which  actually has to do with years over 18.  she’s 28 so that’s 10 years over 18 we  multiply each year by 13.95  next is bargaining skills she’s got a  four out of five and for each step up  you get fifty nine hundred dollars by  the way these are real coefficients from  study of survey of salary of data  scientists  and then finally hours per week  for each hour you get 382 dollars  now we can add those up and we can get a  predicted value for her but it’s a  little low it’s 30 000 low which you may  say well that’s really messed up well  that’s because there’s like 40 variables  in the equation including she might be  the owner and if she’s the owner yeah  she’s going to make a lot more  and then we do a similar thing for the  second case but what’s neat about matrix  algebra or linear algebra is that you  can use matrix notation  and this means the same stuff  and what we have here  are  these bolded variables that stand in for  entire vectors or matrices  so for instance  this y a bold y stands for the vector of  outcome scores  this bolded x is the entire matrix  of  values that each person has on each  variable  this bolded beta  is all of the regression coefficients  and then this bolded epsilon is the  entire vector  of error terms and so it’s really super  compact way of representing  the entire collection of data and  coefficients that you use in predicting  values  so in sum let’s say this first off  computers use matrices they like to do  linear algebra to solve problems and  it’s conceptually simpler because you  can put it all there in this tight  formation in fact it’s a very compact  notation and allows you to manipulate  entire collections of numbers pretty  easily and that’s the major benefit of  learning a little bit about linear or  matrix algebra  our next step in mathematics for data  science foundations is systems of linear  equations  and maybe you’re familiar with this but  maybe you’re not and the idea here is  there are times when you actually have  many unknowns and you’re trying to solve  for all of them simultaneously and what  makes this really tricky is a lot of  these are interlocked  specifically that means  x depends on y but at the same time y  depends on x  what’s funny about this is it’s actually  pretty easy to solve these by hand and  you can also use linear matrix algebra  to do it  so let’s take a little example here  of sales let’s imagine that you’ve got a  company and you’ve sold 1 000 iphone  cases so they’re not running around  naked like in this picture  and that some of the cases sold for 20  and others sold for five dollars  you made a total of fifty nine hundred  dollars and so the question is how many  were sold at each price  now hopefully you were keeping your  records  but you can also calculate it from this  little bit of information  and to show you i’m going to do it by  hand  now we’re going to start with this we  know that sales the two price points x  and y add up to 1000 total cases sold  and for revenue  we know that if you multiply a certain  number times twenty dollars and another  number times five dollars that it all  adds up to fifty nine hundred  between the two of those we can figure  out the rest  let’s start with sales  now what i’m going to do is i’m going to  try to isolate the values and i’m going  to do that  by putting in this minus y on both sides  and  then i can take that and i can subtract  it so i’m left with x  is equal to 1000 minus y normally you  solve for y but i solve for x you’ll see  why in just a second  then we go to revenue and we know from  earlier that our sales of these two  price points add up to 5900 total well  what we’re going to do is we’re going to  take this x that’s right here and we’re  going to replace it with the equation we  just got  which is one thousand minus y  then we multiply that through and we get  twenty thousand minus twenty y plus five  y equals fifty nine hundred  well we can subtract these two because  they’re on the same thing so 20y  and we get 15y  and then  we subtract  20 000 from both sides  so there it is right there on the left  and that disappears and then i get it  over on the right side  and i do the math there and i get minus  fourteen thousand one hundred dollars  well then i divide both sides by  negative fifteen dollars  and when we do that we get y is equal to  940.  okay so that’s one of our values for  sales so let’s go back to sales we have  x plus y equals 1000. we take the value  that we just got 940 we stick that into  the equation and then we can solve for x  just subtract 940 from each side  there we go we get x is equal to 60. so  let’s put it all together  just to recap what happened  what this tells us is that 60 cases were  sold at twenty dollars each  and that nine hundred and forty cases  were sold at five dollars each  now what’s interesting about this is you  can also do this graphically we’re going  to draw it  so i’m going to graph the two equations  here the original ones we had this one  predicts sales this one gives price  the problem is these really aren’t in  the canonical form for creating graphs  that needs to be y equals something else  so we’re going to solve both of these  for y  we subtract x from both sides there it  is on the left we subtract that and then  we have y is equal to minus x plus 1  000. that’s something that we can graph  then we do the same thing for price  let’s divide by 5 all the way through  that gets rid of that and then we’ve got  this 4x and then let’s subtract 4x from  each side  and what we’re left with is minus 4x  plus 1180. that’s also something that we  can graph  so  here’s the first line this indicates  cases sold it originally said x plus y  equals one thousand but we rearranged it  to y is equal to minus x plus one  thousand and so that’s the line we have  here  and then we have another line which  indicates earnings  and this one was originally written as  twenty dollars times x plus five dollars  times y equals fifty nine hundred  dollars total  we rearrange that to y equals minus four  x plus 1180  that’s the equation for the line  and then the solution is right here at  the intersection  there’s our intersection and it’s at 60  on the number of cases sold at 20 and  940 on the number of cases sold at five  dollars and that also represents the  solution of these joint equations and so  it’s a graphical way of solving a system  of linear equations  so in sum  systems of linear equations allow us to  balance several unknowns and find the  unique solution  and in many cases it’s easy to solve by  hand  and it’s really easy with linear algebra  when you use software to do it at the  same time  as we continue our discussion of  mathematics and data science and the  foundational principles the next thing  we want to talk about is calculus  and i’m going to give a little more  history right here  the reason i’m showing you pictures of  stones is because the word calculus is  latin for stone as in a stone used for  italian where people would actually have  little bag of stones and they would move  them and they would use it to count  sheep or whatever  and the system of calculus was  formalized in the 1600s  simultaneously independently by isaac  newton and gottfried wilhelm leibniz  and there are three reasons why calculus  is important for data science number one  it’s the basis of most of the procedures  that we do things like least squares  regression and probability distributions  they use calculus in getting those  answers  second one is if you’re studying  anything that changes over time so if  you’re measuring quantities or rates  that change over time then you have to  use calculus  calculus is used in finding the maximum  the minimum of functions especially when  you’re optimizing which is something  i’ll show you separately  also it’s important to keep in mind  there are two kinds of calculus  the first is differential calculus which  talks about rates of change at a  specific time it’s also known as the  calculus of change  the second kind of calculus is integral  calculus and this is where you’re trying  to calculate the quantity of something  at a specific time given the rate of  change and it’s also known as the  calculus of accumulation  so let’s take a look at how this works  and we’re going to focus on differential  calculus  so i’m going to graph an equation here  i’m going to do y is equal to x squared  a very simple one but it’s a curve which  makes it harder to calculate things like  the slope so let’s take a point here  that’s at -2 that’s my little red dot we  have it x is equal to -2  and because y is equal to x squared if  we want to get the y-value all we got to  do is take that negative 2 and square it  and that gives us 4. so that’s pretty  easy so the coordinates for that red  point are minus 2 on x and plus 4 on y  here’s a harder question  what is the slope of the curve at that  exact point well it’s actually a little  tricky because the curve is always curvy  and there’s no flat part on it  but we can get the answer by getting the  derivative of the function now there are  several different ways of writing this  i’m using the one that’s easiest to type  and let’s start by this what we’re going  to do is the n here  and that is the squared part so we had x  squared and you see that same n  turns into the squared and then we come  over here and we put that same value 2  in right there  and we put the 2 in right here  and then we can do a little bit of  subtraction 2 minus 1 is 1 and  truthfully you can just ignore that and  you get 2x  that is the derivative so what we have  here is the derivative of x squared is  2x  that means the slope at any given point  of the curve is 2x  so let’s go back to what we had a moment  ago  here’s our curve here’s our point at x  minus 2 and so the slope is equal to 2x  well we put in the minus 2 when we  multiply it and we get minus 4.  so that is the slope at this exact point  on the curve  okay  what if we choose a different point  let’s say we come over here to x is  equal to 3 well the slope is equal to 2x  so that’s 2 times 3 is equal to 6.  great and on the other hand you might be  saying to yourself and why do i care  about this  there’s a reason that this is important  and what it is is that you can use these  procedures to optimize decisions and if  that seems a little too abstract to you  that means you can use them to make more  money and i’m going to demonstrate that  in the next video but for right now in  some let’s say this  calculus is vital to practical data  science  it’s the foundation of statistics and it  forms the core  that’s needed for doing optimization  in our discussion of mathematics and  data science foundations the last thing  i want to talk about right here  is calculus and how it relates to  optimization i’d like to think of this  in other words as the place where math  meets reality or it meets manhattan or  something  now if you remember this graph i made in  the last video y is equal to x squared  that shows this curve here and we have  the derivative  that the slope can be given by 2x and so  when x is equal to 3 the slope is equal  to 6. fine  and  this is where this comes into play  calculus makes it possible to find  values that maximize or minimize  outcomes  and if you want to make something a  little more concrete out of this let’s  think of an example here by the way  that’s cupid and psyche let’s talk about  pricing for online dating let’s assume  you’ve created a dating service and you  want to figure out how much can you  charge for it that will maximize your  revenue  so let’s get a few hypothetical  parameters involved  first off let’s say that subscriptions  annual subscriptions currently cost 500  a year and you can charge that for a  dating service  and let’s say you sell 180 new  subscriptions every week  on the other hand based on your previous  experience manipulating prices around  you have some data that suggests that  for each five dollars you discount from  the price of five hundred dollars you  will get three more sales  also because it’s an online service  let’s just make our lives a little  simpler right now and assume that there  is no increase in overhead that’s not  really how it works but we’ll do it for  now  and i’m actually going to show you how  to do all this by hand  now let’s go back to price first we have  this 500  is the current annual subscription price  and you’re going to subtract five  dollars for each unit of discount that’s  what i’m giving d  so  one discount is five dollars two  discounts is ten dollars and so on  and then we have a little bit of data  about sales  that you’re currently selling 180 new  subscriptions per week and that you will  add three more for each unit of discount  that you give  so  what we’re going to do here is we’re  going to find sales as a function of  price  now to do that the first thing we have  to do is get the y-intercept so we have  price here 500 dollars is the current  annual subscription price minus five  dollars times d  and what we’re going to do is we’re  going to get the y-intercept by solving  when does this equal zero okay well we  take the 500 we subtract that from both  sides  and then we end up with minus five d is  equal to minus 500  divide both sides by -5 and we’re left  with  d is equal to 100 that is  when d is equal to 100 x is 0 and that  tells us how we can get the y intercept  but to get that we have to substitute  this value into sales  so we take d is equal to 100 and the  intercept is equal to 180 plus three 180  is the number of new subscriptions per  week and then we take this three and  then we multiply that times our 100. so  180 times 3 times 100 is equal to 300  add those together and you get 480 and  that is the y-intercept in our equation  so when we’ve discounted sort of price  to zero when price is zero  then the expected sales is 480 of course  that’s not going to happen in reality  but it’s necessary for finding the slope  of the line and so now let’s get the  slope  the slope is equal to the change in y  on the y-axis divided by the change in x  one way we can get this is by looking at  sales we get our 180 new subscriptions  per week plus three for each unit of  discount  and we take our information on price  500 per year minus five dollars for each  unit of discount  and then we take these the 3d and the 5d  and those will give us the slope  so it’s plus 3 divided by minus 5 and  that’s just  minus 0.6 and so that is the slope of  the line  slope is equal to minus 0.6  and so what we have from this is sales  as a function of price  where sales is equal to 480 because  that’s the y-intercept  when x is equal to zero when price is  zero  minus 0.6 times price  so this isn’t the final thing now what  we have to do is we turn this into  revenue so there’s another stage to this  now revenue is equal to sales time the  price you know how many things did you  sell  and how much did it cost  well we can substitute in some  information here if we take sales  and we put it in as a function of price  because we just calculated that a moment  ago we get this and then we do a little  bit of multiplication and then we get  that revenue  is equal to 480 times the price minus  0.6  times the square of the price okay  that’s a lot of stuff going on there  what we’re going to do now is we’re  going to get the derivative that’s the  calculus that we talked about  well the derivative of 48 in the price  where price is sort of the x the  derivative is simply 480  and the  minus 0.6 times the square of the price  well that’s very similar to the thing we  did with the curve  and what we end up with is 0.6 times 2  is equal to 1.2 times the price  this is the derivative of the original  equation we can solve that for 0 now  and  just in case you’re wondering why do we  solve it for 0  because that is going to give us  the place when y is at a maximum now  we had a minus squared so we have to  invert the shape and we’re trying to  look for  this value right here when it’s at the  very tippy top of the curve because that  will indicate maximum revenue  okay so what we’re going to do is we’re  going to solve for zero  let’s go back to our equation here we  want to find out when is that equal to  zero  well we subtract 480 from each side  there we go  and we divide by minus 1.2 on each side  and this is our price for maximum  revenue  so we’ve been charging 500 a week but  this says we’ll have more total income  if we charge 400 instead  and if you want to find out how many  sales we can get  currently we have 480  and if you want to know what the sales  volume is going to be for that  well you take the 480 which is the  hypothetical y-intercept when the price  is zero but then we put in our actual  price of 400  multiply that we get 240  do the subtraction and we get 240 total  so that would be 240 new subscriptions  per week  so let’s compare this the current  revenue is a hundred and eighty new  subscriptions per week at five hundred  dollars per year  and that means that our current revenue  is ninety thousand dollars per year i  know it sounds it sounds really good  but we can do better than that  because the formula for maximum revenue  is 240 times 400.  when you multiply those you get 96 000.  and so the improvement is just the ratio  of those two 96 000 divided by 90 000 is  equal to 1.07 and what that means is a  seven percent increase and anybody would  be thrilled to get a seven percent  increase in their business simply by  changing the price and increasing the  overall revenue  so let’s summarize what we found here  if you lower the cost by 20 go from 500  per year to 400 per year assuming all of  our other information is correct  then you can increase sales by 33 that’s  more than the 20 that you had and that  increases total revenue by 7  and so  we can optimize the price to get the  maximum total revenue and it has to do  with this little bit of calculus and the  derivative of a function  so in sum  calculus can be used to find the minima  and the maxima of functions including  prices it allows for optimization and  that in turn allows you to make better  business decisions  our next topic in mathematics and data  principles is something called big o  and if you’re wondering what big o is  all about well it is about time  or you can think of it as how long does  it take to do a particular operation  it’s the speed of the operation  if you want to be really precise  the growth rate of a function  how much more it requires as you add  elements is called its order that’s why  it’s big o that’s for order  and big o gives the rate  of how things grow as number of elements  grows and what’s funny is there can be  really surprising differences  let me show you how it works with a few  different kinds of growth rates or big o  first off there’s the ones that i say  are sort of just on the spot you can get  stuff done right away  the simplest one is 01 and that is a  constant order and that’s something that  takes the same amount of time no matter  what you can send out an email to ten  thousand people just hit one button it’s  done the number of elements number of  people the number of operations it just  takes the same amount of time  up from that is logarithmic where you  take the number of operations you get  the logarithm of that and you see it’s  increased but it’s really only a small  increase and it tapers off really  quickly so an example is finding an item  in a sorted array not a big deal  next one up from that now this looks  like a big  change but in the grand schemes it’s not  a big change this is a linear function  where each operation takes the same unit  of time and so if you have 50 operations  it takes 50 units of time if you’re  storing 50 things it takes 50 units of  space  so find an item in an unsorted list it’s  usually going to be linear time  then we have the functions where i say  you know you better just pack a lunch  because it’s going to take a little  while  the best example of this is what’s  called log linear that’s where you take  the number of items and you multiply  that number times the log of the items  and an example of this is something  called a fast fourier transform which is  used for dealing for instance with sound  or anything that’s over time  you can see it takes a lot longer if  you’ve got 30 elements you’re way up  there at the top of this particular  chart at 100 units of time or 100 units  of space wherever you want to put it  and it looks like a lot  but really that’s nothing compared to  the next set where i say you know you’re  just going to be camping out you might  as well go home  that includes something like the  quadratic you square the number of  elements and see how that just kind of  shoots straight up that’s quadratic  growth  and so multiplying two n digit numbers  so if you’re multiplying two numbers  that each have 10 digits it’s going to  take you that long it’s going to take a  long time  even more extreme is this one this is  the exponential two raised to the power  of the number of items you have you’ll  see by the way the red line here doesn’t  even go to the top that’s because the  graphing software that i’m using doesn’t  draw when it goes above my upper limit  there  so it kind of cuts it off  but this is a really demanding kind of  thing it’s for instance finding an exact  solution to what’s called the traveling  salesman problem  using dynamic programming that’s an  example of exponential rate of growth  and then one more i want to mention  which is sort of catastrophic is  factorial you take the number of  elements and raise that to the you know  exclamation point factorial and you see  that one cuts off really soon because it  basically goes straight up you have any  number of elements of any size it’s  going to be hugely demanding and for  instance if you’re familiar with the  traveling salesman problem that’s trying  to find a solution through the brute  force search it just takes an  extraordinary amount of time and so you  know before something like that’s done  you’re probably just going to you know  turn to stone and wish you never even  started  the other thing to know about this is  not only do some things take longer than  others some of these methods some  functions are more variable than others  so for instance if you’re working with  data that you want to sort there are  different kinds of sorts or sorting  methods  so for instance there’s something called  an insertion sort  and what you find is that on its best  day it’s linear it’s o of n that’s not  bad  on the other hand the average is  quadratic and that’s a huge difference  between the two  selection sorts on the other hand the  best is quadratic and the average is  quadratic it’s always consistent so it’s  kind of funny  it takes a long time but at least you  know how long it’s going to take versus  the variability of something like an  insertion sort  so  in sum let me say a few things about big  o number one you need to know that  certain functions or procedures vary in  speed  and the same thing applies to making  demands on the computer’s memory or  storage space or whatever  they vary in their demands also some of  them are inconsistent some of them are  really efficient sometimes and really  slow or really difficult the others  probably the most important thing here  is to be aware of the demands of what  you’re doing that you can’t for instance  just run through every single possible  solution or you know your company will  be dead before you get an answer so be  mindful of that so you can use your time  well and get the insight you need in the  time that you need it  a really important element of the  mathematics and data science and one of  its foundational principles is  probability now one of the things that  probability comes in intuitively for a  lot of people is something like rolling  dice or looking at sports outcomes and  really the fundamental question what are  the odds of something  that gets at the heart of probability  now let’s take a look at some of the  basic principles we got our friend  albert einstein here to explain things  the principles of probability work this  way probabilities range from zero to one  that’s like zero percent to a hundred  percent chance  when you put p that stands for  probability and then in parentheses here  a that means the probability of whatever  in parentheses so p a means the  probability of a  and then p of b is the probability of b  when you take all of the probabilities  together you get what’s called the  probability space and that’s why we have  s  and it all adds up to one because you’ve  now covered 100 of the possibilities  also you can talk about the complement  the tilde here is used to say  probability of not a  is equal to 1 minus the probability of a  because those have to add up  so let’s take a look at something also  about conditional probabilities which is  really important in statistics  a conditional probability is the  probability of something if something  else is true you write it this way the  probability of and that vertical line is  called a pipe and it’s read as  assuming that or given that so you can  read this as probability of a given b  is the probability of a occurring if b  is true  and so you can say for instance what’s  the probability if something’s orange  what’s the probability that’s a carrot  given this picture  now the place where this comes in  really important for a lot of people is  the probabilities of type 1 and type 2  errors in hypothesis testing which we’ll  mention at some other point  but i do want to say a few things about  arithmetic with probabilities because it  doesn’t always work the way that people  think it will  let’s start by talking about adding  probabilities let’s say you have two  events a  and b  and let’s say you want to find the  probability of either one of those  events so that’s like adding the  probabilities of the two events  well it’s kind of easy you take the  probability of event a and you add the  probability of event b however  you may have to subtract something you  may have to subtract this little piece  because maybe there’s some overlap  between the two of them  on the other hand if a and b are  disjoint which means they never occur  together then that’s equal to zero and  then you can  you know  subtract zero which is you get back to  the original probabilities but let’s  take a really easy example of this i’ve  created my super simple sample space i  have ten shapes i got five squares on  the top five circles on the bottom i’ve  got a couple of red shapes on the right  side  let’s say we want to find the  probability of a square or a red shape  so  we are adding the probabilities but we  have to adjust for the overlap between  the two  well here’s our squares on top five out  of the ten are squares and over here on  the right we have two red shapes two out  of ten  so let’s go back to our formula here  and let’s change a little bit change the  a and the b to s and r for square in red  now we can start this way let’s get the  probability that something has a square  well we go back to our probability space  you see we have five squares out of 10  shapes total  so we do 5 over 10 that reduces to 0.5  okay  next up the probability of something red  in our sample space  well we have 10 shapes total two of them  on the far right are red  so that’s 2 over 10  and you do the division you get 0.2  now the trick is the overlap between  these two categories do we have anything  that is both square and red because we  don’t want to count that twice so we  have to subtract it  so let’s go back to our sample space  and we’re looking for something that is  square there’s the squares on top and  there’s the things that are red on the  side and you see they overlap and this  is our little overlapping red square  so there’s one shape that meets both of  those 1 out of 10.  so we come back here we do 1 out of 10  that reduces to 0.1  and then we just do the addition and  subtraction here 0.5 plus 0.2 minus 0.1  gets us 0.6 and so what that means is  there’s a 60  chance  of an object being square or red  and you can look at it right here we got  six shapes outlined now and so that’s  the visual interpretation that lines up  with the mathematical one we just did  now let’s talk about multiplication for  probabilities  now the idea here is you want to get  what are called joint probabilities so  the probability of two things occurring  together simultaneously  and what you need to do here is you need  to multiply the probabilities and we can  say probability of a and b because we’re  asking about a and b occurring together  a joint occurrence  and it’s equal to the probability of a  times the probability of b that’s easy  but you do have to expand it just a  little bit because you can have the  problem of things overlapping a little  bit and so you actually need to expand  it to a conditional probability the  probability rephrase  the probability of b given a again  that’s the vertical pipe there  on the other hand if a and b are  independent if they never co-occur or  they  b is no more likely to occur if a  happens  then it just reduces to the probability  of b and you get your slightly simpler  equation but let’s go and take a look at  our sample space right here so we’ve got  our 10 shapes  five of each kind and then two that are  red  and we’re going to look at  originally the probability of something  being square or red now we’re going to  look at the probability of being square  and red now i know we can eyeball this  one really easy but let’s run through  the math  the first thing we need to do is get the  ones that are square there’s those five  on the top  and the ones that are red and there’s  those two on the right  in terms of the ones that are both  square and red yeah obviously there’s  just this one red square at the top  right but let’s do the numbers here  we change our formula to be s and r for  square in red we get the probability of  square  again that’s those five out of ten so we  do five out of ten reduce this to 0.5  and then we need the probability of red  given that it’s a square  so we only need to look at the squares  here  there’s the squares five of them and one  of them is red so that’s one over five  that reduces to 0.2 you multiply those  two numbers 0.5 times 0.2 and what you  get is 0.1 or a 10 chance or 10 of our  total sample space is red squares and  you come back and you look at it you say  yeah there’s one out of 10. so that just  confirms what we were able to do  intuitively  so that’s our short presentation on  probabilities and in sum what do we get  out of that number one probability it’s  not always intuitive  and also the idea that conditional  values can help in a lot of situations  but they may not work the way you expect  them to and really the arithmetic  of probability can surprise people so  pay attention when you’re working with  it so you can get a more accurate  conclusion in your own calculations  welcome to statistics and data science  i’m part and paulson and what we’re  going to be doing in this course is  talking about some of the ways that you  can use statistics to see the unseen to  infer what’s there even when most of  it’s hidden  now this shouldn’t be a surprise if you  remember the data science venn diagram  that we talked about a while ago we have  math up here in the top right corner but  if you were to go to the original  description of this venn diagram its  full name was math and stats  and let me just mention something in  case it’s not completely obvious about  why statistics matters to data science  and the idea is this  counting is easy it’s easy to say how  many times a word appears in a document  it’s easy to say how many people voted  for a particular candidate in one part  of the country  counting is easy but  summarizing and generalizing those  things are hard  and part of the problem is there’s no  such thing as a definitive analysis all  analyses  really depend on the purposes that  you’re dealing with  so as an example let me give you a  couple of pairs of words and try to  summarize the difference between them in  just two or three words  i mean in a word or two how is a souffle  different from a quiche  or how is an aspen different from a pine  tree  or how is baseball different from  cricket  and how are musicals different from  opera  it really depends on who you’re talking  to it depends on your goals and it  depends on sort of the shared knowledge  and so there’s not a single definitive  answer  and then there’s the matter of  generalization think about again  take music  listen to three concerti by antonio  vivaldi and do you think you can safely  and accurately describe  all of his music  no i actually chose vivaldi on purpose  because igor stravinsky said you could  he said he didn’t write 500 concertos he  wrote the same concerto 500 times  but take something more real world like  politics  if you talk to 400 registered voters in  the u.s can you then accurately predict  the behavior of all of the voters  there’s about 100 million voters in the  u.s  and that’s a matter of generalization  and that’s the sort of thing that we try  to take care of with inferential  statistics  now  there are different methods that you can  use in statistics and all of them are  designed to give you sort of a map a  description of the data you’re working  with there are descriptive statistics  they’re inferential statistics there’s  the inferential procedure hypothesis  testing and there’s also estimation and  i’ll talk about each of those in more  depth  there are a lot of choices that have to  be made and some of the things i’m going  to discuss in detail are for instance  the choice of estimators that’s  different from estimation  different measures of fit  feature selection for knowing which  variables are the most important in  predicting your outcome  also common problems that arise when  trying to model data and  the principles of model validation but  through this all the most important  thing to remember is that analysis  is functional it’s designed to serve a  particular purpose and there’s a very  wonderful quote within the statistics  world  that says all models are wrong all  statistical descriptions of reality are  wrong because they’re not exact  depictions they’re summaries  but some are useful and that’s from  georgia box and so really the question  is you’re not trying to be totally  completely accurate because in that case  you just wouldn’t do an analysis  the real question is are you better off  doing your analysis than not doing it  and truthfully i bet you are  so in sum we can say three things number  one you wanna use statistics to both  summarize your data and to generalize  from one group to another if you can  on the other hand there’s no one true  answer with data you gotta be flexible  in terms of what your goals are and the  shared knowledge and no matter what  you’re doing the utility of your  analysis should guide you in your  decisions  the first thing we want to cover in  statistics in data science is the  principle of exploring data and this  video is just designed to give an  exploration overview  so we like to think of this the intrepid  explorers they’re out there exploring  and seeing what’s in the world you can  see what’s in your data more  specifically you want to see what your  data set is like you want to see if your  assumptions are met so you can do a  valid analysis with your chosen  procedure  and really something that may seem very  weird is you want to listen to your data  if something’s not working out if it’s  not going the way you want then you need  to pay a little more attention and  exploratory data analysis is going to  help you do that  now there are two general approaches to  this first off there’s a graphical  exploration so you use graphs and  pictures and visualizations to explore  your data  the reason you want to do this is that  graphics are very dense in information  they’re also really good in fact the  best way to get the overall impression  of your data  second to that there’s numerical  exploration  i make it very clear this is the second  step do the visualization first then do  the numerical part  now you want to do this because it can  give greater precision and this is also  an opportunity to try variations on the  data you can actually do some  transformations move things around a  little bit and try different methods and  see how that affects the results and see  how it looks  so let’s go first to the graphical part  there are very quick and simple plots  that you can do those include things  like bar charts and histograms and  scatter plots  very easy to make and very quick way of  getting an understanding of the  variables in your data set  in terms of numerical analysis  again after the graphical methods you  can do things like transform the data  that is take like the logarithm of your  numbers you can do empirical estimates  of population parameters and you can use  robust methods and i’ll talk about all  of those in more length in later videos  but for right now i can sum it up this  way  the purpose of exploration is to help  you get to know your data  and also  you want to explore your data thoroughly  before you start modeling it before you  build statistical models  and all the way through you want to make  sure you listen carefully so that you  can find hidden or unassumed details and  leads in your data  as we move in our discussion of  statistics and exploring data the single  most important thing we can do is  exploratory graphics  in the words of the late great yankees  catcher yogi berra  you can see a lot by just looking that  applies to data as much as it applies to  baseball now there’s a few reasons you  want to start with graphics number one  is to actually get a feel for the data i  mean what’s it distributed like what’s  the shape are there strange things going  on  also it allows you to check the  assumptions and see how well your data  match the requirements of the analytical  procedures you hope to use  you can check for anomalies like  outliers and unusual distributions and  errors and also you can get suggestions  if something unusual is happening in the  data that might be a clue that you need  to pursue a different angle or do a  deeper analysis  now we want to do graphics first for a  couple of reasons number one is they’re  very information dense and fundamentally  humans are visual it’s our  single highest bandwidth way of getting  information it’s also the best way to  check for shape and gaps and outliers  there are few ways you can do this if  you want to the first is with programs  that rely on code  so you can use the statistical  programming language r  the general purpose programming language  python  you can actually do a huge amount in  javascript especially in d3.js  or you can use apps that are  specifically designed for exploratory  analysis that includes tableau both the  desktop and the public versions  click and even excel is a good way to do  this and then finally if you really want  to know you can do this by hand  john tukey who’s the father of  exploratory data analysis wrote his  seminal book a wonderful book where it’s  all hand graphs and actually it’s a  wonderful way to do it  but let’s start the process for doing  these graphics we start with one  variable that is univariate  distributions  and so you’re going to get something  like this the fundamental chart is the  bar chart  this is when you’re dealing with  categories and you’re simply counting  how many cases there are in each  category the nice thing about bar charts  is they’re really easy to read  put them in descending order and maybe  have them vertical maybe have them  horizontal horizontal can be nice to  make the labels a little easier to read  this is about psychological profiles of  the united states this is real data  and that we have the most states in the  friendly and conventional a smaller  number and temperamental and uninhibited  and the least common of the united  states is relaxed and creative  next you can do a box plot or sometimes  called a box and whiskers plot this is  when you have a quantitative variable  something that’s measured and you can  say how far apart scores are  a box plot shows quartile values it also  shows outliers so for instance this is  google searches for modern dance and  that’s utah at five standard deviations  above the national average that’s where  i’m from and i’m glad to see that there  also it’s a nice way to show many  variables side by side if they’re on  approximately similar scales  next if you have quantitative variables  you’re going to want to do a histogram  again quantitative so interval or ratio  level or measured variables  and these let you see the shape of a  distribution and potentially compare  many so here are three histograms for  google searches on data science and  entrepreneur and modern dance and you  can see  mostly for the part normally distributed  with a couple of outliers  once you’ve done one variable or the  univariate analysis you’re going to want  to do two variables at a time that is  bivariate distributions or joint  distributions  now one easy way to do this is with  grouped plots so  you can do grouped bar charts and box  plots what i have right here is grouped  box plots i have my three regions  psychological regions of the united  states  and i’m showing how they rank on  openness that’s a psychological  characteristic and what you can see is  that the relaxing creative or highest  and the friendly conventional tend to go  to the lowest and that’s kind of how  that works  it’s also a good way of seeing the  association between a categorical  variable like  region of the united states  psychologically and a quantitative  outcome which is what we have here with  openness  next you can also do a scatter plot  that’s where you have two quantitative  variables and what you’re looking for  here is is it a straight line that is is  it linear do we have outliers and also  the strength of association how closely  do the dots all  come to the regression line that we have  here in the middle  and this is an interesting one for me  because we have openness across the  bottom so more open as you go to the  right and agreeableness and what we see  is there’s a strong downhill association  the states in the united states that are  the most open apparently are also the  least agreeable so we’re going to have  to do something about that  and then finally you want to go to many  variables that is multivariate  distributions  now one big question here is  3d or not 3d  let me actually make an argument for not  3d  so what i have here is a 3d scatter plot  of three variables about google searches  up the left i have fifa which is for  professional soccer  down there on the bottom left i have  searches for nfl and on the right i have  searches for mba  now i did this in r and  what’s neat about this you can click and  drag and move it around and you know  that’s kind of fun you kind of spin  around  and it gets kind of nauseating as you  look at it  and this particular version i’m using  plotly in r it allows you to actually  click on a point and see let me see if i  can get the floor in the right place you  got to click on a point and see where it  ranks on each of these  characteristics you can see however this  thing is hard to control and once it  stops moving it’s it’s not much fun and  and truthfully  most 3d plots i’ve worked with are just  kind of nightmares  they seem like they’re a good idea but  not really  so here’s the deal  3d graphics like the one i just showed  you because they’re actually being shown  in 2d they have to be in motion for you  to tell what’s going on at all and  fundamentally they’re hard to read and  confusing now it’s true they might be  useful for finding clusters in three  dimensions we didn’t see that in the  data we had but generally i just avoid  them like the plague  what you want to do however is see the  connection between several variables you  might want to use a matrix of plots this  is where you have for instance many  quantitative variables you can use  markers for group membership if you want  and i find it to be much clearer than 3d  so here i have the relationship between  four search terms nba nfl mlb for major  league baseball and fifa you can see the  individual distributions you can see the  scatter plots you can get the  correlation  truthfully this for me is a much easier  kind of chart to read and get the  richness that we need from a  multi-dimensional display  so the questions you’re trying to answer  overall are number one do you have what  you need do you have the variables you  need do you have the variability that  you need are there clumps or gaps in the  distributions are there exceptional  cases anomalies that are really far out  from everybody else who are spikes in  the scores  and are of course are there errors in  data were there mistakes in coding did  people forget to answer questions are  there impossible combinations and these  kinds of things are easiest to see with  a visualization that’ll really just kind  of put it right there in front of you  and so in sum i can say this about  graphical exploration of data it’s a  critical first step this basically where  you always want to start and you want to  use the quick and easy methods again bar  charts scatter plots are really easy to  make and they’re very easy to understand  and once you’re done with the graphical  exploration then you can go to the  second step which is exploring the data  through numbers  the next step in statistics and  exploring data is exploratory statistics  or numerical exploration of data  i like to think of this as go in order  first you do visualization then you do  the numerical part  and a couple of things to remember here  number one is you’re still exploring the  data you’re not modeling yet but you are  doing a quantitative exploration  this might be an opportunity to get  empirical estimates that is of  population parameters as opposed to  theoretically based ones  it’s a good time to manipulate the data  and explore the effects of manipulating  the data looking at subgroups looking  transforming variables  also it’s an opportunity to check the  sensitivity of your results  do you get the same general results if  you test under different circumstances  so we’re going to talk about things like  robust statistics and resampling data  and transforming data  so we’ll start with robust statistics  this by the way is hercules a robust  mythical character  and the idea with a robust statistic is  that they are stable is that even when  the data varies in sort of unpredictable  ways you still get the same general  impression  this is a class of statistics it’s an  entire category that’s less affected by  outliers and by skewness and kurtosis  and other abnormalities in the data  so let’s take a quick look this is a  very skewed distribution i created  the median which is the dark line there  in the box is right around one  and i’m going to look at two different  kinds of robust statistics the trimmed  mean and the winds rise mean with the  trimmed mean you take a certain  percentage of the data from the top and  the bottom you just throw it away and  you compute the mean for the rest with  the windsorized you take those and then  you move those scores into the highest  non-outline score now the zero percent  is exactly the same as the regular main  and here it’s 1.24  but as we trim off five percent or move  in five percent you can see that the  mean shifts a little bit then ten  percent it comes in a little bit more to  25 now we’re throwing away 50 percent of  the data 25 on the top 25 on the bottom  and we get a mean here of 1.03 that’s a  trimmed mean and a windsorized of 1.07  when we throw away 50 when we trim 50  that actually means that we’re leaving  just the median only the middle score is  left then we get 1.01 what’s interesting  is how close we get to that even when we  have 50 of the data left and so that’s  an interesting example of how you can  use robust statistics to explore data  even when you have things like strong  skewness  next is the principle of resampling and  that’s like pulling marbles repeatedly  out of a jar counting the colors putting  them back in and trying again  that’s an empirical estimate of sampling  variability so sometimes you get 20 red  marbles sometimes you get 30 sometimes  you get 22 and so on  there are several versions of this that  go by the names the jackknife and the  bootstrap and the permutation  and the basic principle of resampling is  also key  to the process of cross-validation i’ll  have more to say about validation later  and then finally there’s transforming  variables here’s our caterpillars in the  process of transforming into butterflies  but the idea here is you take a sort of  difficult data set and then you do  what’s called a smooth function there’s  no jumps in it and something that  preserves the order and allows you to  work on the full data set so you can fix  skewed data and in a scatter plot you  might have a curved line you can  fix that  and probably the best way to look at  this is with something called tukey’s  ladder of powers i mentioned before john  tukey the father of exploratory data  analysis he talked a lot about  transformations  this is his ladder starting at the  bottom  with the minus 1 over x squared up to  the top with this x cubed  and here’s how it works  this distribution over here is a  symmetrical  normally distributed variable and as you  start to move in one direction and you  apply the transformation you take the  square root you see how it moves the  distribution over to one end then the  logarithm when you get to the end you  get this minus one over the square of  the score and that pushes it way way way  over  if you go the other direction for  instance you square the scores it pushes  it down in the one direction you cube it  and then you see how it can move it  around in ways that allow you to  you can actually undo the skewness to  get back to a more centrally distributed  distribution and so these are some of  the approaches that you can use in the  numerical exploration of data  in sum let’s say this  statistical or numerical exploration  allows you to get multiple perspectives  on your data it also allows you to check  the stability see how it works with  outliers and skewness and mixed  distributions and so on and perhaps most  importantly it sets the stage for the  statistical modeling of your data  as a final step of statistics and  exploring data i’m going to talk about  something that’s not usually considered  exploring but is basic descriptive  statistics  i like to think of it this way you’ve  got some data and you are trying to tell  a story more specifically you’re trying  to tell your data’s story  and with descriptive statistics you can  think of it as trying to use a little  data  to stand in for a lot of data using a  few numbers that stand in for a large  collection of numbers  and this is consistent with the advice  we get from good old henry david thoreau  who told us  simplify simplify if you can tell your  story with  more carefully chosen and more  informative data go for it  so there’s a few different procedures  for doing this number one you want to  describe the center of your distribution  of data that’s if you’re going to pick a  single number use that  two if you can give a second number give  something about the spread or the  dispersion of the variability  and three it’s also nice to be able to  describe the shape of the distribution  let me say more about each of these in  turn  first let’s talk about center we have  the center of our rings here  now there are a few very common measures  of center or location or central  tendency of a distribution  there’s the mode and there’s the median  and there’s the mean now there are many  many others but those are the ones that  are going to get you most of the way  let’s talk about the mode first now i’m  going to create a little data set here  on a scale from 1 to 11  and i’m going to put individual scores  there’s a one and another one and  another one and another one then we have  a 2 2  then we have a score way over at 9 and  another score over at 11. so we have  eight scores and this is the  distribution this is actually a  histogram of the data set the mode is  the most commonly occurring score or the  most frequent score well if you look at  how tall each of these go we’ve got more  ones than anything else and so one is  the mode because it occurs four times  and  nothing else comes close to that  the median’s a little different the  median is looking for the score that is  at the center if you split it into two  equal groups we have eight scores so we  want to get one group of four that’s  down here and then the other group of  four is this really big one because it  ranges way out and the median is going  to be the place on the number line that  splits those into two groups that’s  going to be right here at one and a half  now the means a little more complicated  even though people understand means in  general it’s the first one we have here  that actually has a formula  where m for the mean is equal to the sum  of x that’s our scores on the variable  divided by n the number of scores  you can also write it out with greek  notation if you want like this where  that’s sigma  a capital sigma is the summation sign  sum of x divided by n  and with our little data set that works  out to this 1 plus 1 plus 1 plus 1 plus  2 plus 2 plus 9 plus 11.  add those all up and divide by 8 because  that’s how many scores there are well  that reduces to 28 divided by 8 which is  equal to 3.5  if you go back to our little chart here  3.5 is right over here you’ll notice  there aren’t any scores really exactly  right there that’s because the mean  tends to get very distorted by outliers  it follows the extreme scores  but a really nice  i say it’s more than just a visual  analogy is that if this number line were  a seesaw  then the mean is exactly where the  balance point or the fulcrum would be  for these to be equal people understand  that if somebody weighs more they’ve got  to sit in closer to balance somebody who  weighs less who has to sit further out  and that’s how the mean works  now let me give a little bit of the pros  and cons of each of these for the mode  mode’s really easy to do you just count  how common it is on the other hand it  may not be close to what appears to be  the center of the data  the median it splits the data into two  same size groups the same number of  scores in each  and that’s pretty easy to deal with but  unfortunately it’s hard to use that  information in many statistics after  that  and then finally the mean of these three  is the least intuitive it’s the most  effective by outliers and skewiness and  that may really strike against it but it  is however the most useful statistically  and so it’s the one that gets used most  often  next there’s the issue of spread spread  your tail feathers  and we have a few measures here that are  very common also there’s the range there  are percentiles in the quartile range  and there’s the variance and the  standard deviation i’ll talk about each  of those  first the range the range is simply the  maximum score minus the minimum score  and in our case that’s just 11 minus 1  which is equal to 10. so we have a range  of 10. now i can show you that here on  our chart it’s just that line there at  the bottom from the 11 down to the 1.  that’s a range of 10.  the interquartile range which actually  is usually referred to simply as the iqr  is the distance between q3 which is the  third quartile score and q1 which is the  first quartile score  if you’re not familiar with quartiles  it’s the same as the 75th percentile  score and the 25th percentile score  really what it is  is you’re going to throw away some of  the data so let’s go to our distribution  here first thing we’re going to do is  we’re going to throw away the two  highest scores there they are they’re  grayed out now and then we’re going to  throw away two of the lowest scores  they’re out there  and then we’re going to get the range  for the remaining ones now  this is complicated by the fact that  i’ve got this big gap in between 2 and 9  and different methods of calculating  quartiles do something with that gap so  if you use a spreadsheet it’s actually  going to do an interpolation process and  it’ll give you a value of 3.75 i believe  and then down to one for the first  quartile so  not so intuitive with this graph but  that is how it works usually if you want  to write it out you can do it like this  the interquartile range is equal to q3  minus q1 and in our particular case  that’s 3.75 minus 1 and that of course  is equal to just 2.75 and there you have  it  now our final measure of spread or  variability or dispersion  is two related measures the variance and  the standard deviation these are a  little harder to explain a little harder  to show  but the variance  which is at least the easiest formula is  this the variance is equal to that’s the  sum the capital sigma is the sum  of x minus m that’s how far each  individual score is from the mean  and then you take that deviation there  and you square it you add up all the  deviations and then you divide by the  number so the variance is the average  squared deviation from the mean  i’ll try to show you that graphically so  here’s our data set and there’s our mean  right there at three and a half  let’s go to one of these twos we’ve got  a deviation there of one and a half and  if we make a square that’s one and a  half points on each side well there it  is we can do a similar square for the  other score at two if we’re going down  to one  then it’s going to be  two and a half squared and this can be  that much bigger and we can draw one of  these squares for each of our eight  points the squares for the scores at 9  and 11 are going to be huge and go off  the page so i’m not going to show them  but once you have all those squares you  add up the area and you get the variance  so this is the formula for the variance  but now let me show the standard  deviation which is also a very common  measure is closely related to this  specifically it’s just the square root  of the variance now there’s a catch here  the formulas for the variance and the  standard deviation are slightly  different for populations and samples  and that you they use different  denominators  but they give similar answers not  identical but similar if the sample is  reasonably large say over 30 or 50 then  it’s going to be really just a  negligible difference  so let’s do a little pro and con of  these three things  first the range it’s very easy to do it  only uses two numbers the high and the  low  but it’s determined entirely by those  two numbers and if they’re outliers  you’ve got really a bad situation  the interquartile range or iqr it’s  really good for skewed data and that’s  because it ignores extremes on either  end so that’s nice  and the variance in the standard  deviation while they are the least  intuitive and they are the most affected  by outliers they are also generally the  most useful because they feed into so  many other procedures that are used in  data science  finally let’s talk a little bit about  the shape of a distribution  you can have symmetrical or skewed  distributions unimodal uniform or  u-shaped you can have outliers there’s a  lot of variations let me show you a few  of them  first off is a symmetrical distribution  pretty easy they’re the same on the left  and on the right and this little pyramid  shape is an example of a symmetrical  distribution  there are also skewed distributions  where most of the scores are on one end  and then they taper off  this right here is a positively skewed  distribution where most of the scores  are at the low end and the outliers are  on the high end  this is unimodal it’s our shame pyramid  shape unimodal means it has one mode or  really kind of one hump in the data  that’s contrasted for instance to  bimodal where you have two modes and  that usually happens when you have two  distributions that got mixed together  there’s also uniform distributions where  every response is equally common there’s  u-shaped distributions where people tend  to pile up at one end or the other in a  big dip in the middle and so there’s a  lot of different variations and you want  to get those the shape of the  distribution to help you understand and  put the numerical summaries like the  mean and like the standard deviation and  put those into context  in sum we can say this when you use  descriptive statistics that allows you  to be concise with your data tell the  story and tell it  succinctly  you want to focus on things like the  center of the data the spread of the  data the shape of the data  and above all watch out for anomalies  because they can exercise really undue  influence on your interpretations but  this will help you better understand  your data and prepare you for the steps  that follow  the next step in our discussion of  statistics and inference is hypothesis  testing a very common procedure in some  fields of research  i like to think of it as put your money  where your mouth is and test your theory  here’s the wright brothers out testing  their plane now the basic idea behind  hypothesis testing is this  you start with a question  and it’s something like what is the  probability of x occurring  by chance if randomness or meaningless  sampling variation is the only  explanation  well the response is this  if the probability of that data arising  by chance when nothing’s happening is  low  then you reject randomness as a likely  explanation  okay there’s a few things i can say  about this number one it’s really common  in scientific research say for instance  in the social sciences it’s used all the  time  number two this kind of approach can be  really helpful in medical diagnostics  where you’re trying to make a yes no  decision does a person have a particular  disease  and three really anytime you’re trying  to make a go no go decision which might  be made for instance with a purchasing  decision for a school district or  implementing a particular law you base  it on the data and you have to make a  yes no  hypothesis testing might be helpful in  those situations  now you have to have hypotheses to do  hypothesis testing  you start with h sub zero which is the  shorthand version for the null  hypothesis  and what that is in  larger rephrase  and what that is in  lengthier terms is that there is no  systematic effect between groups there’s  no association between variables and  random sampling error is the only  explanation for any observed differences  that you see  and then contrast that with h sub a  which is the alternative hypothesis and  this really just says that there’s a  systematic effect that there is in fact  a correlation between variables that  there is in fact a difference between  two groups that this variable does in  fact predict the other one  let’s take a look at the simplest  version of this statistically speaking  now what i have here is a null  distribution this is a bell curve it’s  actually the standard normal  distribution which shows z-scores and  relative frequency and what you do with  this is you mark off what are called  regions of rejection and so i’ve  actually shaded off the highest two and  a half percent of the distribution and  the lowest two and a half percent what’s  funny about this is even though i draw  it to plus and minus three it looks like  it hit zero it’s actually infinite and  asymptotic  but that’s the highest and lowest two  and a half percent collectively that  leaves 95 in the middle  now the idea is then you gather your  data you calculate a score for your data  and you see where it falls in this  distribution  and i like to think of that as you have  to go down one path or the other you  have to make a decision  and you have to decide whether to retain  your null hypothesis maybe it is random  or reject it and decide no i don’t think  it’s random  the trick is things can go wrong you can  get a false positive this is when the  sample shows some kind of statistical  effect but it’s really randomness and so  for instance this scatter plot i have  right here you can see a little downhill  association here but this is in fact  drawn from data that has a true  correlation of zero and i just kind of  randomly sampled from it until i got  this it took about 20 rounds but  it looks negative but there’s really  nothing happening  the trick about false positives is  that’s conditional on rejecting the null  the only way you can get a false  positive is if you actually conclude  that there’s a positive result  it goes by the highly descriptive name  of a type 1 error  but you get to pick a value for it and  0.05 or 5 risk if you reject the null  hypothesis that’s the most common value  then there’s a false negative this is  when the data looks random but in fact  it is systematic or there’s a  relationship  so for instance this scatter plot it  looks like it’s pretty much a zero  relationship  but in fact this came from two variables  that were correlated at 0.25 that’s a  pretty strong association  again i randomly sampled from the data  until i got a set that happened to look  pretty flat  and a false negative is conditional on  not rejecting the null you can only get  a false negative if you get a negative  you say there’s nothing there  it’s also called a type 2 error and this  is a value that you have to calculate  based on several elements of your  testing framework  so it’s something to be thoughtful of  now i do have to mention one thing big  security notice but wait the problem  with hypothesis testing there’s a few  number one it’s really easy to  misinterpret it a lot of people say well  if you get a statistically significant  result it means that it’s something big  and meaningful and that’s not true  because it’s confounded with sample size  and a lot of other things that just  don’t really matter  also a lot of people take exception with  the assumption of a null effect or even  a nil effect that there’s zero  difference at all  and that could be in certain situations  it could be an absurd claim so got to  watch out for that there’s also bias  from the use of a cut-off anytime you  have a cut-off you’re going to have  problems where you have cases that would  have been just slightly higher slightly  lower it would have switched on the  dichotomous outcome so that is a problem  and then a lot of people say that it  just answers the wrong question because  what it’s telling you is what’s the  probability of getting this data at  random that’s not what most people care  about they want it the other way which  is why i mentioned previously bayes  theorem and i’ll say more about that  later  that being said hypothesis testing is  still  very deeply ingrained very useful in a  lot of questions and has gotten this  really far in a lot of domains  so in sum let me say this hypothesis  testing is very common for yes no  outcomes and it’s the default in many  fields and i’d argue that it is still  useful and informative despite many of  the well-substantiated critiques  continue in statistics and inference by  discussing estimation now as opposed to  hypothesis testing estimation is  designed to actually  give you a number give you a value not  just a yes no go no go but give you an  estimate for a parameter that you’re  trying to get  i like to think of it as sort of a new  angle looking at something from a  different way  and the most common approach to this is  confidence intervals now the important  thing to remember is this is still an  inferential procedure you’re still using  sample data and trying to make  conclusions about a larger group or  population  the difference here is instead of  climbing up with a yes no you instead  focus on likely values for the  population value  most versions of estimation are closely  related to hypothesis testing sometimes  seen as the flip side of the coin and  we’ll see how that works in later videos  now i like to think of this as an  ability to estimate any sample statistic  and there’s a few different versions we  have parametric versions of estimation  and bootstrap versions so i got the  boots here and that’s where you just  kind of randomly sample from the data in  an effort to get an idea of the  variability you can also have what are  called central versus non-central  confidence intervals in estimation but  we’re not going to deal with those  now there are three general steps to  this first you need to choose a  confidence level anywhere from say well  you can’t have zero it has to be more  than zero and it can’t be a hundred  percent  choose something in between ninety-five  percent is the most common  and what it does is it gives you a range  of numbers a high and a low and the  higher your level of confidence the more  confident you want to be the wider the  range is going to be between your high  and your low estimates  now there’s a fundamental trade-off in  what’s happening here and it’s the  trade-off between accuracy which means  you’re on target or more specifically  that your interval contains the true  population value and the idea is that  leads you to the correct inference  there’s a trade-off between accuracy and  what’s called precision in this context  and precision means a narrow interval  and it’s a small range of likely values  and what’s important to emphasize is  this is independent of accuracy you can  have one without the other or neither or  both in fact let me show you how this  works  what i have here is a little  hypothetical situation i’ve got a  variable that goes from maybe you know  10 to 90 and i’ve drawn a thick black  line at 50. if you think of this in  terms of percentages and political polls  it makes a very big difference if you’re  on the left or the right of 50  and then i’ve drawn a dotted vertical  line at 55  to say that that’s  our theoretical true population value  then what i have here is a  distribution that shows possible values  based on our sample data  and what you get here is it’s not  accurate because it’s centered on the  wrong thing it’s actually centered on 45  as opposed to 55  and it’s not precise because it spread  way out from maybe 10 up to almost 80.  so  this situation the data is no help  really at all  now here’s another one this is accurate  because it’s centered on the true value  that’s nice but it’s still really spread  out and you see that about 40 percent of  the values are going to be on the other  side of 50  it might lead you to reach the wrong  conclusion so that’s a problem  now here’s the nightmare situation it’s  this is when you have a very very  precise estimate  but it’s not accurate it’s wrong  and this leads you to a very false sense  of security and understanding of what’s  going on and you’re going to totally  blow it all the time  the ideal situation is this you have an  accurate estimate where the distribution  of sample values is really close to the  true population value and is precise  it’s really tightly knit and you can see  just that  about 95 of it is on the correct side of  50 and that’s good  if you want to see all four of them here  at once  we have the precise two on the bottom  the unprecise ones on the top  the accurate ones on the right the  inaccurate ones on the left and so  that’s a way of comparing it  but no matter what you do you have to  interpret a confidence interval now  the statistically accurate way that has  very little interpretation is this you  would say the 95 confidence interval for  the mean is 5.8 to 7.2 okay so that’s  just kind of taking the output from your  computer and sticking it into sentence  form  the colloquial interpretation of this  goes like this there’s a 95 chance that  the population mean is between 5.8 and  7.2  well  in most statistical procedures  specifically frequentist as opposed to  bayesian  you can’t do that that implies that the  population means shifts  that’s not usually how people see it  instead  a better interpretation is this 95 of  confidence intervals for randomly  selected samples will contain the  population mean  now i can show you this really easily  with a little demonstration this is  where i randomly generated data from a  population with a mean of 55 and i got  20 different  samples and i got the confidence  interval for each sample and i’ve  charted the high and the low  and the question is  did it include the true population value  and you can see that of these 20  19 of them included it some of them  barely made it if you look at sample  number one on the far left barely made  it sample number eight it doesn’t look  like it made it  sample 20 on the far right  barely made it on the other end  only one of them missed it completely  that sample number two which is shown in  red on the left now it’s not always just  one out of twenty i actually had to run  the simulation about eight times  because it gave me either zero or 3 or 1  or 2 and i had to run it until i got  exactly what i was looking for here but  this is what you would expect on average  so let’s say a few things about this  there are some things that affect the  width of a confidence interval the first  is the confidence level or cl  higher confidence levels create wider  intervals the more certain you have to  be you’re going to give a bigger range  to cover your bases  second the standard deviation where  larger standard deviations create wider  intervals if the thing that you’re  studying is inherently really variable  then of course your estimate of the  range is going to be more variable as  well  and then finally there’s the n or the  sample size this one goes the other way  larger sample sizes create narrower  intervals the more observations you have  the more precise and the more reliable  things tend to be i can show you each of  these things graphically  here we have a bunch of confidence  intervals where i’m simply changing the  confidence level from 0.50 at the left  side up through 0.999 and you can see it  gets much bigger as we increase  next one is standard deviations as the  sample standard deviation increases from  1 to 16 you can see that the interval  gets a lot bigger  and then we have sample size going from  just 2 up to 512 i’m doubling it at each  point and you can see how the interval  gets more and more and more precise as  we go through  and so let’s say this to sum up our  discussion of estimation  confidence intervals which are the most  common version of estimation focus on  the population parameter  and the variation in the data is  explicitly included in that estimation  also you can argue that they’re more  informative because not only do they  tell you whether the population value is  likely but they give you a sense of the  variability of the data itself and  that’s one reason that people argue that  confidence intervals should nearly  always be included in any statistical  analysis  as we continue our discussion on  statistics and data science we need to  talk about some of the choices that you  have to make some of the trade-offs and  some of the effects that these things  have  we’ll begin by talking about estimators  and that is different methods for  estimating parameters  i like to think of this as what kind of  measuring stick or standard are you  going to be using  now we’ll begin with the most common  this is actually called ols which is  short for ordinary least squares this is  a very common approach it’s used in a  lot of statistics and it’s based on  what’s called the sum of squared errors  and it’s characterized by an acronym  called blue which stands for best linear  unbiased estimator let me show you how  that works  let’s take a scatter plot here of an  association between two variables this  is actually the speed of a car and the  distance to stop from about the 20s i  think  we have a scatter plot here and we can  draw a straight regression line through  it now the line that i’ve used is in  fact the best linear unbiased estimate  but the way that you can tell that is by  getting what are called the residuals  if you take each data point and draw a  perfectly vertical line  up or down to the regression line  because the regression line predicts  what the value would be for that value  on the x-axis  those are the residuals each of those  individual vertical lines is residual  you square those  and you add them up and this regression  line the gray  angled line here will have the smallest  sum of squared residuals of any possible  straight line that you can run through  it  now another approach  is ml which stands for maximum  likelihood and this is when you choose  parameters  that make the observed data most likely  it sounds kind of weird but i can  demonstrate it  and it’s based on that kind of local  search it doesn’t always find the best i  like to think of it like a person here  with binoculars looking around them  trying hard to find something but you  could theoretically miss something  let me give a very simple example of how  this works  let’s assume  that we’re trying to find parameters  that maximize the likelihood of this  dotted vertical line here at 55  and i’ve got three possibilities i’ve  got my red distribution which is off to  the left the blue which is a little more  center in the green which is far to the  right  and these are all identical except they  have different  means  and by changing the means you see that  the one that is highest where the dotted  line is is the blue one and so  if only thing we’re doing is changing  the mean and we’re looking at these  three distributions then the blue one is  the one that has the maximum likelihood  for this particular parameter  on the other hand we could give them all  the same mean right around 50 and vary  their standard deviations instead and so  they spread out different amounts  in this case the red distribution is  highest at the dotted vertical line and  so it has the maximum value  or if you want to you can vary both the  mean and the standard deviation  simultaneously  and here the green gets a slight  advantage now  this is really a caricature of the  process because obviously you would just  want to center it right there on the 55  and be done with it the question is when  you have many variables in your data set  then it’s a very complex process of  choosing values that can maximize the  association between all of them but you  get a feel for how it works with this  the third approach  that’s pretty common is something called  map map for maximum a posteriori this is  a bayesian approach to parameter  estimation and what it does is it adds  the prior distribution and then it goes  through sort of an anchoring and  adjusting process  what happens by the way is that stronger  prior estimates exert more influence on  the estimate  and that might mean for instance a  larger sample or more extreme values and  those have a greater influence on the  posterior estimate of the parameters  now what’s interesting is that these  three methods all connect with each  other let me show you exactly how they  connect  the ordinary least squares oh well s  this is equivalent to maximum likelihood  when it has normally distributed error  terms  and maximum likelihood ml is equivalent  to maximum opposteriori or map with a  uniform prior distribution  you want to put it another way ordinary  least squares or ols is a special case  of maximum likelihood  and then maximum likelihood or ml is a  special case of maximum or posteriori  and  just in case you like it we can put it  in set notation  ols is a subset of ml it’s a subset of  map  and so there are connections between  these three methods of estimating  population parameters  let me just sum it up briefly this way  the standards that you use ols ml map  they affect your choices and the ways  that you determine what parameters best  estimate what’s happening in your data  several methods exist and there’s  obviously more than what i showed you  right here  but many are closely related and under  certain circumstances they’re all  identical and so it comes down to  exactly what are your purposes and what  do you think is going to best work with  the data that you have to give you the  insight that you need in your own  project  the next step we want to consider in our  statistics and data science and the  choices that we have to make has to do  with measures of fit or the  correspondence between the data that you  have and the model that you create  now it turns out there’s a lot of  different ways to measure this and one  big question is how close is close  enough or how can you see the difference  between the model and reality  well there’s a few really common  approaches to this the first one is  what’s called r squared it’s got a  longer name that’s the coefficient of  determination there’s a variation  adjusted r squared which takes into  consideration the number of variables  then there’s -2 ll which is based on the  likelihood ratio and a couple of  variations the keiki information  criterion or aic and the bayesian  information criterion or bic  and then there’s also chi squared  now that’s actually a greek c there it  looks like an x but it says c and that’s  chi square  and so let’s talk about each of these in  turn first off is r squared this is the  squared multiple correlation or the  coefficient of determination  and what it does is it compares the  variance of y so if you have an outcome  variable it looks at the total variance  of that and compares it to the residuals  on y after you’ve made your prediction  the scores on r squared range from 0 to  1 and higher is better  the next is minus 2 log likelihood  that’s the likelihood ratio or as i just  said at the minus 2 log likelihood and  what this does is compares the fit of  nested models we have a subset then a  larger set than the larger set overall  this approach is used a lot in logistic  regression when you have a binary  outcome and in general smaller values  are considered better fit  now as i mentioned there are some  variations of this i like to think of  variations of chocolate  the -2 log likelihood there’s the akaki  information criterion the aic and the  bayesian information criterion bic and  what both of these do is they adjust for  the number of predictors because  obviously if you have a huge number of  predictors you’re going to get a really  good fit but you’re probably going to  have what’s called over fitting where  your model is tailored to specifically  to the data you currently have and  doesn’t generalize well these both  attempt to reduce the effect of over  fading  and then there’s chi squared again it’s  actually a  lowercase greek c looks like an x and  chi-square is used for examining the  deviations between two data sets  specifically between the observed data  set and the expected values or a model  you created we expect this many  frequencies in each category  now i’ll just mention like going to the  store there’s a lot of other choices but  these are some of the most common  standards particularly the r squared  and i just want to say in sum there are  many different ways to assess the fit  the correspondence between a model and  your data  and the choices affect the model you  know especially are you going to  penalize for throwing in too many  variables relative to your number of  cases are you dealing with a  quantitative or a binary outcome those  things all matter and so the most  important thing as always my standing  advice is keep your goals in mind and  choose a method that seems to fit best  with your analytical strategy and the  insight you’re trying to get from your  data  the statistics in data science offers a  lot of different choices one of the most  important is going to be feature  selection or the choice of variables to  include in your model  it’s sort of like confronting this  enormous range of information and trying  to choose what matters most trying to  get the needle out of the haystack  the goal of feature selection is to  select the best features or variables  and get rid of uninformative and noisy  variables and simplify the statistical  model that you’re creating because that  helps avoid overfitting or getting a  model that works too well with the  current data and works less well with  other data  the major problem here is  multi-collinearity very long word that  has to do with the relationship between  the predictors and the model i can show  it to you graphically here imagine for  instance we’ve got a big circle here to  represent the variability in our outcome  variable we’re trying to predict it and  we’ve got a few predictors so we’ve got  predictor number one over here and you  see it’s got a lot of overlap that’s  nice  then we’ve got predictor number two here  it also has some overlap with the  outcome but it also overlaps with  predictor one  and then finally down here we got  predictor three which overlaps with both  of them and the problem arises the  overlap between the predictors and the  outcome variable  now there’s a few ways of dealing with  this some of these are pretty common so  for instance there’s the practice of  looking at probability values and  regression equations there are  standardized coefficients and there’s  variations on sequential regression  there are also newer procedures for  dealing with the disentanglement of the  association between the predictors  there’s something called commonality  analysis there’s dominance analysis and  there are relative importance weights of  course there are many other choices in  both the common and the newer but these  are a few that are worth taking special  look at  first is p values or probability values  this is the simplest method because most  statistical packages will calculate  probability values for each predictor  and they’ll put little asterisks next to  it and so what you’re doing is you’re  looking at the p-values the  probabilities for each predictor are  more often the asterisks next to it  which sometimes give us the name of star  search just kind of cruising through a  large output of data and just looking  for the stars or asterisks  this is fundamentally a problematic  approach for a lot of reasons  the problem here is you’re looking  individually and it inflates false  positives say you have 20 variables each  is entered and tested with an alpha or  false positive of  5  you end up with nearly a 65 percent  chance of at least a false one false  positive in there it’s distorted by  sample size because with the large  enough sample anything becomes  statistically significant and so relying  on p values can be a seriously  problematic approach  slightly better approach is to use betas  or standardized regression coefficients  and this is where you put all the  variables on the same scale  so usually standardized from  uh zero and then to either minus one  plus one or with a standard deviation of  one the trick is though they’re still in  the same context of each other and you  can’t really separate them because those  coefficients are only valid when you  take that group of predictors as a whole  so one way to try to get around that is  to do what they call stepwise procedures  where you look at the variables in  sequence  there are several versions of sequential  regression that allow you to do that you  can put the variables into groups or  blocks and enter them in blocks and look  at how the equation changes overall  you can examine the change in fit at  each step the problem with a stepwise  procedure like this is it dramatically  increases the risk of overfitting which  again is a bad thing if you want to  generalize your data  and so to deal with this there’s a whole  collection of newer methods  a few of them include commonality  analysis which provides separate  estimates for the unique and shared  contributions of each variable well  that’s a neat statistical trick but the  problem is it just moves the problem of  disentanglement to the analyst so you’re  really not better off than you were as  far as i can tell  there’s dominance analysis which  compares every possible subset of  predictors again sounds really good but  you have the problem known as the  combinatorial explosion if you have 50  variables that you could use and i there  are some that have millions of variables  with 50 variables you have over one  quadrillion possible combinations  you’re not going to finish that in your  lifetime and it’s also really hard to  get things like standard errors and  perform inferential statistics with this  kind of model  then there’s also something that’s even  more recent than these others called  relative importance weights and what  this does is it creates a set of  predictors that are orthogonal or  uncorrelated with each other  basing them off of the originals and  then it predicts the scores and then it  can predict the outcome without the  multicollinear because these  new predictors are uncorrelated it then  rescales the coefficients back to the  original variables that’s the back  transform  and then from that it assigns relative  importance or a percentage of  explanatory power to each predictor  variable  now  despite this very different approach it  tends to have results that resemble  dominance analysis it’s actually really  easy to do their website you just plug  in your information and it does it for  you and so that’s yet another way of  dealing with the problem of  multicollinearity and trying to  disentangle the contribution of  different variables  in sum let’s say this  what you’re trying to do here is choose  the most useful variables to include  into your model  make it simpler be parsimonious also  reduce the noise and distractions in  your data and in doing so you’re going  to always have to confront the ever  present problem of multicollinearity or  the association between the predictors  in your model with several different  ways of dealing with that  as we continue our discussion of  statistics and the choices that are made  one important consideration is model  validation  and the idea here is as you’re doing  your analysis are you on target  more specifically your model that you  create through regression or whatever  you do  your model fits the sample data  beautifully you’ve optimized it there  but  will it work well with other data  fundamentally this is the question of  generalizability also sometimes called  scalability because you’re trying to  apply it in other situations  and you don’t want to get too specific  or it won’t work in other situations  now there are a few general ways of  dealing with this and trying to get some  sort of generalizability  number one is bayes a bayesian approach  then there’s replication then there’s  something called holdout validation and  then there’s cross-validation i’ll  discuss each of these very briefly in  conceptual terms  the first one is bays and the idea here  is you want to get what are called  posterior probabilities  most analyses give you a probability  value for the data given  the hypothesis so you have to start with  an assumption about the hypothesis but  instead it’s possible to flip that  around by combining with special kinds  of data to get the probability of the  hypothesis given the data and that is  the purpose of bayes theorem which i’ve  talked about elsewhere  another way of finding out how well  things are going to work is through  replication that is do the study again  it’s considered the gold standard in  many different fields  the question is whether you need an  exact replication or a conceptual one  that is similar in certain respects you  can argue for both ways but one thing  you want to do is when you do a  replication then you actually want to  combine the results and what’s  interesting is the first study can serve  as the bayesian prior probability for  the second study  so you can actually use meta-analysis or  bayesian methods for combining the data  from the two of  them then there’s holdout validation  this is where you build your statistical  model on one part of the data and you  test it on another i like to think of it  as the eggs in separate baskets the  trick is that you need a large sample in  order to have enough to do these two  steps separately  on the other hand it’s also used very  often in data science competitions as a  way of having a sort of gold standard  for assessing the validity of a model  finally i’ll just mention one more  that’s cross validation this is when you  use the same data for both training and  for testing or validating there are  several different versions of it and the  idea is you’re not using all the data at  once but you’re kind of cycling through  and weaving the results together  there’s leave one out where you leave  out one case at a time also called lu  llo there’s leave p out where you leave  out a certain number at each point  there’s k-fold where you split the data  into say for instance 10 groups and you  leave out one and you develop it on the  other nine and then you cycle through  and there’s repeated random sub-sampling  where you use a random process at each  point any of those can be used to  develop the model on one part of the  data and test it on another and then  cycle through to see how well it holds  up under different circumstances and so  in sum i can say this about validation  you want to make your analysis count by  testing how well your model holds up  from the data you developed it on to  other situations because that’s really  what you’re trying to accomplish this  allows you to check the validity of your  analysis and your reasoning and it  allows you to build confidence  in the utility of your results  to finish up our discussion of  statistics and data science and the  choices that are involved i want to  mention something that really isn’t a  choice but more an attitude hence diy  for do it yourself  the idea here is you know really  you just need to get started remember  data is democratic it’s there for  everyone everybody has data everybody  works with data  either explicitly or implicitly so  data’s democratic  so is data science and really my overall  message is you can do it  you know  a lot of people think you have to be at  this totally cutting edge virtual  reality sort of thing and it’s true  there’s a lot of active development  going on in data science there’s always  new stuff  the trick however is the software that  you can use to implement those things  often lags  it’ll show up first in programs like r  and python but as far as it’s showing up  in a point-and-click program that could  be years  what’s funny though is often these  cutting-edge developments don’t really  make much of a difference in the results  of the interpretation  they may in certain edge cases but  usually not a huge difference and so i’m  just going to say analyst beware you  don’t necessarily have to do it it’s  pretty easy to do them wrong and so you  don’t have to wait for the cutting edge  now that being said i do want you to pay  attention to what you’re doing  a couple of things i’ve said repeatedly  is know your goal why are you doing this  study why are you analyzing the data  what are you hoping to get out of it  try to match your methods to your goal  be goal directed  focus on the usability will you get  something out of this that people can  actually do something with  and then as i’ve mentioned several times  with the bayesian thing don’t get  confused with probabilities remember  that priors and posteriors are different  things just so you can interpret things  accurately  now i want to mention something that’s  really important to me personally and  that is beware the trolls  you will encounter critics people who  are very vocal and who can be harsh and  grumpy and really just intimidating  and they can really make you feel like  you shouldn’t do stuff because you’re  going to do it wrong  but the important thing to remember is  the critics can be wrong yes you’ll make  mistakes everybody does  you know i can’t tell you how many times  i have to write my code more than once  to get it to do what i want it to do  but in analysis nothing is completely  wasted if you pay close attention  i’ve mentioned this before everything  signifies or in other words everything  has meaning the trick is that the  meaning might not be what you expected  it to be so you’re going to have to  listen carefully and  i just want to re-emphasize all data has  value so make sure you’re listening  carefully  in sum let’s say this no analysis is  perfect the real question is not is your  analysis perfect but can you add value  and i’m sure that you can and  fundamentally data is democratic so i’m  going to finish with one more picture  here  and that is just jump right in and get  started you’ll be glad you did  to wrap up our course statistics in data  science i want to give you a short  conclusion and some next steps  mostly i want to take a little piece of  advice i learned from a professional  saxophonist kirk whalum and he says  there’s always something to work on  there’s always something you can do to  try things differently to get better it  works when practicing music it also  works when you’re dealing with data  now there are additional courses here at  datalab.cc that you might want to look  at there are conceptual courses  additional high level overviews on  things like machine learning data  visualization and other topics and i  encourage you to take a look at those as  well to round out your general  understanding of the field  there are also however many practical  courses these are hands-on tutorials on  the statistical procedures i’ve covered  and you learn how to do them in r and  python and spss and other programs  but whatever you’re doing  keep this other little piece of advice  from writers in mind and that is write  what you know and i’m going to say this  way explore and analyze and delve into  what you know remember when we talked  about data science and the venn diagram  we’ve talked about the coding and the  stats but don’t forget this part here on  the bottom  domain expertise is just as important to  good data science as the ability to work  with computer coding and the ability to  work with the numbers and quantitative  skills  but all through it remember this you  don’t have to know everything your work  doesn’t have to be perfect the most  important thing is just get started  you’ll be glad you did thanks for  joining me and good luck  


Leave a Reply

Your email address will not be published. Required fields are marked *