Psychology Textbook Unit 3 Descriptive Statistics for Psychological Research

Explore the Psychology Textbook Unit 3 Descriptive Statistics for Psychological Research study material pdf and utilize it for learning all the covered concepts as it always helps in improving the conceptual knowledge.

Subjects

Social Studies

Grade Levels

K12

Resource Type

PDF

Psychology Textbook Unit 3 Descriptive Statistics for Psychological Research PDF Download

Unit . Descriptive Statistics for Psychological Research TOBY AND CASTRO Summary . This unit briefly reviews the distinction between descriptive and inferential statistics and then discusses the ways in which both numerical and categorical data are usually summarized for psychological research . Different measures of center and spread , and when to use them , are explained . The shape of the data is also discussed . Prerequisite Units Unit . Introduction to Statistics for Psychological Science Unit . Managing Data Introduction Assume that you are interested in some attribute or characteristic of a very large number of people , such as the average hours of sleep per night for all undergraduates at all universities . Clearly , you are not going to do this by measuring the hours of sleep for every student , as that would be difficult 28 Descriptive Statistics

to impossible . So , instead , you will probably take a relatively small sample of students ( TOO people ) ask each of them how many hours of sleep they usually get , and then use these data to estimate the average for all undergraduates . The process outlined above can be thought of as having three phases or steps ( collect a sample , summarize the data in the sample , and ( use the summarized data to make the estimate of the entire population . The issues related to collecting the sample , such as how one ensures that the sample is representative of the entire population will not be discussed here . Likewise , the way that one uses the summary of a sample to calculate an estimate of the population will not be explained here . This unit will focus on the second step the way in which psychologists summarize data . The general label for procedures that summarize data is descriptive statistics . This can be contrasted with procedures that make estimates of population values , which are known as . Thus , descriptive and inferential statistics each give different insights into the nature of the data gathered . Descriptive statistics describe the data so that the big picture can be seen . How ?

By organizing and summarizing properties of a data set . Calculating descriptive statistics takes unordered observations and logically organizes them in some way . This allow us to describe the data obtained , but it does not make conclusions beyond the sample . This is important , because part of conducting ( good ) research is being able to communicate your findings to other people , and descriptive statistics will allow you to do this quickly , clearly , and precisely . To prepare you for what follows , please note two things in advance . First , there are several different ways that we can summarize a large set of data . Most of all we can use numbers or we can use graphical representations . Furthermore , when the data are numerical , we will have options for several of the summary values that we need to calculate . This may seem Descriptive Statistics 29

confusing at first hopefully , it soon will make sense . Second , but related to the first , the available options for summarizing data often depend on the type of data that we have collected . For example , numerical data , such as hours of sleep per night , are summarized differently from categorical data , such as favorite flavors of . The key to preventing this from becoming confusing is to keep the function of descriptive statistics in mind we are trying to summarize a large amount of data in a way that can be communicated quickly , clearly , and precisely . In some cases , a few numbers will do the trick in other cases , you will need to create a plot ofthe data . This unit will only discuss the ways in which a single set of values are summarized . When you collect more than one piece of information from every participant in the sample , you not only ask them how many hours of sleep they usually get , but also ask them for their favorite flavor of then you can do three things using descriptive statistics summarized the first set of values ( on their own ) summarize the second set of values ( on their own ) and summarize the relationship between the two sets of values . This unit only covers the first two ofthese three . Different ways to summarize the relationship between two sets of values will be covered in Units and . Summarizing Numerical Data The way to summarize a set of numerical data , hours of sleep per is in terms of two or three aspects . One always includes values for the center of the data and the spread of the data in some cases , the shape of the data is also described . A measure of center is a single value that attempts to describe an entire set of data by identifying 30 Descriptive Statistics

the central position within that set . The full , formal label for this descriptive statistic is measure tendency , but most people simply say Another label for this is the A measure of spread is also a single number , but this one indicates how widely the data are distributed around their center . Another way of saying this is to talk about the variability ' of the data . If all ofthe individual pieces of data are close to the center , then the value of spread will be low if the data are widely distributed , then the value of spread will De high . What makes this a little bit complicated is that there are multiple ways to mathematically define the center and spread of a set of data . For example , both the mean and the median discussed in detail below ) are valid measures of central tendency . Similarly , both the variance ( or standard deviation ) and the range ( also discussed below ) are valid measures of spread . This might suggest that there are at least our combinations of center and spread ( ie , two versions of center crossed with two version of spread ) but that is true . he standard measures of center and spread actually come in , such that your choice with regard to one forces you to use a particular option for the other , Ifyou define the center as the mean , for example , then you have to use variance ( or standard deviation ) for spread if you define the center as the median , then you have to use the range for spread . Because of this dependency , in what follows we shall discuss the standard measures of center and spread in pairs . When this is finished , we shall mention some of the less popular alternatives and then , finally , turn to the issue of shape . Descriptive Statistics 31

Measures of Spread Based on Moments The mean and variance of a set of numerical values are ( technically ) the first and second moments of the set of data . Although it is not used very often in psychology , the term moment is quite popular in physics , where the first moment is the center of mass and the second moment is rotational inertia ( these are very useful concepts when describing how hard it is to throw or spin something ) The fact that the mean and variance of a set of numbers are the first and second moments is all that important the key is that they are based on the same approach to the data , which is why they are one of the standard pairs of measures for describing a set of numerical data . Mean The mean is the most popular and well known measure of central tendency . It is what most people intend when they use the word The mean can be calculated for any set of numerical data , discrete or continuous , regardless of units or details . The mean is equal to the sum of all values divided by the number . So , if we have values in a data set and they have values , the mean is calculated using the following formula , where is the technical way of writing add up all of the values ( ie , the , Greek letter sigma tells you to calculate the sum of what follows ) and is the number of pieces of data ( which is usually referred to as sample size ) 32 Descriptive Statistics

The for writing the mean is ( ie , you put a bar over the symbol , in this case it is pronounced ex bar ) As a simple example , if the values of are , and , then ) and therefore ( after rounding to two decimal places ) Before moving forward , note two things about using the mean as the measure of center . First , the mean is rarely one of the actual values from the original set of data . As an extreme example when the data are discrete ( whole numbers , like the number of siblings ) the mean will almost never match any ofthe specific values , because the mean will almost never be a whole number as well . Second , an important property ofthe mean is that it includes and depends on every value in your set of data . If any value in the data set is changed , then the mean will change . In other words , the mean is sensitive to all of the data . Variance and Standard Deviation When the center is defined as the mean , the measure of spread to use is the variance ( or the of this value , which is the standard deviation ) Variance is defined as the average ofthe squared deviations from the mean . The formula ( Xi ) for variance is Variance of where is the mean ( see above ) In words , you take each piece of data , subtract the mean , and square the result do this for each of the pieces of data and add up the results , then divide by one less than the number of pieces of data . More technically , to determine the variance of a set of scores , you have to ) find the mean ofthe scores , compute the deviation scores ( the difference between each individual score and the Descriptive Statistics 33

mean ) square each of the deviation scores , add up all of the squared deviation scores , and ) divide by one less than the number of scores . Thus , for example , the variance of , and ( which have a mean of , see above ) is ( and then divide by ( a Note that , because each of the summation involves a value that has been squared , the value of variance can not be a negative number . Note , also , that when all of the individual pieces of data are the same , they will all be equal to the mean , so you will be adding up numbers that are all zero , so variance will also be zero . These both make sense , because here we are calculating a measure of how spread out the data are , which will be zero when all of the data are the same and can not be ess than this . Technically , the value being calculated here is the sample variance , which is different from something known as the population variance . The former is used when we have taken a sample the latter is used when we have measured every possible value in the entire population . Since we never measure every possible value when doing psychological research , we do not need the formula for population variance and can simply refer to the sample variance as variance . As mentioned above , some people prefer to express this measure of spread in terms of the of the variance , which is the standard deviation . The main reason for doing this is because the units of variance are the square of the units of the original data , whereas the units of standard deviation are the same as the units ofthe original data . Thus , for example , if 34 Descriptive Statistics

you have response times of , and seconds , which have a mean seconds , then the variance is ( which is difficult to conceptualize ) but also have a standard deviation of seconds ( which is easy to think about ) Conceptually , you can think ofthe standard deviation as the typical distance of any score from the mean . In other words , the standard deviation represents the standard amount by which individual scores deviate from the mean . The standard deviation uses the mean ofthe data as a baseline or reference point , and measures variability by considering the distance between each score and the mean . Note that similar to the mean , both the variance and the standard deviation are sensitive to every value in the set of data if any one piece of data is changed , then not only will the mean change , but the variance and standard deviation will also be changed . Practice Let calculate now the mean and the standard deviation of the two variables in the following containing the number of study hours before an exam ( Hours ) and the grade obtained in that exam ( Grade ) for 15 participants . To calculate the mean of Hours , we sum all of the values for Hours , and divide by the total number , 15 , To calculate the mean of Grade , we sum all of the values for Grade , Descriptive Statistics 35

and divide bythe total number . Doing so , we obtain the mean for Hours ( and the mean for Grade ( Table 31 shows each of the scores , and the deviation scores for each and score The deviation scores , as explained above , are calculated by subtracting the mean from each ofthe individual scores . 36 Descriptive Statistics

ant Ho 11 16 14 12 15 18 20 10 16 17 12 Ora de 78 80 89 85 84 86 95 96 83 81 93 92 84 83 . 66 11 ) 66 ) 20 ) 10 ) 66 ) 17 ) 13 ) 66 ) 17 ) 78 ) I ( 80 ) I ( 89 ) I ( 85 ) I ( 84 ) I ( 86 ) I ( 95 ) I ( 96 ) I ( 83 ) I ( 81 ) I ( 93 ) I ( 92 ) I ( 84 ) I ( 83 ) I Table . Number hours before an exam ( Hours ) and the grade obtained in that exam ( Grade ) for 75 participants . The two most right columns show the deviation scores for each and . Descriptive Statistics 37

Once we have the deviation scores for each participant , we square each ofthe deviation scores , and sum them . For Hours ( We then divide that sum by one less than the number of scores , 15 in this case is the variance for the number of hours in our sample of participants . In order to obtain the standard deviation , we calculate the square root ofthe variance We follow the same steps to calculate the standard deviation of our participants grade . First , we square each ofthe deviation scores ( most right column in Table ) and sum them ( Next , we divide that sum by one less than the number So , 3055 is the variance for the grade in our sample of participants . 38 Descriptive Statistics

In order to obtain the standard deviation , we calculate the square root of the variance Thus , you can summarize the data in our sample saying that the mean hours of study time are , with a standard deviation of , whereas the mean grade is , with a standard deviation of . Measures of Spread Based on The second pair of measures for center and spread are based on percentile ranks and percentile values , instead of moments . In general , the percentile rank for a given value is the percent of the data that is smaller ( lower in value ) As a simple example , if the data are , and , then the percentile rank for is 67 , because two of the three values are smaller than Percentile ranks are usually easy to calculate . In contrast , a percentile value ( which is kind of the opposite of a percentile rank ) is much more complicated . For example , the percentile value for 67 when the data are , and is something between and , because any value between and would be larger than of the data . the percentile value is this case is . Fortunately , we wo need to worry about the details when calculating that standard measures of center and spread when using the method . Descriptive Statistics 39

Median The median is how the method defines is best thought of the middle score when the data have been arranged in order of magnitude . To see how this can be done by hand , assume that we start with the data below 65 54 79 57 35 14 56 55 77 45 92 We first these data from smallest to largest 35 45 54 55 56 57 65 77 79 92 The median is the middle of this new set of scores in this case , the value ( in blue ) is 56 . This is the middle value because there are scores lower than it and scores higher than it . Finding the median is very easy when you have an odd number of scores . What happens when you have an even number of scores ?

What if you had only 10 scores , instead ?

In this case , you take the middle two scores , and calculate the mean of them . So , if we start with the following data ( which are the same as above , with the last one omitted ) 65 54 79 57 35 56 55 77 45 We again that data from smallest to largest 35 45 54 55 56 57 65 77 79 And then calculate the mean ofthe and values ( tied for the middle , in blue ) to get a median of . In general , the median is the value that splits the entire set of data into two equal halves . Because of this , the other name for 40 Statistics

the median is percentile of the data are below this value and 50 ofthe data are above this value . This makes the median a reasonable alternative definition of center . Range The range ( typically named using its initials , is the measure of spread that is paired with the median as the measure of center . As the name suggests , the divides the data into four , instead of just two the bottom quarter , the next higher quarter , the next higher quarter , and the top quarter ( the same as for the median , you must start by rearranging the data from smallest to largest ) As described above , the median is the dividing line between the middle two quarters , The is the distance between the dividing line between the bottom two quarters and the dividing line between the top two quarters . Technically , the is the distance between the percentile and the percentile . You calculate the value for which 25 of the data is below this point , then you calculate the value for which 25 of the data is above this point , and then you subtract the first from the second . Because the percentile can not be lower than the percentile ( and is almost always much higher ) the value for can not be negative number . Returning to our example set of 11 values , for which the median was 56 , the way that you can calculate the by hand is as follows . First , focus only on those values that are to the left of ( Let , lower than ) the middle value 14 35 45 54 55 ' 92 Then calculate the median of these values . In this case , the Descriptive Statistics 41

answer is 45 , because the third box is the middle of these five boxes . Therefore , the percentile is 45 . Next , focus on the values that are to the right of ( higher than ) the original median 45 54 57 65 77 79 92 The middle of these values , which is 77 , is the percentile . Therefore , the for these data is 32 , because 77 45 32 . Note how , when the original set of data has an odd number of values ( which made it easy to find the median ) the middle value in the data set was ignored when finding the and . In the above example , the number of values to be examined in each subsequent step was also odd ( ie , each ) so we selected the middle value of each subset to get the and . If the number of values to be examined in each subsequent step had been even ( ifwe had started with values , so that values would be used to get the percentile ) then the same averaging rule as we use for median would be used use the average of the two values that tie for being in the middle . For example , if these are the data ( which are the first nine values from the original example after being sorted ) 35 45 54 55 56 57 65 77 The median ( in blue ) is 55 , the percentile ( the average of the two values in green ) is 40 , and the percentile ( the average of the two values in red ) is 61 . Therefore , the for these data is ) 21 . A similar procedure is used when you start with an even number of values , but with a few extra complications ( these complications are caused by the particular method of calculating that is typically used in the psychology ) The first change to the procedure for calculating is that 42 Descriptive Statistics

now every value is included in one of the two for getting the and percentile none are omitted . For example , if we use the same set of ) values from above ( the original 11 values with the highest omitted ) for which the median was , then here is what we would use in the first 14 35 45 54 55 ?

In this case , the percentile will be calculated from an odd number ( We start in the same way before , with the middle of these values ( in green ) which is 45 . Then we adjust it by moving the score 25 of the distance towards next lower value , which is 35 . The distance between these two values is , 45 35 ) so the final value for the percentile is . The same thing is done for percentile . This time we would start with ' 56 57 65 77 79 The starting value ( in red ) of 65 would then be moved 25 of the distance towards the next higher , which is 77 , producing a percentile of 68 , 65 ( 77 65 ) 68 . Note how we moved the value away from the median in both cases Ifwe do do this we used the same simple method as we used when the original set of data had an odd number of then we would slightly the value of . Finally , if we start with an even number of pieces of data and also have an even number for each of the ( we started with values ) then we again have to apply the correction . Whether you have to shift the and depends on original number of pieces of data , not the number that are used for the subsequent . To Descriptive Statistics 43

demonstrate this , here are the first eight values from the 14 35 45 54 . be ' 35 The first step to calculating the percentile is to average the two values ( in green ) that tied for being in the middle of the lower half of the data the answer is 40 . Then , as above , move this value 25 of the distance away from the median , move it down by 250 , because ( 45 35 ) The final value is 3750 . Then do the same for the upper half ofthe data 45 55 56 57 65 Start with the average of the two values ( in red ) that tied for being in the middle and then shift this value 25 of their difference away from the center . The mean of the two values is and after shifting the percentile is 5675 . Thus , the for these eight pieces of data is 21925 . Note the following about the median and because these are both based on , they are not always sensitive to every value in the set of data . Look again at the original set of 11 values used in the examples . Now imagine that the first ( lowest ) value was , instead of 14 . Would either the median or the change ?

The answer is No , neither would change . Now imagine that the last ( highest ) value was 420 , instead of 92 . Would either the median or change ?

Again , the answer is No . Some of the other values can also change without altering the median , but not all of them . If you changed the in the original set to being 50 , instead , for example , then the median would drop from 56 to 55 , but the would remain 32 . In contrast , ifyou only changed the 45 to being a 50 , 44 Descriptive Statistics

then the would drop from 32 to 27 , but the median would remain 56 . The one thing that is highly consistent is how you can decrease the lowest value increase the highest value without changing either the median or ( as long as you start with at least pieces of data ) This is an important property of methods they are relatively insensitive to the most extreme values . This is quite different from methods the mean and variance ofa set of data are both sensitive to every value . Other Measures of Center and Spread Although a vast majority of psychologists use either the mean and variance ( as a pair ) or the median and ( as a pair ) as their measures of center and spread , occasionally you might come across a few other options , Mode The mode is a ( way of defining the center of a set of data . The mode is simply the value that appears the most often in a set of data . For example , if your data are , and , then the mode is because there are two 35 in the data and no other value appears more than once , When you think about other sets of example data , you will probably see why the mode is not very popular . First , many sets of data do not have a meaningful mode . For the set of , and , all three different values appear once each , so no value is more frequent than any other value . When the data are continuous and measured precisely ( response time in milliseconds ) then this problem will happen quite often . Now consider the Descriptive Statistics 45

set of , and these data have two modes and This also happens quite often , especially when the data are discrete , such as when they must all be whole numbers . But the greatest problem with using the mode as the measure of center is that it is often at one of the extremes , instead of being anywhere near the middle . Here is a favorite example ( even if it is not from psychology ) the amount of federal income tax paid . The value for this , the mode income tax is zero . This also happens to be the same as the lowest value . In contrast , in 2021 , for example , the mean amount of federal income tax paid was a little bit over . Range Another descriptive statistic that you might come across is the range of the data . Sometimes this is given as the lowest and highest values , the participant ages ranged from 18 to 24 years which provides some information about center and spread simultaneously . Other times the range is more specifically intended as only a measure of spread , so the difference between the highest and lowest values is given , the average age was ) years with a range of years . There is nothing inherently wrong with providing the range , but it is probably best used as a supplement to one of the pairs of measures for center and spread . This is true because range ( in either format ) often fails to provide sufficient detail . For example , the set of 18 , 18 , 18 , 18 , and 24 and the set , 24 , 24 , 24 , and 24 both range from 18 to 24 ( or have a range of ) even though the data sets are clearly quite different , Choosing the Measures of Center and 46 Descriptive Statistics

Spread When it comes to deciding which measures to use for center and spread when describing a set of numerical data is almost always a choice between mean and variance ( or standard deviation ) or median and the first thing to keep in mind is that this is not a question of which is better ?

it is a question of which is more appropriate for the situation . That is , the mean and the median are notjust alternative ways of calculating a value for the center of a set of data they use different definitions of the meaning of center . So how should you make this decision ?

One factor that you should consider focuses on a key between moments and that was mentioned above how the mean and variance of a set of data both depend on every value , whereas the median and are often by the specific values at the upper and lower extremes . if you believe that every value in the set of data is equally important and equally representative of whatever is being studied , then you should probably use the mean and variance for your descriptive statistics . In contrast , if you believe that some extreme values might be outliers ( the participant was taking the study very seriously or was making random fast guesses ) then you might want to use the median and instead . Another related factor to consider is the shape of the distribution of values in the set . If the values are spread around the center in a roughly symmetrical manner , then the mean and the median will be very similar , but ifthere are more extreme values in one tail of the distribution ( there are more extreme values above the middle than below ) this will pull the mean away from the median , and the latter might better match what you think of as the center . Finally , if you are calculating descriptive statistics as part of a process that will later involve making inferences about the population from which the sample was taken , you might want Descriptive Statistics 47

to consider the type of statistics that you will be using later . Many inferential statistics ( including , and the standard form of the correlation coefficient ) are based on moments so , ifyou plan to use these later , it would be probably more appropriate to summarize the data in terms of mean and variance ( or standard deviation ) Other statistics ( including sign tests and alternative forms of the correlation coefficient ) are based on , so if you plan to use these instead , then the median and might be more appropriate for the descriptive statistics . Hybrid Methods Although relatively rare , there is one alternative to making a firm decision between moments ( mean and variance ) and ( median and ) hybrid methods . One example of this is as follows . First , sort the data from smallest to largest ( in the same manner as when using ) Then remove a certain number of values from the beginning and end of the list . The most popular version of this is to remove the lowest 25 and the highest 25 of the data for example , if you started with 200 pieces of data , remove the first and the last , keeping the middle 190 . Then switch methods and calculate the mean and variance of the retained data . This method is trying to have the best of both worlds it is avoiding outliers by removing the extreme values , but it is remaining sensitive to all the data that are being retained . When this method is used , the correct label for the final two values are the trimmed mean and trimmed variance . 48 Descriptive Statistics

Measures of Shape for Numerical Data As the name suggests , the shape of a set of data is best thought about in terms of how the data would look ifyou made some sort of figure or plot of the values . The most popular way to make a plot of a single set of numerical values starts by putting all of the data into something that is called a frequency table , In brief , a frequency table is a list of all possible values , along with how many times each value occurs in the set of data . This is easy to create when there are not very many different values ( number ) it becomes more complicated when almost every value in the set is unique ( response time in milliseconds ) The key to resolving the problem of having too many unique values is to bin the data . To bin a set , you choose a set of , which will determine the borders of adjacent bins , For example , if you are working with response times which happen to range from about 300 to 600 milliseconds ( with every specific value being unique ) you might decide to use bins that are 50 milliseconds wide , such that all values from 30 ) to 350 go in the first bin , all 35 ) to 400 go in the second bin , etc . Most software packages ( Excel ) have procedures to do this for you . As an illustration of this process , let go back to the set values we have used in previous examples 65 55 79 56 35 14 55 77 45 92 Based on the total number and their range , we decide to use bins that are 20 units wide . Here are the same data in a frequency table Descriptive Statistics 49

Bin 41 Frequency i Once you have a list of values or bins and the number of pieces of data in each , you can make a frequency histogram of the data , as shown in Figure Figure . Example histogram in which the data are grouped into five bins . The numbers inside the bars represent the frequency count , that is , how many data points we have , within each bin . Based on this histogram , we can start to make descriptive statements about the shape of the data In general , these will concern two aspects , known as skewness and kurtosis , as we shall see next . Skewness Skewness refers to the lack of symmetry . It the left and right 50 Descriptive Statistics

sides of the plot are mirror images of each other , then the distribution has no skew , because it is symmetrical this is the case of the normal distribution ( see Figure ) This clearly is not true for the example in Figure . Ifthe distribution has a longer tail on the left side , as is true here , then the data are said to have negative skew . If the distribution has a longer tail ' on the right , then the distribution is said to have positive skew . Note that you need to focus on the skinny part of each end of the plot . The example in Figure might appear to be heavier on the right , but skew is determined by the length of the skinny tails , which is clearly much longer on the left . As a reference , Figure shows you a normal distribution , perfectly symmetrical , so its skewness is zero to the left and to the right , you can see two skewed , positive and negative . Most of the data points in the distribution with a positive skew have low values , and has a long tail on its right side . The opposite is true for the distribution with negative skew most of its data points have high values , and has a long tail on its left side . Positive Skew Normal Negative Skew Figure . An illustration . A normal distribution ( in the middle ) is symmetrical , so it has no skew . The with positive and negative skew show a clear lack . Ku The other aspect , kurtosis , is a bit more complicated . In general , kurtosis refers to how sharply the data are peaked , Descriptive Statistics 51

and is established in reference to a baseline or standard shape , the normal distribution , that has kurtosis zero . When we have a nearly flat distribution , for example when every value occurs equally often , the kurtosis is negative . When the distribution is very pointy , the kurtosis is positive . If the shape of your data looks like a bell curve , then it said to be mesokurtic ( meso means middle or intermediate in Greek ) shape ofyour data is flatter than this , then it said to be platykurtic ( means flat in Greek ) Ifyour shape is more pointed from this , then your data are leptokurtic ( means thin , narrow , or pointed in Greek ) Examples of these shapes can be seen in Figure . Platykurtic Leptokurtic Figure . An illustration . A normal distribution ( in the middle ) is mesokurtic , and its kurtosis value is zero . The platykurtic distribution , on the left , is flatter than the normal distribution ( negative kurtosis , whereas the distribution , on the right , is more pointed than the normal distribution ( positive kurtosis ) Both skew and kurtosis can vary a lot these two attributes of shape are not completely independent . That is , it is impossible for a perfectly flat distribution to have any skew it is also impossible for a distribution to have zero kurtosis . A large proportion of the data that is collected by psychologists is approximately normal , but with a long right tail . In this situation , a good verbal label for the overall shape could be normal , even if that seems a bit contradictory , because the true normal distribution is actually 52 Descriptive Statistics

symmetrical ( see Figures and ) The goal is to summarize the shape in a way that is easy to understand while being as accurate as possible . You can always show a picture of your distribution to your audience . A simple summary ofthe shape of the histogram in Figure could be roughly normal , but with lot of negative skew , this tells your audience that the data have a peak in the middle , but the lower tai is a lot longer than the upper tail . Numerical Values for Skew and Kurtosis In some rare situations , you might want to be even more precise about the shape of a set of data . Assuming that you used the mean and variance as your measures of center and spread , in these cases , you can use some ( complicated ) formulae to calculate specific numerical values for skew and kurtosis . These are the third and fourth moments of the distribution ( which is why they can only be used with the mean and variance , because those are the first and second moments of the data ) The details ofthese measures are beyond this course , but to give you an idea , as indicated above , values that depart from zero tells you that the shape is different from the normal distribution . A value of skew that is less than or greater than implies that the shape is notably skewed , whereas a value of kurtosis that is more than unit away from zero imply that the data are not mesokurtic . Descriptive Statistics 53

Summarizing Categorical Data By definition , you summarize a set of categorical data ( eg , favorite colors ) in terms of a numerical mean a numerical spread . It also does not make much sense to talk about shape , because this would depend on the order in which you placed the options on the is of the plot . Therefore , in this situation , we usually make a frequency table ( with options in any order that we wish ) You can also make a frequency histogram , but be careful not to read anything important into the apparent shape , because changing order ofthe options would completely alter the shape , An issue worth mentioning here is something that is ar to the process of binning . Assume , for example , that you have taken a sample of 100 undergraduates , asking each for their favorite genre of music . Assume that a majority of respondents chose either pop ( 24 ) 27 ) rock ( 25 ) or classical ( 16 ) but a few chose techno ( trance ( or country ( In this situation , you might want to combine all of rare responses into one category with the label Other . reason for doing this is that it is difficult to come to any clear conclusions when something is rare . As a general rule , i a category contains fewer than of the observations , then it should probably be combined with one or more other options . An example frequency table for such data is this Choice Pop Rock Classical Other Frequency 24 27 25 16 Finally , to be technically accurate , it should be mentioned that there are some ways to quantify whether each ofthe options is being selected the same percent ofthe time , including the square ( pronounced ) test and relative entropy 54 Descriptive Statistics

( which comes from physics ) but these are not very usual . In general , most researchers just make a table maybe a histogram to show the distribution ofthe categorical values . Descriptive Statistics 55