Essentials of Geographic Information Systems Chapter 6 Data Characteristics and Visualization

Explore the Essentials of Geographic Information Systems Chapter 6 Data Characteristics and Visualization study material pdf and utilize it for learning all the covered concepts as it always helps in improving the conceptual knowledge.

Subjects

Social Studies

Grade Levels

K12

Resource Type

PDF

Essentials of Geographic Information Systems Chapter 6 Data Characteristics and Visualization PDF Download

Chapter Data Characteristics and Visualization In previous chapters , we learned how geographic information system ( GIS ) software packages use databases to store extensive attribute information for features within a map . The true usefulness of this information , however , is not realized until similarly powerful analytical tools are employed to access , process , and simplify the data , To accomplish this , GIS typically provides extensive tools for searching , querying , describing , summarizing , and classifying . With these data exploration tools , even the most expansive can be mined to provide users the ability to make meaningful insights into and statements about that information . URL books 135

Descriptions and Summaries LEARNING OBJECTIVE . The objective of this section is to review the most frequently used measures of distribution , central tendency , and dispersion . No discussion of analysis would be complete without a brief overview of basic statistical concepts . The basic statistics outlined here represent a starting point for any attempt to describe , summarize , and analyze . An example of a common statistical endeavor is the analysis of point data obtained by a series of rainfall gauges patterned throughout a particular region . Given these rain gauges , one could determine the typical amount and variability of rainfall at each station , as well as typical rainfall throughout the region as a whole . In addition , you could interpolate the amount of rainfall that falls between each station or the location where the most ( or least ) rainfall occurs . Furthermore , you could predict the expected amount of rainfall into the future at each station , between each station , or within the region as a whole . The increase of computational power over the past few decades has given rise to vast that can not be summarized easily . Descriptive statistics provide simple numeric descriptions of these large . Descriptive statistics tend to be analyses , meaning they examine one variable at a time . There are three families of descriptive statistics that we will discuss here measures of distribution , measures of central tendency , and measures of dispersion . However , before we delve too deeply into various statistical techniques , we must a few terms . Variable a symbol used to represent any given value or set of values Value an individual observation of a variable ( in a geographic information system GIS this is also called a record ) Population the universe of all possible values for a variable Sample a subset of the population the number of observations for a variable Array a sequence of observed measures ( in a GIS this is also called a field and is represented in an attribute table as a column ) URL books 137

Sorted Array an ordered , quantitative array Measures of Distribution The measure of distribution of a variable is merely a summary of the frequency of values over the range of the ( hence , this is often called a frequency distribution ) Typically , the values for the given variable will be grouped into a predetermined series of classes ( also called intervals , bins , or categories ) and the number of data values that fall into each class will be summarized . A graph showing the number of data values within each class range is called a histogram . For example , the percentage grades received by a class on an exam may result in the following array ( 30 ) Array of Exam Scores 87 , 76 , 89 , 90 , 64 , 57 , 59 , 79 , 88 , 74 , 72 , 99 , 81 , 77 , 75 , 86 , 94 , 66 , 75 , 74 , 83 , 70 , 50 , 57 When placing this array into a frequency distribution , the following general guidelines should be observed . First , between five and different classes should be employed , although the exact number of classes depends on the number of observations . Second , each observation goes into one and only one class . Third , when possible , use classes that cover an equal range of values ( and 2006 ) With these guidelines in mind , the exam score array shown earlier can be visualized with the following histogram ( Figure Histogram Showing the Frequency Distribution of Exam Scores ) Figure ( Scores 12 10 Grades URL books 138

As you can see from the histogram , certain descriptive observations can be readily made . Most students received a on the exam ( Two students failed the exam ( Five students received an A ( Note that this histogram does violate the third basic rule that each class cover an equal range because an grade ranges from , whereas the other grades have ranges of equal size . Regardless , in this case we are most concerned with describing the distribution of grades received during the exam . Therefore , it makes perfect sense to create class ranges that best suit our individual needs . Measures of Central Tendency We can further explore the exam score array by applying measures of central tendency . There are three primary measures of central tendency the mean , mode , and median . The mean , more commonly referred to as the average , is the most often used measure of central tendency . To calculate the mean , simply add all the values in the array and divide that sum by the number of observations . To return to the exam score example from earlier , the sum of that array is , and there are 30 observations ( 30 ) So , the mean is 30 78 . The mode is the measure of central tendency that represents the most frequently occurring value in the array . In the case of the exam scores , the mode of the array is 75 as this was received by the most number of students ( three , in total ) Finally , the median is the observation that , when the array is ordered from lowest to highest , falls exactly in the center of the sorted array . More , the median is the value in the middle of the sorted array when there are an odd number of observations . Alternatively , when there is an even number of observations , the median is calculated by the mean of the two central values . If the array of exam scores were reordered into a sorted array , the scores would be listed thusly Sorted Array of Exam Scores , 64 , 66 , 67 , 70 , 72 , 73 , 74 , 74 , 75 , 76 , 77 , 79 , 83 , 85 , 88 , 89 , 90 , 92 , 93 , 94 , 99 Since 11 30 in this example , there are an even number of observations . Therefore , the mean of the two central values ( 76 and 1611 77 ) is used to calculate the median as described earlier , resulting in ( 76 77 ) Taken together , the mean , mode , and median represent the most basic ways to examine trends in a . URL books 139

Measures of Dispersion The third type of descriptive statistics is measures of dispersion ( also referred to as measures of variability ) These measures describe the spread of data around the mean . The simplest measure of dispersion is the range . The range equals the largest value minus in the the smallest . In our case , the range is 99 57 42 . The range represents a slightly more sophisticated measure of dispersion . This method divides the data into . To accomplish this , the median is used to divide the sorted array into two halves . These halves are again divided into halves by their own median . The first quartile ( is the median of the lower half of the sorted array and is also referred to as the lower quartile . represents the median . is the median of the upper half of the sorted array and is referred to as the upper quartile . The difference between the upper and lower quartile is the range . In the exam score example , and . Therefore , the range for this is . A third measure of dispersion is the variance ( To calculate the variance , subtract the raw value of each exam score from the mean of the exam scores . As you may guess , some of the differences will be positive , and some will be negative , resulting in the sum of differences equaling zero . As we are more interested in the magnitude of differences ( or deviations ) from the mean , one method to overcome this zeroing property is to square each deviation , thus removing the negative values from the output ( Figure ) This results in the following URL books 140

Figure Deviation Squared From Mean Deviation We then divide the sum of squares by either ( in the case of working with a sample ) or ( in the case of working with a population ) As the exam scores given here represent the entire population of the class , URL books 141

we will employ Figure Variance , which results in a variance of . If we wanted to use these exam scores to extrapolate information about the larger student body , we would be working with a sample of the population . In that case , we would divide the sum of squares by . Figure ' Standard deviation , the final measure of dispersion discussed here , is the most commonly used measure of dispersion . To compensate for the squaring of each difference from the mean performed during the variance calculation , standard deviation takes the square root of the variance . As determined Standard Deviation , our exam score example results in a standard deviation of ( Figure ) Calculating the standard deviation allows us to make some notable inferences about the dispersion of our . A small standard deviation suggests the values in the are clustered around the mean , while a large standard deviation suggests the values are scattered widely around the mean . Additional inferences may be made about the standard deviation if the conforms to a normal distribution . A normal distribution implies that the data , when placed into a frequency distribution ( histogram ) looks symmetrical or When not normal , the frequency distribution of is said to be URL books 142

positively or negatively skewed ( Figure of Normally Curved , Positively Skewed , and Negatively Skewed ) Skewed data are those that maintain values that are not symmetrical around the mean . Regardless , normally distributed data maintains the property of having approximately 68 percent of the data values fall within standard deviation of the mean , and 95 percent of the data value fall within standard deviations of the mean . In our example , the mean is 78 , and the standard deviation is . It can therefore be stated that 68 percent of the scores fall between and ( 78 ) while 95 percent of the scores fall between and ( 78 ) For that do not conform to the normal curve , it can be assumed that 75 percent of the data values fall within standard deviations of the mean . Figure ( I ( filmed . Positively Skewed , Normal Curve Positive Skew Negative Skew KEY TAKEAWAYS The measure of distribution for a given variable is a summary of the frequency of values over the range of the and is commonly shown using a histogram . Measures of central tendency attempt to provide insights into typical value for a . Measures of dispersion ( or variability ) describe the spread of data around the mean or median . EXERCISES . Create a table containing at least thirty data values . For the table you created , calculate the mean , mode , median , range , range , variance , and standard deviation . and . 2006 . Modern Elementary Statistics . Cliffs , Prentice Hall . URL books 143

Searches and Queries LEARNING OBJECTIVE . The objective of this section is to outline the basics of the language and to understand the various query techniques available in a GIS . Access to robust search and query tools is essential to examine the general trends of a . Queries are essentially questions posed to a database . The selective display and retrieval of information based on these queries are essential components of any geographic information system ( GIS ) There are three basic methods for searching and querying attribute data ( selection , query by attribute , and ( query by geography . Selection Selection represents the easiest way to search and query spatial data in a GIS . Selecting features highlight those attributes of interest , both and in the attribute table , for subsequent display or analysis . To accomplish this , one selects points , lines , and simply by using the cursor to the feature of interest or by using the cursor to drag a box around those features . Alternatively , one can select features by using a graphic object , such as a circle , line , or polygon , to highlight all of those features that fall within the object . Advanced options for selecting subsets of data from the larger include creating a new selection , selecting from the currently selected features , adding to the current selection , and removing from the current selection . Query by Attribute Map features and their associated data can be retrieved via the query of attribute information within the data tables . For example , search and query tools allow a user to show all the census tracts that have a population density of 500 or greater , to show all counties that are less than or equal to 100 square kilometers , or to show all convenience stores within mile of an interstate highway . Specifically , Structured Query Language ) is a commonly used computer language developed to query attribute data within a relational database management system . Created by IBM in the , URL books 144

allows for the retrieval of a subset of attribute information based on , criteria via the implementation of particular language elements . More recently , the use of has been extended for use in a GIS ( Shekhar and ) One important note related to the use of is that the exact expression used to query a depends on the GIS format being examined . For example , is a particular version used to query , while Jet is used to access personal . Similarly , and tables use a restricted version of that doesn support all the features of or Jet . As discussed in Chapter Data Management , Section Database Management , all attribute tables in a relational database management system ( used for an query must contain primary foreign keys for proper use . In addition to these keys , implements clauses to structure database queries . A clause is a language element that includes the SELECT , FROM , WHERE , ORDER BY , and HAVING query statements . SELECT denotes what attribute table you wish to view . FROM denotes the attribute table in which the information resides . WHERE denotes the criteria for the attribute information that must be met in order for it to be included in the output set . ORDER BY denotes the sequence in which the output set will be displayed . HAVING denotes the predicate used to filter output from the ORDER BY clause . While the SELECT and FROM clauses are both mandatory statements in an query , the WHERE is an optional clause used to limit the output set . The ORDER BY and HAVING are optional clauses used to present the information in an interpretable manner . URL books 145

Figure ci ' i in ' Table Squires Edwin Paul Hess Douglas Peterson Chris Gibson David Smith Dan Bobby Tony Tom Glen Tanner Dave Ramirez Ruben Justin Jamie Eric Buckley Chris Brody Richard . Ave . Fake Way Wetland . Lane PI . Goldenrod Summer Mill Wrong Way Oso Hambone Sugarplum Upland Upland Kane Los Angeles Los Angeles Los Angeles Newport Beach Oceanside Eugene Los Angeles Miami Topanga The following is a series of expressions and results when applied to Figure Personal Addresses in Attribute Table . The title of the attribute table is Note that the asterisk ( denotes a special case of SELECT whereby all columns for a given record are selected SELECT FROM WHERE City URL books 146

This statement returns the following Squires 4589 . Upland 91657 Upland Consider the following statement SELECT FROM WHERE State CA ORDER BY This statement results in the following table sorted in ascending order by the column ( not included in the output table as directed by the SELECT clause ) Tanner MacDonald Brody Ramirez In addition to clauses , allows for the inclusion of specific operators to further delimit the result of query . These operators can be relational , arithmetic , or and will typically appear inside of URL books ( 147

conditional statements in the WHERE clause . A relational operator employs the statements equal to ( less than ( less than or equal to ( greater than ( or greater than or equal to ( Arithmetic operators are those mathematical functions that include addition ( subtraction ( multiplication ( and division ( operators ( also called connectors ) include the statements AND , OR , and NOT . The AND connector is used to select records from the attribute table that satisfies both expressions . The OR connector selects records that satisfy either one or both expressions . The connector selects records that satisfy one and only one of the expressions ( the functional opposite of the AND connector ) Lastly , the NOT connector is used to negate ( or unselect ) an expression that would otherwise be true . Put into the language of probability , the AND connector is used to represent an intersection , OR represents a union , and NOT represents a complement . Figure Diagram of Operators illustrates the logic of these connectors , where circles A and represent two sets of intersecting data . Keep in mind that is a very exacting language and minor inconsistencies in the statement , such as additional spaces , can result in a failed query . Figure Di ( per ( Joins AND OR NOT URL books 148

Used together , these operators combine to provide the GIS user with powerful and search and query options . With this in mind , can you determine the output set of the following query as it is applied to Figure Histogram Showing the Frequency Distribution of Exam Scores ?

SELECT , FROM WHERE 10000 AND 100 ORDER BY The following are the results Buckley Chris Gibson David Hess Douglas Bob Tony Ramirez Ruben Smith Dan Squires Edward Tanner Dave Justin Query by Geography Query by geography , also known as a spatial query , allows one to highlight particular features by examining their position relative to other features . For example , a GIS provides robust tools that allow for the determination of the number of schools within 10 miles of a home . Several spatial query options are available , as outlined here . Throughout this discussion , the target layer refers to the feature whose attributes are selected , while the source layer refers to the feature on which the spatial query is applied . For example , if we were to use a state boundary polygon feature to select URL books 149

highways from a line feature ( select all the highways that run through the state of Arkansas ) the state layer is the source , while the highway layer is the target . INTERSECT . This spatial query technique selects all features in the target layer that share a common locale with the source layer . The intersect query allows points , lines , or polygon layers to be used as both the source and target layers ( Figure ) Figure When features that intersect with point features Points Lines HIE when features that intersect with line features Points Lines ( A 2131 ?

when features that intersect with polygon features Points Lines URL books 150 The highlighted blue and yellow features are selected because they intersect the red features . ARE A DISTANCE OF . This technique requires the user to specify some distance value , which is then used to buffer ( Chapter Analysis Vector Operations , Section Multiple Layer Analysis ) the source layer . All features that intersect this buffer are highlighted in the target layer . The are within a distance of query allows points , lines , or polygon layers to be used for both the source and target layers ( Figure ) Figure When features that are within a distance of point features When features that are within a distance of line features DA ! A When features that are within a distance of polygon features Points Lines . URL books ) a 151

The highlighted blue and yellow features are selected because they are within the selected distance of the red features tan areas represent around the various features . COMPLETELY CONTAIN . This spatial query technique returns those features that are entirely within the source layer . Features with coincident boundaries are not selected by this query type . The completely contain query allows for points , lines , or as the source layer , but only can be used as a target layer ( Figure ) Figure When point , line , and polygon features completely contained by polygon features The highlighted blue and yellow features are selected because they completely contain the red features . ARE COMPLETELY WITHIN . This query selects those features in the target layer whose entire spatial extent occurs within the geometry of the source layer . The are completely within query allows for points , lines , or as the target layer , but only can be used as a source layer ( Figure ) URL books 152

Figure when features that are completely polygon features The highlighted blue and yellow features are selected because they are completely within the red features . HAVE THEIR CENTER IN . This technique selects target features whose center , or centroid , is located within the boundary of the source feature . The have their center in query allows points , lines , or polygon layers to be used as both the source and target layers ( Figure ) Figure When features that have their centers within a point feature When features that have their centers within a line feature Points Lines 77677 , When features that have their centers within a polygon feature URL books ?

153 The highlighted blue and yellow features are selected because they have their centers in the red features . SHARE A LINE SEGMENT . This spatial query selects target features whose boundary share a minimum of two adjacent vertices with the source layer . The share a line segment query allows for line or polygon layers to be used for either of the source and target layers ( Figure ) Figure When features that share a line segment with line features When features that share a line segment with polygon features URL books ) a 154

The highlighted blue and yellow features are selected because they share a line segment with the red features . TOUCH THE BOUNDARY OF . This methodology is similar to the INTERSECT spatial query however , it selects line and polygon features that share a common boundary with target layer . The touch the boundary of query allows for line or polygon layers to be used as both the source and target layers ( Figure ) Figure When features that touch the boundary of line features Lines When features that touch the boundary of polygon features Lines URL books ( 155

The highlighted blue and yellow features are selected because they touch the boundary of the red features . ARE IDENTICAL TO . This spatial query returns features that have the exact same geographic location . The are identical to query can be used on points , lines , or , but the target layer type must be the same as the source layer type ( Figure ) Figure When features that are identical to Lines The highlighted blue and yellow features are selected because they are identical to the red features . ARE CROSSED BY THE OUTLINE OF . This selection criteria returns features that share a single vertex but not an entire line segment . The are crossed by the outline of query allows for line or polygon layers to be used as both source and target layers ( Figure ) URL books ) 156

Figure When features that are crossed by the outline of line features When features that are crossed by the outline of polygon features The highlighted blue and yellow features are selected because they are crossed by the outline of the red features . CONTAIN . This method is similar to the COMPLETELY CONTAIN spatial query however , features in the target layer will be selected even if the boundaries overlap . The contain query allows for point , line , or polygon features in the target layer when points are used as a source when line and polygon target layers with a line source and when only polygon target layers with a polygon source ( Figure ) URL books ) a ti ) 157

Figure When features that contain point features When features that contain line features Lines ED When features that contain polygon features The highlighted blue and yellow features are selected because they contain the red features . ARE CONTAINED BY . This method is similar to the ARE COMPLETELY WITHIN spatial query however , features in the target layer will be selected even if the boundaries overlap . The are contained by query allows for point , line , or polygon features in the target layer when URL books 158

are used as a source when point and line target layers with a line source and when only point target layers with a point source ( Figure ) Figure When features that are contained by point features When features that are contained by line features Points Lines ?

When features that are contained by polygon features Points Lines I The highlighted blue and yellow features are selected because they are contained by the red features . URL books a 159 KEY TAKEAWAYS The three basic methods for searching and querying attribute data are selection , query by attribute , and query by geography . is a commonly used computer language developed to query by attribute data within a relational database management system . Queries by geography allow a user to highlight desired features by examining their position relative to other features . The eleven different options listed here are available in most GIS software packages . EXERCISES . Using Figure Histogram Showing the Frequency Distribution of Exam Scores , develop the statement that results in the output of all the street names of people living in Las Angeles , sorted by street number . When querying by geography , what is the difference between a source layer and a target layer ?

What is the difference between the CONTAIN , COMPLETELY CONTAIN , and ARE CONTAINED BY queries ?

Shekhar , and . 2003 . Spatial Databases A Tour . Upper Saddle River , Prentice Hall . URL books 160 Data Classification LEARNING OBJECTIVE . The objective of this section is to describe the available to parse data into various classes for visual representation in a map . The process of data combines raw data into classes , or bins . These classes may be represented in a map by some unique symbols or , in the case of maps , by a unique color or hue ( for more on color and hue , see Chapter Analysis II Raster Data , Section Basic with ) maps are thematic maps shaded with graduated colors to represent some statistical variable of interest . Although seemingly straightforward , there are several different available to a cartographer . These break the attribute values down along various interval patterns . 1991 ) noted that different can have a major impact on the interpretability of a given map as the visual pattern presented is easily distorted by manipulating the interval breaks of the . In addition to the methodology employed , the number of classes chosen to represent the feature of interest will also affect the ability of the viewer to interpret the mapped information . Including too many classes can make a map look overly complex and confusing . Too few classes can oversimplify the map and hide important data trends . Most effective classification attempts utilize approximately four to six distinct classes . While problems potentially exist with any classification technique , a increases the interpretability of any given map . The following discussion outlines the methods commonly available in geographic information system ( GIS ) software packages . In these examples , we will use the US Census Bureau population statistic for US counties in 1997 . These data are freely available at the US Census website ( The equal interval ( or equal step ) method divides the range of attribute values into equally sized classes . The number of classes is determined by the user . The equal interval method is best used for continuous such as precipitation or temperature . In the case of the 1997 Census Bureau data , county population values across the United States range URL books 151

from 40 ( Yellowstone National Park County , to ( Los Angeles County , CA ) for a total range of 40 . If we decide to classify this data into equal interval classes , the range of each class would cover a population spread of ( Figure Equal for 1997 US County Population Data ) The advantage of the equal interval classification method is that it creates a legend that is easy to interpret and present to a nontechnical audience . The primary disadvantage is that certain will end up with most of the data values falling into only one or two classes , while few to no values will occupy the other classes . As you can see in Figure Equal Classification for 1997 US County Population Data , almost all the counties are assigned to the first ( yellow ) bin . Figure Equal Interval CI ( 1997 US ( Equal Interval County Population 2000 1903921 I 1903922 3807775 3807776 5711629 715154134 . 9519333 URL books 152

The method places equal numbers of observations into each class . This method is best for data that is evenly distributed across its range . Figure shows the method with five total classes . As there are counties in the United States , each class in the classification methodology will contain 628 different counties , The advantage to this method is that it often excels at emphasizing the relative position of the data values ( ie , which counties contain the top 20 percent of the US population ) The primary disadvantage of the methodology is that features placed within the same class can have wildly differing values , particularly if the data are not evenly distributed across its range . In addition , the opposite can also happen whereby values with small range differences can be placed into different classes , suggesting a wider difference in the than actually exists . Figure ( County Population 2000 67 9210 ' 9211 13251 18252 34895 34896 32522 32523 9519333 URL books 153

The natural breaks ( or ) classification method utilizes an algorithm to group values in classes that are separated by distinct break points . This method is best used with data that is unevenly distributed but not skewed toward either end of the distribution . Figure Natural Breaks shows the natural breaks classification for the 1997 US county population density data . One potential disadvantage is that this method can create classes that contain widely varying number ranges . Accordingly , class is characterized by a range of just over , while class is characterized by a range of over . In cases like this , it is often useful to either tweak the classes following the effort or to change the labels to some ordinal scale such as small , medium , or The latter example , in particular , can result in a map that is more comprehensible to the viewer . A second disadvantage is the fact that it can be to compare two or more maps created with the natural breaks method because the class ranges are so very to each . In these cases , that may not be overly disparate may appear so in the output graphic . Figure Natural Natural Breaks County 2000 67 150971 150972 534573 534379 . 1334544 1334545 3400573 3400579 9519333 URL books 154

Finally , the standard deviation method forms each class by adding and subtracting the standard deviation from the mean of the . The method is best suited to be used with data that conforms to a normal distribution . In the county population example , the mean is , and the standard deviation is . Therefore , as can be seen in the legend of Figure Standard Deviation , the central class contains values within a standard deviation of the mean , while the upper and lower classes contain values that are or more standard deviations above or below the mean , respectively . Figure Standard Deviation Standard Deviation County Population 2000 . Dev . Dev . Dev . Dev . URL books 155

In conclusion , there are several viable data that can be applied to maps . Although other methods are available ( equal area , optimal ) those outlined here represent the most commonly used and widely available . Each of these methods presents the data in a different fashion and highlights different aspects of the trends in the . Indeed , the methodology , as well as the number of classes utilized , can result in very widely varying interpretations of the . It is incumbent upon you , the cartographer , to select the method that best suits the needs of the study and presents the data in as meaningful and transparent a way as possible . KEY TAKEAWAYS maps are thematic maps shaded with graduated colors to represent some statistical variable of interest . Four methods for classifying data presented here include equal intervals , quartile , natural breaks , and standard deviation . These methods convey certain advantages and disadvantages when visualizing a variable of interest . Given the maps presented in this chapter , which do you feel best represents the ?

Why ?

Go online and describe two other data classification methods available to GIS users . For the table of thirty data values created in Section Descriptions and Summaries , Exercise , determine the data ranges for each class as if you were creating both equal interval and classification schemes . 1991 . How to Lie with Maps . Chicago University of Chicago Press . URL books 155