Download Data Processing for GIS: A Spreadsheet Tutorial and more Study Guides, Projects, Research MS Microsoft Excel skills in PDF only on Docsity!
Basic Spreadsheet Data Processing
S4 Summer 2021 GIS Institute
Frank Donnelly, GIS & Data Librarian
Introduction
This tutorial introduces a selection of basic data processing tasks with spreadsheets, geared towards GIS users. Data tables must follow strict formatting rules in order to be used in GIS, and a substantive amount of time is often needed for processing attribute data in preparation for mapping and analy- sis. Spreadsheets are powerful and accessible tools that are practical for processing data for small projects. These examples were written using MS Excel 2019, but can be replicated in other pack- ages. Basic spreadsheet knowledge (selecting columns and rows, copying and pasting, saving files) is presumed. Before we begin, download the tutorial data file data_processing_tutorial.zip from the Canvas GIS Institute page, save it on your computer, and unzip the file.
Conventions used in this tutorial:
- Names of windows, tabs, and tools appear in italic face.
- Names of folders, files, and columns appear in typewriter face.
- Button-clicking steps are chained together as: Tab on Ribbon - Button - Menu Option
In this tutorial we will type formulas (also called functions) directly into cells. For help with formulas, you can insert them using the fx function button located above the cell block.
Notes on Sample Data
The sample data is a global spreadsheet of copper smelters created by USGS researchers in 2003 (metadata is included with the spreadsheet). This dataset was selected because it presents a number of common data processing challenges; like many spreadsheets it is formatted for display rather than data analysis. We will reformat this data with the intention of plotting the smelters in GIS using their coordinates, and summarizing it by country to join it to a country shapefile for thematic mapping. Additional data includes a country boundary file from Natural Earth https://www.naturalearthdata.com/, and a CSV file of ISO 3166 country codes.
1 Copy - Paste Special - Values
Never work directly in the original file, in case you need to refer back to it. Make a copy; you could copy the entire file, but in this case we will combine this with an additional step of removing
formatting. The Copy - Paste Special - Values process is one you will use over and over again. Use it to eliminate formatting and to replace formulas with the actual values output from the formula.
- Navigate into the data_processing_tutorial - smelter folders, and open the CSTable.xls file. Notice how the Copper Smelters worksheet is formatted for display purposes.
- Select all the columns in this sheet, right click, and choose Copy. Then go to File - New and open a new, blank workbook. Click in cell A1, right click, and under Paste hit the Values button (little clipboard with 123). Alternatively, you can click on the Paste Special text, and choose the values radio button and hit OK.
- Go to File - Save As and save the file in the data_processing_tutorial - smelter as copper_smelters.xlsx in the contemporary Excel format .xlsx.
2 Remove Headers and Footers
Data tables can only have a single header row, and cannot have footnotes or subtotals.
- Delete Row 2 in-between the column header row and the first value row. Then scroll to the bottom of the worksheet and delete the two rows of footnotes. Save the workbook.
3 Making Values Explicit
The names of the countries in the sheet appear once on their own row as subheadings, and the rows that follow represent smelters in that country. In a data table, each row must represent a unit or record of data. We need to explicitly associate country names with each individual row, and eliminate the country subheader rows.
- Select Column A. On Home - Find and Select choose Go to Special, and click the radio button for Blanks, and hit OK. Cell A3 will be highlighted - do not click on the cell, just type: =A2 without hitting enter. Then, hold down the CTRL key and hit enter. The blank cells will autofill with the country name above it.
Columns. Choose Delimited, and in the next window check Comma and nothing else. Click Finish. This leaves Locations (with and without feature Names) in Column B and states (with and without feature names) in Column C, split apart based on the location of the comma.
- Insert a new column to the right of Column B. Select Column B - Locations, Data - Text to Columns, Delimited, uncheck Comma and check Other and in the Other box type an open parentheses (. then click Finish. Now we have just the Locations in Column B, and if there was a feature name it’s now in Column C.
- Insert a new column to the right of Column D. Select Column D - States, Data - Text to Columns, Delimited, Other with an open parentheses ( then Finish. Now we have just the States in Column D, and if there was a feature name it’s now in Column E.
- Insert a new column to the right of Column E. In this column we will combine the two separate feature name columns. In Cell F2 type =C2&E2. Copy and paste this formula all the way down. Then do a Copy - Paste Special - Values on Column F. Delete Column E, then delete Column C. Save your work.
- Let’s clean up the feature column. Select Column D, go to Home - Find and Select - Replace. Type a closing parentheses ) in the Find box and hit Replace All. This removes the parentheses and replaces it with nothing.
- Many of our name values have trailing or leading spaces. Insert three empty columns to the right of Column D. In Cell E2 type =TRIM(B2). Copy this formula to the two adjacent cells in Columns E and F. Then copy the formula for these three columns all the way down the sheet. Do a Copy - Paste Special - Values on all three columns. Delete Columns B,C,D. Save your work.
5 Naming Columns
Tables that are being imported into GIS or a database cannot have column names that are: too long, contain spaces, contain punctuation (except underscores), or begin with numbers. Some like to
use camel case (PctTotal) to distinguish different parts of the name, while others prefer snake case (pct_total). Use common sense abbreviations for values (pct, avg, tot) and units (km, vol, sqmi).
- Name your columns from left to right as follows: country, location, state, smelt_name, latitude, longitude, facility, commodity, company, status, capacity_mt, process.
6 Splitting Text with Formulas
Formulas can also be used for splitting strings of text into distinct values. For example, the process column contains footnotes in parentheses. In GIS or a database, this would prevent us from summarizing data into categories using this field, as a single category would be split into two (one with and without a footnote). The following example illustrates how to use substring formulas for this type of work. We will strip the process name out of the cell without the footnote.
- In Cell M6 type =LEFT(L6,13). This reads from the left of the cell and returns the first 13 characters, so the string Reverberatory is extracted to exclude the footnote (S).
- Replace the formula you just typed with this one: =FIND("(",L6). This finds the first occurrence of the open parentheses and returns its position in the string.
- Replace this formula with: =LEFT(L6,FIND("(",L6)-2). This reads from the left of the cell, and for the character position we use the location of the first open parentheses, minus two positions (to avoid returning the parentheses and the leading space in the result).
- Take this formula and copy / paste it into Cell M5 directly above it. The formula returns a value error, because the string in Cell L5 does not contain a parentheses.
- Replace the formula in Cell M5 with this one: =IFERROR(LEFT(L5,FIND("(",L5)-2),L5). This says, if there is no error return the string before the parentheses. Otherwise, if an error is returned (because there is no parentheses), then simply return the full value that’s in that cell.
- Delete the existing formulas you created in the last steps. Enter this formula into cell Cell M2: =IFERROR(LEFT(L2,FIND("(",L2)-2),L2). Copy and paste it down the column. Select the column and do a Copy - Paste Special - Values. Delete Column L which is the process column with the footnotes. Rename the new process column without the footnotes in Column L: process. Save your work.
- Tab back to Sheet1. Insert a new column to the left of Column C Location. In Cell B2 type this formula: =VLOOKUP(C2,Sheet2!A:C,3,FALSE). This takes the Cell C country value and attempts to find that value in the first column (implied) in Sheet2. The 2nd parameter is the location of the full range of cells that the formula will look. If it finds the same value in the first column, it moves over to column 3 (specified in the 3rd parameter) and returns whatever is in that row. FALSE indicates that the match must be an exact match, not an approximate one. Copy and paste this formula down the column, then select the column and do a Copy - Paste Special - Values. Name the column ccode.
- Some values did not return a match because of differences in the way country names were spelled in each table. Manually fix records that didn’t match. Iran - IRN, Korea, North
- PRK, Russia - RUS, Serbia and Montenegro - SRB, United States
- USA. Change the name for Serbia and Montenegro to Serbia (since this dataset was created these countries have split into two separate ones, and the smelter is in Serbia). Save your work.
9 Pivot Tables
The smelter data is now in a well formed state, and the coordinates in this file can be used to plot smelter locations. What if we wanted a summary table by country, so we could potentially join this table to a shapefile and map it? A pivot table makes this possible.
- While in Sheet1 go to Insert - Pivot Table. Click OK on the pivot window. In the Pivot Table Fields window on the right, choose ccode to add it as a row. Drag location and then capacity_mt into the Values section. It will assume to count the former and sum the latter (keep these defaults, but you could change them by hitting the drop down for each). Drag
status under the Filters section. In the table on the left, change the dropdown for status to Operating to remove smelters that are not open. Save your work.
- Copy Columns A to C in the pivot table. Hit the plus symbol at the bottom of the sheet to add a new sheet, Sheet4. Move into this sheet, click in Cell A1 and do a Paste Special - Values. Delete the first two rows, then delete the last Grand total row. Rename the column headings: ccode, cu_smelters,capacity_mt. Save your work.
10 Final Steps
- Click on each sheet name and rename them: Sheet1: smelters, Sheet2: ccodes, Sheet3: pivot, Sheet4: country_smelt. Save your work.
- Excel files can be added directly to GIS packages. When you add a file it will prompt you for an individual worksheet. The country_smelt sheet can be added as a table and joined to a shapefile of countries using the ccode. Try doing with this with the country layer from Natural Earth, saved in the nat_earth_countries folder.
- The smelters data can be plotted as XY data using its longitude and latitude coordinates to create a point file. Most GIS packages require XY data to be in a CSV format. Go into the smelters sheet and choose File - Save As. Save it in the project folder as a CSV UTF-8 file (this will preserve diacritics in country names). It will save just the active sheet as a CSV. In a GIS package, when plotting specify longitude as the X coordinate and latitude as the Y coordinate. After plotting, transform it to a spatial data file by exporting it as a shapefile or feature class (otherwise your ability to manipulate the file will be limited).