Section 6 Methods
The following section details how to use the data and R code provided as well as an explanation of how the code works.
6.1 Reviewing the Parameters File
This file allows for alterations in the data analysis that one may want to customize for their study, such as geography of interest, year, agencies of interest, and so forth. If any changes to the code are necessary in order to customize the output results, the only place users will need to make changes is in this parameters file. The default entries in the parameters file are based on the 2021 California study to provide an example of how to enter the desired variables.
6.1.1 Sections of the Parameters File
Each section of the parameters file defines variables used in a specific portion of the code. This section details the purpose of these variables and their intended use. The README.md file contains correct syntax to use when making alterations to avoid breaking the code’s functionality. Please refer to both sections as needed while altering the parameters file.
6.1.1.1 General Global Variables
This section contains general variables pertaining to the year the study takes place and the state the study focuses on.
- f_year: This variable represents the fiscal year, or the timeframe during which data was generated. Often, the year that the study data comes from is older than the year the study is performed. This value is important to specify to ensure that 1) the data exists, and 2) that you are pulling the correct data.
- year: This variable represents the year that the study is performed in. Its main purpose is to ensure a naming convention is enforced for generated output files.
- state: This variable represents the state being analyzed. It is used to filter some data and to aid in enforcing naming conventions for generated output files.
6.1.1.2 User Generated Employment Files
This section contains the variables needed in order to generate user specific employment data pertaining to the study region.
- res_mult: This variable represents the multiplier required to calculate the number of equivalent full-time positions generated by members of the military reserves. In anticipation that this value may change over time, it is included in the parameters file.
- national_sus_dhs: This variable represents the number of Homeland Security employees that are “suppressed.” This means that the precise location of their work is intentionally redacted for matters of national security. This number is updated quarterly on FedScope - for more details see Section 3.1.
- sus_dhs_mult: This variable represents the percentage of suppressed employees in each state. This multiplier comes from 2016 data. End users should make their own approximations of this value if they are focused on a study region smaller than the state level.
- acs: This variable represents the file name used for downloaded American Community Survey (ACS) data used in calculating how military personnel are distributed throughout the study region (by county and congressional district).
- dod_shares: This variable represents the file name used for the DoD County shares file. In the event the user needs to use a different file than the one provided; the name of the new file can be altered here.
- fed_prop: This variable represents the file name of the document that stores information on federally owned buildings. The square footage of these buildings is used to estimate the proportion of federal employees in each agency per county and congressional district in the study area.
6.1.1.3 USAspending.gov API Variables
This section contains the variables needed to run the code to automate obtaining federal spending data from the USAspending.gov website API. USAspending.gov provides additional documentation on how the USAspending API runs and its variable requirements.
- agency_type: This variable represents whether the agency or agencies in question awarded the contract money or funded it. USAspending.gov provides more information on agency type definitions.
- agency_tier: This variable represents which tier the agency resides in. An example of a top tier agency is the Department of Energy, and subtier of a top tier agency is the Missile Defense Agency (a subtier of the Department of Energy).
- agency_name: This variable represents the name of the agency or agencies targeted in the code.
- tier_name: This is an optional variable, representing the name of the agency only if it is a subtier agency.
- date_type: This variable represents the date type of the spending data. Whether the date range will focus on when the money was spent (action date) or the last time information was entered about a contract (last modified date).
- date_range_start: This variable represents the start date of the spending data being analyzed. It is usually the beginning of a fiscal year or quarter.
- date_range_end: This variable represents the end date of the spending data being analyzed. It is usually the end of a fiscal year or quarter.
- awards: This variable represents the award type of the spending data. It includes categories such as contract spending, grant award spending, Veterans Affairs disbursements and direct payments to individuals or businesses.
- recipient_locations_country: This variable represents the country to limit spending data to. USA needs to be specified to omit spending and resulting economic activity that occurs overseas.
- recipient_locations_state: This variable represents the state (if applicable) to limit the spending data to. Multiple states can be specified if put into a list format.
- recipient_locations_county: This optional variable represents the county(ies) to limit the spending data to, if further refinement within a state is desired. This variable can contain multiple counties for multiple states. However, this variable is mutually exclusive with the district variable. One or the other may be filled out, but NOT BOTH.
- recipient_locations_district: This variable represents the congressional district(s) to limit the spending data to. This variable can contain multiple districts for multiple states (although you may grab more data than intended if the same district name is used in multiple states). Data is usually more accurate at the county level, and district data is often left blank where there are multiple recipients in a county that overlaps multiple districts. As a result, county-based analysis may be preferable.
6.1.1.4 Filter USAspending Variables
This section contains the variables required to filter the downloaded USAspending data.
- doe: This variable represents the syntax used in USAspending data for naming the Department of Energy (DOE). It is used to help filter and analyze DOE data.
- contract_columns: This variable represents the column headers of contract spending data to filter.
- grant_columns: This variable represents the column headers of grant data to filter.
- c_label: This variable represents a section of the name generated when the USAspending contracts data is automatically downloaded. It allows users to specify which unique portion of the file name the code searches for in order to simplify loading in the correct file for further processing.
- g_label: This variable represents a section of the name generated when the USAspending grants data is automatically downloaded. It allows users to specify which unique portion of the file name the code searches for in order to simplify loading in the correct file for further processing.
- c_out_name: This variable represents the desired output name of the filtered contracts data.
- g_out_name: This variable represents the desired output name of the filtered grants data.
6.1.1.5 Concatenate USAspending Variables
This section contains the variable required to run the code to merge the separate contracts and grants data into one file. This is to make subsequent portions of the code easier to run.
- u_out_name: This variable represents the desired output name of the concatenated USAspending data.
6.1.1.6 National Security DOE Spending Variables
This section contains the variable required to calculate which portion of federal DOE spending applies to national security.
- doe_offices: This variable represents the list of DOE sub tier agencies that receive national security funding.
6.1.1.7 Aggregate USAspending Variables
This section contains the variables required to aggregate the different spending data to prepare it for entry into Excel sheets and upload into IMPLAN.
- u_state_outname: This variable represents the desired output name of the file containing USAspending data that has been aggregated by IMPLAN code.
- doe_state_outname: This variable represents the desired output name of the file containing DOE data that has been aggregated by IMPLAN code.
6.2 Reviewing User Specific Files
After modifying the parameters file and before running any code, make sure you have obtained, modified and renamed the following data as per instructions in Section 4.2. If these files are not properly named and located, the CODE WILL NOT RUN:
- TEMPLATE_emp.csv: Should be renamed based on the “state” parameter and saved to the data/raw/ folder.
- Federal Real Public Property Data: Should be renamed based on naming conventions and saved to the data/temp/ folder.
- American Community Survey Data: Should be renamed based on naming conventions and saved to the data/temp/ folder.
6.3 Processing the Data
After set-up is complete, code can be run to process data for loading into IMPLAN. Refer to the deprecated_run_analysis_master.R script to run through the data analysis process for this project.
6.3.1 Clearing Environment, and Loading in Packages and Parameters
Lines 1 through 17 in the deprecated master code serve to set up RStudio to process the federal data. This involves performing some housekeeping by clearing any variables already stored in the global environment, removing previously loaded packages, ensuring needed library packages are present and loading in the variables stored in the parameters file.
6.3.2 Loading in Functions
Lines 20-23 load in functions that standardize and simplify working through the data. Details about what purposes these functions provide and how they alter data will be described below in the relevant sections.
6.3.3 Obtaining the USAspending Data
Line 26 loads in a R script that performs a call to USAspending.gov’s API to grab relevant contract, grant and direct payment spending data. The success of this script depends on correctly filling out the desired filter variables in the parameters file. For our study, we required data for the following awarding, top-tier agencies: Department of Defense, Department of Homeland Security, Veterans Affairs and Department of Energy, for the action date date-type from the 2020 fiscal year (October 1st, 2019 – September 30th, 2020) in the state of California in the United States. The obtain USAspending script downloads and unzips this data to the data/temp/ folder in the repository. The end output is in the form of two .csv files, one for contracts data and one for grant/direct payment data.
6.3.4 Filtering the USAspending Data
Lines 29-34 load in the data from USAspending.gov and use the filter_usaspending function to filter it. This serves to reduce the file size in order to speed up subsequent processing. Only data that was needed for this analysis was kept. All data was paired down to federal national security spending that occurred within the State of California using the primary_place_of_performance column.
For contracts data the following columns were kept: federal_action_obligation, awarding_agency_name, awarding_sub_agency_name, award_description, funding_office_name, recipient_name, recipient_county_name, recipient_congressional_district, recipient_zip_4_code and naics_code.
For grant/direct payment data the following columns were kept: federal_action_obligation, awarding_agency_name, awarding_sub_agency_name, award_description, funding_office_name, recipient_name, recipient_county_name, recipient_congressional_district, recipient_zip_code, recipient_zip_last_4_code, assistance_type_code and business_types_description.
6.3.5 Error checking the USAspending Data
Lines 37 and 38 begin checking for errors in the filtered USAspending data and modifying them to ensure they have an IMPLAN code for every entry. The only major difference between contracts and grants data is the variable referenced in fixing errors. Contracts data uses NAICS codes. Grants data uses business types.
6.3.5.1 Error Checking Contracts
Line 37 loads in a R script file that goes through the filtered USAspending contracts and remediates errors in certain contract entries. The first lines in the script load two files into dataframes: 1) the filtered USAspending contracts file (contracts); and 2) a NAICS to IMPLAN crosswalk provided by IMPLAN (naics_to_implan). There are several code fixes and removal of duplicate codes that our team determined, to the best of our knowledge, to be the most accurate way to relate a NAICS code to an IMPLAN code. After this, the crosswalk is merged into the contracts dataframe by NAICS code in order to get an IMPLAN code for each contract entry.
However, this merge does not give all contract entries an IMPLAN code. This can be for a variety of reasons, including: 1) a contract entry from USAspending not having a NAICS code; and 2) an incorrect and/or outdated NAICS code that was not part of IMPLAN’s provided crosswalk. The lines of code following the merge work to fix the contract entry errors. First, an IMPLAN code was hardcoded to the contracts with no NAICS code based on the contract recipient’s name and industry sector. Next, the contracts which had a mistyped or older NAICS code were pulled out to a new dataframe (contracts_missing_implan) and hardcoded to an IMPLAN code based on our team’s best research into which IMPLAN code each entry would best fit into. This presents a degree of error when parsing out the contract spending, but also represents the best available workaround given time and information constraints. A new way of handling the error checks for contracts is detailed in Section 8, “What’s Next?”
The final steps of this script involve dropping the contract entries with no IMPLAN code from the original contracts dataframe. Then, the dataframe which fixed the contracts missing IMPLAN codes (contracts_missing_implan) is merged back into the original contracts dataframe. The contracts dataframe now has all contract entries cleaned, with an IMPLAN code for each entry. Last, only the columns necessary for the analysis moving forward are kept, and then the cleaned contracts dataframe is written into a new CSV file in the data/temp/ folder.
6.3.5.2 Error Checking Grants/Direct Payments
Line 38 loads in a R script file that goes through the filtered USAspending grants and direct payments data to remediate issues in certain entries. First, the script loads in the filtered USAspending grants and direct payments file into a dataframe (grants). Next, Veterans Affairs (VA) direct payments are separated from the grants dataframe and placed into their own dataframe (va_benefits). This is done because VA direct payments are calculated separately from contracts and grants in the IMPLAN activity sheets. The grants dataframe filters out the direct payments based on the assistance_code column - entries with an assistance code of 10 are the VA direct payment entries.
After separating direct payments from grants, the next step is to load a business type to IMPLAN crosswalk into a dataframe (business_to_implan). This file was created internally as the best method for relating grants data to IMPLAN codes, as USAspending grants data did not have NAICS codes. The crosswalk is then merged with the grants dataframe in order to get an IMPLAN code for each grant entry.
However, this merge does not give all grant entries an IMPLAN code. In the case of the grants data, they may be missing a business type value, or their business type may be one that is not captured in the crosswalk. Thus, the lines after the merge hardcode IMPLAN codes into the grants dataframe based on the grant recipient’s name and the industry sector. After these hardcodes, a second dataframe (grants_missing_implan) is defined to capture any other grant entries missing an IMPLAN code. If a grant entry is still missing an IMPLAN code, it will need to be manually fixed in the grants_missing_implan dataframe and then merged back into the original grants dataframe.
Now that the grants data has been cleaned, the final steps in this script involve selecting only the necessary columns of data for the va_benefits and grants dataframes and writing each one into their own CSV files in the data/temp/ folder.
6.3.6 Manually Fixing Congressional District Errors
Lines 40 and 42 provide an important notice to manually fix some errors in the three files produced in the error checking step before running additional lines of code. This step is essential to ensuring a reduction of errors further along in the analysis. This step is needed because there are certain entries in all three CSV files where the recipient_congressional_district column has a district value of “NA” (contracts) or “90” (grants and direct payments). There are multiple reasons for this error, with the most common being that these data entries are in counties that span across multiple congressional districts.
In order to properly remediate these issues, our team utilized the dod_county_shares Excel file in the data/raw/ folder. Looking at this file, the “Districts” tab in that Excel sheet provides a breakdown from the California Redistricting Commission of how much a county overlaps one or more congressional districts based on land area. Then the contract, grant, or direct payment CSV files’ entries are filtered by county. These entries are randomly assigned to a district based on the percentage of their county that resides in each district. For example, if 20 contract entries without a district assignment were in Alameda County, 47% of those entries (~9) would be assigned to CA-13, 42% (~8) assigned to CA-15, and 12% (~3) assigned to CA-17.
This method inherently presents a degree of error with allocating spending across congressional districts. Given time constraints and data limitations, this workaround presented itself as the best solution at the time. Our team is working on a new and improved way to address these manual fixes - for more details, refer to Section 8, “What’s Next?”
6.3.7 Concatenating the USAspending Data
Lines 45-46 concatenate the cleaned USAspending contracts and grants data. The concatenate function works to bring together the cleaned USAspending contracts and grants files, while omitting the VA benefits file from that grouping. This concatenation step is necessary to eventually aggregate the contracts and grants spending by IMPLAN code for the activity sheets. Line 46 takes this concatenated USAspending contracts and grants dataframe (concat_files) and writes it into a CSV in the data/temp/ folder.
6.3.8 Splitting DOE from USAspending Data
Lines 49-52 separate the DOE spending from the main USAspending data. This is done because, to this point, our report has kept DOE data separate from the main analysis with the Departments of Defense, Homeland Security, and Veterans Affairs. Line 49 reads in the concatenated USAspending file that was generated from the previous step as a variable. Lines 51 and 52 perform the split_usaspending function on the concatenated data and define two new dataframes: usaspending and doespending. The usaspending dataframe will be used for the main analysis of this project, while the doespending dataframe will need some alterations prior to becoming a separate addition to the analysis of this project.
6.3.9 Filtering DOE to National Security-Related Data
Line 55 filters the doespending dataframe to only include national security-related activity. This step is important because, unlike the Departments of Defense, Homeland Security, and Veterans Affairs, only a subset of DOE activity is related to national security. A list of these sub-tier agencies can be found under the “doe_offices” variable in the parameters file. This new dataframe provides the DOE national security spending which will be used in IMPLAN. Additionally, a doe_ns_adjustment is calculated by dividing the sum of federal_action_obligation in doe_ns_spending by the sum of federal_action_obligation in the doe_spending dataframe. This adjustment calculation provides a rough proportion for how much DOE activity in the state is national security related. This value is used to assign statewide DOE employees and SmartPay accordingly.
6.3.10 Aggregating the Spending Data
Lines 59 and 60 perform the statewide_aggregate function for USAspending and DOE spending, pulling together all the spending data based on IMPLAN code. The CSV files which are written as a result of running the function provide the ‘statewide spending data by IMPLAN sector’ for our main analysis and DOE analysis.
Lines 62 through 64 do the simple calculation of aggregating the VA benefits data for the state (line 62), each county (line 63), and congressional district (line 64). The VA benefits figure from line 62 will go in the IMPLAN activity sheet with the statewide main analysis (i.e., the data and CSV from line 59), while the VA benefits data from lines 63 and 64 will figure into later code.
6.3.11 Compiling Employment Data
Line 67 loads in a R script that calculates the necessary employment data. All prior steps focused on preparing spending data for IMPLAN – this is the only step necessary for preparing employment data. The first couple of lines in this script read the state employment CSV (detailed in Section 4.2) into a dataframe, state_emp, and define values for statewide military, civilian, Department of Defense, Department of Homeland Security, Veterans Affairs, and Department of Energy employment. From that point on, the code works to apportion the statewide employment numbers for each county and congressional district. Certain employment types have different methods and files needed in order to do this local apportionment.
ACS data is used for military employment calculations. We assume that the distribution of residents employed in the armed forces across counties and districts is a good proxy for estimating the location of their employment. The ACS data file is read into a dataframe (acs_data) and two columns are added. The first column takes the number of employed armed forces in a county/district from the ACS and divides it by the statewide employed armed forces value to obtain a percentage, which we name armed_forces_percent. The second column, military_personnel, takes armed_forces_percent and multiplies it by the statewide military employment value (obtained from state_emp). Then, the geography and military_personnel columns are retained, and the county and district military employment numbers are saved into separate dataframes.
Next, the dod_county_shares.xlsx file (detailed in Section 5.2.1) is used to apportion Department of Defense (DoD) civilian employees. ACS data does not account for the high concentration of DoD employees in Lassen County due to the presence of the Sierra Army Depot. Our self-created file, which is based on apportionment across counties from 2016 DoD employment numbers, does account for this, and is therefore used in place of the ACS armed forces apportionment. The dod_county_shares.xlsx file has two sheets, one for counties and one for districts. Each sheet is read into a dataframe to apportion the counties and districts, respectively. The counties dataframe (dod_county) apportions the statewide DoD civilian employees by the percent share that each county had for DoD employees in 2016. The districts dataframe (dod_district) is based off the counties dataframe but uses the percentage of county land area that lies within a district to distribute the DoD employees per county for each district. This distribution from county to district is then aggregated by district in order to get the DoD employees by district. Then, the two DoD dataframes (dod_county and dod_district) retain their respective geography column and DoD employment totals to complete the DoD apportionment.
GSA federal property data (detailed in Section 3.3.2) is used to apportion Department of Homeland Security (DHS) and Veterans Affairs (VA) employment. The square footage of DHS and VA buildings across counties and districts in the state can provide a rough estimate for apportioning the statewide employment for those agencies. First, federal property data is read into a dataframe (fedprop), necessary columns are selected, and then appropriate filters are applied for property type and departments of interest. After this initial read in, a new dataframe is created just for VA properties (va_fedprop). A new column for employees is calculated by multiplying the statewide VA employment by the percentage of that region’s building square footage divided by the state total. Then, two new dataframes are made to aggregate employees by county (va_county) and district (va_district). This entire process is then recreated for DHS, with three respective dataframes being made for the DHS calculations (dhs_fedprop, dhs_county, and dhs_district).
Next, the county_emp and district_emp dataframes are created. These dataframes will be used in the final step to help allocate the county and district employment for the IMPLAN activity sheets. Each dataframe merges together the four employment dataframes developed for each respective geography from the above steps (DHS, VA, DoD, and military) and ensures that all the numbers in the dataframes are read as numeric rather than character strings. The final step for these two dataframes is to add in columns of implan_545 (which is just military employment) and implan_546 (which is the sum of DoD, DHS, and VA employment). For the county_emp dataframe, two additional columns are needed: inverse_545 (the statewide military employees minus a county’s military employees) and inverse_546 (the statewide DoD, DHS, and VA employees minus a county’s DoD, DHS, and VA employees). These new columns in these dataframes will be read in the next section and fill in the employment data for IMPLAN.
6.3.12 Running For Loops to Generate IMPLAN Activity Sheets
Line 70 loads in a R script that holds the entire process for generating IMPLAN activity sheets. The first lines of code in this script read in the individual Excel sheets from IMPLAN’s activity sheet template and assigns them to a variable. These variables will be combined later in the process to generate one Excel file (the activity sheet for a given local geography), with each variable becoming its own sheet. After that, two variables are generated (countynames and congressid) by calling for a unique list of all county names and all congressional district numbers in the usaspending dataframe. From there, the localized geographies IMPLAN activity sheets are generated.
The first for loop code fills out and generates the county and inverse county IMPLAN activity sheets by iterating through the usaspending, va_benefits_countiesagg, and county_emp dataframes. Within the first portion of this loop, the usaspending dataframe is combined based on each county, and then aggregated by IMPLAN code. This data is combined with the employment data from county_emp, which is also combined based on each county. All of this data is stored into a variable, temp, which, along with some lines of code that format temp, constitute the first Excel sheet tab of the multi-tab IMPLAN activity sheet. Additionally, the HouseholdSpendingChange4 variable is altered to input the respective county’s va_benefits_countiesagg value. Then, a list (templist) is created, composed of the temp variable and all the Excel sheets that were read in. An output path and folder are defined, and the county activity sheet is written into this folder. The second portion of this loop mimics this entire process above but grabs the inverse of each county’s data. In other words, the tempooc variable and everything that corresponds to it is based on all the spending and employment activity done outside of the county of interest. This data also gets written into an Excel file, which is named by the county followed by “in” to denote that it is a county’s inverse IMPLAN activity sheet. For our analysis, this for loop code generates 116 total county IMPLAN activity sheets: 58 normal and 58 inverse.
The second for loop fills out and generates the district IMPLAN activity sheets by iterating through usaspending, as well as the va_benefits_districtsagg and district_emp dataframes. Due to limitations with IMPLAN’s software, the district inverse models are not run; hence, the district inverse activity sheets are not needed. This for loop follows the exact same process as the first portion of the county for loop – the only difference being that the data is aggregated by district, and that different variable names are utilized (i.e., temp2). This for loop does denote a separate output folder to store all the district IMPLAN activity sheets, and names each Excel file based on “CA-” followed by the district number. For our analysis, this for loop code generates 53 total district IMPLAN activity sheets.
6.4 Using IMPLAN
At this point, the activity sheets for the desired local geographies have been generated and are ready to be run through IMPLAN. Note that the data for the statewide main analysis and statewide DOE analysis needs to be written to a file, and copied over into IMPLAN’s activity sheet template, in order to run the statewide analysis. For more details on how to run these activity sheets through IMPLAN, refer to their guide on how to use the activity sheets in their software.