Where Does Data Come From

CSI-MTH-190

Schwab

Data is collected by

Polling Organizations

example: Yougov and The Economist Collect Data on the most important issues facing the US

This data, or a summary of it is often made public

Public Researchers

example: Nihart et al find plastics n brain tissue

This data, or a summary of it is often made public

And collected by

Companies

example: Meta Collects Data On Its Users

This data, or a summary of is not usually made public.

Governments

example: The Bureau of Labor Statistics Collects and Publishes Employment figures

This data, or a summary of it is often made public

We will not be collecting Data

That task is for a methods course (Psychology has such a course). When you collect data you need to do it in a thoughtful way, which requires a fair amount of training.

We will be using Data

We want to use data to gleen information.

Import Data

We first have to import data into whatever software we are using to analyze it.

We’ll import using three methods in this lecture and We’ll see more as the semester progresses.

  1. read_csv()
  2. read_sheets()
  3. packages
  4. APIs
  5. Scraping from websites

Importing a .csv file.

A .csv file is a special type of text document. csv is short for “comma separated values.” Essentially its a text file with commas separating the data within the file.

There are other ways of separating data (such as with tabs).

The tidyverse has a function called read_csv() which will take a .csv file and read it into memory so you can use it.

Importing a google sheets file

Googlesheets are also very common ways of storing data. There is a package called googlesheets4 that has the function read_sheets() for reading in data from a google sheet.

Importing from a package

Packages like the tidyverse already have data built in. Often times these data are for educational purposes and are sometimes very out of date.

We’ll import from a package first, as it is the easiest.

Example 1.

Let’s import txhousing with the tidyverse. This data is loaded automatically when we load the tidyverse.

# first load the tidyverse
library(tidyverse)
# next call the data with txhousing to view it.
txhousing
# A tibble: 8,602 × 9
   city     year month sales   volume median listings inventory  date
   <chr>   <int> <int> <dbl>    <dbl>  <dbl>    <dbl>     <dbl> <dbl>
 1 Abilene  2000     1    72  5380000  71400      701       6.3 2000 
 2 Abilene  2000     2    98  6505000  58700      746       6.6 2000.
 3 Abilene  2000     3   130  9285000  58100      784       6.8 2000.
 4 Abilene  2000     4    98  9730000  68600      785       6.9 2000.
 5 Abilene  2000     5   141 10590000  67300      794       6.8 2000.
 6 Abilene  2000     6   156 13910000  66900      780       6.6 2000.
 7 Abilene  2000     7   152 12635000  73500      742       6.2 2000.
 8 Abilene  2000     8   131 10710000  75000      765       6.4 2001.
 9 Abilene  2000     9   104  7615000  64500      771       6.5 2001.
10 Abilene  2000    10   101  7040000  59300      764       6.6 2001.
# ℹ 8,592 more rows

Benefiets: Data is easy to import. Documentation is often provided with package. Trust the data is good.

Cons: The data is often outdated. Mostly just for educational purposes.

Example 2

Let’s read in the data from the plastics in brains study.

First we have to download the data from here.

Next we have to move it to our current working directory. In most cases this is folder this .qmd file is in, if not we can set it manually in rstudio, which I’ll do in the video of this lecture.

Finally we have to load the tidyverse and read in the data and save it to memory.

plastic_in_brains = read_csv("path/to/name_of_csv.csv")

Pros and Cons

Benefiets: The data is current and exactly what we are interested in.

Cons: The documentation is elsewhere. We have to be sure we trust the supplier of data.

Read from a google sheet.

Finally I’ll show how we can read in data from a google sheet.

We collected class data in a google form. That data is now accessible to us as a sheet.

To read in the data:

1. load the googlesheets4

2. run read_sheet() with the url from our sheet.

library(googlesheets4)
class_data = read_sheet("https://docs.google.com/spreadsheets/d/1aNrNX5xWkrtuJHc-bsCC7e07KbnRbVIxaWdbblOCyxA/edit?usp=sharing")
  1. You will be prompted in the console to cache OAuth Tokens. Select 1 for Yes.
  2. You might be asked to install httpuv, you should do that. Finally you may be taken to a webpage to authorize the tidyverse access, do that too.
  3. You should only have to do the steps above once.

Pros and Cons

Same as those from importing .csv files.

APIs and Scraping

We’ll discuss API calls and scraping when they come up in the labs.

You are ready to go on to the next lecture.

You will need to read in our class data using google sheets to go on to the next lecture and homework assignment.