Datasets and Examples
Data science tutorials will most likely work with a dataset. But for programming topics or other technical tutorials, the goal may not be to work with data. To help you write more interesting exercises that help learners see how the topic might be relevant to their work, it's helpful to have an example -- a story or task -- to structure your exercises and explanations.
Examples:
Submitting a Python script that takes command line arguments as a job on an HPC cluster
Pretend you're a teacher automating tasks related to grading or course administration
Even if you don't have a need to maintain a single example throughout your tutorial, it's worth giving examples in your introductory materials concerning use cases for the skills people will be learning. Why should they learn what you're teaching and when can they expect to use it?
Dataset Considerations
Thinking about your 3-6 exercises, what characteristics do you need a dataset to have?
Both continuous and categorical variables?
Multiple groups?
Time series?
Do you need two datasets (one to demonstrate with and one for exercises), or does your single data set have enough variety to accommodate both teaching examples and exercises with different variables?
Dataset Topics
Choosing a data set can be a small way to introduce topics of social justice. Consider datasets (and exercises) that expose people to data on social inequities.
Unless you are teaching a domain-specific tutorial, if you choose an example or data from one research domain, keep it at an appropriate level of generality and background knowledge. For example, non-Bio people know that DNA and protein sequences are made up of strings of letters, but they don’t know about transcription factor binding sites.
Preparing the Data
Make a plan to subset and simplify the data. 100s of extra columns, messy file names, etc. are not good. Simplify the data to include only things you will reference in the workshop.
If your workshop topic is about cleaning data, clean the dataset except the deliberate examples you will work through during the workshop. Or choose a clean dataset and deliberately “reverse clean” the characteristics you will address.
Plan to store your dataset somewhere online where participants can download or import the file via URL in your notebook or script. This will help avoid file path issues for participants. Data files can be stored in GitHub or other cloud storage services.
Some Starting Datasets
Here are some datasets with useful characteristics for teaching data skills. While they are often used, they can be useful for a first draft of a tutorial. You can work to substitute in a more unique data set at a later date if you choose to.
Palmer Penguins: Each observation is a penguin. A useful mix of numeric (physical measurements of the penguins) and categorical variables (species, island, sex). Just a few missing values, which can be eliminated if you don't want missing values included. Available as an R package, or grab the CSV file from the GitHub repository. Data can be used to illustrate Simpson's Paradox.
Stanford Open Policing Project: Each observation is an individual traffic stop by the police. The data is almost all categorical and indicator (binary) variables describing the vehicle, person stopped, and what happened during the stop, but it can be useful if you want to practice filtering or aggregating (counting) data. The full data set is large, so consider grabbing just a single year and/or location. Has some messy values in it occasionally. This could be filtered out or incorporated into the tutorial if you're looking for relatively simple data re-coding or cleaning tasks. Can be used when you want to merge or join data sets by combining it with demographic statistics for a locality.
Ames Housing Data: Each observation is a property for sale in Ames, Iowa. Multiple types of categorical variables (ordered and unordered), along with a few numeric variables (square footage, lot size). Useful for statistical and machine learning models. Available as an R package, or grab a copy of the data set you want from the GitHub repo.
Sources for Datasets
This is a weekly email newsletter that shares data sets of all types. Spreadsheet archive of shared data sets.
Please check and follow any terms of use on the data sets.
Open APIs: list of APIs that don't require authentication
Looking for a certain type of data set? Feel free to ask others in the workshop session - people from different fields may know of resources you don't.
Last updated