Abbreviated Topic Lists: A Searchable Index
Outline of Unit 1

Topic Area  1.1: Data science and me




1.1 
Role of data in decisionmaking ― at home, in government, business, industry, sport, ... 


1.2 
Introduction to Data Science and the Data Science learning cycle 
1 
What is Data Science? 
1.3 
Data Science success stories 


1.4 
Data Science disasters 


1.5 
Elaboration of steps in the datascience cycle 


1.6T 
Diverse uses of the term " Data Science " 


1.7T 
Sources of information about Data Science and its activities 






2.1 
Data I keep 


2.2 
Data about me 
2 
What does Data Science 
2.3 
Data on friends and family 

have to do with me? 
2.4 
What are data? 


2.5 
Privacy, security and openness/accessibility ― issues and tradeoffs 


2.6T 
Pedagogical issues relating to leading discussions and extracting issues 






3.1 
Examples of data 


3.2 
What are data? 


3.3 
How do we get useful data? Primary versus secondary data 
3 
Sources of data 
3.4 
Privacy, security and openness/accessibility  issues and tradeoffs 


3.5 
Thinking critically about data: introduction 


3.6 
Thinking critically about data: data quality and GIGO 


3.7 
Thinking critically about data: ways in which data need critical appraisal 


3.8T 
Pedagogical issues relating to these topics 






4.1 
Ideas (with examples) about using data 


4.2 
Examples of social and personal consequences 
4 
Examples of Data Science 
4.3 
How can the prediction process go wrong? 

problems 
4.4 
Examples of causal explanation and use for control 


4.5 
How can the process of finding causes go wrong? 


4.6T 
Pedagogical issues relating to these topics 




5T 
Extracting pertinent lessons 
5.1T 
Leading discussions and extracting key issues from student contributions 

from student discussions 
5.2T 
Story Telling 





Topic Area 1.2 Basic tools for exploration and analysis. Part 1: Tools for a single variable




1.1 
Observations and features/variables 


1.2 
Numerical versus categorical variables 
1 
Rectangular data sets 
1.3 
Importing data files in simple formats 


1.4 
The need for data cleaning 


1.5T 
Knowledge of Topic Area 1.6 


1.6T 
Pedagogical issues relating to these topics 






2.1 
Frequency tables and bar charts (counts and proportions) 


2.3 
Good ways of ordering groups for displays and tables 
2 
Graphics and summaries 
2.4T 
Proportions versus counts: what works best for what? 

for a single categorical 
2.5T 
Weaknesses of pie charts and stacked bar charts 

feature/variable 
2.6T 
Pedagogical issues relating to these topics 






3.1 
Graphics: Dot plots and histograms as largedataset alternative. 


3.2 
Characteristics to look out for and their implications 


3.3 
Summaries 
3 
Graphics and summaries 
3.4 
Box plot as plot of summaries 

for a single numeric 
3.6 
Converting numerical features /variables to categorical 

feature/variable 
3.6T 
How the summary measures are obtained 


3.7T 
Pedagogical issues relating to these topics 



Topic Area 1.3 Basic techniques for exploration and analysis. Part 2: Pairs of features/variables




1.1 
Making comparisons 


1.2 
Interpreting group comparisons 
1 
Comparing groups 
1.3 
Extension to panel plots 


1.4T 
Pedagogical issues relating to these topics 


1.5T 
Comparing and evaluating different presentations 






2.1 
Scatter plots 


2.2 
Outcome/Response features/variables versus Predictor/Explanatory features/variables 
2 
Relationships between two 
2.3 
Construction 

numerical features/variables 
2.4 
Structure in scatter plots 
2.5 
Basic ideas of prediction 



2.6 
Vertical strips as a guide for sketching trend curves by eye 


2.7 
How predictions can fail 


2.8 
Minimizing average prediction errors 


2.9 
Obtaining trend lines and slidercontrolled smooths from software 


2.10 
(Straight) lines and interpreting the intercept and slope coefficients of a trend line 


2.11 
Positive and negative associations 


2.12 
Modifications to scatter plots to overcome perceptual problems 


2.13T 
Working with algebraic expressions 


2.14T 
Pedagogical issues relating to these topics 






3.1 
Twoway tables of counts and proportions 
3 
Relationships between 
3.2 
Sidebyside and separate bar charts or dot charts of proportions 

categorical features/variables 
3.3T 
Pedagogical issues relating to these topics 




4 
Filtering data 
4.1 
Filtering data by levels of a categorical feature/variable 


4.2T 
Pedagogical issues relating to these topics 



Topic Area 1.4 Basic tools for exploration and analysis. Part 3: Three or more variables


1 
Pairs plots 
1.1 
Pairs plots that will cope with categorical and numerical features/variables 


1.2T 
Pedagogical issues relating to these topics 






2.1 
Panel plots/faceting and 3dimensional summary tables 
2 
Subsetting by a third 
2.2 
Playing or stepping through the sequence of plots in panel display 

feature/variable 
2.3 
Highlighting subgroups in a scatter plot or dot plot 
2.4T 
Pedagogical issues relating to these topics. 







3.1 
Coloring points in dot plots and scatter plots 
3 
Other ways of adding 
3.2 
Sizing points in scatter plots 

information on additional 
3.3 
Labeling points 

features/variables to 1 and 2 
3.4 
Strengths and weaknesses of methods of adding information 

feature/variable plots 
3.5T 
Pedagogical issues relating to these topics 






4.1 
Plots that allow querying of elements 
4 
Interactive plots 
4.2 
Linked plots, linked plotsandtables 


4.3T 
Pedagogical issues relating to these topics 



Topic Area 1.5 Graphs and Tables: how to construct them and when to use them


1 
When to use a graphical 
1.1 
Exploration and discovery versus presentation 

display and when to use a table 
1.2 
Infographics and infotables 






2.1 
The graphical process: a chain from graph creator to graph interpreter 


2.2 
Visualization Principle 1. Use Position along a common scale 


2.3 
Visualization Principle 2. Choose an appropriate Aspect Ratio 
2 
What makes a graphical 
2.4 
Visualization Principle 3. Encoding features/variables 

display good or bad? 
2.5 
Visualization Principle 4. Supply an informative caption 
2.6 
Visualization Principle 5. May need more than one graph 

2.7 
Other factors that good software generally gets right by default 

2.8T 
Pedagogical issues relating to comparing different types of graphs 







3.1 
Plotting samples of numerical data to explore relationships 
3 
What sort of graphical 
3.2 
Plotting a numerical and a categorical feature/variable 

display should I use? 
3.3 
Plotting features/variables that change over time 


3.3 
Plotting two numerical features/variables to explore relationships 


3.5T 
Pedagogical issues relating to comparing different types of graphs 






4.1 
Role of tables 
4 
Tables: their purpose, and 
4.2 
Principles for making patterns in tabular information 

how to create good tables 
4.3 
When it is often better to use a table rather than a graph 


4.4T 
Pedagogical issues relating to using tables 





Topic Area 1.6 The datahandling pipeline




1.1 
Automating the data science process 


1.2 
Data management principles 
1 
Introduction to tool 
1.3 
Case studies of real datascience projects using various toolsets 

support for the datahandling pipeline 
1.4T 
Characteristics and evaluation of some widespread toolsets 


1.5T 
Pedagogical issues relating to tool use/mastery 










2.1 
Data sources 


2.2 
Logical data formats 
2 
Getting and storing data 
2.3 
Physical file formats 


2.4T 
Case studies of good data sources 


2.5T 
Comparison and evaluation of storage approaches 


2.6T 
Pedagogy issues relating to data sources and storage 






3.1 
Automating an analysis 


3.2 
Data cleaning 
3 
Tool support for exploring 
3.3 
Data transformations 

and analyzing data 
3.4T 
Learning to use more aspects of the programming language 


3.5T 
Pedagogical issues for coding 






4.1 
Principles of communication 


4.2 
Customizing graphs and tables 
4 
Generating presentations 
4.3 
Combining explanation with graphs/tables 

Of the data 
4.4T 
Comparison and evaluation of tools that allow generating presentations 


4.5T 
Pedagogy issues relating to generating presentations 

Topic Area 1.7 Avoiding being misled by data




1.1 
What do we mean by GIGO? 

1 
GIGO  Garbage In, 
1.2 
Examples of garbage. How can we avoid collecting or using garbage? 


Garbage Out 
1.3T 
Pedagogical issues relating to these topics 








2.1 
Biases due to measurement issues 

2 
Bias and what we can do 
2.2 
Biases due to selection or filtering in data streams 


about it 
2.3 
(Discussion Topic) Extrapolating from the data we have to a larger setting 



2.4T 
Pedagogical issues relating to these topics 








3.1 
Allowing for an important third feature/variable 

3 
Problems and solutions in 
3.2 
Difference between an observational study and a randomized experiment 


reaching causal conclusions 
3.3 
Extrapolating from the data at hand to a larger setting 



3.4T 
Pedagogical issues relating to these topics 








4.1 
Learning to ask questions that can be answered from data 

4 
Questions that can and 
4.2 
Learning to spot questions that cannot be answered from the available data 


cannot be answered by 
4.3T 
Pedagogical issues relating to these topics 


data 










5.1 
Random sampling is not perfect 

5 
Sampling errors and 
5.2 
Unpacking the likely extent of sampling error 


confidence intervals 
5.3 
Experiencing how confidence intervals can be constructed 



5.4T 
Pedagogical issues relating to these topics 








6.1 
Randomized assignment is not perfect 

6 
Randomization variation 
6.2 
When to conclude that observed group differences are real? 


in experiments 
6.3 
Experiencing a twogroup randomization test 



6.4T 
Pedagogical issues relating to these topics 


Topic Area 2.1 Time Series data


1 
Problem elicitation and 
1.1 
The nature of time series data 

formulation: Time Series 
1.2 
Reasons why people are often interested in time series data 

data 








2.1 
Obtaining timeseries datasets 
2 
Getting the data 
2.2 
Some common dateandtime feature/variable formats 


2.3 
Transforming and reshaping data sets 






3.1 
Basic timeseries plots and recognizing features 
3 
Exploring the data 
3.2 
Identifying the trend + seasonal oscillation components 


3.3 
Decomposition into trend + season + residual 


3.4 
Comparing related series 






4.1 
Forecasting as projecting patterns from the past 
4 
Analyzing the data: 
4.2 
Making an informal forecast 

Modelling and Forecasting 
4.3 
Experience with using a formal forecasting method 


4.4T 
Pedagogical issues relating to these topics 






5.1 
Selecting features to be communicated 
5T 
Communicating the 
5.2 
Choosing a communication method 

results; next question? 
5.3 
Telling the story 

Topic Area 2.2 Map data




1.1 
Ubiquitous nature of maps 


1.2 
Constituent components of maps 
1 
What are the purposes of 
1.3 
Separation of data from display 

maps? 
1.4 
Multiple dimensions 


1.5 
Interactive example as a tool for exploratory learning 






2.1 
Location and Region Data as common archetypes 


2.2 
Plotting points on downloaded map tiles, relationship to scatterplots 
2 
How do we build and 
2.3 
Coding addedvariableinformation at location points; interpretation 

work with location maps? 
2.4 
Subsetting/Faceting; ways of showing changes over time 


2.5 
Interactivity with location mapplots 






3.1 
Shape files and choropleth maps; region labels 


3.2 
Matching regions in a dataset to regions in a shape file 


3.3 
Representing two or more features/variables; issues of scales 
3 
How do we build and 
3.4 
Perceptual problems with choropleth maps; alternative representations 

work with regional maps? 
3.5 
Subsetting/Faceting as a tool; ways of showing changes over time 


3.6 
Interactivity with regional maps 


3.7T 
Subtleties of color and scale choice  communication enhancement 


3.8T 
Distortions and bias: avoiding misleading figures projections 






3.3T 
Maps as visualization versus maps as data 
4 
(TEACHERonly TOPIC) 
3.4T 
Maps as data 

What is a Map, and how is 
3.5T 
Finding patterns in data through maps 

this Data? 
3.6T 
Overlays on maps 

Topic Area 2.3 Text data




1.1 
Examples of questions to be addressed using natural language 


1.2 
Important characteristics of text data 
1 
Problem elicitation and 
1.3 
Extracting tokens 

formulation: Text data 
1.4 
Removing stop words and performing stemming 


1.5T 
Characteristics of text data 






2.1 
Constructing frequency tables of tokens 


2.2 
Generating bar charts and word clouds of token frequencies 


2.3 
Limitations of unigrams; extracting bigrams 
2 
Bag of words analysis of 
2.4 
Summarizing bigrams; comparing unigrams and bigrams 

text data 
2.5 
Exploring differences between documents 


2.6 
Distinguishing the content of documents 


2.7 
Limitations of bag of words analysis 


2.8T 
Pedagogical issues relating to text data analysis 






3.1 
What is sentiment and why do people want to use it? 
3 
Sentiment Analysis 
3.2 
Merging tokens with sentiment data tables 


3.3 
Summarizing sentiment in a document; differences between sources 


3.4 
Limitations of sentiment analysis 


3.5T 
Issues relating to sentiment analysis 



Topic Area 2.4 Supervised learning


1 
Problem elicitation and 
1.1 
What is classification? 

formulation: Supervised 
1.2 
Classification models and rules 

Classification 
1.3 
Measuring how well the classification model works 






2.1 
Components of a classification tree and how it works 
2 
Introduction to 
2.2 
Misclassification rate 

Classification Trees 
2.3 
Node/leave "purity" 
2.4 
Generating Classification and Regression Trees 

2.5 
Consequences of misclassification 







3.1 
Introduction to R/Python commands to grow and visualize a CART 
3 
Growing Classification 
3.2 
When to stop growing the tree 

Trees 
3.3 
What is overfitting? Validation data 


3.4 
Pruning 






4.1 
What does the tree tell you about how classifications are made? 
4 
Communicating the 
4.2 
Using the tree to make decisions 

Results; next question? 
4.3T 
Pedagogical issues relating to these topics 






5.1 
Measuring quality of prediction 


5.2 
Interpreting regression trees 
5 
Introduction to 
5.3 
Building regression trees with one predictor feature/variable 

Regression Trees 
5.4 
Building regression trees with more than one predictor feature/variable 


5.5 
Comparing trees with a validation set 



Topic Area 2.5 Unsupervised Learning


1 
Problem elicitation and 
1.1 
What is unsupervised learning? 

formulation: Unsupervised 
1.2 
Creating clusters (groups) of data points based on attributes of the data 

Learning 
1.3 
Contrast with classification or supervised learning 






2.1 
Obtain appropriate datasets 
2 
Getting and exploring data 
2.2 
Interpreting the structure of the data 


2.3 
Contrast with Supervised Learning 
2.4 
Motivation for identifying clusters automatically 







3.1 
Kmeans is a clustering algorithm which is iterative in nature 


3.2 
Use of distance metric to assign each data point to a cluster automatically 
3 
Example of Unsupervised 
3.3 
Explanation of iterative procedure 

learning algorithm: 
3.4 
The need to repeat with different initial guesses for cluster centers 

Kmeans clustering 
3.5 
Use a small set of data points to explain the Kmeans algorithm manually 


3.6 
What is an outlier in this context? 


3.7 
Discuss distance of an object from the center of its assigned cluster 






4.1 
Introduction to format of the data set 


4.2 
Introduction to the programming environment to use for clustering 
4 
Implementing Kmeans 
4.3 
Clean and transform the data set to observations versus features 

clustering on a large data 
4.4 
Choose K and run the algorithm using the code snippet provided 

set 
4.5 
Interpret the results 


4.6 
Change the value for K, and repeat; compare how selections have changed 


4.7T 
Pedagogical issues relating to these topics 






5.1 
When is unsupervised learning as best approach for problem at hand? 


5.2 
Exploratory analysis and graphs to summarize the input data. 
5 
Use in Problem solving 
5.3 
Visualizations to show differences for different values of K 


5.4 
Making an optimal choice of K 


5.5 
Descriptive statistics for the features in each cluster 


5.6 
Interpreting and communicating the results 


5.7 
Effect of human factors 




6 
Other unsupervised 
6.1 
Examples when distancebased methods may not be appropriate 

learning methods 
6.2 
Motivate need for other clustering methods 

Alternatives to Kmeans 
6.3 
Visualizations of different cluster shapes unsuited to Kmeans 

clustering 
6.4 
Examples of visualizations 

Topic Area 2.6 Recommender Systems




1.1 
Examples of some recommender systems 


1.2 
Desirable features of recommender systems 
1 
Problem elicitation and 
1.3 
Ethical issues for recommender and personalized systems 

formulation: 
1.4 
Additional complexities 

Recommender Systems 
1.5 
Communicating the recommenders, 


1.6T 
Characteristics of a wide variety of recommender systems 


1.7T 
Pedagogical issues relating to recommender systems 






2.1 
Ratings (on useritem pairs): sparsity of the data. 


2.2 
Data quality issues 
2 
The data used by 
2.3 
Feature data on items, demographic data on users 

Recommender Systems 
2.4 
Ethics with data 


2.5 
Storing the data 


2.6T 
Tools to collect and manipulate ratings data 


2.7T 
Pedagogical issues relating to data for recommender system 






3.1 
Concept: Recommendation based on singleuser data, from item similarity 


3.2 
Measures of item similarity 
3 
Contentbased 
3.3 
Analysis and recommendation based on calculating nearest neighbors 

recommendation 
3.4 
Analysis and recommendation based on forming clusters of items 


3.5T 
Tools for calculating similarity, clusters etc 






4.1 
Concept: recommend items that are liked by similar users 


4.2 
Define similarity of users 
4 
Collaborativefiltering 
4.3 
Use regression to predict unseen rating from known ones 


4.4 
Ethics issues 


4.5T 
Tools that calculate predicted ratings 


4.6T 
Pedagogical issues relating to collaborative filter 






5.1 
User satisfaction; proxies to measure this 


5.2 
Measures from information retrieval 
5 
Evaluation of a 
5.3 
Ethics rules that apply to user studies 

Recommender System 
5.4T 
Tools to calculate IR measures 


5.5T 
Pedagogical issues relating to recommendation evaluation 

Topic Area 2.7 Interactive Visualization


1 
Why visualization? The 
1.1 
The power of visualization 


role of Visualization 
1.2 
Visual Exploratory Data Analysis 


in the Data Science 
1.3 
Visual Communication and Presentation 


Learning cycle 
1.4 
Considerations and challenges of visualization design 













2.1 
Brief introduction to perceptual and cognitive capacity. 

2 
Data Types and Visual 
2.2 
Hierarchy of visual variables 


Variables 
2.3 
Color 



2.4 
Motion 








3.1 
The role of interaction 

3 
Interaction 
3.2 
Types of interaction 1. Basic 



3.3 
Types of interaction 2. Manipulating the layout 



3.4 
Types of interaction 3. Manipulating the data 








4.1 
Understanding the audience and the task 

4 
Critique 
4.2 
Perceptual biases that affect visualization efficacy 



4.3 
Inappropriate encodings of data 



4.4 
Scales, legends, decorations 


Topic Area 2.8 Confidence intervals and the bootstrap


1 
Parameters versus 
1.1 
Motivation and examples 


estimates 










2.1 
Behavior of sampling errors in various contexts 



2.2 
What is a standard error? 

2 
Sampling error 
2.3 
Relation between sample size and standard error 



2.4 
Inverserootn relationship between sample size and sampling error 



2.5 
Effect of population size on sampling error 



2.6T 
Simulation and simulation tools 








3.1 
The concept of a confidence interval 

3 
Confidence intervals and 
3.2 
Interpretation of confidence intervals for a single parameter 


their implementation using 
3.3 
Experiencing bootstrap resampling 


bootstrapping 
3.4 
Comparing bootstrap standard error with usual approach 



3.5T 
Ways to facilitate learning from simulation 








4.1 
Coverage properties 

4 
Investigating performance 
4.2 
(Optional) More advanced situations 



4.3T 
Ways to facilitate learning from simulation 






5 
Differences and ratios 
5.1 
Constructing and interpreting bootstrap confidence intervals for differences 



5.2T 
Ways to facilitate learning from simulation 






6 
Further exploration 
6.1 
(Optional) Explore sampling from a set of theoretical distributions 



6.2T 
Ways to facilitate learning from simulation 






Topic Area 2.9 Randomization tests and Significance testing


1 
Randomized Experiments 
1.1 
The issue of assessing a possible treatment effect 

and Randomization variation 




2.1 
Generation and display of randomization distributions 


2.2 
Discussion: Concluding that you had evidence of a real difference 
2 
Towards the randomization 
2.3 
Motivate concept of rerandomization 

test 
2.4 
Generating the randomization distribution 


2.5 
Scope of inference justified by the experiment 


2.6T 
Simulation and simulation tools 






3.1 
A general procedure for conducting randomization tests 
3 
Randomization test 
3.2 
Analyzing data from several experiments 


3.3 
Language and idea of Pvalue and sidedness of a test 


3.4 
Generalize to 3 or more groups using simple distance measure 


3.5T 
Simulation and simulation tools 




4 
Investigating the 



performance of 
4.1 
Explore using large data set 

randomization tests in a 
4.2T 
Simulation and simulation tools 

sampling context 






5 
Relationships between 
5.1 
Assessing treatment differences using randomization 

categorical features/variables 
5.2T 
Simulation and simulation tools 













Topic Area 2.10 Image data


1 
Examples of image data 
1.1 
Sources and types 


and their uses 
1.2 
Images as data and methods for representing them 













2.1 
Sources of images 

2 
Image data origins and 
2.2 
Ways of encoding images in digital form 


encodings 
2.3T 
Comparing encoding methods 



2.4T 
Encodings for video and audio 













3.1 
Representing images using a numerical matrix 

3 
Image data representation 
3.2 
Relating individual pixels to color 


and basic mathematical 
3.3T 
Visual effects of mathematical operations on numerical versions of images 


operations on images 
3.4T 
Higherdimensional representations of 3D spatial data 



3.5T 
Using image processing operations to augment image data 


















4.1 
Examples of lowlevel features 

4 
Feature detection in s 
4.2 
Overview of some ways features can be identified in an image 


image 
4.3 
A feature vector as an imperfect representation of the image. 



4.4T 
Alternative choices for the features to consider 



4.5T 
Impact of highdimensionality of typical feature vector representations 













5.1 
Review supervisedlearning approach to a classification task 

5 
Image classification 
5.2 
Obtaining highquality training data for image classification 



5.3 
Applying an automated classifier to learn a classification 



5.4T 
Advantages and disadvantages of running a classifier directly on image data 



5.5T 
Unsupervised learning approaches to image classification 





