A Scaffolded Project Implementation For Teaching Undergraduate Data Analytics Course

Mihir Mehta
Mihir Mehta
Pennsylvania State University

Mihir is a doctoral student at the Industrial and Manufacturing Engineering Department of Penn State University. His research topic is trustworthy artificial intelligence in the public health context. He is an editorial staff writer for the INFORMS OR/MS Tomorrow magazine. He will be a technical mentor for the Data Science for Social Good- UK Chapter Summer Program 2022. He will discuss this article in the “Inclusive Pedagogy" section at the upcoming INFORMS Annual Meeting 2022.

Undergraduate students with little programming and data analytics background often face challenges in learning problem-solving skills. To help all students become better problem-solvers, the author developed a scaffolded project that leverages publicly available datasets. This article describes this scaffolded project implementation. As part of this project, the author created numerous active learning exercises to help students apply data engineering and data visualization-based concepts to provided datasets. Through this engaging and positive learning-focused project, students developed essential data analytics skills useful for areas such as manufacturing, retail, and finance.

This article is organized as follows. First, the project-related key terminologies are introduced. Then, an overview of the underlying project is provided. Subsequently, the instructional process is discussed in detail with an illustrative example, including the prior assessment strategies, the specifics of the scaffolded module, and post-completion activities. Afterward, a list of additional strategies implemented to nurture a positive and engaging classroom experience is provided. The article concludes with a summarized description of the project.

Key Terminologies

  • Instructional Process:  It consists of three stages: Prior Knowledge Assessment, 3-section Scaffolded Module, and Post Completion Evaluation & Subsequent Development. This three-stage process develops the specifics to assist with the project completion. It also shapes subsequent course content development. This process is continuously supported through Complimentary Strategies. These strategies nurture engaging and positive learning experiences. The entire process is illustrated in Figure 1.

Figure 1: Instructional Process

  • Scaffolded Module: The 3-section scaffolded module, one of the three stages in the instructional process, helps students to complete this large and complex project through a series of incremental active learning exercises. This module consists of three sections: Meta Data, Data Structure, and Exploratory Analysis. These three sections are motivated by the key learning steps discussed in the project description.
  • Illustrated Example: It exemplifies the three stages of the Instructional Process and the three sections of the 3-section Scaffolded Module.

Project Description

This section provides a general overview of the implemented exploratory data analytics project. It describes the leveraged publicly available datasets, the format of the problem statements, and the key learning stages in completing project. The next section, instructional process, provides further details illustrated with an example.

Datasets: This project uses three publicly available datasets for the Pennsylvania state:

  1. County Level Population Statistics: Dataset (Source Documentation)
  2. County Level COVID-19 Cases: Dataset(Source Documentation)
  3. County Level COVID-19 Deaths: Dataset (Source Documentation)

Problem Statement: Based on problem descriptions and datasets, students are expected to complete “R" code skeletons to reproduce a given set of visualizations.

Key Learning Steps: To reproduce these visualizations, students need to learn incrementally in the following steps:

  1. Familiarize with provided datasets (Meta Data Section)
  2. Understand corresponding data structures (Data Structure Section)
  3. Formulate data engineering algorithms- a path starting from the provided business problem description to the desired visualization (Exploratory Analysis Section)

Instructional Process


Figure 2: Illustrative Project Example

This section discusses the proposed instructional process using the project example shown in Figure 2, which illustrates the process of going from the project description and dataset to the desired visualization by following the 3-section Scaffolded Module.

Using the “County Level COVID-19 Cases" dataset, this example expects the students to (1) do data distribution analyses for the Luzerne County and (2) examine distribution skewness differences between two variables: new case numbers and 7-day average new cases. The project requires reproducing the two visualizations: box-plot and histogram, as shown in Figure 2. The 3-section Scaffolded Module helps the students achieve the goals as follows. The Meta Data section helps the students familiarize with different variables in the “County Level COVID-19 Cases" dataset. The Data Structure section helps them understand the underlying data structure of the “County Level COVID-19 Cases" dataset. The Exploratory Analysis guides them to apply that understanding in reproducing desired visualizations.

Prior Knowledge Assessment

In the first stage of the instructional process, the students’ prior knowledge levels are assessed through a set of strategies. These strategies help in designing specifics of the 3-section Scaffolded Module.

  • Assignments and Classroom Activities: These activities focus on identifying gaps in programming concepts. Example: For a custom-developed R-function, the effect of “print" and “return" usage on R-output was one of the identified gaps.
  • Multiple-Choice Quiz: Using source documentation and sample data, this quiz measures students’ familiarity with provided datasets. Example: True or False Question “Luzerne County has a FIPS code value of 42078."
  • Course Interactions: Through multi-channel engagement ranging from classroom discussions and office hours to anonymous Canvas surveys-based queries, specific hints, and references for the exploratory analytics section. Example: Classroom interactions prompted to include guidelines for referencing a variable name inside ‚‚ in “dplyr" syntax.

3-section Scaffolded Module

This module, comprised of three sections, help the students advance towards the completion of their project. Prior knowledge assessments-based insights are used to design the details of this module.

Meta Data

The learning objectives of this section are as follows.

  1. Familiarize students with underlying datasets’ dimensions and columns
  2. Reinforce conceptual gaps identified in the Assignments and Classroom Activities
  3. Create a smaller dataset to study different data engineering operations through “dplyr" verbs

Scaffolded Discussion with Example:

  • By defining a simplified data structure involving dataset and information types, a custom R function was developed to create the metadata table.

    Example: The part “b" of Figure 3 is a partial output snapshot of the metadata table created for the “County Level COVID-19 Cases" dataset (denoted as ‘dt_type=2’).

  • This development applied prior covered programming concepts.

    The part “a" of Figure 3 shows some of steps developed in the custom R-function. These incorporate applications of “dim"(finding data dimension) and “=="(comparison operation), prior covered concepts . Specifically, “#Step 9" is developed to reinforce a conceptual gap example discussed in the Assignments and Classroom Activities.

  • The custom function included multiple printing statements to follow through with the step-wise development of the final output.

    Example: The part “a" of Figure 3 shows “print" statement with text description defined before each of the steps.


Figure 3: Meta Data Section

Data Structure

The learning objectives of this section are as follows.

  1. Develop a strong understanding of all three datasets’ data structure
  2. Interpret underlying context captured in a row or a set of rows of the dataset
  3. Build self-confidence in R-programming through the practice of low-stake active learning coding exercises

Scaffolded Discussion with Example:

  • This section provided a heuristic algorithmic outline for completion. By applying a series of basic “dplyr" verbs-based data engineering operations, the outline helps understanding the data structure of a dataset irrespective of source documentation availability. This section also included sample codes and desired outputs creating low-stake active learning opportunities.
    Example:  Figure 4 shows a partial output of three sub-parts, having the names “The number of FIPS Value by Date," “The number of Date Values by FIPS," and “Dataset for Luzerne 04/13/2020." These results are produced for the “County Level COVID-19 Cases" dataset.
  • One sub-part in the outline asks students to understand summary statistics, such as the number of unique values in a specific column and the total number of observations in a group. In the other sub-part, data structure comprehension is strengthened through subsetting the provided dataset for specific values in multiple columns.
    Example: The part “a" of figure 7 lists summary statistics produced at “FIPS Code" and “Date" levels for the “County Level COVID-19 Cases" dataset. These summary statistics-based data structure insights are strengthened by a subsetting this dataset for the “Luzerne" county value for the “04/13/2020" date. The corresponding partial output is shown in the part “b" of Figure 4.

Figure 4: Data Structure Section

Exploratory Analysis

The learning objectives of this section are as follows.

  1. Apply data engineering and data visualizations concepts on provided datasets to answer a specific data analytics question using a visualization
  2. Provide exposure to the thought process of formulating multiple analytics questions using the same dataset
  3. Develop an understanding of the various components of a visualization
  4. Familiarize with different practices useful in developing easy-to-interpret visualizations
  5. Infer insights about the dataset from a visualization

Scaffolded Discussion with Example:

  • This section provided an R-code skeleton to generate desired visualizations.

    Example: The part “c" of Figure 5 shows snippets of provided R-code skeleton for distribution studies.

  • A set of hints were included to sketch the required algorithm including the mapping for visualization components to “ggplot2" specific keywords.

    Example: The part “b" of Figure 5 shows some provided guidelines to develop an algorithm for completing distribution studies. The box outline highlights provided “ggplot2" mapping.

  • A reference HTML document with multiple R-codes and resultant outputs was arranged for illustrating varied applications involving a set of data engineering operations.

    Example: The part “a" of Figure 5 contains an R-code and a partial resultant output. This code is producing all the Q1-2021 observations for the “Luzerne" county in the “County Level COVID-19 Deaths" dataset.

  • Another reference HTML document was supplied to reinforce some specific concepts identified in the “Course Interactions" assessment.

    Example: The part “d" of Figure 5 demonstrates how to reference variables containing spaces in their names. This development is inspired by an interaction discussing “Including variable names inside ‚‚."


Figure 5: Exploratory Data Analysis Section

Post Completion Evaluation & Subsequent Development

This stage of the instructional process evaluates students’ post-completion comprehension levels. Project evaluations are discussed individually in the 15-minute scheduled oral evaluation. Besides the project feedback, these discussions recommend students a set of strategies for enhancing comprehension levels and learning in the remaining course. Students are strongly encouraged to take an active part in discussions. Students are awarded 30% project grade for participating.

Project evaluations help identify key areas for concept reinforcement. Oral discussions help in deciding content development representation approaches to create an enhanced learning experience. Combined insights from both of these activities shape the design of the subsequent course modules.

Subsequent Content Development with Example:

  • Evaluation: Parts “a" and “b" of Figure 6 show snippets from the content developed for reinforcing data structure and engineering related concepts, respectively. These were identified while evaluating project submissions.
  • Discussion: The part “a" of Figure 6shows a structural thinking approach designed for learning intricacies of data structure concepts. The part “b" of Figure 6demonstrates a structural format developed to observe and list numerous specifics during data engineering operation. Oral evaluations shaped both of these structural frameworks.
  • Content Example: Oral evaluations found the Data Structure section the most interesting for students. Also, these interest levels were reflected throughout the project evaluation process. Students showed higher comprehension levels for these concepts. Hence, the data governance module and generalized analytics process-related content was designed from the a data structure perspective. The part “c" of Figure 6 shows partial outputs of the simulated finance and retail datasets. Data structural similarities between these datasets were highlighted to teach generalized analytics framework and data governance concepts.

Figure 6: Post Completion Content Development

Complimentary Strategies

The following set of strategies nurtured continuous positive and engaging learning experiences.

  • Communication: The written communication incorporated a few empathetic and light-hearted lines from time to time. Example: “Just like in real life, group dynamics may not always work in your favor. You can still convert those lemons into lemonade. Hence, when you get an Alphonso mango, that smoothie and/or milkshake would be tastier!"
  • Creative Components: Multiple anonymous creative quizzes were developed on Canvas encouraging divergent thinking.

    Example: Students were asked to describe their fondness for the Penn State football using ice-cream flavors.

  • Flexibility: Students were provided a few flexible options on submission documents. Example: One-page poster submission provided the flexibility for the format and content. Students were encouraged to use this poster submission to express their learning, experiences, and feelings while completing the project.
  • Peer Support: Students were encouraged to brainstorm in groups. Example: Course consisted of a similar exploratory analysis group assignment. A major portion of this group assignment was discussed and solved during course lectures and homework help sessions.


Data engineering and data visualization are building blocks for solving data analytics problems. This article proposes an exploratory analytics project leveraging publicly available datasets. For that, a three-stage instructional process was developed. After assessing students’ prior knowledge levels in the first stage, a 3-section Scaffolded Module follows as the second stage. This module contains all students’ inclusive active learning exercises. In the third stage of the instructional process, there are post-completion assessments and individual discussions with students. Using insights derived from the third stage, a structural framework is incorporated to reinforce some of the key concepts. Also, these insights are used to decide about content representation for subsequent course modules. Finally, the proposed process nurtures a positive and engaging learning environment by executing the complimentary strategies.


The author acknowledges the Industrial and Manufacturing Engineering Department of Penn State University, Sofia Perez-Guzman for edits on this article, Larkin Hood from Penn State’s Schreyer Institute for Teaching Excellence for her input on the abstract, and Paul Griffin for encouraging the pursuit of the concept.