Educational use only. The dataset used throughout this course is entirely fictitious and created for instructional purposes. All figures, district names, and indicator values are simulated and do not represent actual programme data. They should not be cited, referenced, or used to inform any programme or policy decisions.

PH Data Academy — Core Course

Data Analysis for
Public Health Practitioners

From a messy DHIS2 export to a donor-ready dashboard. Built around simulated South African STI and HIV district data — Similar to the files you work with every day.

Focused on Excel, Power Query, Power BI, and R — practical tools for M & E practitioners and programme managers, not data scientists.

Start Learning  
8Parts
28Lessons
52Districts of data
9Provinces
Excel + RTools covered

Course dataset — South Africa STI/HIV district indicators

Every lesson uses simulated DHIS2 exports covering 52 districts across all 9 provinces, April 2023 – September 2025. You will work with messy raw exports, inconsistent province prefixes, data quality flags, and cross-period comparisons — exactly what you encounter in practice.

Syphilis screening rate1st ANC visit (target ≥95%)
BPG treatment coverage1st dose after positive test
ANC syphilis positivityRate per 100 ANC clients
HIV positivity 15–24 yrsExcl. ANC clients
HIV positivity 25–49 yrsExcl. ANC clients
MUS incidence rateMale Urethritis Syndrome
Teen delivery rateFacility births, 10–19 years
Male STI screening15–49 years, new episodes
Course map
1
Part One

The Whole Game

Before learning any single tool, you run the complete pipeline once. Messy DHIS2 export → clean data → insight → action. This section exists so every later lesson has a destination.

Lesson 1.1
From export to answer in 30 minutes
A guided walkthrough of the full 8-stage analysis cycle using a single DHIS2 file. No theory — just execution. You will produce one output: a ranked list of districts with syphilis screening below 80%.
OrientationExcel
Programme scenario

Your provincial STI coordinator has asked for a list of districts not hitting the 95% syphilis screening target at ANC. She needs it by 3pm. You have a DHIS2 export for April 2024 — 52 districts, 12 columns. What do you do first?

  • 01
    Opening the file: what are you actually looking at?
    Reading column headers. Identifying the unit of observation (one row = one district-month). Spotting numeric vs rate columns. Checking for blanks.
    ExcelData literacy
  • 02
    Filter, sort, and flag — finding the underperformers
    Using AutoFilter on Syphilis screening at 1st ANC visit rate. Sorting descending. Adding a conditional format column: Below target / On target.
    Excel
  • 03
    Turning the list into a decision — the one-sentence finding
    Writing a programme-facing finding: "In April 2024, 9 of 52 districts reported syphilis screening below 80%, concentrated in Gauteng and the Northern Cape."
    Communication
Practice task 1.1

Open the April 2025 dataset. Produce the same ranked list. How many districts have improved since 2024? Write one finding sentence suitable for a district review meeting.

Lesson 1.2
The 8-stage analysis cycle — your mental map
Every analysis task you will ever do fits the same eight stages. This lesson maps each stage to the course and to the cheat sheet.
FrameworkConcept
  • 01
    Stage 1–2: Define → Collect
    The most skipped steps. What question are we actually answering? Which DHIS2 indicator is the right one? Covered fully in Part 2.
  • 02
    Stage 3–4: Clean → Explore
    Why 60% of analysis time is data preparation. Why EDA before analysis saves rework.
  • 03
    Stage 5–8: Analyse → Communicate → Act → Monitor
    The outputs programme managers need: a finding, a chart, a dashboard, a decision. Covered in Parts 5–7.
2
Part Two

Define & Collect

Most analysis errors happen before a single cell is touched. Translate a programme question into a data request — and understand what DHIS2 actually delivers.

Lesson 2.1
From programme problem to measurable question
The difference between "our syphilis numbers are bad" and "which districts have ANC screening below 80% for two consecutive quarters?" Using the Problem–Objective–Success Criteria framework.
DefineConcept
Programme scenario

Donor review in six weeks. The programme manager says: "We need to show progress on syphilis." Before pulling any data — Problem → Positivity has not declined in EC; Objective → Identify EC districts with declining vs rising positivity 2023–2025; Success → Ranked district table with trend direction.

  • 01
    The Problem–Objective–Success framework
    Using the cheat sheet Stage 1 template. Worked example: syphilis in the Eastern Cape. Why "we need data" is not a question.
  • 02
    Choosing the right indicator
    When to use rate vs count. Why Syphilis screening at 1st ANC visit rate and BPG dose 1st treatment coverage measure different things. Denominator traps.
Practice task 2.1

A colleague says: "KZN teen pregnancy is a problem." Complete the Problem–Objective–Success table using Delivery in 10-19 years in facility rate. Identify three KZN districts where the rate exceeds 20% in 2024.

Lesson 2.2
Reading a DHIS2 export — what the columns actually mean
Anatomy of the dataset: periodname, organisationunitname, and each indicator. What "rate" means in DHIS2 — it's a calculation, not a raw count.
CollectExcel
  • 01
    Anatomy of the DHIS2 export
    periodname and organisationunitname are your time and place variables. The two-letter province prefix in district names (ec, kz, gp…) — and why it matters for grouping.
    ExcelData literacy
  • 02
    What values above 100% mean — and when to flag them
    Our dataset includes screening rates >100% (e.g. Thabo Mofutsanyana 116.5%, Capricorn 294.7%). These are DHIS2 data quality artefacts. How to identify and note them without deleting them.
    Data quality
Lesson 2.3
Combining multiple monthly exports
You have six separate files — one per time point. Power Query's folder import appends them automatically. First step toward a longitudinal dataset.
CollectPower Query
Programme scenario

You have six DHIS2 exports — April 2023 through September 2025. The donor wants a trend table showing national screening rates across all periods. You need one combined table, not six files.

  • 01
    Get Data → Folder: importing all exports at once
    Setting up the folder import. Expanding the Content column. Why the header row appears in every file and how Power Query skips it automatically.
    Power Query
  • 02
    Result: 1,612 rows — verifying the append was correct
    Checking row count (52 districts × ~31 periods). Confirming periodname has the right unique values. Adding a refresh step comment for the next analyst.
    Power QueryReproducibility
Power Query M — folder import
// Step 1: Connect to folder
Source = Folder.Files("C:\Data\STI_Exports"),

// Step 2: Keep only .xlsx files
Filtered = Table.SelectRows(Source, each
  [Extension] = ".xlsx"),

// Step 3: Expand and combine
Combined = Table.Combine(
  List.Transform(Filtered[Content],
    each Excel.Workbook(_){0}[Data])),

// Step 4: Promote headers
WithHeaders = Table.PromoteHeaders(Combined)
3
Part Three

Clean & Prepare

The province prefixes are lowercase. District names have trailing spaces. Rates above 100% exist. This part teaches Power Query techniques using exactly the problems in this dataset.

Lesson 3.1
Extracting the province from the district name
The organisationunitname column encodes the province as a two-letter lowercase prefix (ec, kz, gp, lp…). Creating a clean province column is the most important cleaning step for provincial analysis.
CleanPower Query
Programme scenario

You need a province-level summary for the quarterly review. But the data has no province column — only long district names like kz eThekwini Metropolitan Municipality. You need to extract KZ from those two leading characters.

  • 01
    Extract First Characters — Power Query's built-in tool
    Add Column → Extract → First Characters → 2. Rename to prov_code. Transform → Uppercase. Result: EC, KZ, GP, LP, MP, FS, WC, NC, NW.
    Power Query
  • 02
    Mapping codes to full province names with a lookup table
    Creating a small EC→Eastern Cape lookup table. Merging it into the main query. Why this is better than Replace Values — it's reusable and auditable.
    Power QueryData modelling
Power Query M — extract province
AddProvCode = Table.AddColumn(
  Source, "prov_code",
  each Text.Upper(Text.Start([organisationunitname], 2)),
  type text
),

CleanDistrict = Table.AddColumn(
  AddProvCode, "district",
  each Text.Trim(Text.Middle([organisationunitname], 3)),
  type text
)
Lesson 3.2
Fixing the period column — text dates to sortable dates
periodname is stored as text ("April 2023"). Excel sorts "April" before "August" alphabetically, not chronologically. Converting to a real date type enables correct trend charts.
CleanPower Query
  • 01
    Converting "April 2023" to a proper Date column
    Appending " 01" to make it parseable. Using Date.FromText. Why sortability matters for every chart you will ever make from this data.
    Power Query
  • 02
    Adding Year and Quarter columns for grouping
    Extracting Year and Quarter from the date column. These become your row/column headers in every pivot table going forward.
    Power Query
Lesson 3.3
Flagging data quality issues — values above 100% and blanks
The syphilis screening column contains rates above 100% in several districts (e.g. June 2025: Thabo Mofutsanyana 116.5%, Capricorn 294.7%). The right response is to flag, not delete.
Data qualityExcel
Programme scenario

During review you notice Capricorn District (Limpopo) reported BPG treatment coverage = 294.7% in June 2025. This cannot be real. Before including it in any trend chart, you need a flag column — deleting it would hide the problem from programme leadership.

  • 01
    Adding a data quality flag column in Power Query
    Conditional column: IF screening rate >100 THEN "Flag" ELSE "OK". Keeping the original value — never overwriting raw data.
    Power QueryData quality
  • 02
    Documenting data quality for programme managers
    Writing a one-line data quality note in the analysis header: "3 district-months flagged with screening >100% and excluded from national averages." Why transparency matters for donor reporting.
    Communication
Lesson 3.4
Automating the clean — refresh with one click
Every transformation you built in Power Query is recorded as a step. When next month's DHIS2 export arrives, you drop it in the folder and click Refresh. The entire pipeline re-runs in seconds.
AutomationPower Query
  • 01
    Reviewing Applied Steps — your audit trail
    Reading the steps pane from top to bottom. Naming each step descriptively (not "Removed Columns1"). Why named steps matter when a colleague picks up your file in six months.
    Power QueryReproducibility
  • 02
    Test: add new data and verify the pipeline
    Drop a new file in the folder. Refresh All. Confirm row count increases. Confirm province column still populates. Confirm data quality flags run. This is your regression test.
    Power Query
Practice task 3.4 — mini project

Build the complete cleaning pipeline for the 6-file dataset. Your cleaned output must have: (1) a province column, (2) a district column without prefix, (3) a period_date as a real Date type, (4) a dq_flag column for values >100%. Verify it refreshes correctly when you add a new file.

4
Part Four

Explore

Before you analyse, you look. Pivot tables, summary statistics, and distribution checks — done on your combined 1,600-row dataset. This is where surprises live.

Lesson 4.1
Summary statistics — national picture at a glance
Using AVERAGE, MEDIAN, MIN, MAX, and STDEV to characterise each indicator nationally. The national syphilis screening average improved from 54% (April 2023) to 94% (September 2025) — what does that distribution look like?
ExploreExcel
Key finding from the data

National average syphilis screening at 1st ANC: April 2023 = 54.2%April 2024 = 86.7%April 2025 = 92.7%. A 38-percentage-point improvement in two years. But the average hides KZN districts consistently above 95% while some Gauteng metros were at 5–18% in April 2023.

  • 01
    AVERAGEIFS and COUNTIFS — conditional summaries by province
    Calculating the provincial average screening rate for each period using =AVERAGEIFS(rate_col, prov_col, "KZ", period_col, "April 2024"). Building a 9-province summary table.
    Excel
  • 02
    What does the distribution look like? Min, max, spread
    In April 2023, GP districts ranged from 5.8% to 89.4%. In September 2025, the range narrowed to 82.9%–98.6%. Distribution shape tells you whether the problem is systemic or concentrated.
    ExcelEDA
Lesson 4.2
Pivot tables for health data — the workhorse of EDA
Pivot tables answer "what is the value of X, broken down by Y?" in seconds. Three essential configurations every analyst should know.
ExploreExcel
  • 01
    Configuration 1: Province × Period cross-tab
    Rows: province, Columns: periodname (April 2023, April 2024, April 2025), Values: AVERAGE of syphilis screening rate. Immediately shows Gauteng's dramatic catch-up.
    Excel
  • 02
    Configuration 2: Top/bottom district ranking
    Rows: district, Values: AVERAGE screening rate, Sort descending, Show Top 10 / Bottom 10. Identifies chronic underperformers for targeted support visits.
    Excel
  • 03
    Configuration 3: % below target by province
    Rows: province, Values: COUNTIF districts below 80%, COUNTA districts. Calculated field: % below target. Conditional formatting to flag provinces needing support visits.
    Excel
Practice task 4.2

Build all three pivot configurations. Answer: (1) Which province had the largest improvement in screening rate between April 2023 and September 2025? (2) Which three districts have the highest Antenatal syphilis test positive rate in the most recent period?

Lesson 4.3
Exploring relationships between indicators
Does high syphilis positivity correlate with low BPG treatment coverage? Scatter plots and CORREL before formal analysis.
ExploreExcel
  • 01
    Scatter plot: syphilis positivity vs BPG coverage
    Visualising districts where positivity is high but treatment coverage is low — the worst-case quadrant. How to add quadrant lines at national average.
    Excel
  • 02
    =CORREL() — quantifying relationships
    Calculating the correlation between HIV positive 15-24 and HIV positive 25-49 rates across districts. Interpreting r in a programme context.
    ExcelStatistics
5
Part Five

Analyse

Moving from "what does the data look like?" to "what is actually happening?" Trend analysis, comparative ranking, performance gap measurement, and outlier detection.

Lesson 5.1
Trend analysis — is this indicator improving?
Using SLOPE() and trend lines to answer the most common programme question: "Are we getting better?" Using 30 time points of national screening data from the combined dataset.
AnalyseExcel
Key finding from the data

National syphilis screening improved from 54% (Apr 2023) to 94% (Sep 2025). But national BPG treatment coverage only improved from 63% to 72% over the same period. Screening improved far faster than treatment completion — what does that mean for programme design?

  • 01
    Building a trend table: monthly averages across all 30 periods
    Calculating national monthly averages for both screening rate and BPG coverage. Ordering by date (not alphabetically). The single most important table in a programme review.
    Excel
  • 02
    =SLOPE() — is the trend positive, negative, or flat?
    Calculating =SLOPE(y_values, x_values) where x = period number 1–30. Interpreting slope: +1.2 pp/month means reaching target in X months. Calculating that projection.
    ExcelStatistics
  • 03
    Adding a target reference line to your trend chart
    Adding a 95% target line as a separate data series. The visual distance between the line and the trend is your programme gap — the most actionable information in the chart.
    Excel
Lesson 5.2
Ranking districts — who needs support visits?
RANK() and PERCENTRANK() to produce performance league tables with traffic-light formatting. The KZN district table from April 2024 as the worked example.
AnalyseExcel
Programme scenario

The KZN STI coordinator wants a district scorecard for the quarterly review. In April 2024, KZN districts ranged from iLembe (100.3% screening, 98.4% BPG) to King Cetshwayo (85.6% screening, 85.7% BPG). She wants a ranked table with traffic-light flags — green ≥95%, amber 80–94%, red <80%.

  • 01
    =RANK() for performance league tables
    =RANK(C2,$C$2:$C$12,0) for descending rank. Why simple RANK beats complex composite scores for programme use.
    Excel
  • 02
    Traffic light formatting with conditional format rules
    Three-colour conditional format: ≥95 = green, 80–94.9 = amber, <80 = red. Using icon sets vs colour fills — when each is appropriate for a programme audience.
    Excel
Lesson 5.3
Outlier detection — which districts are anomalies?
Using AVERAGE ± 2×STDEV to flag statistical outliers. In the BPG coverage data, Sedibeng (August 2025: 43.1% MUS incidence) and Chris Hani (53% in September 2025) are outliers worth investigating.
AnalyseExcel
  • 01
    Statistical outliers: AVERAGE + 2*STDEV threshold
    =IF(D2 > AVERAGE+2*STDEV, "High outlier", IF(D2 < AVERAGE-2*STDEV, "Low outlier", "Normal")). Why this threshold is a starting point, not a final answer.
    ExcelStatistics
  • 02
    Data quality outliers vs clinical outliers
    Not every outlier is a data error. iLembe's MUS incidence of 37–40/100,000 across multiple periods is probably real. How to distinguish systematic programme difference from data problems.
    Data qualityProgramme context
Lesson 5.4
Performance gap analysis — how far from target?
Measuring the gap between actual and target — not just above/below but by how much, for how long. The BPG treatment coverage gap (target 95%, national average 70%) across all periods.
AnalyseExcel
  • 01
    =(Actual − Target) / Target × 100 — performance gap formula
    Building a gap table for BPG treatment coverage against the 95% target. Which districts have a gap >25 percentage points? Which have been below target for >3 consecutive periods?
    Excel
  • 02
    Persistent underperformance: counting consecutive periods below target
    Using COUNTIFS with a period range to count how many months a district has been below 80% BPG coverage. Translating this into a programme escalation list.
    ExcelProgramme context
Practice task 5.4 — mini project

Produce a "BPG Alert List" for September 2025: all districts with (1) syphilis screening ≥90% AND (2) BPG treatment coverage <60%. These are districts finding positive cases but not completing treatment — the most actionable gap in the data.

6
Part Six

Visualise & Communicate

Charts that tell the right story to a programme audience. Power BI dashboards for provincial coordinators and district managers. Every visual is built directly from the STI/HIV dataset.

Lesson 6.1
Choosing the right chart type for health data
The cheat sheet chart selection guide applied to our specific indicators. When a line chart beats a bar chart. Why a sorted horizontal bar is the right choice for district rankings.
VisualiseExcel · Power BI
  • 01
    Line chart: national syphilis screening trend (30 months)
    Building the trend line with a 95% target reference. Writing the chart title as a finding: "National screening reached 94% — but BPG coverage lags at 72%."
    Excel
  • 02
    Sorted bar chart: district ranking for a single period
    52 districts sorted by syphilis screening rate for September 2025. Adding a vertical target line. Labelling only the top 5 and bottom 5.
    Excel
  • 03
    Scatter plot: screening coverage vs BPG treatment — the quadrant chart
    X: syphilis screening rate. Y: BPG treatment coverage. Quadrant lines at 95% and 90%. Labels: High screening/High BPG (ideal) vs High screening/Low BPG (priority).
    Excel
Lesson 6.2
Power BI fundamentals — connecting to the clean dataset
Importing the cleaned Excel model into Power BI Desktop. Building the data model with a period dimension table for correct time-intelligence.
Power BIDashboard
  • 01
    Import, Model, Report: the three Power BI views
    Data view — checking your 1,600 rows loaded correctly. Model view — confirming the period date column is recognised as Date. Report view — where the dashboard lives.
    Power BI
  • 02
    DAX measures: average screening rate and BPG coverage
    Avg Screening = AVERAGE([Syphilis screening at 1st ANC visit rate]). % Districts Above Target = DIVIDE(COUNTROWS(FILTER(data,[rate]>=95)), COUNTROWS(data)).
    Power BIDAX
Lesson 6.3
Building the STI surveillance dashboard
A four-visual dashboard page: KPI cards, trend line, district bar chart, and province map. Slicers for period and province. The deliverable for a quarterly review meeting.
Power BIDashboard
Programme scenario

The national STI programme manager needs an interactive dashboard for a 30-minute district review meeting. She needs: (1) national KPI numbers at the top, (2) a trend line filterable by province, (3) a ranked district list for the selected period, (4) a map showing provincial averages. Everything must update when she selects a different province or period.

  • 01
    Page 1: National overview — KPI cards + trend
    Four KPI cards: Avg Screening Rate, % Districts ≥95%, Avg BPG Coverage, National Syphilis Positivity Rate. Dual-axis line chart across 30 months.
    Power BI
  • 02
    Page 2: District drilldown — sorted bar + map
    Horizontal bar chart sorted by screening rate. Province slicer updates bar chart and map simultaneously. Conditional colour: red <80%, amber 80–94%, green ≥95%.
    Power BI
Lesson 6.4
Storytelling — turning visuals into decisions
A chart without a narrative is just a picture. Framing findings for a programme director, a district manager, and a donor — three different audiences, same data.
CommunicationConcept
  • 01
    The finding sentence: Situation → Data → Implication
    Formula: "[What the data shows] in [place/time], which means [programme implication]." Example: "Syphilis screening in EC improved from 68% to 92%, but BPG treatment coverage remains at 71%, suggesting a treatment pipeline gap requiring a stock or training intervention."
    Communication
  • 02
    Dashboard design principles for health sector audiences
    Sort bar charts by value; use target lines; limit charts to 8 per page; never use 3D charts; use consistent colour coding. Applying all to the STI dashboard.
    Design
Practice task 6.4 — mini project

Publish the STI surveillance dashboard to Power BI Service. Write three finding sentences — one for a district manager, one for a provincial coordinator, one for a donor report. Each must use a different indicator and follow the Situation → Data → Implication format.

7
Part Seven

Act & Monitor

Analysis that doesn't inform a decision is wasted. District scorecards, donor progress reports, and automated alerts — the operational end of the analysis pipeline.

Lesson 7.1
District scorecard — one-page performance summary
A single-page Excel scorecard for a district manager: current values, trend direction (↑ ↓ →), traffic lights, and one action-oriented finding. Designed to print on one A4 page.
DeliverableExcel
Programme scenario

The Chris Hani District Manager needs a one-page scorecard for his quarterly self-assessment. Chris Hani has consistently high syphilis positivity (3.6–5.4% across periods) and variable BPG coverage (62–74%). The scorecard must show where he stands, whether he's improving, and what to do next.

  • 01
    XLOOKUP: pulling the latest district values automatically
    The scorecard updates automatically when you change the district dropdown. Why XLOOKUP beats VLOOKUP for this use case.
    Excel
  • 02
    Trend arrows with conditional formatting
    Using ▲ ▼ → symbols based on whether the current value is higher, lower, or within 1pp of the previous period. How to fit 8 indicators on one readable A4 page.
    Excel
Lesson 7.2
Donor progress report — linking Excel data to a Word narrative
How to produce a quarterly donor report where numbers update automatically when new data arrives. Excel → Word linked tables → PDF. The workflow that saves four hours per reporting cycle.
DeliverableExcel · Word
  • 01
    Linking Excel tables to Word — the paste-link technique
    Paste Special → Paste Link in Word. When the Excel source updates, the Word table updates with one click. Breaking the copy-paste-reformat cycle that causes donor report errors.
    Excel
  • 02
    Writing the standard indicator table for donor reporting
    Structure: Indicator | Target | Baseline (Apr 2023) | Latest (Sep 2025) | % Change | Status. The narrative sentence accompanying each row.
    Communication
Lesson 7.3
Automated alerts — monitoring without manual checking
Power BI data alerts send an email when a KPI crosses a threshold. Setting up both Power BI and Excel alerts so you are notified when a district drops below target — not after you happen to look.
MonitorPower BI
  • 01
    Power BI alert: notify when national BPG coverage drops below 65%
    Setting up a data alert on a KPI card. Alert conditions: BPG coverage <65% (warning), <55% (critical). Connecting to email or Teams notification.
    Power BI
  • 02
    Excel monitoring: the refreshable exception report
    A Power Query-driven exception table: districts with >3 consecutive periods below 80% screening rate. This table re-runs on every Refresh All — your standing monitoring output.
    ExcelMonitoring
8
Part Eight · Stretch Track

Scale with R

Everything you did in Excel, reproduced in R with the tidyverse. Not because R is better for every task — but because it handles 1,600 rows the same way it handles 160,000, and every step is documented in code.

Who this part is for: Analysts who have completed Parts 1–7 and want to reproduce the same analysis in R — for reproducibility, scale, or integration with DHIS2 APIs. No prior R knowledge required.
Lesson 8.1
Reading DHIS2 exports into R with readxl
Installing the tidyverse. Reading all six Excel files with map_dfr(). The R equivalent of Power Query's folder import — in six lines of code.
RImport
R — load all DHIS2 exports
library(tidyverse)
library(readxl)

# List all xlsx files in the data folder
files <- list.files(
  "data/", pattern = "\\.xlsx$", full.names = TRUE
)

# Read and combine — one function, six files
raw <- map_dfr(files, ~read_excel(.x, skip = 1))

glimpse(raw)  # Rows: 1,612  Cols: 13
Lesson 8.2
Cleaning with dplyr and stringr
Extracting province codes, parsing period dates, renaming columns, and flagging data quality issues — all in a single tidy pipeline using |>.
RClean
R — cleaning pipeline
clean <- raw |>
  rename(
    period          = periodname,
    district        = organisationunitname,
    syphilis_screen = `Syphilis screening at 1st ANC visit rate`,
    bpg_coverage    = `BPG dose 1st treatment coverage`,
    syph_positivity = `Antenatal syphilis test positive rate`
  ) |>
  mutate(
    prov_code   = str_to_upper(str_sub(district, 1, 2)),
    dist_clean  = str_trim(str_sub(district, 4)),
    period_date = as.Date(paste0("01 ", period), format = "%d %B %Y"),
    year        = year(period_date),
    dq_flag     = if_else(
      syphilis_screen > 100 | bpg_coverage > 150,
      "Flag", "OK"
    )
  )
Lesson 8.3
Exploratory analysis with dplyr
group_by() + summarise() replaces the entire pivot table interface. Producing the provincial cross-tab, district rankings, and trend table in code.
RExplore
R — national trend + district rankings
# National trend by period
national_trend <- clean |>
  filter(dq_flag == "OK") |>
  group_by(period_date, period) |>
  summarise(
    avg_screening = mean(syphilis_screen, na.rm = TRUE),
    avg_bpg       = mean(bpg_coverage,    na.rm = TRUE),
    n_districts   = n(),
    .groups = "drop"
  ) |>
  arrange(period_date)

# District rankings for September 2025
sep_ranking <- clean |>
  filter(period == "September 2025", dq_flag == "OK") |>
  select(dist_clean, prov_code, syphilis_screen, bpg_coverage) |>
  arrange(desc(syphilis_screen))
Lesson 8.4
Visualising with ggplot2
Reproducing the three key charts from Part 6 in ggplot2. Publication-quality output in 15 lines of code.
Rggplot2
R — ggplot2 trend chart
library(ggplot2)

ggplot(national_trend, aes(x = period_date, y = avg_screening)) +
  geom_line(colour = "#1d4ed8", linewidth = 1) +
  geom_line(aes(y = avg_bpg),
    colour = "#d97706", linewidth = 1, linetype = "dashed") +
  geom_hline(yintercept = 95,
    colour = "#dc2626", linetype = "dotted") +
  annotate("text", x = min(period_date),
    y = 96.5, label = "95% target",
    colour = "#dc2626", size = 3, hjust = 0) +
  labs(
    title = "Screening improved sharply; BPG coverage lags",
    x = NULL, y = "Rate (%)"
  ) +
  theme_minimal()
Lesson 8.5
Reproducible reports with R Markdown
Combining code and narrative in one document. When new exports arrive, the report re-renders with updated figures. The gold standard for donor reporting.
RReproducibility
  • 01
    Structure of an R Markdown quarterly report
    YAML header, setup chunk, inline code for numbers (`r round(latest_screening, 1)`%), figure chunks, and knit-to-Word output for donor submission.
    R Markdown
  • 02
    Parameterised reports — one template, nine provinces
    Adding a params: province parameter. Running rmarkdown::render() in a loop across all 9 provinces. Nine separate district reports, automatically generated.
    R MarkdownAutomation
Practice task 8.5 — final project

Build a parameterised R Markdown report for the KwaZulu-Natal provincial coordinator. The report must: (1) read the full 6-file dataset, (2) filter to KZN districts only, (3) include the trend chart, district ranking table, and the BPG alert list, (4) render to Word with one click, (5) update automatically when a new data file is added to the folder.

Course outcomes

By the end of this course, you will be able to:

Import and combine DHIS2 exports automatically
Clean district names, dates, and data quality flags
Build province and district pivot summaries
Produce ranked district performance tables
Detect trends, outliers, and performance gaps
Build interactive Power BI surveillance dashboards
Write finding sentences for programme audiences
Link Excel data to Word donor report templates
Set up automated monitoring alerts in Power BI
Reproduce the full pipeline in R for scale and rigour