Data Science Lifecycle und Visualisierung

Unit 1

Willkommen! 👋

Themen

  • Data Science Lifecycle
  • Daten visualisieren
  • Tidy data, Daten aufräumen und umwandeln
  • mit Quarto kommunizieren

Kennenlernen

  • Dienststelle / Aufgabe
  • was willst du lernen?
  • was machst du gerne ausserhalb der Arbeit?

Ziele für heute

  1. die sechs Elemente des Data Science Lifecycles aufzählen
  2. aesthetic mappings für Datenvisualisierung im {ggplot2} R Package identifizieren
  3. vier Komponenten einer Quarto-Datei identifizieren

Data Science Lifecycle

Reproduzierbare Datenanalyse

  • Tabellen und Grafiken aus Daten und Code reproduzieren?
  • Code in anderen Skripten verwenden?
  • Stimmt meine Umgebung mit der meiner Kollegen/innen überein?

Kurs-Werkzeuge

Programmierung:

Versionskontrolle & Kollaboration:

Reproduzierbare Datenanalyse

Hallo R! 👋


Du brauchst die Sprache

und das IDE

R-Packages


Du benutzt R durch packages

install.packages("package")
library(package)


…die functions enthalten


…die häufig nur Befehle sind

do_this(to_this)
do_that(to_this, to_that, with_those)

RStudio

RStudio

RStudio

RStudio

RStudio

RStudio und R-wesentliches

Break 🍵 🍜

10:00

Tidyverse

Data Science Lifecycle

  • Text → Markdown
  • Code → code chunk
```{r}
sqrt(1/5)
```
[1] 0.4472136
  • Code und Text → inline code

    `​r sqrt(1/5)` → 0.4472136

Lasst uns eintauchen!

```{r}
#| code-line-numbers: "|5-7|8|16|22"
#| eval: false

un_votes |>
  inner_join(un_roll_calls, by = "rcid") |>
  inner_join(un_roll_call_issues, by = "rcid") |>
  filter(country %in% c("Algeria", "Switzerland", "United Kingdom")) |>
  mutate(
    year = year(date),
    issue = fct_relevel(issue, "Arms control and disarmament"),
    issue = fct_relevel(issue, "Palestinian conflict", after = Inf)
  ) |>
  group_by(country, year, issue) |>
  summarise(percent_yes = mean(vote == "yes")) |>
  ggplot(mapping = aes(x = year, y = percent_yes, colour = country)) +
  geom_point(alpha = 0.4, size = 1) +
  geom_smooth(method = "loess", se = FALSE) +
  facet_wrap(~issue) +
  scale_y_continuous(labels = label_percent()) +
  labs(
    title = "Percentage of 'Yes' votes in the UN General Assembly",
    subtitle = paste(un_roll_calls |> summarise(min(year(date))) |> pull(), "to", un_roll_calls |> summarise(max(year(date))) |> pull()),
    colour = "Country",
    x = "Year",
    y = "% Yes"
  ) +
  theme_minimal() +
  theme(
    text = element_text(size = 8)
  )
```

Praktikum 01a: UN Votes

prak-01a-unvotes.qmd

  • Render oder Ctrl + Shift + K,
  • andere Länder auswählen.

20:00

Praktikum 01b: Quarto – Bechdel Test

prak-01b-bechdel.qmd

20:00

Break 🍵 🍜

10:00

Visualisierung

R Package ggplot2

Grammar of Graphics

Streudiagramm (Scatterplot)

Verteilungen visualisieren

Beziehungen visualisieren

ggplot(data = [dataset], 
       mapping = aes(x = [x-variable], 
                     y = [y-variable])) +
   geom_xxx() +
   other options

R Package ggplot2

Grammar of Graphics

Streudiagramm (Scatterplot)

Verteilungen visualisieren

Beziehungen visualisieren

ggplot(data = [dataset], 
       mapping = aes(x = [x-variable], 
                     y = [y-variable])) +
   geom_xxx() +
   other options

Erste Schritte: Streudiagramm

Voraussetzung

```{r}
# install.packages("tidyverse")
library(tidyverse)
```


```{r}
library(palmerpenguins) # data
library(ggthemes) # colourblind safe colour palette
```

```{r}
glimpse(penguins)
```
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Unser Ziel

ggplot(
  data = penguins,
  mapping = aes(
    x = flipper_length_mm,
    y = body_mass_g
  )
) +
  geom_point(mapping = aes(colour = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper length(mm)",
    y = "Body mass (g)",
    colour = "Species",
    shape = "Species"
  ) +
  scale_colour_colorblind()

Plot Erstellen

ggplot(
  data = penguins
)

Plot Erstellen

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm)
)

Plot Erstellen

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
)

Plot Erstellen

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point()

Aesthetics und Schichten

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, colour = species)
) +
  geom_point()

Aesthetics und Schichten

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, colour = species)
) +
  geom_point() +
  geom_smooth(method = "lm")

Aesthetics und Schichten

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(mapping = aes(colour = species)) +
  geom_smooth(method = "lm")

Aesthetics und Schichten

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(mapping = aes(colour = species, shape = species)) +
  geom_smooth(method = "lm")

Aesthetics und Schichten

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(aes(colour = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper length (mm)", 
    y = "Body mass (g)",
    color = "Species", 
    shape = "Species"
  ) +
  scale_colour_colorblind()

Aesthetics und Schichten

Code
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(aes(colour = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper length (mm)", 
    y = "Body mass (g)",
    color = "Species", 
    shape = "Species"
  ) +
  scale_colour_colorblind()

Praktikum 01c

prak-01c-ggplot-scatter.qmd

20:00

Break 🍵 🍜

10:00

Aesthetics-Optionen

  • colour ✅
  • shape ✅
  • size
  • alpha (transparency)

ggplot(
  penguins,
  aes(
    x = flipper_length_mm,
    y = body_mass_g,
    colour = species,
    shape = species
  )
) +
  geom_point() +
  scale_colour_colorblind()

ggplot(
  penguins,
  aes(
    x = flipper_length_mm,
    y = body_mass_g,
    colour = species,
    shape = island
  )
) +
  geom_point() +
  scale_colour_colorblind()

ggplot(
  penguins,
  aes(
    x = flipper_length_mm,
    y = body_mass_g,
    colour = species,
    shape = island,
    size = bill_length_mm
  )
) +
  geom_point() +
  scale_colour_colorblind()

ggplot(
  penguins,
  aes(
    x = flipper_length_mm,
    y = body_mass_g,
    colour = species,
    shape = island,
    size = bill_length_mm,
    alpha = bill_depth_mm
  )
) +
  geom_point() +
  scale_colour_colorblind()

Faceting

ggplot(
  penguins,
  aes(
    x = flipper_length_mm,
    y = body_mass_g,
    colour = species
  )
) +
  geom_point() +
  facet_wrap(~island) +
  scale_colour_colorblind()

Faceting

ggplot(
  penguins,
  aes(
    x = flipper_length_mm,
    y = body_mass_g,
    colour = species
  )
) +
  geom_point() +
  facet_grid(island ~ sex) +
  scale_colour_colorblind()

Mapping vs. Setting

ggplot(
  penguins,
  aes(
    x = flipper_length_mm,
    y = body_mass_g,
    size = bill_depth_mm,
    alpha = bill_length_mm
  )
) +
  geom_point()
ggplot(
  penguins,
  aes(
    x = flipper_length_mm,
    y = body_mass_g
  )
) +
  geom_point(size = 4, alpha = 0.2)

Mapping vs. Setting

ggplot(
  penguins,
  aes(
    x = bill_depth_mm,
    y = bill_length_mm,
    size = body_mass_g,
    alpha = flipper_length_mm
  )
) +
  geom_point()

Mapping

ggplot(
  penguins,
  aes(
    x = bill_depth_mm,
    y = bill_length_mm
  )
) +
  geom_point(size = 4, alpha = 0.2)

Setting

prak-01c-ggplot-scatter.qmd

Workflow: Rechtschreibung

Namen

# Good
i_use_snake_case

# acceptable
otherPeopleUseCamelCase
some.people.use.periods

# Bad
And_aFew.People_Are.FREEspirits

⚠️ Case matters…

welcome_to_r <- "Welcome to R"
welcome_to_R
#> Error: object 'welcome_to_R' not found

… and so does punctuation!

Hausaufgabe

prak-01d-style.qmd

R for Data Science

  • Das Buch für den Kurs
  • Kostenfrei Online
  • Tiydverse-Philosophie

R for Data Science

Danke! 🌑

Slides created via revealjs and Quarto.

Access slides as PDF.

All material is licensed under Creative Commons Attribution Share Alike 4.0 International.