Daten rekodieren

Unit 5

Ziele für heute

Datentypen und -klassen in einem Datensatz identifizieren und deren Bedeutung erklären
Spalten basierend auf Bedingungen umkodieren
Kategorien in Daten mit {forcats} umkodieren und sortieren

Datentypen

Warum sollten wir uns für Datentypen interessieren?

Beispiel

cat_lovers <- read_csv("data/cat-lovers.csv")

name	number_of_cats	handedness
Bernice Warren	0	left
Woodrow Stone	0	left
Willie Bass	1	left
Tyrone Estrada	3	left
Alex Daniels	3	left
Jane Bates	2	left
Latoya Simpson	1	left
Darin Woods	1	left
Agnes Cobb	0	left
Tabitha Grant	0	left
Perry Cross	0	left
Wanda Silva	0	left
Alicia Sims	1	left
Emily Logan	3	right
Woodrow Elliott	3	right
Brent Copeland	2	right
Pedro Carlson	1	right
Patsy Luna	1	right
Brett Robbins	0	right
Oliver George	0	right
Calvin Perry	1	right
Lora Gutierrez	1	right
Charlotte Sparks	0	right
Earl Mack	0	right
Leslie Wade	4	right
Santiago Barker	0	right
Jose Bell	0	right
Lynda Smith	0	right
Bradford Marshall	0	right
Irving Miller	0	right
Caroline Simpson	0	right
Frances Welch	0	right
Melba Jenkins	0	right
Veronica Morales	0	right
Juanita Cunningham	0	right
Maurice Howard	0	right
Teri Pierce	0	right
Phil Franklin	0	right
Jan Zimmerman	0	right
Leslie Price	0	right
Bessie Patterson	0	right
Ethel Wolfe	0	right
Naomi Wright	1	right
Sadie Frank	3	right
Lonnie Cannon	3	right
Tony Garcia	2	right
Darla Newton	1	right
Ginger Clark	1.5 - honestly I think one of my cats is half human	right
Lionel Campbell	0	right
Florence Klein	0	right
Harriet Leonard	1	right
Terrence Harrington	0	right
Travis Garner	1	right
Doug Bass	three	right
Pat Norris	1	right
Dawn Young	1	ambidextrous
Shari Alvarez	1	ambidextrous
Tamara Robinson	0	ambidextrous
Megan Morgan	0	ambidextrous
Kara Obrien	2	ambidextrous

Durschnittliche Anzahl

cat_lovers |>
  summarise(mean_cats = mean(number_of_cats))

# A tibble: 1 × 1
  mean_cats
      <dbl>
1        NA

Warum funktioniert es nicht?!

cat_lovers |>
  summarise(mean_cats = mean(number_of_cats, na.rm = TRUE))

# A tibble: 1 × 1
  mean_cats
      <dbl>
1        NA

Warum funktioniert es immer noch nicht?!

Einatmen… und sich die Daten Anschauen

Welchen Typ hat die Variable number_of_cats?

cat_lovers |>
  glimpse()

Rows: 60
Columns: 3
$ name           <chr> "Bernice Warren", "Woodrow Stone", "Willie Bass", "Tyro…
$ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", "0", "0", "0", …
$ handedness     <chr> "left", "left", "left", "left", "left", "left", "left",…

Noch Einmal einen Blick darauf Werfen

cat_lovers |> count(number_of_cats)

# A tibble: 7 × 2
  number_of_cats                                          n
  <chr>                                               <int>
1 0                                                      32
2 1                                                      15
3 1.5 - honestly I think one of my cats is half human     1
4 2                                                       4
5 3                                                       6
6 4                                                       1
7 three                                                   1

Manchmal musst Du auf deine Befragten aufpassen

cat_lovers |>
  mutate(number_of_cats = case_when(
    number_of_cats == "1.5 - honestly I think one of my cats is half human" ~ 2,
    number_of_cats == "three" ~ 3,
    .default = as.numeric(number_of_cats)
  )) |>
  summarise(mean_cats = mean(number_of_cats))

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `number_of_cats = case_when(...)`.
Caused by warning in `vec_case_when()`:
! NAs introduced by coercion

# A tibble: 1 × 1
  mean_cats
      <dbl>
1     0.833

Immer Datentypen Respektieren!

cat_lovers |>
  mutate(
    number_of_cats = case_when(
      number_of_cats == "1.5 - honestly I think one of my cats is half human" ~ "2",
      number_of_cats == "three" ~ "3",
      .default = number_of_cats
    ),
    number_of_cats = as.numeric(number_of_cats)
  ) |>
  summarise(mean_cats = mean(number_of_cats))

# A tibble: 1 × 1
  mean_cats
      <dbl>
1     0.833

Jetzt, wo wir wissen, was wir tun…

cat_lovers <- cat_lovers |>
  mutate(
    number_of_cats = case_when(
      number_of_cats == "1.5 - honestly I think one of my cats is half human" ~ "2",
      number_of_cats == "three" ~ "3",
      .default = number_of_cats
    ),
    number_of_cats = as.numeric(number_of_cats)
  )

Moral der Geschichte

Wenn sich Deine Daten nicht so verhalten, wie du es erwartest, könnte ein type coercion beim Einlesen der Daten die Ursache sein.
Gehe hinein, untersuche deine Daten, wende den Fix an, speichere deine Daten und lebe glücklich bis ans Ende deiner Tage.

Datentypen in R

Atomic vectors

logical: TRUE, FALSE

character: “Hallo”, “a”, “TRUE”

integer: 2L, 34L, 0L

double: 1, 2.4, pi

Datentypen in R

typeof() → wie R das Objekt speichert

logical (TRUE/FALSE)

typeof(TRUE)

[1] "logical"

typeof(c(TRUE, FALSE))

[1] "logical"

character (Text)

typeof("Hallo")

[1] "character"

typeof(c("a", "aa"))

[1] "character"

double (floating point)

typeof(3.56)

[1] "double"

typeof(c(4, 3))

[1] "double"

integer (Ganzzahl)

typeof(4L)

[1] "integer"

typeof(1:4)

[1] "integer"

Expliziter vs. Impliziter Typenzwang

Explicit coercion as.logical(), as.numeric(), as.integer(), as.double(), as.character().
```
x <- c(TRUE, FALSE)
as.character(x)
```
```
[1] "TRUE"  "FALSE"
```
Implicit coercion z. B. R konvertiert Variablen gemischter Typen in einen einzelnen Typ.
```
c(15, "Danke")
```
```
[1] "15"    "Danke"
```
```
c(3L, pi)
```
```
[1] 3.000000 3.141593
```

… und das ist nicht immer eine gute Sache!

Praktikum: Type Coercion

prak-05a-type-coercion.qmd

Welcher Typ sind die angegebenenen Vektoren?

Daten-Rekodierung

if_else(), case_when()

TRUE/FALSE: `if_else()`

Schnabellänge kategorisieren: “überdurchschnittlich”, “unterdurchschnittlich”

penguins |>
  summarise(median_bill_length = median(bill_length_mm, na.rm = TRUE))

# A tibble: 1 × 1
  median_bill_length
               <dbl>
1               44.4

TRUE/FALSE: `if_else()`

if_else(stimmt_das, das_passiert, sonst_das_passiert)

penguins |>
  mutate(
    bl_cat = if_else(bill_length_mm < 44.45, "unterdurchschnittlich", "überdurchschnittlich")
  ) |>
  count(bl_cat)

# A tibble: 3 × 2
  bl_cat                    n
  <chr>                 <int>
1 unterdurchschnittlich   171
2 überdurchschnittlich    171
3 <NA>                      2

TRUE/FALSE: `if_else()`

if_else(stimmt_das, das_passiert, sonst_das_passiert, NA_so_behandeln)

penguins |>
  mutate(
    bl_cat =
      if_else(
        bill_length_mm < 44.45, "unterdurchschnittlich", "überdurchschnittlich", missing = "unbekannt"
      )
  ) |>
  count(bl_cat)

# A tibble: 3 × 2
  bl_cat                    n
  <chr>                 <int>
1 unbekannt                 2
2 unterdurchschnittlich   171
3 überdurchschnittlich    171

Mehrere Bedingungen: `case_when()`

Schnabellänge kategorisieren: short, medium, long

penguins |> 
  select(bill_length_mm) |> 
  summary()

 bill_length_mm 
 Min.   :32.10  
 1st Qu.:39.23  
 Median :44.45  
 Mean   :43.92  
 3rd Qu.:48.50  
 Max.   :59.60  
 NA's   :2

Mehrere Bedingungen: `case_when()`

penguins |>
  mutate(
    bl_cat = case_when(
      is.na(bill_length_mm) ~ NA,
      bill_length_mm < 39.2 ~ "short",
      between(bill_length_mm, 39.2, 48.5) ~ "medium",
      .default = "long"
    )
  ) |>
  count(bl_cat)

# A tibble: 4 × 2
  bl_cat     n
  <chr>  <int>
1 long      84
2 medium   175
3 short     83
4 <NA>       2

Praktikum: Daten rekodieren

prak-05b-cond-mutate.qmd

Break ☕ 🍵 🍜

10:00

Datenstrukturen

class() → wie sich das Objekt verhält

Factors

\(\rightarrow\) Kategoriale Variablen: Character + Ganzzahl

(x <- c("BS", "MS", "PhD", "MS"))

[1] "BS"  "MS"  "PhD" "MS"

typeof(x)

[1] "character"

class(x)

[1] "character"

(y <- factor(x))

[1] BS  MS  PhD MS 
Levels: BS MS PhD

typeof(y)

[1] "integer"

class(y)

[1] "factor"

as.integer(y)

[1] 1 2 3 2

Dates

Ganzezahl = Anzahl Tage seit Ursprung

(y <- as.Date("1990-01-01"))

[1] "1990-01-01"

typeof(y)

[1] "double"

class(y)

[1] "Date"

as.integer(y)

[1] 7305

as.integer(y) / 365

[1] 20.0137

~ 20 Jahre nach dem 1970-01-01

Lists

Generische Vektorcontainers: Vektoren jeglicher Typ und Länge

l <- list(
  x = 1:4,
  y = c("Hallo", "hello", "salut"),
  z = c(TRUE, FALSE)
)
l

$x
[1] 1 2 3 4

$y
[1] "Hallo" "hello" "salut"

$z
[1]  TRUE FALSE

Data Frames

Spezielle Liste mit Vektoren gleicher Länge

(df <- data.frame(x = 1:2, y = 3:4))

  x y
1 1 3
2 2 4

class(df)

[1] "data.frame"

(df <- tibble(x = 1:2, y = 3:4))

# A tibble: 2 × 2
      x     y
  <int> <int>
1     1     3
2     2     4

class(df)

[1] "tbl_df"     "tbl"        "data.frame"

df |>
  pull(y)

[1] 3 4

df$y

[1] 3 4

Mit Factors Arbeiten: `{forcats}`

Daten

penguins |>
  glimpse()

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Code

penguins |>
  ggplot(aes(x = species, fill = year)) +
  geom_bar()

Warning: The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

penguins |>
  ggplot(aes(x = species, fill = factor(year))) +
  geom_bar()

penguins |>
  mutate(year_factor = factor(year)) |>
  ggplot(aes(x = species, fill = year_factor)) +
  geom_bar()

penguins |>
  mutate(
    year_factor = factor(year),
    species = fct_infreq(species)
  ) |>
  ggplot(aes(x = species, fill = year_factor)) +
  geom_bar()

penguins |>
  mutate(
    year_factor = factor(year),
    species = fct_infreq(species),
    species = fct_rev(species)
  ) |>
  ggplot(aes(x = species, fill = year_factor)) +
  geom_bar()

penguins |>
  ggplot(aes(x = species, y = bill_depth_mm, fill = species)) +
  geom_boxplot()

penguins |>
  mutate(
    species = fct_reorder(species, bill_depth_mm)
  ) |>
  ggplot(aes(x = species, y = bill_depth_mm, fill = species)) +
  geom_boxplot()

starwars |> count(species, sort = TRUE)

# A tibble: 38 × 2
   species      n
   <chr>    <int>
 1 Human       35
 2 Droid        6
 3 <NA>         4
 4 Gungan       3
 5 Kaminoan     2
 6 Mirialan     2
 7 Twi'lek      2
 8 Wookiee      2
 9 Zabrak       2
10 Aleena       1
# ℹ 28 more rows

starwars |>
  mutate(species = fct_lump(species, n = 2)) |>
  count(species)

# A tibble: 4 × 2
  species     n
  <fct>   <int>
1 Droid       6
2 Human      35
3 Other      42
4 <NA>        4

starwars |>
  mutate(
    species = fct_lump(species, n = 2),
    species = fct_relevel(species, "Human")
  ) |>
  count(species)

# A tibble: 4 × 2
  species     n
  <fct>   <int>
1 Human      35
2 Droid       6
3 Other      42
4 <NA>        4

starwars |>
  mutate(
    species = fct_lump(species, n = 2),
    species = fct_relevel(species, "Human"),
    species = fct_recode(species, "Mensch" = "Human", "Anders" = "Other")
  ) |>
  count(species)

# A tibble: 4 × 2
  species     n
  <fct>   <int>
1 Mensch     35
2 Droid       6
3 Anders     42
4 <NA>        4

Praktikum: `{forcats}`

prak-05c-forcats-firmen.qmd

20:00

Break ☕ 🍵 🍜

10:00

Praktikum: `if_else()`, `case_when()`, `{forcats}`

prak-05d-cond-mutate-forcats.qmd

30:00

Danke! 🌔

Slides created via revealjs and Quarto.

Access slides as PDF.

All material is licensed under Creative Commons Attribution Share Alike 4.0 International.

Daten rekodieren

Ziele für heute

Datentypen

Beispiel

Durschnittliche Anzahl

Einatmen… und sich die Daten Anschauen

Noch Einmal einen Blick darauf Werfen

Manchmal musst Du auf deine Befragten aufpassen

Immer Datentypen Respektieren!

Jetzt, wo wir wissen, was wir tun…

Moral der Geschichte

Datentypen in R

Datentypen in R

Expliziter vs. Impliziter Typenzwang

Praktikum: Type Coercion

Daten-Rekodierung

TRUE/FALSE: if_else()

TRUE/FALSE: if_else()

TRUE/FALSE: if_else()

Mehrere Bedingungen: case_when()

Mehrere Bedingungen: case_when()

Praktikum: Daten rekodieren

Break ☕ 🍵 🍜

Datenstrukturen

Datenstrukturen

Factors

Dates

Lists

Data Frames

Mit Factors Arbeiten: {forcats}

Daten

Praktikum: {forcats}

Break ☕ 🍵 🍜

Praktikum: if_else(), case_when(), {forcats}

Danke! 🌔

TRUE/FALSE: `if_else()`

TRUE/FALSE: `if_else()`

TRUE/FALSE: `if_else()`

Mehrere Bedingungen: `case_when()`

Mehrere Bedingungen: `case_when()`

Mit Factors Arbeiten: `{forcats}`

Praktikum: `{forcats}`

Praktikum: `if_else()`, `case_when()`, `{forcats}`