Day 1 Answer key

Assessing sample similarity and identifying potential outliers

1. Read in the file data/multiBAMsummary.tab and save it to a variable. This matrix contains counts for input samples as well.

counts_input <- read.delim("data/multiBamSummary/multiBAMsummary.tab", sep = "\t")

2. Transform the data using vst().

Hint: You will need to follow a similar process as above to first clean the count matrix and create a dds object.
Hint: You may find it useful to include input/ChIPseq as a metadata factor.

# Remove genomic coordinate info
plot_counts_input <- data.frame(counts_input[, 4:ncol(counts_input)])

# Change column names
colnames(plot_counts_input) <- colnames(plot_counts_input) %>% 
  str_replace( "X.", "") %>% 
  str_remove( "\\.$")

# Create meta
meta_input <- data.frame(row.names = colnames(plot_counts_input), 
                         genotype = colnames(plot_counts_input) %>% str_remove("\\_.*"),
                         sample = ifelse(grepl("i", colnames(plot_counts_input)), "input", "ChIPseq"))

# Create DESeq2 object
dds_input <- DESeqDataSetFromMatrix(plot_counts_input, meta_input, design = ~sample + genotype)

# Run vst and extract transformed counts
vst_input <- vst(dds_input)
vst_input_counts <- assay(vst_input)

3. Use the vst data to draw a PCA plot. How do samples separate on the plot? How much variance is explained by PC1 and PC2?

# Compute principal components
pc_input <- prcomp(t(vst_input_counts))
plot_pca_input <- data.frame(pc_input$x, meta_input)
summary(pc_input) # will tell you how much variance is explained by each PC

# Plot with sample names used as data points
ggplot(plot_pca_input, aes(PC1, PC2, color = genotype, shape = sample,
                           label = rownames(plot_pca_input)), size = 3) + 
  theme_bw() +
  geom_point() +
  geom_text_repel() +
  xlab('PC1 (84% of variance)') +
  ylab('PC2 (3% of variance)') +
  scale_x_continuous(expand = c(0.3, 0.3)) +
  theme(plot.title = element_text(size = rel(1.5)),
        axis.title = element_text(size = rel(1.5)),
        axis.text = element_text(size = rel(1.25)))

Samples separate by input/ChIPseq on PC1 (84% of variance). PC2 explains only 3% of variance.

4. Use the vst data to draw a correlation heatmap. How does this result compare to the PCA plot? Do we see defined block structure in the heatmap?

# Set annotation and colors
annotation <- meta_input
heat.colors <- brewer.pal(6, "YlOrRd")

# Plot ICA heatmap
pheatmap(cor(vst_input_counts), color = heat.colors, annotation = annotation)

As seen in the PCA, samples separate cleanly by input/ChIPseq. Within each group, there is no clear separation by genotype.

Concordance across replicates using peak overlaps

1. Find the overlapping peaks across the cKO samples. Store the result to a variable called olaps_cko.

olaps_cko <- findOverlapsOfPeaks(cKO_H3K27ac_ChIPseq_REP1,
                                 cKO_H3K27ac_ChIPseq_REP2,
                                 cKO_H3K27ac_ChIPseq_REP3, connectedPeaks = "merge")

2. Use the results from olaps_cko to draw a Venn Diagram. What number of peaks occur in all three replicates?

venstats <- makeVennDiagram(olaps_cko, connectedPeaks = "merge",
                            fill    = c("#CC79A7", "#56B4E9", "#F0E442"), # circle fill color
                            col     = c("#D55E00", "#0072B2", "#E69F00"), # circle border color
                            cat.col = c("#D55E00", "#0072B2", "#E69F00")) # category name color

There are ~53k peaks in all 3 replicates.

3. Extract the required data from olaps_cko to create an UpSet plot. Draw the Upset plot. How many peaks are unique to each replicate?

# Prepare data for UpSetR
set_counts <- olaps_cko$venn_cnt[, colnames(olaps_cko$venn_cnt)] %>% 
  as.data.frame() %>% 
  mutate(group_number = row_number()) %>%
  pivot_longer(!Counts & !group_number, names_to = 'sample', values_to = 'member') %>%
  filter(member > 0) %>%
  group_by(Counts, group_number) %>% 
  summarize(group = paste(sample, collapse = '&'))

# Set required variables 
set_counts_upset <- set_counts$Counts
names(set_counts_upset) <- set_counts$group

# Plot the UpSet plot
upset(fromExpression(set_counts_upset), order.by = "freq", text.scale = 1.5)

Replicate 1 has 12,200 unique peaks; replicate 2 has 10,815, and replicate 3 has 13,046.