Finish deconvolution notebook draft#1011
Conversation
|
Hey look it's PR 1011! We (the royal we specifically) love to see it. |
jashapiro
left a comment
There was a problem hiding this comment.
When I think of co-occurance, What I usually want to see is how often two events co-occur given the frequency of each individual event, so the normalization you are doing here is not what I would expect.
The the co-occurance measure we are interested in is how the count you have compare to the expected count.
The expected count for each pair of cell types will be n_a * n_b / n_total (p(a) * p(b) * n_total, with some cancelling), so you should calculate that for each row, then you can calculate the ratio of the observed co-occurance to that expected by chance and then plot that.
I don't think we need to get into the full statistics of co-occurance analysis here, unless you really like saying "hypergeomtric distribution," but I do think we should show what the expectation is and characterize the results based on that, not just on the raw counts or constant scaling of that).
(I mostly resisted until now talking about how you can calculate these from the binary occurence matrix and a few matrix multiplications instead of joins, because I doubt we should go that route here, but that is how I would really implement it.)
| cooccur_df <- present_df |> | ||
| dplyr::inner_join(present_df, by = "barcode", relationship = "many-to-many") |> | ||
| # rename after self-joining | ||
| dplyr::rename(cell_type_a = cell_type.x, cell_type_b = cell_type.y) |> |
There was a problem hiding this comment.
When starting a join, I prefer to not pipe into the join statement. But the real reason for the comment was to skip the renaming and just set the suffixes in the join statement.
| cooccur_df <- present_df |> | |
| dplyr::inner_join(present_df, by = "barcode", relationship = "many-to-many") |> | |
| # rename after self-joining | |
| dplyr::rename(cell_type_a = cell_type.x, cell_type_b = cell_type.y) |> | |
| cooccur_df <- dplyr::inner_join( | |
| present_df, | |
| present_df, # join table to itself | |
| by = "barcode", | |
| relationship = "many-to-many", | |
| suffix = c("_a", "_b") | |
| ) |> |
Thanks, and definitely greed with all of this - I was trying to show something hinting at co-occurrence but I did not want to do it "properly" (please never read this sentence out of context 😂 ) aka with full statistics, and a more involved but more accurate implementation seemed like getting too into the weeds. So, we should really just be sure to emphasize that this is highly exploratory and there's a whole world of stats for more complicated analyses. Either way I'll amend the calculations as suggested to get closer to a "real" analysis. |
…oduce some caveats, updated heatmap accordingly
|
Updated here: 03-deconvolution.nb.html I ended up calculating the log ratio to center it at 0 for easier plot interpretation, and bonus it comes with a real life stats name! I also added some more contextualizing text. Notably, there are several chunks now to break up the calculation, and none of them are live which is good because the notebook is already long and many typos would occur otherwise. |
Probably a lot less if you had used matrices! But then you have to convince people that matrix multiplication actually works for co-occurance. |
jashapiro
left a comment
There was a problem hiding this comment.
Thanks for this update! Much better for interpretation. I had just a few comments about color and theme, but I don't think I need to see it again.
| # Calculate ratios | ||
| cooccur_df <- observed_df |> | ||
| # combine data frames so we have a column for each of: | ||
| # expected count |
There was a problem hiding this comment.
did you mean to have observed here instead? (You have a separate comment about expected)
| # expected count | |
| # observed count |
| theme_classic() + | ||
| theme( | ||
| axis.title = element_blank(), | ||
| axis.text.x = element_text(angle = 30, hjust = 1) |
There was a problem hiding this comment.
Can I get picky about the theme here? I don't like having only two axes on a heatmap. I'd rather have the full box or no axes at all. Personally I'd probably go for theme_bw and then remove the gridlines.
| ) + | ||
| # add scale that diverges from 0 in the middle | ||
| scale_fill_gradient2( | ||
| low = "steelblue", mid = "white", high = "firebrick", |
There was a problem hiding this comment.
We have the opportunity here to use something close to the Data Lab color scheme, and I say we take it. (feel free to tweak for optimal appearance)
| low = "steelblue", mid = "white", high = "firebrick", | |
| low = "blue3", mid = "white", high = "gold1", |
There was a problem hiding this comment.
cc @allyhawkins if you'd like to take this chance to suggest different blues and yellows ;)
There was a problem hiding this comment.
No, not U-M colors, Data Lab colors!
There was a problem hiding this comment.
No, not U-M colors, Data Lab colors!
😢
There was a problem hiding this comment.
I said she could suggest, not that I would use them!
Closes #980
This PR wraps up the deconvolution notebook with a heatmap of co-occurring pairs and TSV export, so a pretty short PR (just wanted a fresh slate for reviewing this section). For this, I again use a 0.1 threshold but want to know specifically how results would change if we changed the threshold.
I think just this heatmap wraps the notebook up nicely, but just for posterity there are two things that I thought of but did not do because I think it's just too much.
As part of review let me know if you'd suggest any other additional or expanded content, but I think this is pretty solid with just the heatmap. This is the least confusing way I thought to write the code, but it manages of course to still be a bit confusing - any ideas for simplifying?
03-deconvolution.nb.html