Computed features vs PC with new AnnData format#342
Computed features vs PC with new AnnData format#342Soorya19Pradeep wants to merge 2 commits intomainfrom
Conversation
srivarra
left a comment
There was a problem hiding this comment.
My main question is, do we want to make full use of AnnData, or is it necessary to save the intermediate features .csv .
There was a problem hiding this comment.
We can work directly with AnnData.obs and use AnnData.to_df to convert X to a Dataframe.
https://anndata.readthedocs.io/en/latest/generated/anndata.AnnData.to_df.html
There was a problem hiding this comment.
The part I am still struggling with is matching the annotations and embedding feature based on the combination of 'track_id', 'fov_name' and 't'. I have to deal with a dataframe for this step as I sort and remove rows based on the match.
| output_file = '/hpc/projects/intracellular_dashboard/organelle_dynamics/2025_07_22_A549_SEC61_TOMM20_G3BP1_ZIKV/4-phenotyping/predictions/quantify_remodeling/G3BP1/feature_list_G3BP1_2025_07_22_192patch.csv' | ||
|
|
||
| # Write to CSV - append if file exists, create new if it doesn't | ||
| position_df.to_csv(output_file, mode='a', header=not os.path.exists(output_file), index=False) |
| feature_values = features.filter(like="feature_") | ||
|
|
||
| # compute the PCA features | ||
| pca = PCA(n_components=10) |
There was a problem hiding this comment.
Are we recomputing the PCA features here? Should we instead put these AnnData oriented ones over in dimensionality_reduction.py.
Also is there any functionality from feature.py which we could reuse?
| correlation_df = pd.DataFrame(pca_features, columns=[f"PCA{i+1}" for i in range(pca_features.shape[1])], index=features.index) | ||
|
|
||
| # get the computed features like 'contrast', 'homogeneity', 'energy', 'correlation', 'edge_density', 'organelle_volume', 'organelle_intensity', 'no_organelles', 'size_organelles' | ||
| image_features_df = features.filter(regex="contrast|homogeneity|energy|correlation|edge_density|organelle_volume|organelle_intensity|no_organelles|size_organelles").copy() |
| # Rename columns to avoid conflicts during merge | ||
| # Rename 't' in features_df_filtered to 'time_point' to match computed_features | ||
| features_df_filtered = features_df_filtered.rename(columns={"t": "time_point"}) |
There was a problem hiding this comment.
This was an issue from the older computations when we created headers of our choice. We have converged to use 't' from now on. I can redo the computed features to have 't' column to solve this.
There was a problem hiding this comment.
Instead of recomputing the computed features, could you just rename the column?
| cell_features = { | ||
| 'fov_name': '/'+well_id+'/'+pos_id, | ||
| 'track_id': row['track_id'], | ||
| 'time_point': timepoint, |
| # Select columns (features) in the desired order | ||
| feature_order = ["edge_density", "correlation", "energy", "homogeneity", "contrast", "no_organelles", "organelle_volume", "organelle_intensity"] | ||
| # Filter to only include features that actually exist in the dataframe | ||
| feature_order_filtered = [f for f in feature_order if f in correlation_selected.columns] |
There was a problem hiding this comment.
Do we need to check their existence if they are already fixed features from up earlier in the script?
|
@srivarra , the computed feature set depends on the organelle data. At this point I keep changing the set as I work with different organelles. I think a tool like CellProfiler can compute a 1000-feature list, which can be further filtered for the most significant features. I haven't implemented anything like that yet. That is when it will be ready to be added to AnnData as it will be a more constant list. |
Ah gotcha, so keep the |
The code is modified to read from the zarrs with the new anndata format. @srivarra, can you let me know if there is a better way to do this?
@edyoshikun , I have computed the image features outside the library as I used the segmentations of G3BP1 for the image feature computation.