File size: 5,981 Bytes
3346881
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
# STPath: A Generative Foundation Model for Integrating Spatial Transcriptomics and Whole Slide Images

This is a Huggingface repo for the paper:

> Tinglin Huang, Tianyu Liu, Mehrtash Babadi, Rex Ying, and Wengong Jin (2025). STPath: A Generative Foundation Model for Integrating Spatial Transcriptomics and Whole Slide Images. Paper in [bioRxiv](https://www.biorxiv.org/content/10.1101/2025.04.19.649665v2.abstract). Code in [GitHub](https://github.com/Graph-and-Geometric-Learning/STPath).


## Usage

We provide an easy-to-use interface for users to perform inference on the pre-trained model, which can be found in `app/pipeline/inference.py`. Specifically, the following code snippet shows how to use it:

```python
from stpath.app.pipeline.inference import STPathInference

agent = STPathInference(
    gene_voc_path='STPath_dir/utils_data/symbol2ensembl.json',
    model_weight_path='your_dir/stpath.pkl', 
    device=0
)

pred_adata = agent.inference(
    coords=coords, # [number_of_spots, 2]
    img_features=embeddings,  # [number_of_spots, 1536], the image features extracted using Gigapath 
    organ_type="Kidney",  # Default is None
    tech_type="Visium",  # Default is None
    save_gene_names=hvg_list  # a list of gene names to save in the adata, e.g., ['GATA3', 'UBLE2C', ...]. None will save all genes in the model.
)

# save adata
pred_adata.write_h5ad(f"your_dir/pred_{sample_id}.h5ad")
```

The vocabularies for organs and technologies can be found in the following locations:
* [organ vocabulary](https://github.com/Graph-and-Geometric-Learning/STPath/blob/main/stpath/utils/constants.py#L98)
* [tech vocabulary](https://github.com/Graph-and-Geometric-Learning/STPath/blob/main/stpath/utils/constants.py#L20) 

If the organ type or the tech type is unknown, you can set them to `None` in the inference function. Besides, the predicted gene expression values are log1p-transformed (`log(1 + x)`), consistent with the transformation applied during the training of STPath.


### Example of Inference

Here, we provide an example of how to perform inference on a [sample](https://github.com/Graph-and-Geometric-Learning/STPath/tree/main/example_data) from the HEST dataset:

```python
from scipy.stats import pearsonr
from stpath.hest_utils.st_dataset import load_adata
from stpath.hest_utils.file_utils import read_assets_from_h5

sample_id = "INT2"
source_dataroot = "STPath_dir"  # the root directory of the STPath repository
with open(os.path.join(source_dataroot, "example_data/var_50genes.json")) as f:
    hvg_list = json.load(f)['genes']

data_dict, _ = read_assets_from_h5(os.path.join(source_dataroot, f"{sample_id}.h5"))  # load the data from the h5 file
coords = data_dict["coords"]
embeddings = data_dict["embeddings"]
barcodes = data_dict["barcodes"].flatten().astype(str).tolist()
adata = sc.read_h5ad(os.path.join(source_dataroot, f"{sample_id}.h5ad"))[barcodes, :]

# The return pred_adata includes the expressions of the genes in hvg_list, which is a list of highly variable genes.
pred_adata = agent.inference(
    coords=coords, 
    img_features=embeddings, 
    organ_type="Kidney", 
    tech_type="Visium",
    save_gene_names=hvg_list  # we only need the highly variable genes for evaluation
)

# calculate the Pearson correlation coefficient between the predicted and ground truth gene expression
all_pearson_list = []
gt = np.log1p(adata[:, hvg_list].X.toarray())  # sparse -> dense
# go through each gene in the highly variable genes list
for i in range(len(hvg_list)):
    pearson_corr, _ = pearsonr(gt[:, i], pred_adata.X[:, i])
    all_pearson_list.append(pearson_corr.item())
print(f"Pearson correlation for {sample_id}: {np.mean(all_pearson_list)}")  # 0.1562
```

### In-context Learning

STPath also support in-context learning, which allows users to provide the expression of a few spots to guide the model to predict the expression of other spots:

```python
from stpath.data.sampling_utils import PatchSampler

rightest_coord = np.where(coords[:, 0] == coords[:, 0].max())[0][0]
masked_ids = PatchSampler.sample_nearest_patch(coords, int(len(coords) * 0.95), rightest_coord)  # predict the expression of the 95% spots
context_ids = np.setdiff1d(np.arange(len(coords)), masked_ids)  # the index not in masked_ids will be used as context
context_gene_exps = adata.X.toarray()[context_ids]
context_gene_names = adata.var_names.tolist()

pred_adata = agent.inference(
    coords=coords, 
    img_features=embeddings, 
    context_ids=context_ids,  # the index of the context spots
    context_gene_exps=context_gene_exps,   # the expression of the context spots
    context_gene_names=context_gene_names,   # the gene names of the context spots
    organ_type="Kidney", 
    tech_type="Visium", 
    save_gene_names=hvg_list,
)

all_pearson_list = []
gt = np.log1p(adata[:, hvg_list].X.toarray())[masked_ids, :]  # groundtruth expression of the spots in masked_ids
pred = pred_adata.X[masked_ids, :]  # predicted expression of the spots in masked_ids
for i in range(len(hvg_list)):
    pearson_corr, _ = pearsonr(gt[:, i], pred[:, i])
    all_pearson_list.append(pearson_corr.item())
print(f"Pearson correlation for {sample_id}: {np.mean(all_pearson_list)}")  # 0.2449
```


## Reference

If you find our work useful in your research, please consider citing our paper:

```
@inproceedings{huang2025stflow,
  title={Scalable Generation of Spatial Transcriptomics from Histology Images via Whole-Slide Flow Matching},
  author={Huang, Tinglin and Liu, Tianyu and Babadi, Mehrtash and Jin, Wengong and Ying, Rex},
  booktitle={International Conference on Machine Learning},
  year={2025}
}

@article{huang2025stpath,
  title={STPath: A Generative Foundation Model for Integrating Spatial Transcriptomics and Whole Slide Images},
  author={Huang, Tinglin and Liu, Tianyu and Babadi, Mehrtash and Ying, Rex and Jin, Wengong},
  journal={bioRxiv},
  pages={2025--04},
  year={2025},
  publisher={Cold Spring Harbor Laboratory}
}
```