Explicit Title Format Performs Worse Than Simple Format (Contradicts Docs)

#35
by Farhadtehrani - opened

The documentation states:
"providing a title, if available, will improve model performance"
But testing shows the opposite - simple formats perform up to 18.7% better.

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("google/embeddinggemma-300m")

title = "Marketing Manager"
content = "Lead marketing team and drive brand growth."
query = "Marketing Manager"

Compare formats

doc_official = f"title: {title} | text: {content}" # Per docs
doc_simple = f"{title} | {content}" # Simple format

query_prompt = f"task: search result | query: {query}"
q_emb = model.encode(query_prompt)

sim_official = util.cos_sim(q_emb, model.encode(doc_official)).item()
sim_simple = util.cos_sim(q_emb, model.encode(doc_simple)).item()

print(f"Official format: {sim_official:.4f}") # 0.5867
print(f"Simple format: {sim_simple:.4f}") # 0.6961 (+18.7%!)

Official format: 0.5867
Simple format: 0.6961 (+18.7% better!)

Additional Findings
Large-scale eval (500 queries, 80k docs): Official format has 9.3% worse NDCG@100
Title-only documents: Official format loses 17.5% performance when text: is empty
Longer titles suffer more: 60-char titles show bigger performance gaps

Question
Should I use simple concatenation ({title} | {content}) instead of the documented format (title: {title} | text: {content}) for retrieval tasks?
The simple format consistently outperforms. Is this expected behavior or should documentation be updated?

Sign up or log in to comment