An embedding is a dense vector of numbers — often 768 or 1536 dimensions — that encodes the semantic content of a piece of text. Modern language models (such as OpenAI's text-embedding models or Sentence-BERT) are trained to place semantically similar texts near each other in this high-dimensional space. The cosine similarity between two embeddings is a measure of semantic relatedness: two pitch decks in the same domain will have a high cosine similarity; a deck about consumer fintech and one about industrial robotics will have a low similarity.
In venture capital applications, embeddings are used for several critical functions. Investor-startup matching: the embedding of an investor's thesis and portfolio is compared to the embedding of a startup's description to compute alignment. Similarity search: given a new pitch deck, the system can retrieve the most similar historical decks (both successful and unsuccessful investments) as a benchmark reference. Clustering: a portfolio of 200 pitch decks can be clustered by embedding similarity to identify concentrations and gaps in deal flow.
The quality of an embedding-based system depends on the quality of the underlying model (general-purpose language models outperform older bag-of-words approaches but may miss domain-specific nuance), the quality of the text being embedded (a well-structured deck produces more informative embeddings than a sparse one), and the quality of the similarity computation (cosine similarity with appropriate thresholds rather than Euclidean distance, which degrades in high dimensions).