Using Qdrant for Embeddings Search with C#

Published on Sunday, 31 December 2023 by Russ Cam

I recently wrote a .NET client for the Qdrant vector database and ended up contributing it to the official Qdrant .NET SDK. A question came up from a user asking how to adapt an OpenAI cookbook article written in Python using the qdrant python client , to C# using the .NET SDK. This is the topic of today's post.

Setup

First, create a .NET 8 console project

dotnet new console -f net8.0

and add the required nuget packages needed for the application

dotnet add package Qdrant.Client -v 1.7.0
dotnet add package CsvHelper -v 30.0.1
dotnet add package Azure.AI.OpenAI -v 1.0.0-beta.12

Load data

Open Program.cs and add code to download the wikipedia articles CSV, and read it

var embeddingsUrl =
	new Uri("https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip");

// set to directory in which to unzip download
var name = @"C:\vector_database_wikipedia_articles_embedded";

if (!Directory.Exists(name))
{
	using var httpClient = new HttpClient();
	await using (var stream = await httpClient.GetStreamAsync(embeddingsUrl))
	await using (var destination = new FileStream($"{name}.zip", FileMode.CreateNew))
		await stream.CopyToAsync(destination);

	ZipFile.ExtractToDirectory($"{name}.zip", name);
}

var records = ReadRecords(Path.Combine(name, "vector_database_wikipedia_articles_embedded.csv"));

static IEnumerable<CsvRecord> ReadRecords(string name)
{
	using var reader = new StreamReader(name);
	var config = new CsvConfiguration(CultureInfo.InvariantCulture)
	{
		PrepareHeaderForMatch = args => string.Concat(args.Header.Select((x, i) =>
			i > 0 && char.IsUpper(x) ? "_" + x : x.ToString())).ToLowerInvariant()
	};
	using var csv = new CsvReader(reader, config);
	foreach (var record in csv.GetRecords<CsvRecord>())
		yield return record;
}

public record CsvRecord(
	int Id,
	string Url,
	string Title,
	string Text,
	[TypeConverter(typeof(StringToEmbeddingConverter))]
	float[] TitleVector,
	[TypeConverter(typeof(StringToEmbeddingConverter))]
	float[] ContentVector,
	int VectorId);

public class StringToEmbeddingConverter : DefaultTypeConverter
{
	public override object ConvertFromString(string? text, IReaderRow row, MemberMapData memberMapData) =>
		JsonSerializer.Deserialize<float[]>(text ?? throw new ArgumentNullException(nameof(text)))!;
}

This downloads the wikipedia zip file from OpenAI if not already downloaded, unzips it, and enumerates over CSV records. The StringToEmbeddingConverter is needed to deserialize a string representation of an embedding to a vector.

Index data

Data is now ready to be indexed, so we'll need an instance of qdrant to work with. Using the qdrant docker image is the easiest way to get a single instance up and running

docker pull qdrant/qdrant
docker run -p 6333:6333 -p 6334:6334 \
    -v $(pwd)/qdrant_storage:/qdrant/storage:z \
    qdrant/qdrant

With qdrant running, a collection can be created to store data and records can be indexed into the collection using the official .NET SDK

var firstRecord = records.First();
var size = (ulong)firstRecord.ContentVector.Length;
var collectionName = "Articles";
var client = new QdrantClient("localhost");

try
{
	await client.DeleteCollectionAsync(collectionName);
}
catch (QdrantException)
{
	// swallow
}

await client.CreateCollectionAsync(collectionName,
	new VectorParamsMap
	{
		Map =
		{
			["title"] = new VectorParams { Distance = Distance.Cosine, Size = size },
			["content"] = new VectorParams { Distance = Distance.Cosine, Size = size },
		}
	});

var points = new List<PointStruct>(1000) { RecordToPointStruct(firstRecord) };

foreach (var record in records)
{
	points.Add(RecordToPointStruct(record));
	if (points.Count == 1000)
	{
		await client.UpsertAsync(collectionName, points);
		points.Clear();
	}
}

if (points.Any())
	await client.UpsertAsync(collectionName, points);

var count = await client.CountAsync(collectionName);
Console.WriteLine($"Count of points: {count}");

static PointStruct RecordToPointStruct(CsvRecord record)
{
	return new PointStruct
	{
		Id = (ulong)record.Id,
		Vectors = new Dictionary<string, float[]>
		{
			["title"] = record.TitleVector,
			["content"] = record.ContentVector
		},
		Payload =
		{
			["url"] = record.Url,
			["title"] = record.Title,
			["text"] = record.Text
		}
	};
}

The separate delete and create collection calls are needed until a small bug in RecreateCollectionAsync is merged and released.

Search data

Once the data is put into Qdrant the collection can be queried for the closest vectors. The named vector to query can be provided to query either the title or content vector.

var openAiApiKey = "<insert your OpenAI API key>";
var openAIClient = new OpenAIClient(openAiApiKey);

var results = await Query(client, openAIClient, "modern art in Europe", collectionName);

foreach (var (point, i) in results.Select((point, i) => (point, i)))
	Console.WriteLine($"{i + 1}. {point.Payload["title"].StringValue} (Score: {Math.Round(point.Score, 3)})");

Console.WriteLine();

results = await Query(client, openAIClient, "Famous battles in Scottish history", collectionName, "content");

foreach (var (point, i) in results.Select((point, i) => (point, i)))
	Console.WriteLine($"{i + 1}. {point.Payload["title"].StringValue} (Score: {Math.Round(point.Score, 3)})");

return;

static async Task<IReadOnlyList<ScoredPoint>> Query(
	QdrantClient client,
	OpenAIClient openAIClient,
	string query,
	string collectionName,
	string vectorName = "title",
	ulong topK = 20)
{
	var response = await openAIClient.GetEmbeddingsAsync(new EmbeddingsOptions
		{
			Input = { query },
			DeploymentName = "text-embedding-ada-002"
		});
	return await client.SearchAsync(collectionName, response.Value.Data[0].Embedding, vectorName: vectorName, limit: topK);
}

which yields

1. Museum of Modern Art (Score: 0.875)
2. Western Europe (Score: 0.867)
3. Renaissance art (Score: 0.864)
4. Pop art (Score: 0.86)
5. Northern Europe (Score: 0.855)
6. Hellenistic art (Score: 0.853)
7. Modernist literature (Score: 0.847)
8. Art film (Score: 0.843)
9. Central Europe (Score: 0.843)
10. European (Score: 0.841)
11. Art (Score: 0.841)
12. Byzantine art (Score: 0.841)
13. Postmodernism (Score: 0.84)
14. Eastern Europe (Score: 0.839)
15. Europe (Score: 0.839)
16. Cubism (Score: 0.839)
17. Impressionism (Score: 0.838)
18. Bauhaus (Score: 0.838)
19. Surrealism (Score: 0.837)
20. Expressionism (Score: 0.837)

1. Battle of Bannockburn (Score: 0.869)
2. Wars of Scottish Independence (Score: 0.862)
3. 1651 (Score: 0.853)
4. First War of Scottish Independence (Score: 0.85)
5. Robert I of Scotland (Score: 0.846)
6. 841 (Score: 0.844)
7. 1716 (Score: 0.844)
8. 1314 (Score: 0.837)
9. 1263 (Score: 0.836)
10. William Wallace (Score: 0.835)
11. Stirling (Score: 0.831)
12. 1306 (Score: 0.831)
13. 1746 (Score: 0.83)
14. 1040s (Score: 0.828)
15. 1106 (Score: 0.827)
16. 1304 (Score: 0.827)
17. David II of Scotland (Score: 0.825)
18. Braveheart (Score: 0.825)
19. 1124 (Score: 0.824)
20. July 27 (Score: 0.823)

Complete solution

Here's the complete solution (also available as a gist.)

using System.Globalization;
using System.IO.Compression;
using System.Text.Json;
using Azure.AI.OpenAI;
using CsvHelper;
using CsvHelper.Configuration;
using CsvHelper.Configuration.Attributes;
using CsvHelper.TypeConversion;
using Qdrant.Client;
using Qdrant.Client.Grpc;

var embeddingsUrl =
	new Uri("https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip");

var name = @"C:\vector_database_wikipedia_articles_embedded";

if (!Directory.Exists(name))
{
	using var httpClient = new HttpClient();
	await using (var stream = await httpClient.GetStreamAsync(embeddingsUrl))
	await using (var destination = new FileStream($"{name}.zip", FileMode.CreateNew))
		await stream.CopyToAsync(destination);

	ZipFile.ExtractToDirectory($"{name}.zip", name);
}

var records = ReadRecords(Path.Combine(name, "vector_database_wikipedia_articles_embedded.csv"));

var firstRecord = records.First();
var size = (ulong)firstRecord.ContentVector.Length;
var collectionName = "Articles";
var client = new QdrantClient("localhost");

try
{
	await client.DeleteCollectionAsync(collectionName);
}
catch (QdrantException)
{
	// swallow
}

await client.CreateCollectionAsync(collectionName,
	new VectorParamsMap
	{
		Map =
		{
			["title"] = new VectorParams { Distance = Distance.Cosine, Size = size },
			["content"] = new VectorParams { Distance = Distance.Cosine, Size = size },
		}
	});

var points = new List<PointStruct>(1000) { RecordToPointStruct(firstRecord) };

foreach (var record in records)
{
	points.Add(RecordToPointStruct(record));
	if (points.Count == 1000)
	{
		await client.UpsertAsync(collectionName, points);
		points.Clear();
	}
}

if (points.Any())
	await client.UpsertAsync(collectionName, points);

var count = await client.CountAsync(collectionName);
Console.WriteLine($"Count of points: {count}");

var openAiApiKey = "<insert your OpenAI API key>";
var openAIClient = new OpenAIClient(openAiApiKey);

var results = await Query(client, openAIClient, "modern art in Europe", collectionName);

foreach (var (point, i) in results.Select((point, i) => (point, i)))
	Console.WriteLine($"{i + 1}. {point.Payload["title"].StringValue} (Score: {Math.Round(point.Score, 3)})");

Console.WriteLine();

results = await Query(client, openAIClient, "Famous battles in Scottish history", collectionName, "content");

foreach (var (point, i) in results.Select((point, i) => (point, i)))
	Console.WriteLine($"{i + 1}. {point.Payload["title"].StringValue} (Score: {Math.Round(point.Score, 3)})");

return;

static async Task<IReadOnlyList<ScoredPoint>> Query(
	QdrantClient client,
	OpenAIClient openAIClient,
	string query,
	string collectionName,
	string vectorName = "title",
	ulong topK = 20)
{
	var response = await openAIClient.GetEmbeddingsAsync(new EmbeddingsOptions
		{
			Input = { query },
			DeploymentName = "text-embedding-ada-002"
		});
	return await client.SearchAsync(collectionName, response.Value.Data[0].Embedding, vectorName: vectorName, limit: topK);
}

static PointStruct RecordToPointStruct(CsvRecord record)
{
	return new PointStruct
	{
		Id = (ulong)record.Id,
		Vectors = new Dictionary<string, float[]>
		{
			["title"] = record.TitleVector,
			["content"] = record.ContentVector
		},
		Payload =
		{
			["url"] = record.Url,
			["title"] = record.Title,
			["text"] = record.Text
		}
	};
}

static IEnumerable<CsvRecord> ReadRecords(string name)
{
	using var reader = new StreamReader(name);
	var config = new CsvConfiguration(CultureInfo.InvariantCulture)
	{
		PrepareHeaderForMatch = args => string.Concat(args.Header.Select((x, i) =>
			i > 0 && char.IsUpper(x) ? "_" + x : x.ToString())).ToLowerInvariant()
	};
	using var csv = new CsvReader(reader, config);
	foreach (var record in csv.GetRecords<CsvRecord>())
		yield return record;
}

public record CsvRecord(
	int Id,
	string Url,
	string Title,
	string Text,
	[TypeConverter(typeof(StringToEmbeddingConverter))]
	float[] TitleVector,
	[TypeConverter(typeof(StringToEmbeddingConverter))]
	float[] ContentVector,
	int VectorId);

public class StringToEmbeddingConverter : DefaultTypeConverter
{
	public override object ConvertFromString(string? text, IReaderRow row, MemberMapData memberMapData) =>
		JsonSerializer.Deserialize<float[]>(text ?? throw new ArgumentNullException(nameof(text)))!;
}

Comments

comments powered by Disqus