AI-First .NET backend
AI-First .NET backend – We’ll build a high-throughput, AI-first .NET 8 backend that:
This solves a real production problem: how to serve semantic capabilities (search, RAG, personalization, anomaly detection) from your existing .NET services without routing every request through a cloud LLM provider. You get:
In your Web API project, add:
Create a new .NET 8 Web API (minimal APIs) project:
dotnet new webapi -n SemanticBackend
cd SemanticBackend
Edit SemanticBackend.csproj to target .NET 8 and add packages:
<Project Sdk="Microsoft.NET.Sdk.Web">
<PropertyGroup>
<TargetFramework>net8.0</TargetFramework>
<ImplicitUsings>enable</ImplicitUsings>
<Nullable>enable</Nullable>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="Microsoft.ML.OnnxRuntime" Version="1.20.0" />
<PackageReference Include="Microsoft.ML.OnnxRuntime.Managed" Version="1.20.0" />
<PackageReference Include="Dapper" Version="2.1.35" />
<PackageReference Include="Qdrant.Client" Version="3.5.0" />
</ItemGroup>
</Project>
Place your ONNX model file under ./Models/embeddings.onnx and mark it as Copy if newer in the .csproj:
<ItemGroup>
<None Include="Models\embeddings.onnx" CopyToOutputDirectory="PreserveNewest" />
</ItemGroup>
We’ll focus on a simple domain: documents with semantic search.
namespace SemanticBackend.Documents;
public sealed record Document(
Guid Id,
string ExternalId,
string Title,
string Content,
float[] Embedding,
DateTimeOffset CreatedAt);
For API DTOs:
namespace SemanticBackend.Api;
public sealed record IndexDocumentRequest(
string ExternalId,
string Title,
string Content);
public sealed record SearchRequest(
string Query,
int TopK = 5);
public sealed record SearchResult(
Guid Id,
string ExternalId,
string Title,
string Content,
double Score);
This service will:
Basic abstraction:
namespace SemanticBackend.Embeddings;
public interface IEmbeddingGenerator
{
ValueTask<float[]> GenerateAsync(string text, CancellationToken ct = default);
}
ONNX-based implementation (simplified – assumes the model takes a single input tensor already preprocessed; you can extend this to include tokenization or use a model exported with pre/post processing baked in):
using System.Numerics;
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
namespace SemanticBackend.Embeddings;
public sealed class OnnxEmbeddingGenerator : IEmbeddingGenerator, IAsyncDisposable
{
private readonly InferenceSession _session;
private readonly string _inputName;
private readonly string _outputName;
public OnnxEmbeddingGenerator(string modelPath)
{
// Configure session options (CPU, threads, graph optimizations)
var options = new SessionOptions
{
GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_ALL
};
options.EnableMemoryPattern = true;
_session = new InferenceSession(modelPath, options);
// Inspect model metadata for input/output names if needed.
_inputName = _session.InputMetadata.Keys.First();
_outputName = _session.OutputMetadata.Keys.First();
}
public ValueTask<float[]> GenerateAsync(string text, CancellationToken ct = default)
{
// You would normally do proper tokenization here or call a model
// that encapsulates tokenization in the ONNX graph.
// For demo, we assume an external process provides us with a fixed-size input vector.
// Replace this with real tokenization for a production system.
// Example: fake tokenization into a fixed-length float vector
const int inputLength = 128;
var inputTensor = new DenseTensor<float>(new[] { 1, inputLength });
var span = inputTensor.Buffer.Span;
span.Clear();
// SUPER simplified: map chars to floats
var length = Math.Min(text.Length, inputLength);
for (var i = 0; i < length; i++)
{
span[i] = text[i] % 128; // safe demo mapping
}
var inputs = new List<NamedOnnxValue>
{
NamedOnnxValue.CreateFromTensor(_inputName, inputTensor)
};
using var results = _session.Run(inputs);
var outputTensor = results.First(v => v.Name == _outputName).AsTensor<float>();
var embedding = outputTensor.ToArray();
NormalizeInPlace(embedding);
return ValueTask.FromResult(embedding);
}
private static void NormalizeInPlace(Span<float> vector)
{
var length = vector.Length;
if (length == 0) return;
// Use double accumulator to minimize rounding
double sumSquares = 0;
for (var i = 0; i < length; i++)
{
var v = vector[i];
sumSquares += (double)v * v;
}
var norm = Math.Sqrt(sumSquares);
if (norm < 1e-12) return;
var inv = (float)(1.0 / norm);
for (var i = 0; i < length; i++)
{
vector[i] *= inv;
}
}
public ValueTask DisposeAsync()
{
_session.Dispose();
return ValueTask.CompletedTask;
}
}
Note: In production, you should plug in a real tokenizer and model-specific pre/post-processing. The overall pattern remains the same.
We want the rest of the code to be independent of the specific database implementation.
namespace SemanticBackend.VectorStore;
using SemanticBackend.Documents;
public interface IVectorStore
{
Task IndexAsync(Document document, CancellationToken ct = default);
Task<IReadOnlyList<(Document Document, double Score)>> SearchAsync(
float[] queryEmbedding,
int topK,
CancellationToken ct = default);
}
We’ll start with an in-memory store implementing cosine similarity. This is great for local development and testing.
using System.Collections.Concurrent;
using SemanticBackend.Documents;
namespace SemanticBackend.VectorStore;
public sealed class InMemoryVectorStore : IVectorStore
{
private readonly ConcurrentDictionary<Guid, Document> _documents = new();
public Task IndexAsync(Document document, CancellationToken ct = default)
{
_documents[document.Id] = document;
return Task.CompletedTask;
}
public Task<IReadOnlyList<(Document Document, double Score)>> SearchAsync(
float[] queryEmbedding,
int topK,
CancellationToken ct = default)
{
if (_documents.Count == 0)
{
return Task.FromResult<IReadOnlyList<(Document, double)>>
(Array.Empty<(Document, double)>());
}
// Cosine similarity: dot(a, b) / (|a| * |b|), but since vectors
// are normalized, this is just dot(a, b).
var results = new List<(Document, double)>(_documents.Count);
foreach (var doc in _documents.Values)
{
var score = Dot(queryEmbedding, doc.Embedding);
results.Add((doc, score));
}
var top = results
.OrderByDescending(r => r.Item2)
.Take(topK)
.ToArray();
return Task.FromResult<IReadOnlyList<(Document, double)>>(top);
}
private static double Dot(ReadOnlySpan<float> a, ReadOnlySpan<float> b)
{
if (a.Length != b.Length)
{
throw new InvalidOperationException(
$"Vector dimension mismatch: {a.Length} vs {b.Length}.");
}
var sum = 0.0;
for (var i = 0; i < a.Length; i++)
{
sum += a[i] * b[i];
}
return sum;
}
}
Let’s add a Qdrant-backed store to illustrate real vector DB usage. We assume a collection with vector_size equal to your embedding dimension and appropriate distance metric (cosine).
using Qdrant.Client;
using Qdrant.Client.Grpc;
using SemanticBackend.Documents;
namespace SemanticBackend.VectorStore;
public sealed class QdrantVectorStore : IVectorStore
{
private readonly QdrantClient _client;
private readonly string _collectionName;
private readonly int _dimension;
public QdrantVectorStore(QdrantClient client, string collectionName, int dimension)
{
_client = client;
_collectionName = collectionName;
_dimension = dimension;
}
public async Task IndexAsync(Document document, CancellationToken ct = default)
{
if (document.Embedding.Length != _dimension)
{
throw new InvalidOperationException(
$"Vector dimension mismatch: expected {_dimension}, got {document.Embedding.Length}.");
}
var payload = new Dictionary<string, object?>
{
["externalId"] = document.ExternalId,
["title"] = document.Title,
["content"] = document.Content,
["createdAt"] = document.CreatedAt
};
var point = new PointStruct
{
Id = document.Id.ToString(),
Vectors = new Vectors
{
Vector_ = { document.Embedding.Select(v => (double)v) }
},
Payload = { payload.ToStruct() }
};
await _client.UpsertAsync(
_collectionName,
new[] { point },
cancellationToken: ct);
}
public async Task<IReadOnlyList<(Document Document, double Score)>> SearchAsync(
float[] queryEmbedding,
int topK,
CancellationToken ct = default)
{
var searchPoints = await _client.SearchAsync(
_collectionName,
queryEmbedding.Select(v => (double)v),
topK,
withPayload: true,
cancellationToken: ct);
var results = new List<(Document, double)>(searchPoints.Count);
foreach (var point in searchPoints)
{
var payload = point.Payload?.Fields ?? new Dictionary<string, Google.Protobuf.WellKnownTypes.Value>();
var externalId = payload.TryGetValue("externalId", out var extVal)
? extVal.StringValue
: string.Empty;
var title = payload.TryGetValue("title", out var titleVal)
? titleVal.StringValue
: string.Empty;
var content = payload.TryGetValue("content", out var contentVal)
? contentVal.StringValue
: string.Empty;
var createdAt = payload.TryGetValue("createdAt", out var createdVal)
? DateTimeOffset.Parse(createdVal.StringValue)
: DateTimeOffset.UtcNow;
// For many APIs, the original vector is not returned; you might not need it
// for read scenarios. For simplicity, we reuse the query embedding.
var doc = new Document(
Guid.Parse(point.Id.StringValue),
externalId,
title,
content,
queryEmbedding,
createdAt);
results.Add((doc, point.Score));
}
return results;
}
}
Note: The ToStruct() extension is straightforward to implement using Google.Protobuf.WellKnownTypes.Struct if your Qdrant client doesn’t already provide helpers.
Now we compose the embedding generator with the vector store into a use-case–centric service.
using SemanticBackend.Api;
using SemanticBackend.Documents;
using SemanticBackend.Embeddings;
using SemanticBackend.VectorStore;
namespace SemanticBackend.Application;
public interface IDocumentService
{
Task<Guid> IndexAsync(IndexDocumentRequest request, CancellationToken ct = default);
Task<IReadOnlyList<SearchResult>> SearchAsync(SearchRequest request, CancellationToken ct = default);
}
public sealed class DocumentService(IEmbeddingGenerator embeddings, IVectorStore store)
: IDocumentService
{
public async Task<Guid> IndexAsync(IndexDocumentRequest request, CancellationToken ct = default)
{
var embedding = await embeddings.GenerateAsync(request.Content, ct);
var document = new Document(
Id: Guid.NewGuid(),
ExternalId: request.ExternalId,
Title: request.Title,
Content: request.Content,
Embedding: embedding,
CreatedAt: DateTimeOffset.UtcNow);
await store.IndexAsync(document, ct);
return document.Id;
}
public async Task<IReadOnlyList<SearchResult>> SearchAsync(SearchRequest request, CancellationToken ct = default)
{
var queryEmbedding = await embeddings.GenerateAsync(request.Query, ct);
var matches = await store.SearchAsync(queryEmbedding, request.TopK, ct);
return matches
.Select(m => new SearchResult(
m.Document.Id,
m.Document.ExternalId,
m.Document.Title,
m.Document.Content,
m.Score))
.ToArray();
}
}
Now we expose REST endpoints using minimal APIs.
using Microsoft.AspNetCore.Http.HttpResults;
using SemanticBackend.Api;
using SemanticBackend.Application;
using SemanticBackend.Embeddings;
using SemanticBackend.VectorStore;
using Qdrant.Client;
var builder = WebApplication.CreateBuilder(args);
// Configuration
var configuration = builder.Configuration;
var modelPath = Path.Combine(AppContext.BaseDirectory, "Models", "embeddings.onnx");
const int embeddingDimension = 384; // Adjust to your model
// DI registrations
builder.Services.AddSingleton<IEmbeddingGenerator>(_ => new OnnxEmbeddingGenerator(modelPath));
// Choose one vector store implementation.
// For local/dev:
builder.Services.AddSingleton<IVectorStore, InMemoryVectorStore>();
// For Qdrant (comment the above and uncomment these):
// var qdrantUri = configuration.GetValue<string>("Qdrant:Url") ?? "http://localhost:6334";
// var qdrantCollection = configuration.GetValue<string>("Qdrant:Collection") ?? "documents";
// builder.Services.AddSingleton(new QdrantClient(qdrantUri));
// builder.Services.AddSingleton<IVectorStore>(sp =>
// {
// var client = sp.GetRequiredService<QdrantClient>();
// return new QdrantVectorStore(client, qdrantCollection, embeddingDimension);
// });
builder.Services.AddScoped<IDocumentService, DocumentService>();
builder.Services.ConfigureHttpJsonOptions(options =>
{
options.SerializerOptions.PropertyNamingPolicy = null;
options.SerializerOptions.WriteIndented = false;
});
var app = builder.Build();
app.MapPost("/documents/index", async Task<Results<Ok<Guid>, BadRequest<string>>> (
IndexDocumentRequest request,
IDocumentService service,
CancellationToken ct) =>
{
if (string.IsNullOrWhiteSpace(request.Content))
{
return TypedResults.BadRequest("Content must not be empty.");
}
var id = await service.IndexAsync(request, ct);
return TypedResults.Ok(id);
});
app.MapPost("/documents/search", async Task<Ok<IReadOnlyList<SearchResult>>> (
SearchRequest request,
IDocumentService service,
CancellationToken ct) =>
{
if (string.IsNullOrWhiteSpace(request.Query))
{
return TypedResults.Ok(Array.Empty<SearchResult>());
}
var results = await service.SearchAsync(request, ct);
return TypedResults.Ok(results);
});
app.Run();
You now have:
Once you have semantic search, layering retrieval-augmented generation (RAG) becomes straightforward. Instead of returning the documents, you can compose them into a prompt for an LLM (local ONNX LLM or remote provider).
Example service method (pseudo-LLM call):
public sealed class RagService(IDocumentService documents, IChatModel chatModel)
{
public async Task<string> AskAsync(string question, CancellationToken ct = default)
{
var searchResults = await documents.SearchAsync(
new SearchRequest(question, TopK: 5), ct);
var context = string.Join("\n\n", searchResults.Select(r =>
$"Title: {r.Title}\nContent: {r.Content}"));
var prompt = $"""
You are a helpful assistant. Answer the question based only on the context.
Context:
{context}
Question: {question}
""";
var answer = await chatModel.CompleteAsync(prompt, ct);
return answer;
}
}
Where IChatModel could be implemented using another ONNX model (e.g., Phi-3) or a cloud provider.
For high throughput, you want to batch embeddings whenever possible.
public interface IBatchEmbeddingGenerator
{
ValueTask<float[][]> GenerateBatchAsync(
IReadOnlyList<string> texts,
CancellationToken ct = default);
}
Inside your ONNX implementation, you can create a tensor of shape [batchSize, sequenceLength] and run a single _session.Run() call, then split the output tensor into separate vectors per item. This significantly improves throughput when handling many small requests (e.g., indexing jobs).
Use a background queue for indexing to reduce latency on the write path:
public sealed class IndexingBackgroundService(
Channel<IndexDocumentRequest> channel,
IDocumentService documentService,
ILogger<IndexingBackgroundService> logger) : BackgroundService
{
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
await foreach (var request in channel.Reader.ReadAllAsync(stoppingToken))
{
try
{
await documentService.IndexAsync(request, stoppingToken);
}
catch (Exception ex)
{
logger.LogError(ex, "Error indexing document {ExternalId}", request.ExternalId);
}
}
}
}
Symptom: Errors like “vector dimension mismatch” or “expected dim X, got Y”.
Cause: Your model outputs a vector of dimension N, but your vector DB or code assumes a different size.
Fix:
QdrantVectorStore).Symptom: App fails to start with missing DLL or shared library errors.
Cause: Native ONNX Runtime binaries missing for your platform.
Fix:
Symptom: First inference is slow (model load, JIT, etc.).
Fix:
OnnxEmbeddingGenerator constructor or via IHostedService.Symptom: Memory grows with concurrent requests.
Causes: Large allocations per request, no reuse of buffers, unbounded caching.
Fix:
Symptom: Semantic search results look random or irrelevant.
Causes: Wrong model type, missing normalization, bad pre-processing.
Fix:
InferenceSession is safe for concurrent use in many scenarios; use a singleton per model.Use domain-specific abstractions like SemanticSearchResult, Embedding value objects, and dedicated services. This makes it easier to evolve the underlying implementation without leaking details.
IEmbeddingGenerator and IVectorStore to test application logic.appsettings or environment variables.We’ve built a modern, AI-first .NET 8 backend that:
From here, you can:
Pick a model that is explicitly designed for sentence embeddings or similarity search and has a good balance between embedding size and speed. Smaller dimensions (e.g., 384–768) are usually enough for many enterprise scenarios while being faster and more memory-efficient than very large embeddings.
Yes, the concepts are the same. Minimal APIs exist in .NET 6+, and ONNX Runtime works across these versions. You might need minor adjustments to the project file and language features depending on the C# version.
You have two main options:
Either works. Normalizing in C# (as shown) is flexible and easy to reason about; normalizing in the ONNX graph simplifies your C# code and guarantees consistent behavior across languages. The key is to normalize consistently for both indexing and querying.
Treat them like any other internal microservice:
You can store embeddings as arrays or JSON and compute similarity in your application or via custom functions, but performance will be limited. pgvector gives you efficient vector types and index structures (IVFFlat, HNSW) suitable for high-throughput APIs.
In-memory or naive approaches work for thousands to tens of thousands of vectors. Once you reach hundreds of thousands or millions of vectors, specialized vector DBs (Qdrant, Milvus, pgvector) become important for both latency and resource usage.
Create a small labeled dataset of query-document pairs with ground-truth relevance labels. Run your semantic search pipeline against it and compute metrics such as MRR, nDCG, or precision@K. Integrate those tests into your CI/CD pipeline to catch regressions when changing models or preprocessing logic.
Yes. You can:
?query= for semantic search)./semantic route.Include a tenant identifier in your payload (Qdrant) or as a column (pgvector) and add it as a hard filter to all queries. You may also decide to use separate collections/tables per tenant if isolation requirements are strict or if you need different models per tenant.
You might be interest at
AI-Native .NET: Building Intelligent Applications with Azure OpenAI, Semantic Kernel, and ML.NET
AI-Augmented .NET Backends: Building Intelligent, Agentic APIs with ASP.NET Core and Azure OpenAI
Master Effortless Cloud-Native .NET Microservices Using DAPR, gRPC & Azure Kubernetes Service
.NET 8 and Angular Apps with Keycloak In the rapidly evolving landscape of 2026, identity…
Mastering .NET 10 and C# 13: Building High-Performance APIs Together Executive Summary In modern…
NET 10 is the Ultimate Tool for AI-Native Founders The 2026 Lean .NET SaaS Stack…
Modern .NET development keeps pushing toward simplicity, clarity, and performance. With C# 12+, developers can…
Implementing .NET 10 LTS Performance Optimizations: Build Faster Enterprise Apps Together Executive Summary…
Building Production-Ready Headless Architectures with API-First .NET Executive Summary Modern applications demand flexibility across…
This website uses cookies.