Documentation
¶
Overview ¶
Package utils provides shared, reusable algorithms. This file implements a generic BM25 search engine.
Usage:
type MyDoc struct { ID string; Body string }
corpus := []MyDoc{...}
engine := bm25.New(corpus, func(d MyDoc) string {
return d.ID + " " + d.Body
})
results := engine.Search("my query", 5)
Index ¶
- Constants
- func CreateHTTPClient(proxyURL string, timeout time.Duration) (*http.Client, error)
- func DerefStr(s *string, fallback string) string
- func DoRequestWithRetry(client *http.Client, req *http.Request) (*http.Response, error)
- func DownloadFile(ctx context.Context, urlStr, filename string, opts DownloadOptions) string
- func DownloadFileSimple(ctx context.Context, url, filename string) string
- func DownloadToFile(ctx context.Context, client *http.Client, req *http.Request, maxBytes int64) (string, error)
- func ExtractZipFile(zipPath string, targetDir string) error
- func IsAudioFile(filename, contentType string) bool
- func IsBlockedTargetIP(ip net.IP) bool
- func SanitizeFilename(filename string) string
- func SanitizeMessageContent(input string) string
- func SetDisableTruncation(enabled bool)
- func Truncate(s string, maxLen int) string
- func ValidateSkillIdentifier(identifier string) error
- func ValidateURLForRequest(ctx context.Context, raw string) (*url.URL, error)
- type BM25Engine
- type BM25Option
- type BM25Result
- type DownloadOptions
Constants ¶
const ( // DefaultBM25K1 is the term-frequency saturation factor (typical range 1.2–2.0). // Higher values give more weight to repeated terms. DefaultBM25K1 = 1.2 // DefaultBM25B is the document-length normalization factor (0 = none, 1 = full). DefaultBM25B = 0.75 )
Variables ¶
This section is empty.
Functions ¶
func CreateHTTPClient ¶
CreateHTTPClient creates an HTTP client with optional proxy support. If proxyURL is empty, it uses the system environment proxy settings. Supported proxy schemes: http, https, socks5, socks5h.
func DerefStr ¶
DerefStr dereferences a pointer to a string and returns the value or a fallback if the pointer is nil.
func DoRequestWithRetry ¶
func DownloadFile ¶
func DownloadFile(ctx context.Context, urlStr, filename string, opts DownloadOptions) string
DownloadFile downloads a file from URL to a local temp directory. Returns the local file path or empty string on error.
func DownloadFileSimple ¶
DownloadFileSimple is a simplified version of DownloadFile without options
func DownloadToFile ¶
func DownloadToFile(ctx context.Context, client *http.Client, req *http.Request, maxBytes int64) (string, error)
DownloadToFile streams an HTTP response body to a temporary file in small chunks (~32KB), keeping peak memory usage constant regardless of file size.
Parameters:
- ctx: context for cancellation/timeout
- client: HTTP client to use (caller controls timeouts, transport, etc.)
- req: fully prepared *http.Request (method, URL, headers, etc.)
- maxBytes: maximum bytes to download; 0 means no limit
Returns the path to the temporary file. The caller is responsible for removing it when done (defer os.Remove(path)).
On any error the temp file is cleaned up automatically.
func ExtractZipFile ¶
ExtractZipFile extracts a ZIP archive from disk to targetDir. It reads entries one at a time from disk, keeping memory usage minimal.
Security: rejects path traversal attempts and symlinks.
func IsAudioFile ¶
IsAudioFile checks if a file is an audio file based on its filename extension and content type.
func IsBlockedTargetIP ¶
IsBlockedTargetIP returns true if the IP falls into a non-public range: loopback, private (RFC1918), link-local unicast, multicast, unspecified, or CGNAT (RFC 6598).
func SanitizeFilename ¶
SanitizeFilename removes potentially dangerous characters from a filename and returns a safe version for local filesystem storage.
func SanitizeMessageContent ¶
SanitizeMessageContent removes Unicode control characters, format characters (RTL overrides, zero-width characters), and other non-graphic characters that could confuse an LLM or cause display issues in the agent UI.
func SetDisableTruncation ¶
func SetDisableTruncation(enabled bool)
SetDisableTruncation globally enables or disables string truncation
func Truncate ¶
Truncate returns a truncated version of s with at most maxLen runes. Handles multi-byte Unicode characters properly. If the string is truncated, "..." is appended to indicate truncation.
func ValidateSkillIdentifier ¶
ValidateSkillIdentifier validates that the given skill identifier (slug or registry name) is non-empty and does not contain path separators ("/", "\\") or ".." for security.
func ValidateURLForRequest ¶
ValidateURLForRequest validates a URL for safe use in outgoing HTTP requests. It rejects non-http(s) schemes, URLs with credentials, localhost targets, and resolves the hostname to block private/loopback/CGNAT/link-local/multicast IPs. Returns a sanitized *url.URL on success (safe to use with http.NewRequest).
Types ¶
type BM25Engine ¶
type BM25Engine[T any] struct { // contains filtered or unexported fields }
BM25Engine is a query-time BM25 search engine over a generic corpus. T is the document type; the caller supplies a TextFunc that extracts the searchable text from each document.
The engine is stateless between queries: no caching, no invalidation logic. All indexing work is performed inside Search() on every call, making it safe to use on corpora that change frequently.
func NewBM25Engine ¶
func NewBM25Engine[T any](corpus []T, textFunc func(T) string, opts ...BM25Option) *BM25Engine[T]
NewBM25Engine creates a BM25Engine for the given corpus.
- corpus : slice of documents of any type T.
- textFunc : function that returns the searchable text for a document.
- opts : optional tuning (WithK1, WithB).
The corpus slice is referenced, not copied. Callers must not mutate it concurrently with Search().
func (*BM25Engine[T]) Search ¶
func (e *BM25Engine[T]) Search(query string, topK int) []BM25Result[T]
Search ranks the corpus against query and returns the top-k results. Returns an empty slice (not nil) when there are no matches.
Complexity: O(N×L) for indexing + O(|Q|×avgPostingLen) for scoring, where N = corpus size, L = average document length, Q = query terms. Top-k extraction uses a fixed-size min-heap: O(candidates × log k).
type BM25Option ¶
type BM25Option func(*bm25Config)
BM25Option is a functional option to configure a BM25Engine.
func WithB ¶
func WithB(b float64) BM25Option
WithB overrides the document-length normalization factor (default 0.75).
func WithK1 ¶
func WithK1(k1 float64) BM25Option
WithK1 overrides the term-frequency saturation constant (default 1.2).
type BM25Result ¶
BM25Result is a single ranked result from a Search call.