CubeWiz

May 19
12 min read

An AI-Assisted ETF Analysis, Classification, and Evaluation System

Executive Summary

ETF selection is more demanding than it appears. Since KM Cube operates within the EU, identifying the right product for a specific client mandate within the relatively standardized UCITS universe requires more than quantitative screening. The information that matters most lives in factsheet documents: how a fund is actually constructed, what benchmark it tracks, how it handles currency, and what its thematic intent truly is. These documents are inconsistent in language, variable in structure, and impossible to process at scale through conventional means. The result is a persistent gap between the intelligence embedded in fund documentation and what investment teams can practically access when building client recommendations.

CubeWiz is KM Cube's internal AI-assisted ETF analysis and evaluation system. The system is built around two connected capabilities.

The first is intelligent fund classification. CubeWiz deploys a Large Language Model to read ETF factsheets the way an experienced analyst would — interpreting investment language in context rather than matching surface-level keywords. From each factsheet, it extracts a normalized metadata record across a set of structured dimensions, covering asset class, geographic focus, sector and thematic exposure and more. This converts an inconsistent universe of fund documents into a single structured and queryable investment database.
The second is mandate-driven querying. Once the metadata exists, CubeWiz allows KM Cube's investment team to interrogate it using the language of investment mandates rather than the logic of dropdown filters. The system reasons over the structured metadata to return candidate sets that are both mandate-relevant and analytically grounded. CubeWiz can answer queries such as "find ETFs that overweight firm-power utilities, independent power producers, and grid infrastructure" — distinguishing between utility sub-sectors that a standard screener would group together — or "find ETFs that invest in AI infrastructure, specifically chips, networking, and power equipment, but exclude hyperscalers" — simultaneously filtering for a theme and excluding a segment of it based on portfolio construction logic embedded in the factsheet.

Within the resulting filtered universe, CubeWiz applies a quantitative ranking framework based on risk-adjusted return measures to prioritize the strongest candidates. Ranking is the final step — applied to a set that has already been validated for mandate fit.

The system operates on a monthly cadence. Factsheets are retrieved and reprocessed each month, metadata is refreshed, and rankings reflect the latest available market data. This ensures that every output is current, traceable to its source documentation, and reproducible.

We publish this methodology openly for two reasons. First, we believe our clients deserve to understand how fund selection decisions are made on their behalf. Second, we want to demonstrate how LLM technology can be applied in an asset management context — not as a complex black box, but as a practical and well-defined tool that addresses a specific, real problem.

Problem Definition: Why ETF Selection Is Operationally Difficult

Selecting suitable ETFs for portfolio construction is often presented as a straightforward screening exercise. In practice, it is not. Even within the relatively standardized UCITS ETF universe - we are an EU regulated entity-, identifying the right product for a specific mandate requires a combination of quantitative screening, qualitative interpretation, and judgment. Achieving this consistently across a large and growing fund universe remains a significant operational challenge.

The first challenge is that data providers and product issuers do not represent the investment universe in a fully consistent way. Platforms may capture returns, volatility, and other market metrics accurately, but they rarely express a fund's true investment exposure with enough depth. Two ETFs can appear similar in name or high-level category while differing materially in benchmark construction, geographic mix, sector concentration, duration profile, replication method, dividend treatment, currency hedging, or domicile.

The second challenge is that the information needed to resolve these distinctions is qualitative and document-bound. ETF factsheets contain the most relevant evidence for understanding what a fund actually does, but they are written in inconsistent language, structured differently across issuers, and impossible to process manually at scale. The consequence is that nuanced mandate requirements — distinguishing between regulated utilities and independent power producers within an energy ETF, or identifying AI infrastructure funds that exclude hyperscaler exposure — cannot be answered by any combination of filters a standard screener offers. They require reading and interpreting fund documentation in context, the way an analyst would. When that step is done manually, it is slow, difficult to scale, and prone to inconsistency across team members.

CubeWiz was developed to address this exact gap.

CubeWiz: System Overview

CubeWiz employs a sequential, three-stage pipeline to transform unstructured ETF documents and quantitative market data into a unified, queryable intelligence platform. This workflow evaluates each ETF holistically, bridging the gap between a fund's stated qualitative mandate and its historical quantitative performance. The system operates across three distinct stages:

Stage	Operational Action	Primary Deliverable
1. Preparation	Ingests the global ETF universe and validates the availability of current fund documentation	Verified Universe: A curated dataset restricted to products with corroborated source documentation.
2. Analysis & Integration	Deploys LLMs to interpret nuanced fund attributes across structured metadata dimensions while concurrently embedding pre-computed quantitative performance rankings.	A unified metadata record where each fund is characterised by both qualitative investment intent and quantitative performance merit.
3. Intelligent Retrieval	Employs a dual-mode reasoning framework (Deterministic or Semantic) to navigate the enriched metadata and identify optimal fund candidates.	Validated Shortlist: High-precision fund selections that adhere to mandate constraints and performance benchmarks.

Stage 1: Data Foundation and preparation

Stage 1 constructs the baseline dataset for the CubeWiz system by aggregating two distinct data streams: quantitative market metrics and qualitative issuer documentation. This phase establishes a normalized universe of European UCITS ETFs, ensuring all subsequent Large Language Model (LLM) classifications and ranking models operate on verified, current inputs.

Defining the Investable Universe

The CubeWiz universe is restricted to the European UCITS ETF market, aligning with KM Cube’s operational footprint and client mandates.

Current Scope: As of the May 2026 data snapshot, the active database comprises approximately 1,000 distinct UCITS ETFs.
Inclusion Criteria: An ETF is included only if a current, machine-readable Monthly Factsheet (PDF) is accessible via institutional feeds or direct issuer portals. Funds without up-to-date documentation are excluded to maintain the integrity of the classification process.

Integrating Two Data Sources

Data is synchronized monthly across two distinct streams:

Quantitative Stream (Market Data): Captures standardized financial metrics, including risk-adjusted performance (Sharpe, Sortino, Treynor, 1Y/3Y/5Y returns), yield characteristics (YTM, Dividend Yield), and operational/risk metrics (TER, 1Y Tracking Error, AUM, Beta).
Qualitative Stream (Issuer Documentation): Ingests the full unstructured text of official Monthly Factsheets. This corpus serves as the raw input for Stage 2, enabling the LLM to extract the system's structured metadata fields (e.g., thematic focus, index tracked, and swap replication).

Universe Composition

The Stage 1 ingestion process yields a structured dataset prepared for algorithmic processing. A cross-section of the approximately 1,000 constituents (as of May 2026) defines the composition of the evaluated market:

Asset Class	Total Funds	% of Universe	Dominant Implementation	Explicit Thematic/ESG Mandate
Equity	626	63.7%	Passive (85.3%)	314 funds (50.2%)
Fixed Income	342	34.8%	Passive (90.4%)	117 funds (34.2%)
Mixed Allocation	14	1.4%	Active (57.1%)	4 funds (28.6%)

Distribution Metrics:

Equity Coverage: 63.7% of the total universe consists of Equity ETFs, the majority of which utilize passive tracking mechanisms.
Thematic and ESG Footprint: 50.2% of Equity ETFs and 34.2% of Fixed Income ETFs incorporate explicit thematic or ESG-screened mandates within their issuer documentation.
Active Management Sub-Segment: While representing a minor segment of the overall universe (1.4%), active strategies are predominantly concentrated (over 60%) in specialized thematic or ESG allocations.

With this integrated data foundation established, the validated universe advances to Stage 2: automated classification and metadata extraction via the LLM engine.

Stage 2: Analysis & Integration

Stage 2 serves as the processing engine of the CubeWiz framework. It operates across two parallel tracks: an "Integration" layer that deploys a Large Language Model (LLM) to extract structured qualitative metadata, and an "Analysis" layer that calculates a purely quantitative composite score for every constituent. These dual tracks merge to convert variable inputs into a standardized, fully evaluated dataset.

The Integration Layer: Metadata Extraction

Each month, the LLM reads the full text of every factsheet in the universe and extracts a fixed set of structured fields. These fields fall into four categories: core investment mandate (thematic intent, geographic focus, factor exposures); fixed income parameters (bond type, duration, credit quality); fund structure and regulation (UCITS status, index tracked, replication method); and operational mechanics (distribution type, currency hedging). To keep the output reliable, the model is not allowed to infer or estimate. If a particular field is not explicitly stated in the factsheet, the system records it as blank. The qualitative layer reflects only what issuers have actually published — nothing more.

Strategic & Asset Class Mandates: Captures core investment focus (e.g., thematic intent, geographic mix, factor exposures).
Fixed Income Specifics: Isolates critical debt parameters (e.g., bond type, duration focus, credit rating constraints).
Structural & Regulatory Mechanics: Identifies the fund's operational architecture (e.g., UCITS status, index tracked, swap replication).
Trading & Operational Characteristics: Normalizes mechanical details (e.g., distribution type, currency hedging).

(Note: For the exhaustive list of all eighteen extracted metadata fields, please refer to Appendix A).

To ensure outputs reflect only what issuers have actually published, the model is prohibited from inferring, estimating, or interpolating missing parameters. If a specific field is absent from the official documentation, the system records it as blank. The qualitative layer is strictly an empirical reflection of published issuer intent — nothing more.

The Analysis Layer: Quantitative Ranking

In parallel with the qualitative extraction, CubeWiz scores every ETF on purely quantitative grounds. The ranking draws exclusively on market data — no qualitative judgments enter the scoring model — and covers two dimensions: risk-adjusted return (how efficiently the fund generates return per unit of risk) and absolute performance consistency across standard time horizons.

Risk-Adjusted Performance: Evaluating the return generated per unit of risk, assessing investment efficiency and downside mitigation.
Performance Returns: Capturing the strength and consistency of absolute returns across relevant standardized time horizons.

This global ranking is computed once, at the universe level, prior to any filtering or retrieval. Because Stage 3 retrieval groups funds by mandate-relevant criteria before surfacing results, comparisons always occur within structurally similar cohorts — making a universe-level composite both sufficient and appropriate. The relative order among funds is preserved from this stage; no recalculation is required at retrieval time.

Stage 3: Intelligent Retrieval

Stage 3 is where the enriched dataset produced in Stage 2 gets put to work. Retrieval follows a two-step sequence: broad structural filters first, thematic reasoning second — progressively narrowing the universe toward a precise, mandate-compliant shortlist.

Step 1: Deterministic Filtering

Every query begins with deterministic filtering. This step applies hard constraints against structured metadata dimensions — asset class, currency share class, domicile, replication method, distribution type, and others — to reduce the full universe to a structurally compliant subset. Beyond enforcing mandate alignment, this step serves a practical function: by limiting the input to the semantic stage, it reduces noise and improves the precision of LLM reasoning in the subsequent step.

Step 2: Semantic Retrieval

For mandates that involve thematic or conceptual nuance beyond what metadata filters can resolve, a second step is applied to the filtered subset. The system deploys a Large Language Model to interpret the natural language intent of the query and evaluate it against the qualitative profiles constructed during Stage 2. This step is optional — it is invoked only when the complexity of the query warrants it.

A query such as "find ETFs exposed to the clean energy transition that actively minimize fossil fuel exposure" cannot be resolved through metadata filters alone. Semantic retrieval identifies funds whose stated investment objectives contextually align with the query, even where exact phrasing differs across issuer documents.

The output of this two-step process is a mandate-compliant shortlist, ranked by the composite score assigned globally in Stage 2. No recalculation is required at retrieval time — the relative order among funds is preserved from the universe-level ranking.

Delivery Modes

CubeWiz produces outputs through two distinct channels:

Monthly Report. On the same monthly cadence as the data refresh, CubeWiz generates a structured PDF report containing a predefined set of queries executed through the deterministic-then-semantic pipeline. Each query produces a ranked shortlist traceable to its source documentation. This report is the primary output for the investment team's monthly ETF review process.

Interactive Exploration. In addition to the monthly report, the full metadata file is made available through a dedicated interactive interface configured for internal use. This allows investment team members to submit free-form natural language queries directly, without predefined filters. In this mode, retrieval is fully semantic — the underlying model reasons over the complete metadata to identify mandate-relevant candidates. This channel is intended for ad hoc research and exploratory queries that fall outside the scope of the monthly report.

Illustrative Queries

The following examples illustrate the range of questions CubeWiz is designed to answer:

Client portfolio construction

A conservative EUR-based client requires a fixed income allocation with capital preservation priority: investment-grade only, EUR share class, duration under 3 years, accumulation structure. Which funds best fit this mandate?
Find dividend-distributing equity ETFs focused on European markets, suitable for a client with an income requirement — limited to funds that explicitly state a quarterly or semi-annual distribution policy in their documentation.
A client is reducing duration risk in their fixed income sleeve. Find ETFs that explicitly state a floating-rate or inflation-linked mandate in their factsheet, available in EUR share class with accumulation structure.
Find EUR-denominated equity ETFs focused on global developed markets that use physical replication and explicitly state a minimum number of holdings or diversification constraint in their investment methodology.
A client requires a defensive equity allocation. Find ETFs that explicitly target low-volatility or minimum-variance factor exposure in their stated investment objective, available in EUR share class.

Thematic and satellite allocations

A client wants a tactical allocation to the energy transition. Find ETFs that provide exposure to renewable energy infrastructure and grid modernisation — but whose factsheets explicitly exclude fossil fuel utilities, not merely screen on ESG scores.
Find ETFs providing exposure to European defence and aerospace that are available in EUR share class with physical replication — suitable as a satellite allocation following NATO spending commitments.
Find ETFs that invest in AI infrastructure — chips, networking, and power equipment — but explicitly exclude hyperscalers as stated in the fund's investment mandate.
A client wants thematic exposure to water infrastructure and resource scarcity. Find ETFs whose investment objective explicitly names water treatment, utilities, or infrastructure — excluding funds that use "water" as a broad ESG screen without sector specificity.
Find ETFs that explicitly mention Paris-Aligned Benchmark (PAB) or Climate Transition Benchmark (CTB) classification in their factsheet documentation — as distinct from funds with generic ESG screens.
Find ETFs that provide targeted exposure to healthcare innovation — specifically those whose factsheets distinguish between biotechnology, genomics, or medical devices rather than tracking a broad healthcare index.

Looking Ahead

CubeWiz is designed to evolve alongside KM Cube's investment operations. Two integration priorities are currently under development: the inclusion of KM Cube-managed funds — specifically the AMCs and UCITS strategies the firm manages — within the evaluated universe, subject to the same ranking methodology as all third-party funds; and tighter connection between CubeWiz outputs and KM Cube's internal client proposal workflow, with the goal of reducing the steps between fund selection and MiFID-compliant client documentation.

Conclusion

CubeWiz was built to solve a practical problem: ETF selection is more complex than standard screening tools imply, and manual review does not scale when teams need consistent, mandate-driven, and document-supported recommendations.

By combining market data, factsheet analysis, LLM-based metadata interpretation, and a structured global ranking framework, CubeWiz converts a large and inconsistent ETF universe into an actionable decision-support system. It allows KM Cube's investment team to evaluate funds not only by performance characteristics, but by what those products actually represent in investment terms — their construction logic, thematic intent, and structural mechanics, as expressed in issuer documentation.

Its value is clear in three areas. It improves classification quality by reading fund documentation rather than relying on simplified labels. It improves operational efficiency by producing ranked, mandate-aligned shortlists that reduce the burden of repeated manual screening. And it improves consistency by creating a repeatable monthly process with full traceability from source document to output.

CubeWiz reflects a broader conviction at KM Cube: that technology should make investment decision-making more rigorous and more transparent, not more opaque. Publishing this methodology openly is part of that commitment.

Appendix A: Metadata Fields Extracted by the LLM

The CubeWiz system extracts structured metadata fields from official Monthly Factsheets to form the qualitative layer of the database, organized into four primary categories:

Strategic & Asset Class Mandates: asset class, geographic focus, geographic mix, sector focus, thematic focus, market capitalization focus, factor exposure.
Fixed Income Specifics: bond type, duration focus, credit rating.
Structural & Regulatory Mechanics: UCITS status, index tracked, leverage or inverse status, swap replication.
Trading & Operational Characteristics: base and trading currency, currency hedging, domicile country, distribution type.

Appendix B: Technology Stack

KM Cube publishes this information in the interest of transparency, consistent with our broader commitment to making our methodology legible to clients and counterparties.

Component	Technology
LLM — Metadata Extraction & Semantic Retrieval	Google Gemini
Interactive Exploration Interface	Google Gemini GEM
Quantitative Data Feed	EODHD
Document Ingestion	PDF parsing via institutional feeds and issuer portals
Output Format	Structured PDF report (monthly cadence)

This stack is subject to change as the system evolves. The methodology described in this document is designed to be implementation-agnostic — the underlying tools may be substituted without altering the logic of the pipeline.

Conclusion

This document, "CubeWiz: An AI-Assisted ETF Analysis, Classification, and Evaluation System," is prepared by KM Cube Asset Management for informational purposes only and does not constitute a recommendation, solicitation, offer, or acceptance of any proposal for transactions with KM Cube Investment S.A. or its affiliates. The content is based on internal systems and analysis and should not be construed as investment advice. Past performance is not a reliable indicator of future results, and no representation or warranty is made regarding the future performance of any investment. Investments discussed in this paper may be subject to significant volatility and risks, including the potential loss of the original investment. While all information has been obtained from sources believed to be reliable, KM Cube Investment S.A. provides no guarantee as to its accuracy or completeness. For more information about the policies of KM Cube Investment S.A., please visit https://www.km3am.com/terms-of-use.

Authors: Kostas Metaxas and Konstantinos Ilias

KM Cube Asset Management is regulated by the Hellenic Capital Market Commission (HCMC).