From Weeks to Hours: Automating REIT Portfolio Extraction

By Ian Ronk

Automated real estate data extraction process

One of the core products at KR&A is the REIT portal, which covers over a hundred real asset funds, listed and private, based on asset-by-asset analyses. The assets of funds covered by the portal are enriched with micro-market, city and regional data. This creates normalised asset and fund scores, allowing a unique comparison between REITs and funds, of asset attribute averages. 

Our portal offers fund- and REIT-investors an ultra-precise but quick and comprehensive view of real estate holdings, across the continent. 

Apart from fund-of-fund managers and asset-allocators, REIT management teams, executive boards and supervisory boards find our findings useful. It deepens their insights in risk, and attributes of the individual assets, necessary to create repeatable outperformance. 

The Challenge: 16,000 Assets, Three Weeks of Work

The main bottleneck of processing around 16,000 assets is the time it takes to collect, standardise and assure quality across dozens of different sources and three distinct asset collection methods. To understand this problem in more depth, we need to look at the original process.

The Traditional Manual Approach

Previously, extracting REIT portfolio data involved:

1.     Manual PDF review: opening each annual report and manually copying data into spreadsheets

2.     Excel standardisation: reformatting data from various Excel formats into a consistent structure

3.     Implementation of web scrapers for funds not sharing their portfolio in the annual report

4.     Quality assurance: validating addresses, square meters and other attributes

This simple retrieval, cleaning, standardising and quality control loop for all funds took two or three weeks of full-time junior analyst work, time taking away from higher-value tasks and projects, whilst being highly repetitive.

In 2025, extraction of data from PDFs, Excel files and web pages by Large Language Models such as Claude and Gemini has become increasingly viable. PDFs are notoriously unstructured and messy and reliable data extraction from these was nearly impossible in a structured way due to the non-standardised nature of the tables used in annual reports. Below are two examples of tables taken from Annual Reports of Catena and Aedifica.

Figure 1. Example of an Annual Report Table of Catena

 

Figure 2. Example of the Annual Report Table of Aedifica

 

Even traditional OCR approaches were not able to retrieve this data consistently, due to different structures and required significant manual oversight to extract the right data.

Our Solution: A Validated Extraction Pipeline

Current LLM capabilities enable extraction of this data with proper prompt engineering and pipeline logic. However, the main risk with LLMs for data extraction is hallucination, data being incorrectly extracted, assets being overlooked, or assets being fabricated entirely.

Our new REIT extraction pipeline addresses this through multiple validation layers, resulting in an implementation requiring minimal manual intervention. This pipeline reduces the assets requiring manual validation by 90-99%, where all information has already been extracted and verified automatically.

The result: extraction time reduced from two to three weeks to one or two days

How It Works: A Multi-Stage Approach

The core principle is that real estate portfolios generally do not change dramatically year-over-year. Even with active portfolio management, most assets do not change hands in a year. We leverage this stability, combined with historical data, to validate extraction results.

Figure 3. The complete KR&A REIT Pipeline

Stage 1: Intelligent Extraction

The pipeline first identifies the document type (PDF, Excel or a scraping extract) and applies the appropriate extraction method. For PDFs, the most challenging format, we use a multi-step approach:

1.     Header Detection: the LLM analyses the first pages to identify the PDF table structure, accounting for multi-row headers and merged cells

2.     Complete Table Extraction: every row is extracted, including subtotals and other elements to ensure that all data is extracted

3.     Data Cleaning: a separate pass removes header rows, subtotals and irrelevant entries, structuring the data into a consistent format

This separated approach allows the LLM to focus on one task at a time, yielding significantly better results than doing all tasks in one LLM call.

Stage 2: Multi-Layer Validation

This part of the pipeline does the validation of the retrieved results, using previous manual retrievals as a ground truth. This is done in four steps:

1.     Geocoding: all extracted addresses are geocoded to obtain precise coordinates

2.     Spatial matching: current year assets are compared against previous year data using geographic proximity

3.     Fuzzy address matching: text matching algorithms compare addresses accounting for spelling variations, abbreviations and formatting differences

4.     Source document verification: the LLM re-examines and registrates the original document to confirm matches and identify any missed or disposed assets

Stage 3: Confidence-Based Output and Manual Amendment

The pipeline produces three output files:

Output

Description

Action Required

Certain Assets

High-confidence matches verified against source

Ready for database

Previous Year Not Found

Assets from last year not matched

Review for disposals

Current Year Uncertain

New or unmatched assets

Manual verification

 

Only the uncertain 1-5% requires analyst review, a dramatic reduction from manually validating all 16,000 assets.

Why This Approach Works

The multi-stage validation pipeline is superior to conventional LLM approaches because:

1.     Each stage focuses on a single task, improving focus of the model and increasing accuracy

2.     Historical validation by comparing against known data, catching extraction errors

3.     Spatial matching confirms addresses are real and correctly located

4.     Assets are categorised by certainty, incorporating human oversight where it matters

This feedback loop ensures hallucinations are caught before data is entered into the database.

Pipeline, Agentic AI or Something Else?

Is this an Agentic Pipeline? Yes and no. It is not a purely agentic pipeline as it is already primed to work following a set pipeline. For this task it is better not to follow the hype and instead use this approach instead.

Conclusion

By combining modern LLM capabilities with traditional data validation techniques, we have transformed a three week manual process into an automated pipeline requiring minimal human oversight. The key insight is that LLMs alone are not sufficient, but combined with proper validation, historical comparison and confidence scoring, deliver consistent and good results.

One possible addition to this approach could be the inclusion of traditional OCR approaches for extraction to decrease the processing time and costs while maintaining accuracy.

 

This allows us to offer you exceptional value for money. For investor asset analyses and datasets in- and outside Europe. For a sample of our REIT and Fund reports please see our RED Reports. The portal analyses can be checked out here.

Continue Reading

Subscribe to our newsletter
for the latest updates

You will be updated on the latest developments and informed about new blogs being published.