NLP4Dev

Problem iconThe Problem


Most data catalogs rely exclusively on lexical (full-text) search engines, which perform unsatisfactorily. Commercial search engines like Google or Bing implement more advanced methods and tools, but also perform inadequately when tasked to discover data.

Improving data discoverability and accessibility requires (i) better metadata (structured, augmented), and (ii) improved search engines with semantic search capability and recommender systems.

Approach iconOur Approach


To address these issues, our work includes:

  • The promotion and development of metadata standards and related tools and guidelines,

  • Exploratory work on the use of NLP to enable semantic searchability and build recommender systems, and

  • The development of the NADA cataloguing application.

Explore our corpus of 446,962 documents
Last updated on Thu Apr 11 2024

To train NLP models, we compiled and maintain a corpus of documents related to social and economic issues. Topic and word embeddings models, and other methods are applied to provide information discovery solutions including lexical and semantic search, filtering by topic composition, and others. A meta-database is created, which provides a detailed description of all documents and can be used as input to analysis of knowledge on development.

All solutions implemented in this project rely on publicly-available documents and on open source tools. Our solutions are openly accessible in our GitHub repository.

Explore our word embeddings
For more information, visit the word embeddings page
Analyze your document

First, select the models to use for both the topic and word embedding models. Then upload or provide a URL to the pdf or text file. Then click Process document to start generating results.




Submit your own document to our text processing and modeling API.
For more information, visit the Analyze your Document page.