2025-summer/team-6

No description

Find a file

tikhon-vodyanov 7dd5063e4c Update api/abb.py		2025-08-02 11:55:01 +00:00
api	Update api/abb.py	2025-08-02 11:55:01 +00:00
backend	backend	2025-08-02 13:29:43 +02:00
frontend	frontend	2025-08-02 13:26:06 +02:00
.DS_Store	README	2025-08-02 13:36:15 +02:00
LICENSE	Initial commit	2025-08-01 15:31:46 +00:00
README.md	README	2025-08-02 13:36:15 +02:00

README.md

Project Genesis Backend

This repository contains the backend service for Project Genesis, a powerful application that combines a Retrieval-Augmented Generation (RAG) system with image generation capabilities.

Description

Project Genesis Backend is a Node.js service built with Express.js. It leverages a Qdrant vector database and LlamaIndex to provide advanced question-answering capabilities based on a custom knowledge base. The system can process and understand a variety of documents, allowing users to ask complex questions and receive accurate, context-aware answers.

Additionally, the backend integrates with Google's Vertex AI to offer powerful image generation features, allowing for the creation of visuals based on textual descriptions.

Core Technologies

Backend Framework: Node.js, Express.js
Vector Database: Qdrant
LLM Orchestration: LlamaIndex
Image Generation: Google Vertex AI

Configuration (.env)

Before running the application, you need to set up your environment variables. Create a .env file in the root of the project by copying the .env.example file (if one is provided) or by creating a new one.

cp .env.example .env

Key Environment Variables:

LLM Configuration: While the demo was built using OpenAI keys (OPENAI_API_KEY, OPENAI_MODEL), LlamaIndex is highly flexible. You can easily configure it to use any open-source or self-hosted Large Language Model (LLM) of your choice.
Image Generation (Google Vertex AI): To enable image generation, you need to:
1. Set up a Google Cloud project with the Vertex AI API enabled.
2. Create a service account with the necessary permissions for Vertex AI.
3. Download the JSON key file for the service account.
4. Provide the path to this JSON key file in your .env file.

Getting Started

Follow these steps to set up and run the backend service on your local machine.

Prerequisites

1. Clone the Repository

git clone https://github.com/GVodyanov/plant-desc-parser.git
cd plant-desc-parser

2. Install Dependencies

Install the required Node.js packages using npm:

npm install

3. Set Up Qdrant Vector Database

Qdrant is used to store the document embeddings for the RAG system. The easiest way to get it running is with Docker.

Download the Qdrant image:
```
docker pull qdrant/qdrant
```
Run the Qdrant container: This command starts a Qdrant container and maps the port 6333 to your local machine. It also mounts a local directory (./storage/qdrant) to persist the vector data, ensuring your data is not lost when the container is stopped or removed.
```
docker run -p 6333:6333 -v $(pwd)/storage/qdrant:/qdrant/storage qdrant/qdrant
```

4. Create Embeddings

The knowledge base, consisting of markdown files located in the /storage directory, needs to be processed and stored in the Qdrant vector database.

Run the following script to create the embeddings:

node createEmbeddings.js

5. Run the Server

Once the setup is complete, you can start the Express server:

npm start

node index.js

The server will be running on the port specified in your .env file (defaults to 3000).

Customizing the RAG Data

You can easily customize the knowledge base of the RAG system by adding your own data.

Adding New Documents

Place your own sliced markdown files in the /storage directory. The createEmbeddings.js script will automatically process all .md files in this folder and its subdirectories.

Converting Scientific PDFs to Markdown

For converting complex documents like scientific PDFs into clean markdown, we recommend using Marker. It is a powerful tool that can accurately extract text, tables, and other elements from PDFs.

Slicing Markdown Files

After converting your documents to markdown, you need to slice them into smaller, more manageable chunks for the RAG system. This helps improve the accuracy of the retrieval process.

We recommend using the UnstructuredMarkdownLoader with the mode="elements" option for the best results. This will split the markdown file by its headers, titles, and other structural elements.

For a detailed guide on how to implement this, you can refer to the following example Colab notebook:

LangChain Unstructured Markdown Loader Example