From 9c1af7071738298d0885d399dfc3ba06826c314e Mon Sep 17 00:00:00 2001 From: Tikhon Vodyanov Date: Sat, 2 Aug 2025 13:36:15 +0200 Subject: [PATCH] README --- .DS_Store | Bin 6148 -> 6148 bytes README.md | 122 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 121 insertions(+), 1 deletion(-) diff --git a/.DS_Store b/.DS_Store index 14a15a24afedb9e3f39906eb9ebcaa9ad3a319ba..665975a27634a21fe48dd4dea719bf39e7da15b0 100644 GIT binary patch literal 6148 zcmeHK%TB^T6g>lkhi-7^#-v|B;vW=*1&JCWKOip+u~39!bl1JVV&eC?(sO4fq%E*D z#@yT7In(L6cV^Bsoelug*-kHkCV)C!u+(62z~s63o|Syd3DMXbE4=licruRS@g7?n z{-Of1c2$fp=G8)e{p6)+(_7!x?C{Ivt}xH6GFj$=&ptA&8ncZs!4qa!Vwz{;v%s$7 z0c*r$CP(?%G1PF4KCaNiEr#gjv)i#`*;Oo}b#veY6@7eehim|;je_-h_p+hZI z0aYNcK-t|kW&iITKmX^0^iCB}1^$%+rqUgDJDif=TlXd>du>3!r;7X0oo{SmM-XrT)HsscZx2#baQ delta 91 zcmZoMXfc=|#>CJzu~2NHo}wrt0|NsP3otO`Fcg;s7v<&T=cP|9RA*$IEXFLq`3ti! q%O)02=FRLJ{2V}Cn?EvtXP(S2qRR==cmRkQCfo2xZ;laJ!VCZcG8Ps9 diff --git a/README.md b/README.md index dbc662c..4e835dd 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,122 @@ -# team-6 +# Project Genesis Backend +This repository contains the backend service for Project Genesis, a powerful application that combines a Retrieval-Augmented Generation (RAG) system with image generation capabilities. + +## Description + +Project Genesis Backend is a Node.js service built with Express.js. It leverages a Qdrant vector database and LlamaIndex to provide advanced question-answering capabilities based on a custom knowledge base. The system can process and understand a variety of documents, allowing users to ask complex questions and receive accurate, context-aware answers. + +Additionally, the backend integrates with Google's Vertex AI to offer powerful image generation features, allowing for the creation of visuals based on textual descriptions. + +## Core Technologies + +- **Backend Framework:** Node.js, Express.js +- **Vector Database:** Qdrant +- **LLM Orchestration:** LlamaIndex +- **Image Generation:** Google Vertex AI + +## Configuration (.env) + +Before running the application, you need to set up your environment variables. Create a `.env` file in the root of the project by copying the `.env.example` file (if one is provided) or by creating a new one. + +```bash +cp .env.example .env +``` + +### Key Environment Variables: + +- **LLM Configuration:** While the demo was built using OpenAI keys (`OPENAI_API_KEY`, `OPENAI_MODEL`), LlamaIndex is highly flexible. You can easily configure it to use any open-source or self-hosted Large Language Model (LLM) of your choice. + +- **Image Generation (Google Vertex AI):** To enable image generation, you need to: + 1. Set up a Google Cloud project with the Vertex AI API enabled. + 2. Create a service account with the necessary permissions for Vertex AI. + 3. Download the JSON key file for the service account. + 4. Provide the path to this JSON key file in your `.env` file. + +## Getting Started + +Follow these steps to set up and run the backend service on your local machine. + +### Prerequisites + +- [Node.js and npm](https://nodejs.org/en/) +- [Docker](https://www.docker.com/get-started) + +### 1. Clone the Repository + +```bash +git clone https://github.com/GVodyanov/plant-desc-parser.git +cd plant-desc-parser +``` + +### 2. Install Dependencies + +Install the required Node.js packages using npm: + +```bash +npm install +``` + +### 3. Set Up Qdrant Vector Database + +Qdrant is used to store the document embeddings for the RAG system. The easiest way to get it running is with Docker. + +- **Download the Qdrant image:** + ```bash + docker pull qdrant/qdrant + ``` + +- **Run the Qdrant container:** + This command starts a Qdrant container and maps the port `6333` to your local machine. It also mounts a local directory (`./storage/qdrant`) to persist the vector data, ensuring your data is not lost when the container is stopped or removed. + + ```bash + docker run -p 6333:6333 -v $(pwd)/storage/qdrant:/qdrant/storage qdrant/qdrant + ``` + +### 4. Create Embeddings + +The knowledge base, consisting of markdown files located in the `/storage` directory, needs to be processed and stored in the Qdrant vector database. + +Run the following script to create the embeddings: + +```bash +node createEmbeddings.js +``` + +### 5. Run the Server + +Once the setup is complete, you can start the Express server: + +```bash +npm start +``` + +or + +```bash +node index.js +``` + +The server will be running on the port specified in your `.env` file (defaults to 3000). + +## Customizing the RAG Data + +You can easily customize the knowledge base of the RAG system by adding your own data. + +### Adding New Documents + +Place your own sliced markdown files in the `/storage` directory. The `createEmbeddings.js` script will automatically process all `.md` files in this folder and its subdirectories. + +### Converting Scientific PDFs to Markdown + +For converting complex documents like scientific PDFs into clean markdown, we recommend using [Marker](https://github.com/datalab-to/marker). It is a powerful tool that can accurately extract text, tables, and other elements from PDFs. + +### Slicing Markdown Files + +After converting your documents to markdown, you need to slice them into smaller, more manageable chunks for the RAG system. This helps improve the accuracy of the retrieval process. + +We recommend using the `UnstructuredMarkdownLoader` with the `mode="elements"` option for the best results. This will split the markdown file by its headers, titles, and other structural elements. + +For a detailed guide on how to implement this, you can refer to the following example Colab notebook: + +[LangChain Unstructured Markdown Loader Example](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/integrations/document_loaders/unstructured_markdown.ipynb)