Skip to main content

RAG

·979 words·5 mins
Author
Marcus Rasmussen

Automated RAG Workflow for Keeping My Dify Chatbot in Sync with My Portfolio
#

Overview
#

In this post, I’ll walk through how I built an automated workflow that ensures my RAG (Retrieval-Augmented Generation) chatbot in Dify always reflects the latest content from my portfolio.

The goal was simple:

Whenever I update my portfolio, my chatbot should automatically “know” about it.

To achieve this, I created a Java-based pipeline that:

  1. Reads my sitemap
  2. Extracts all portfolio URLs
  3. Converts each page into clean Markdown
  4. Uploads the content into a Dify dataset

Architecture
#


⚙️ Technologies Used
#

  • Java
  • OkHttp (HTTP client)
  • Jsoup (XML parsing)
  • Jina AI Reader API (HTML → Markdown)
  • Dify API (dataset ingestion)

🔄 Step 1: Fetch URLs from Sitemap
#

I start by fetching my sitemap and extracting all <loc> elements using Jsoup.

public static List<String> fetchSitemapUrls(String sitemapUrl) throws IOException {
    Request request = new Request.Builder().url(sitemapUrl).build();
    Response response = client.newCall(request).execute();

    String xml = response.body().string();

    Document doc = Jsoup.parse(xml, "", org.jsoup.parser.Parser.xmlParser());
    Elements locElements = doc.select("loc");

    List<String> urls = new ArrayList<>();
    locElements.forEach(e -> urls.add(e.text()));

    return urls;
}

public static String fetchMarkdownFromJina(String url) throws IOException {
    String jinaUrl = JINA_API + url;

    Request request = new Request.Builder()
            .url(jinaUrl)
            .header("Accept", "text/plain")
            .build();

    Response response = client.newCall(request).execute();

    if (!response.isSuccessful()) return null;

    return response.body().string();
}
# Automated RAG Bot with Dify  Always in Sync with My Portfolio

> **TL;DR:** I built a Java program that automatically reads all pages from my portfolio sitemap, converts them to Markdown via Jina AI, and uploads them to a Dify knowledge base  so my RAG chatbot always reflects the latest content on my site.

---

## Background

I have a portfolio hosted on GitHub Pages and an AI chatbot built with [Dify](https://dify.ai) that visitors can use to ask questions about my work. The problem was: **every time I updated my portfolio, the chatbot was still answering based on outdated content.**

The solution? An automated workflow that syncs content from my portfolio directly into Dify's RAG knowledge base  with no manual intervention.

---

## Architecture & Flow

The workflow consists of three steps:

Portfolio (sitemap.xml) │ ▼ [1] Fetch all page URLs from sitemap │ ▼ [2] Convert each page to Markdown via Jina AI │ ▼ [3] Upload Markdown files to Dify dataset via API


Everything is handled by a single Java program: `SitemapToDify.java`.

---

## Implementation

### Step 1 – Fetch URLs from the Sitemap

I use `OkHttp` to fetch `sitemap.xml` and `Jsoup` to parse the XML and extract all `<loc>` tags:

```java
public static List<String> fetchSitemapUrls(String sitemapUrl) throws IOException {
    Request request = new Request.Builder().url(sitemapUrl).build();
    Response response = client.newCall(request).execute();

    String xml = response.body().string();
    Document doc = Jsoup.parse(xml, "", org.jsoup.parser.Parser.xmlParser());
    Elements locElements = doc.select("loc");

    List<String> urls = new ArrayList<>();
    locElements.forEach(e -> urls.add(e.text()));
    return urls;
}

The result is a complete list of every page URL on my portfolio — automatically, with no hardcoding.


Step 2 – Convert Pages to Markdown via Jina AI
#

Jina AI’s Reader API takes a URL and returns the page content as clean Markdown — perfect for RAG, since it strips away all the noise: HTML tags, navigation, footers, and ads:

public static String fetchMarkdownFromJina(String url) throws IOException {
    String jinaUrl = "https://r.jina.ai/" + url;

    Request request = new Request.Builder()
            .url(jinaUrl)
            .header("Accept", "text/plain")
            .build();

    Response response = client.newCall(request).execute();
    if (!response.isSuccessful()) return null;

    return response.body().string();
}

Just prefix any URL with https://r.jina.ai/ and Jina returns clean Markdown. Simple and effective.


Step 3 – Upload to Dify Dataset
#

Each Markdown file is uploaded as a document to my Dify knowledge base via their REST API. I use multipart/form-data with two parts: the file itself and a JSON configuration string:

public static void uploadToDify(String url, String markdown) throws IOException {
    String fileName = url.replaceAll("[^a-zA-Z0-9]", "_") + ".md";

    String dataJson = new JSONObject()
            .put("indexing_technique", "high_quality")
            .put("doc_form", "text_model")
            .put("doc_language", "English")
            .put("process_rule", new JSONObject().put("mode", "automatic"))
            .toString();

    RequestBody requestBody = new MultipartBody.Builder()
            .setType(MultipartBody.FORM)
            .addFormDataPart("file", fileName,
                    RequestBody.create(markdown, MediaType.parse("text/markdown")))
            .addFormDataPart("data", dataJson)
            .build();

    Request request = new Request.Builder()
            .url(DIFY_API)
            .post(requestBody)
            .addHeader("Authorization", "Bearer " + DIFY_API_KEY)
            .build();

    client.newCall(request).execute();
}

One important detail: the data field must be a serialized JSON string, not a nested object — this caught me off guard and took a while to figure out.


Main Loop
#

Everything is tied together in main():

public static void main(String[] args) throws Exception {
    List<String> urls = fetchSitemapUrls(SITEMAP_URL);

    for (String url : urls) {
        System.out.println("Processing: " + url);
        String markdown = fetchMarkdownFromJina(url);
        if (markdown != null && !markdown.isEmpty()) {
            uploadToDify(url, markdown);
        }
        Thread.sleep(1000); // avoid rate limits
    }
}

I add a one-second delay between each page to avoid hitting rate limits on Jina’s API.


Technologies Used
#

TechnologyPurpose
JavaProgramming language for the workflow
OkHttpHTTP client for API calls
JsoupXML/HTML parsing of the sitemap
Jina AI ReaderConverting web pages to Markdown
DifyRAG platform and chatbot engine
GitHub PagesPortfolio hosting (Hugo)

What I Learned
#

Jina AI Reader is underrated. Converting an entire web page to clean Markdown with a single API call is extremely useful for RAG pipelines — no scraping logic, no HTML parsing, just content.

Dify’s API is well-documented, but detail-oriented. The data field in the multipart request must be a serialized JSON string, not an object. Easy to overlook, but critical for the upload to succeed.

Automation pays off quickly. I can now run the script after every portfolio update — or set it up as a GitHub Action that triggers on every push — and my chatbot is always up to date.


Next Steps
#

  • Set up the workflow as a GitHub Action that runs automatically on every push to the portfolio repo
  • Add deletion of old documents in Dify before re-uploading, to prevent duplicate entries from accumulating
  • Add language detection instead of hardcoding "English"

The code is written as a standalone Java program. If you have questions about the implementation, feel free to reach out.