A deep dive on Microsoft's GraphRAG paper found questionable metrics with vaguely defined lift, so I analyzed knowledge graphs in RAG overall using Neo4j vs FAISS
reate_ground_truth2 is creating a "model-generated ground truth" rather than a true ground truth based on source documents. I mean, the questions are grounded in the source material, but the answers come from the model's general knowledge. So It's not truly a "ground truth" in the traditional sense. Right?
I’d looked into that before it was an issue - thx for asking. Doc I was evaluating RAG on wasn’t in source material (which you’d mentioned) but also human reviewed (since N is small - I went through an extensive sample size calc ahead of time). Ground truth in this case worked perfect.
Oh, ok thanks for the explanation, well it also seems that the questions do not correspond to such specialized knowledge to worry about. Another thing, I feel that to be more fair we could use FAISS with HNSW (instead of default FLAT), because that's the architecture that Neo4j has for its Vector Stores (HNSW). Can you tell I'm a fan of faiss? Hahahahah
That's interesting, thx - hadn't known about HNSW (looks like HNSWSQ would be most comparable) - was just looking to compare a vanilla vector db at first, and started this analysis with chroma (very similar results, but it broke due to some incompatibility w new RAGAS dependencies before publishing, so updated to FAISS and I enjoyed the speed/perf (was awesome even in its default). Became a fan also, but have since learned FAISS isn't being maintained frequently (not a priority for FB/Meta so will be slower to update compatibilities to future python versions etc)
Do you know how to save open source embeddings on Neo4j directly in their entities and do search? (Like inserting my list of vector as easy as setting the name property of a Node) I don't like OpenAI much.
1. Entity Knowledge Graph Generation: Initially, a large language model (LLM) is used to extract entities and their interrelations from source documents, creating an entity knowledge graph.
2. Community Summarization: Related entities are further grouped into communities, and summaries are generated for each community. These summaries serve as partial answers during queries.
3. Final Answer Generation: For user questions, partial answers are extracted from the community summaries and then re-summarized to form the final answer.
This approach not only enhances the comprehensiveness and diversity of answers but also demonstrates higher efficiency and scalability when handling large-scale textual data.
Thank you for this insightful article. I'm not a developer by trade but have been diving into the power of RAG (regular RAG) recently and successfully built my first pipeline for a SlackBot using Astra DB and OpenAI API (chat and embeddings). As I was learning how to do simple RAG techniques, I kept reading the GraphRAG hype and wondered if it was worth me diving into those techniques next. If regular RAG is a 5 out of 10 in complexity for me, what would say learning GraphRAG would be comparatively? Thanks again!
reate_ground_truth2 is creating a "model-generated ground truth" rather than a true ground truth based on source documents. I mean, the questions are grounded in the source material, but the answers come from the model's general knowledge. So It's not truly a "ground truth" in the traditional sense. Right?
I’d looked into that before it was an issue - thx for asking. Doc I was evaluating RAG on wasn’t in source material (which you’d mentioned) but also human reviewed (since N is small - I went through an extensive sample size calc ahead of time). Ground truth in this case worked perfect.
Oh, ok thanks for the explanation, well it also seems that the questions do not correspond to such specialized knowledge to worry about. Another thing, I feel that to be more fair we could use FAISS with HNSW (instead of default FLAT), because that's the architecture that Neo4j has for its Vector Stores (HNSW). Can you tell I'm a fan of faiss? Hahahahah
That's interesting, thx - hadn't known about HNSW (looks like HNSWSQ would be most comparable) - was just looking to compare a vanilla vector db at first, and started this analysis with chroma (very similar results, but it broke due to some incompatibility w new RAGAS dependencies before publishing, so updated to FAISS and I enjoyed the speed/perf (was awesome even in its default). Became a fan also, but have since learned FAISS isn't being maintained frequently (not a priority for FB/Meta so will be slower to update compatibilities to future python versions etc)
Do you know how to save open source embeddings on Neo4j directly in their entities and do search? (Like inserting my list of vector as easy as setting the name property of a Node) I don't like OpenAI much.
Not necessary for my use case.
Core of the GraphRAG Project:
1. Entity Knowledge Graph Generation: Initially, a large language model (LLM) is used to extract entities and their interrelations from source documents, creating an entity knowledge graph.
2. Community Summarization: Related entities are further grouped into communities, and summaries are generated for each community. These summaries serve as partial answers during queries.
3. Final Answer Generation: For user questions, partial answers are extracted from the community summaries and then re-summarized to form the final answer.
This approach not only enhances the comprehensiveness and diversity of answers but also demonstrates higher efficiency and scalability when handling large-scale textual data.
Thank you for this insightful article. I'm not a developer by trade but have been diving into the power of RAG (regular RAG) recently and successfully built my first pipeline for a SlackBot using Astra DB and OpenAI API (chat and embeddings). As I was learning how to do simple RAG techniques, I kept reading the GraphRAG hype and wondered if it was worth me diving into those techniques next. If regular RAG is a 5 out of 10 in complexity for me, what would say learning GraphRAG would be comparatively? Thanks again!