1. Chunking
We can Select Chunk size through simple experiments.
Evaluating the Ideal Chunk Size for a RAG System using LlamaIndex — LlamaIndex - Build Knowledge Assistants over your Enterpri
LlamaIndex is a simple, flexible framework for building knowledge assistants using LLMs connected to your enterprise data.
www.llamaindex.ai
import nest_asyncio
nest_asyncio.apply()
from llama_index import (
SimpleDirectoryReader,
VectorStoreIndex,
ServiceContext,
)
from llama_index.evaluation import (
DatasetGenerator,
FaithfulnessEvaluator,
RelevancyEvaluator
)
from llama_index.llms import OpenAI
import openai
import time
openai.api_key = 'OPENAI-API-KEY'
# Download Data
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
# Load Data
reader = SimpleDirectoryReader("./data/10k/")
documents = reader.load_data()
# To evaluate for each chunk size, we will first generate a set of 40 questions from first 20 pages.
eval_documents = documents[:20]
data_generator = DatasetGenerator.from_documents(eval_documents)
eval_questions = data_generator.generate_questions_from_nodes(num = 20)
# We will use GPT-4 for evaluating the responses
gpt4 = OpenAI(temperature=0, model="gpt-4")
# Define service context for GPT-4 for evaluation
service_context_gpt4 = ServiceContext.from_defaults(llm=gpt4)
# Define Faithfulness and Relevancy Evaluators which are based on GPT-4
faithfulness_gpt4 = FaithfulnessEvaluator(service_context=service_context_gpt4)
relevancy_gpt4 = RelevancyEvaluator(service_context=service_context_gpt4)
# Define function to calculate average response time, average faithfulness and average relevancy metrics for given chunk size
def evaluate_response_time_and_accuracy(chunk_size):
total_response_time = 0
total_faithfulness = 0
total_relevancy = 0
# create vector index
llm = OpenAI(model="gpt-3.5-turbo")
# https://docs.llamaindex.ai/en/v0.9.48/module_guides/supporting_modules/service_context.html
# from ServiceContect we can set llm, embed_model, text_splitter
service_context = ServiceContext.from_defaults(llm=llm, chunk_size=chunk_size)
vector_index = VectorStoreIndex.from_documents(
eval_documents, service_context=service_context
)
query_engine = vector_index.as_query_engine()
num_questions = len(eval_questions)
for question in eval_questions:
start_time = time.time()
response_vector = query_engine.query(question)
elapsed_time = time.time() - start_time
faithfulness_result = faithfulness_gpt4.evaluate_response(
response=response_vector
).passing
relevancy_result = relevancy_gpt4.evaluate_response(
query=question, response=response_vector
).passing
total_response_time += elapsed_time
total_faithfulness += faithfulness_result
total_relevancy += relevancy_result
average_response_time = total_response_time / num_questions
average_faithfulness = total_faithfulness / num_questions
average_relevancy = total_relevancy / num_questions
return average_response_time, average_faithfulness, average_relevancy
# Iterate over different chunk sizes to evaluate the metrics to help fix the chunk size.
for chunk_size in [128, 256, 512, 1024, 2048]
avg_time, avg_faithfulness, avg_relevancy = evaluate_response_time_and_accuracy(chunk_size)
print(f"Chunk size {chunk_size} - Average Response time: {avg_time:.2f}s, Average Faithfulness: {avg_faithfulness:.2f}, Average Relevancy: {avg_relevancy:.2f}")

1.1 Chunking Methods
a. Fixed-size Chunking :
- computationally cheap, save processing power, easy to use.
- Use overlap
b. "Context-aware" Chunking
- Sentence Splitting
- Many models are optimized for embedding sentence-level context.
- Librarys : NLTK, spaCy
- Recursive Chunking
- Recursive splitting: It tries to split the text using a hierarchy of separators (e.g., paragraph breaks, sentences, words) in a recursive way.
- Intelligent chunking: The goal is to make chunks that are under a certain size (in characters or tokens) while preserving as much semantic meaning as possible.
- Overlap support: It can create overlapping chunks to preserve context between segments.
- Specialized Chunking : for structured and formatted content
- from langchain.text_splitter import MarkdownTextSplitter
- from langchain.text_splitter import LatexTextSplitter
c. Multi-Modal Chunking
2. Namespaces
Understanding indexes - Pinecone Docs
You may get fewer than top_k results if top_k is larger than the number of sparse vectors in your index that match your query. That is, any vectors where the dotproduct score is 0 will be discarded.
docs.pinecone.io
Within an index, records are partitioned into namespaces, and all upserts, queries, and other data operations always target one namespace.
This has two main benefits:
- Multitenancy: When you need to isolate data between customers, you can use one namespace per customer and target each customer’s writes and queries to their dedicated namespace. See Implement multitenancy for end-to-end guidance.
- Faster queries: When you divide records into namespaces in a logical way, you speed up queries by ensuring only relevant records are scanned. The same applies to fetching records, listing record IDs, and other data operations.

Practical Advantages of Namespace
- Efficiency
- Faster vector search because you can narrow the search space just by specifying namespace.
- Simplicity
- You manage only one index.
- No need to create, monitor, backup, or version many different indexes.
- Cost Saving
- Especially in Pinecone, each index requires baseline server resources.
(e.g., each index eats a minimum RAM/storage, even if empty.) - Using namespaces avoids this unnecessary overhead.
- Especially in Pinecone, each index requires baseline server resources.
- Faster Development
- Easier to batch upserts, queries, deletions because you're always talking to the same index.
- Multi-Tenancy
- If you have a multi-user application (ex: chatbot for multiple companies), you can isolate each company’s data inside different namespaces without spinning up new indexes.
When NOT to use namespace:
- If you truly need different dimensions (e.g., some embeddings are 512-dim, others are 768-dim).
- If you need full isolation for compliance (e.g., legally mandated separation).
- If indexes are so big that even one namespace becomes huge (then sharding across indexes may be better).
Implementation
https://docs.trychroma.com/docs/overview/getting-started
Chroma Docs
Documentation for ChromaDB
docs.trychroma.com
https://wikidocs.net/234094 > VectorStore 생성
01. Chroma
.custom { background-color: #008d8d; color: white; padding: 0.25em 0.5…
wikidocs.net