Skip to content

LangChain Study

In LLM application there are a lot of steps to do, trying different prompting, integrating different LLMs, implementing conversation history, at the end there is a lot of glue code to implement.

LangChain is a open-source framework for developing applications powered by large language models, connecting them to external data sources, and manage conversation with human.

Value propositions

Develop apps with context awareness, and that can reason using LLMs. It includes Python and Typescript packages, and a Java one under construction.

It focuses on composition and modularity. The components defined by the framework can be combined to address specific use cases, and developers can add new components.

  • LangChain: Python and Javascript libraries
  • LangServe: a library for deploying LangChain chains as a REST API.
  • LangSmith: a platform that lets developers debug, test, evaluate, and monitor chains
  • Predefined prompt template from langChain Hub.

They are adding new products to their portfolio quickly like LangSmith (get visibility on LLMs execution), and LangServe (server API for LangChain apps).

Sources

The content comes from different sources:

LangChain libraries

The core building block of LangChain applications is the LLMChain:

  • A LLM
  • Prompt templates
  • Output parsers

PromptTemplate helps to structure the prompt and facilitate reuse by creating model agnostic templates. The library includes output parsers to get content extracted from the keyword defined in the prompt. Example is the chain of thought keywords of Thought, Action, Observation.

Getting started

The LangChain documentation is excellent so no need to write more. All my study codes with LangChain and LLM are in different folders of this repo:

Backend Type of chains
openAI The implementation of the quickstart examples, RAG, chatbot, agent
Ollama run a simple query to Ollama (running Llama 3.2) locally
Anthropic Claude
Mistral LLM
IBM WatsonX
AWS Bedrock zero_shot generation

Each code needs to define only the needed LangChain modules to keep the executable size low.

Main Concepts

Model I/O

Model I/O are building blocks to interface with any language model. It facilitates the interface of model input (prompts) with the LLM model to produce the model output.

  • LangChain supports two types of language: LLM (for pure text completion models) or ChatModel (conversation on top of LLM using constructs of AIMessage, HumanMessage)
  • LangChain uses Prompt templates to control LLM behavior.

    from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
    prompt = ChatPromptTemplate.from_messages([
        ("system", "Answer the user's questions based on the below context:\n\n{context}"),
        MessagesPlaceholder(variable_name="chat_history"),
        ("user", "{input}"),
    ])
    
    * We can build custom prompt by extending existing default templates. An example is a 'few-shot-examples' in a chat prompt using FewShotChatMessagePromptTemplate. * LangChain offers a prompt hub to get predefined prompts easily loadable:

    from langchain import hub
    prompt = hub.pull("hwchase17/openai-functions-agent")
    
  • Chains allow developers to combine multiple components together (or to combine other chains) to create a single, coherent application.

  • OutputParsers convert the raw output of a language model into a format that can be used downstream

Feature stores, like Feast, can be a great way to keep information about the user conversation or query, and LangChain provides an easy way to combine data from Feast with LLMs.

Chain

Chains are runnable, observable and composable. The LangChain framework uses the Runnable class to encapsulate operations that can be run synchronously or asynchronously.

  • LLMChain class is the basic chain to integrate with any LLM.

    # Basic chain
    chain = LLMChain(llm=model, prompt = prompt)
    chain.invoke("a query")
    
  • Sequential chain combines chains in sequence with single input and output (SimpleSequentialChain)

    overall_simple_chain = SimpleSequentialChain(chains=[chain_one, chain_two],
                                                verbose=True
                                                )
    

    or multiple inputs and outputs with prompt using the different environment variables (see this code).

  • LLMRouterChain is a chain that outputs the name of a destination chain and the inputs to it.

  • LangChain Expression Language is a declarative way to define chains. It looks similar to Unix shell pipe: input for one runnable comes from the output of predecessor (This is why prompt below is a runnable).

    # a chain definition using Langchain expression language
    chain = prompt | model | output_parser
    
  • Chain can be executed asynchronously in its own Thread using the ainvoke method.

Runnable

Runnable interface is a protocol to define custom chains and invoke them. Each Runnable exposes methods to get input, output and config schemas. Each implements synchronous and async invoke methods and batch. Runnable can run in parallel or in sequence.

To pass data to a Runnable there is the RunnablePassthrough class. This is used in conjunction with RunnableParallel to assign data to key in a map.

from langchain.schema.runnable import RunnableParallel, RunnablePassthrough

runnable = RunnableParallel(
   passed = RunnablePassthrough(),
   extra= RunnablePassthrough.assign(mult= lambda x:x["num"] * 3),
   modified=lambda x:x["num"] +1   
)

print(runnable.invoke({"num": 6}))
{'passed': {'num': 6}, 'extra': {'num': 6, 'mult': 18}, 'modified': 7}
  • RunnableLambda is a type of Runnable that wraps a callable function.
sequence = RunnableLambda(lambda x: x + 1) | {
    'mul_2': RunnableLambda(lambda x: x * 2),
    'mul_5': RunnableLambda(lambda x: x * 5)
}
sequence.invoke(1)

The RunnablePassthrough.assign method is used to create a Runnable that passes the input through while adding some keys to the output.

We can use Runnable.bind() to pass arguments as constants accessible within a runnable sequence (a chain) where argument is not part of the output of preceding runnables in the sequence.

See some code RunnableExamples

Memory

Large Language Models are stateless and do not remember anything. Chatbot seems to have memory, because conversation is kept in the context.

With a simple conversation like the following code, the conversation is added as string into the context:

llm = ChatOpenAI(temperature = 0)
memory = ConversationBufferMemory()
conversation = ConversationChain(
    llm= llm,
    memory=memory,
    verbose=True   # trace the chain
)

The memory is just a container in which we can save {"input:""} and {"output": ""} content.

But as the conversation goes, the size of the context grows, and so the cost of operating this chatbot, as API are charged by the size of the token. Using ConversationBufferWindowMemory(k=1) with a k necessary to keep enough context, we can limit cost. Same with ConversationTokenBufferMemory to limit the token in memory.

ConversationChain is a predefined chain to have a conversation and load context from memory.

As part of memory component there is the ConversationSummaryMemory to get the conversation summary so far.

The other important memory is Vector Data memory and entity memory or knowledgeGraph

See related code conversation_with_memory.py

Retrieval Augmented Generation

The goal for Retrieval Augmented Generation (RAG) is to add custom dataset not already part of a trained model and use the dataset as input sent to the LLM. RAG is illustrated in figure below:

Embed is the vector representation of a chunk of text. Different embedding can be used.

Embeddings

The classical Embedding is the OpenAIEmbeddings but Hugging Face offers an open source version: the SentenceTransformershttps://huggingface.co/sentence-transformers which is a Python framework for state-of-the-art sentence, text and image embeddings.

from langchain_openai import OpenAIEmbeddings
vectorstore =  Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings(),
                                     persist_directory=DOMAIN_VS_PATH)
# With HuggingFace
from sentence_transformers import SentenceTransformer
def build_embedding(docs):
    model = SentenceTransformer("all-MiniLM-L6-v2")
    return model.encode(docs)
# With AWS embedding
from langchain.embeddings import BedrockEmbeddings

Different code that implement RAG

Code Notes
build_agent_domain_rag.py Read Lilian Weng blog and create a ChromeDB vector store with OpenAIEmbeddings
query_agent_domain_store.py Query the persisted vector store for similarity search
prepareVectorStore.py Use AWS Bedrock Embeddings
embeddings_hf.py Use Hunggingface embeddings with splitting a markdown file and use FAISS vector store
rag_HyDE.py Hypothetical Document Embedding (HyDE) the first prompt create an hypothetical document

Creating chunks is necessary because language models generally have a limit to the amount of token they can deal with. It also improve the similarity search based on vector.

Split docs and save in vector store

RecursiveCharacterTextSplitter splits text by recursively look at characters. text_splitter.split_documents(documents) return a list of Document which is a wrapper to page content and some metadata for the indexes from the source document.

# ...
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.indexes.vectorstore import VectorStoreIndexWrapper

loader = PyPDFDirectoryLoader("./data/")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap  = 100,
)
docs = text_splitter.split_documents(documents)

vectorstore_faiss = FAISS.from_documents(
    docs,
    embeddings,
)
vectorstore_faiss.save_local("faiss_index")
Search similarity in vector DB

OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large", dimensions=1024)
query = """Is it possible that ...?"""
query_embedding = embeddings.embed_query(query)
relevant_documents = vectorstore_faiss.similarity_search_by_vector(query_embedding)

During the interaction with the end-user, the system (a chain in LangChain) retrieves the most relevant data to the question asked, and passes it to LLM in the generation step.

  • Embeddings capture the semantic meaning of the text to help do similarity search
  • Persist the embeddings into a Vector store. Faiss and ChromaDB are common vector stores to use, but OpenSearch, Postgresql can also being used.
  • Retriever includes semantic search and efficient algorithm to prepare the prompt. To improve on vector similarity search we can generate variants of the input question.

See Q&A with FAISS store qa-faiss-store.py.

Getting started with Feast

Use pip install feast then the feast CLI with feast init my_feature_repo to create a Feature Store then feast apply to create entity, feature views, and services. Then feast ui + http://localhost:8888 to act on the store.

LLM and FeatureForm

See FeatureForm as another open-source feature store solution and the LangChain sample with Claude LLM

Q&A app

For Q&A the pipeline will most likely integrate with existing documents as illustrated in the figure below:

Embeddings capture the semantic meaning of the text, which helps to do similarity search. Vector store supports storage and searching of these embeddings. Retrievers use different algorithms for the semantic search to load vectors.

Use RAG with Q&A

chains.RetrievalQA

from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

prompt_template = """Human: Use the following pieces of context to provide a concise answer to the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Assistant:"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore_faiss.as_retriever(
        search_type="similarity", search_kwargs={"k": 3}
    ),
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)
query = "Is it possible that I get sentenced to jail due to failure in filings?"
result = qa({"query": query})
print_ww(result['result'])

ChatBot

Chatbots is the most common app for LLM: Aside from basic prompting and LLMs call, chatbots have memory and retrievers:

Text Generation Examples

Summarization chain

Always assess the size of the content to send, as the approach can be different: for big document, we need to split the doc in chunks.

Using langchain summarize chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms.bedrock import Bedrock
from langchain.chains.summarize import load_summarize_chain

llm = Bedrock(
    model_id=modelId,
    model_kwargs={
        "max_tokens_to_sample": 1000,
    },
    client=boto3_bedrock,
) 

text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n"], chunk_size=4000, chunk_overlap=100
)
docs = text_splitter.create_documents([letter])

summary_chain = load_summarize_chain(llm=llm, chain_type="map_reduce", verbose=True)
output = summary_chain.run(docs)

Evaluating results

The evaluation of a chain, we need to define the data points to be measured. Building Questions and accepted answers is a classical approach.

We can use LLM and a special chain (QAGenerateChain) to build Q&A from a document.

Agent

Agent is an orchestrator pattern where the LLM decides what actions to take from the current query and context. With chain, developer code the sequence of tasks, with agent the LLM decides. LangGraph is an extension of LangChain specifically aimed at creating highly controllable and customizable agents.

Chains let create a pre-defined sequence of tool usage(s), while Agents let the model uses tools in a loop, so that it can decide how many times to use its defined tools.

AgentExecutor is deprecated

Use LangGraph to implement agent.

This content is then from v0.1

There are different types of agent: Intended Model, Supports Chat, Supports Multi-Input Tools, Supports Parallel Function Calling, Required Model Params.

LangChain uses a specific Schema model to define: AgentAction, with tool and tool_input and AgentFinish.

from langchain.agents import create_tool_calling_agent
from langchain.agents import AgentExecutor
from langchain.tools.retriever import create_retriever_tool
from langchain_community.tools.tavily_search import TavilySearchResults

...
tools = [retriever_tool, search, llm_math, wikipedia]

agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
  • To create agents use one of the constructor methods such as: create_react_agent, create_json_agent, create_structured_chat_agent, create_tool_calling_agent etc. Those methods return a Runnable.
  • The Agent loops on user input until it returns AgentFinish action. If the Agent returns an AgentAction, then use that to call a tool and get an Observation. Agent has input and output and intermediate steps. AgentAction is a response that consists of action and action_input.
  • See the existing predefined agent types.
  • AgentExecutor is the runtime for an agent.

  • Tools are functions that an agent can invoke. It defines the input schema for the tool and the function to run. Parameters of the tool should be sensibly named and described.

Tool Calling

With Tool Calling we can define function or tool to be referenced as part of the LLM response, and LLM will prepare the arguments for the function. It is used to generate tool invocations, not to execute it.

Tool calling allows a model to detect when one or more tools should be called and responds with the inputs that should be passed to those tools. The inputs match a defined schema. Below is an example structured answer from OpenAI LLM: "tool_calls" is the key to get the list of function names and arguments the orchestrator needs to call.

    "tool_calls": [
            {
            "name": "tavily_search_results_json",
            "args": {
                "query": "weather in San Francisco"
            },
            "id": "call_Vg6JRaaz8d06OXbG5Gv7Ea5J"
            }

Prompt defines placeholders to get tools parameters. The following langchain prompt for OpenAI uses agent_scratchpad variable, which is a MessagesPlaceholder. Intermediate agent actions and tool output messages, will be passed in here.

LangChain has a lot of predefined tool definitions to be reused.

We can use tool calling in chain (to use tools in sequence) or in agent (to use tools in loop).

LangChain offers an API to the LLM called bind_tools to pass the definition of the tool, as part of each call to the model, so that the application can invoke the tool when appropriate.

See also the load tools api with a list of predefined tools.

Below is the classical application flow using tool calling. The exposed function wraps a remote microservice.

When developing a solution based on agent, consider the tools, the services, the agent needs to access. See a code example openAI_agent.py.

Many LLM providers support for tool calling, including Anthropic, Cohere, Google, Mistral, OpenAI, see the existing LangChain tools.

Interesting tools

Search recent news

A common tool integrated in agent, is the Tavily search API, used to get the last trusted News, so the most recent information created after the cutoff date of the LLM.

retriever_tool = create_retriever_tool(
    retriever,
    "langsmith_search",
    "Search for information about LangSmith. For any questions about LangSmith, you must use this tool!",
)
search = TavilySearchResults()
tools = [retriever_tool, search]
Tavily

Tavily is the leading search engine optimized for LLMs. It provides factual, explicit and objective answers. It is a GPT researcher which queries, filters and aggregates over 20+ web sources per a single research task. It focuses on optimizing search for AI developers and autonomous AI agents. See this git repo

Python REPLtool

PythonREPLTool is a tool for running python code in REPL (look like a jupiter notebook).

A base model

It is possible to bind a BaseModel class as below, where a LLM is used to create prompt, so the prompt instruction entity is a json used as tool. (Tool definition are structured system prompt for the LLMs as they just understand text)

class PromptInstructions(BaseModel):
    """Instructions on how to prompt the LLM."""
    objective: str
    variables: List[str]
    constraints: List[str]
    requirements: List[str]

llm_with_tool = llm.bind_tools([PromptInstructions])

See the LangGraph sample: prompt_builder_graph.py.

Our own tools

Define custom tool using the @tool annotation on a function to expose it as a tool. It uses the function name as the tool name and the function’s docstring as the tool’s description.

A second approach is to subclass the langchain.pydantic_v1.BaseModel class.

Finally the last possible approach is to use StructuredTool dataclass.

When doing agent we need to manage exception and implement handle_tool_error.

To map the tools to OpenAI function call there is a module called: from langchain_core.utils.function_calling import convert_to_openai_function.

It may be interesting to use embeddings to do tool selection before calling LLM. See this code agent_wt_tool_retrieval.py The approach is to dynamically select the N tools we want at run time, without having to pass all the tool definitions within the context window. It uses a vector store to create embeddings for each tool description.

How Tos

How to trace the agent execution?

import langchain
langchain.debug = True
Or use LangSmith

Defining an agent with tool calling, and the concept of scratchpad

Define an agent with 1/ a user input, 2/ a component for formatting intermediate steps (agent action, tool output pairs) (format_to_openai_tool_messages: convert (AgentAction, tool output) tuples into FunctionMessages), and 3/ a component for converting the output message into an agent action/agent finish:

# x is the response from LLM 
agent = (
        {
            "input": lambda x: x["input"],
            "agent_scratchpad": lambda x: format_to_openai_tool_messages(
                x["intermediate_steps"]
            ),
             "chat_history": lambda x: x["chat_history"],
        }
        | prompt
        | llm_with_tools
        | OpenAIToolsAgentOutputParser()
    )

OpenAIToolsAgentOutputParser used with OpenAI models, as it relies on the specific tool_calls parameter from OpenAI to convey what tools to use.

How to support streaming the LLM's output?

LangChain streaming is needed to make the app more responsive for end-users. All Runnable objects implement a sync method called stream and an async variant called astream. They cut output into chunks and yield them. Recall yield is a generator of data and acts as return. The main demo code is web_server_wt_streaming with the client_stream.py

Example of Intended Model

to be done

Example of Supports Multi-Input Tools

to be done

Use a vector store to keep the list of agent and description

As we cannot put the description of all the tools in the prompt (because of context length issues) so instead we dynamically select the N tools we do want to consider using, at run time. See the code in agent_wt_tool_retrieval.py.

LangChain Expression Language (LCEL)

LCEL supports streaming the LLM results, use async communication, run in parallel, retries and fallbacks, access intermediate results.