Generative AI for DataWeave: Generate Code from Unit Tests

[Cinefootage Visuals / Getty Images]

Learn how the MuleSoft ML/AI team developed the DataWeave Codegen project, a generative AI tool aiming to simplify using DataWeave.

Shruthan Radhakrishna

September 7, 2023 11 min read

The DataWeave Codegen project by the MuleSoft ML/AI team is a generative AI tool aimed at simplifying the use of MuleSoft’s powerful data transformation programming language, DataWeave, and making it more accessible for low-code/no-code users.

The DataWeave language can have a steep learning curve, but DataWeave Codegen simplifies the process by letting users generate DataWeave scripts by simply providing a sample input and output data corresponding to the desired DataWeave code. By streamlining the data transformation process, the DataWeave Codegen project enables more users to build Mule applications while accelerating development and time-to-value.

We will describe our exploration of techniques and models that are used to generate DataWeave code given the sample input and output. These results also serve as an analysis of the generalizability of current state-of-the-art models and techniques to a low-resource programming language like DataWeave and the ability of large language models (LLMs) to learn about and adapt to a new language with limited training data.

Broadly, we explore two types of approaches:

Using private large language models like GPT 3.5, GPT 4(from OpenAI), Claude 1.3 and Claude v 2(from Anthropic) out-of-the-box via their APIs
Creating training data and using it to fine-tune pretrained open-source 7B parameter LLMs

We will also describe the evaluation methodology to evaluate the models and the approaches we tried.

A language for transformation

DataWeave allows you to easily read, manipulate, and write data in any format. Discover why DataWeave is industry proven by trillions of transactions on mission-critical apps.

Learn more

Evaluation methodology for the DataWeave Codegen project

The pass@k metric is an unbiased metric commonly used to evaluate code generation models. To calculate pass@k, n (where n is larger than k), code samples are generated per task. Among these n samples, if c samples are correct (i.e. pass all the unit tests), then (an unbiased estimate of) pass@k is computed as follows:

Since our main goal is to provide the DataWeave Codegen as a tool inside the MuleSoft IDEs like Anypoint Code Builder, we only suggest one or two generated scripts to the user to choose from. As such, we evaluate the models for k=1 and 2.

Compilation percentage

One of the downsides of current AI-based code generation tools is that the suggested (generated) code doesn’t even compile, nor is it necessarily correct, which impacts developers’ productivity.

In the DataWeave Codegen feature inside the IDE, we plan to check if the generated code compiles; then, if its produced output on the given input sample matches the given output, we suggest it to the user. Apart from the pass@k on non-filtered (without considering whether the generated code compiles) generations, we also report the compilation percentage, the percentage of the generated code samples that compile correctly to produce an output together with the pass@k on the generations that compile.

We run generated codes using the DataWeave command line tool, and the exit code from the process running the code is used to determine whether or not the code compiled successfully.

Using models out-of-the-box

The first approach we tried for DataWeave code generation was to prompt pre-trained large language models in zero-shot, one-shot and two-shot settings.

We evaluated the following models:

GPT-3.5-Turbo from OpenAI
GPT 4 from OpenAI
Claude 1.3 from Anthropic
Claude 2 from Anthropic

For one-shot evaluation, the example in the prompt was chosen such that the input and output formats (JSON, XML, CSV, etc.) of the example matched the input and output formats of the test instance.

For each model and each setting, we tried different prompt-styles, including OpenAI’s suggestions of using triple backticks, xml-like tags, and more. We’ll discuss the best results we obtained for each model below.

The results for the three models for two temperature values (T=0.2 and T=0.8) are shown in the tables below. For all these experiments, we use N=20. These results were obtained from the models in early July 2023 on a benchmark dataset with 66 input-output examples.

GPT 3.5

Results from GPT 3.5:

Style	Temperature	Pass@1	Compile %	Pass@1 for compiled	Pass@2
Zero-shot	0.2	0.162	0.262	0.463	0.183
Zero-shot	0.8	0.143	0.263	0.484	0.204
One-shot	0.2	0.350	0.617	0.522	0.387
One-shot	0.8	0.327	0.577	0.467	0.392
Two-shot	0.2	0.308	0.588	0.466	0.333
Two-shot	0.8	0.299	0.569	0.430	0.347

GPT 4

Results from GPT 4:

Style	Temperature	Pass@1	Compile %	Pass@1 for compiled	Pass@2
Zero-shot	0.2	0.393	0.625	0.607	0.439
Zero-shot	0.8	0.356	0.581	0.554	0.436
One-shot	0.2	0.410	0.640	0.598	0.446
One-shot	0.8	0.354	0.607	0.522	0.445
Two-shot	0.2	0.434	0.657	0.582	0.473
Two-shot	0.8	0.369	0.642	0.487	0.461

Claude 1.3

Results from Claude 1.3:

Style	Temperature	Pass@1	Compile %	Pass@1 for compiled	Pass@2
Zero-shot	0.2	0.151	0.412	0.310	0.170
Zero-shot	0.8	0.145	0.330	0.323	0.209
One-shot	0.2	0.213	0.451	0.449	0.244
One-shot	0.8	0.237	0.396	0.516	0.324
Two-shot	0.2	0.234	0.472	0.45	0.262
Two-shot	0.8	0.230	0.398	0.525	0.314

Claude 2

Results from Claude 2:

Style	Temperature	Pass@1	Compile %	Pass@1 for compiled	Pass@2
Zero-shot	0.2	0.298	0.475	0.521	0.307
Zero-shot	0.8	0.244	0.414	0.494	0.314
One-shot	0.2	0.289	0.527	0.482	0.314
One-shot	0.8	0.271	0.521	0.418	0.323
Two-shot	0.2	0.290	0.575	0.437	0.321
Two-shot	0.8	0.270	0.534	0.416	0.333

Analyzing the results

Overall, we see that GPT-4 produces the most reliable code with an average Pass@1 of 0.434 in a two-shot setting. Also, the models perform better in a one-shot setting compared to a zero-shot setting, but there is no significant improvement in the two-shot setting over the one-shot setting.

We also see that GPT 3.5 and GPT 4 outperform Claude 1.3 and Claude 2 for this task. The difference in performance is more evident in terms of compilation percentages suggesting that GPT 3.5 and GPT 4 achieve higher performance by virtue of generating more compilable code. GPT 3.5 and Claude 2 achieve similar Pass@1 for the compiled codes suggesting that they have similar abilities in terms of generating the right code for the implicitly specified task, while GPT-4 exceeds both GPT 3.5 and Claude 2 in this regard.

We found an improved performance in GPT-4 in the June 2023 version over the March 2023 version, contrary to other reports, the March 2023 version of GPT-4 produced results similar to what GPT 3.5 produces. Intuitively, this shows a difference in behavior between a popular language like Python and a low-resource language like DataWeave.

The report states that the generations of GPT-4 become more verbose and contain more comments. A plausible explanation for this is the variety of tasks (like code-summarization, code-comment generation, etc.) and data (comments present in the Python code that is seen by the model) that GPT-4 might be used and trained for in the case of Python. Such variety is uncommon in a language like DataWeave, possibly explaining the absence of such effects.

Fine-tuning open-source models

We also fine-tuned a few 7B parameter models, specifically Bloomz 7b1, Salesforce XGen 4k and Salesforce XGen Instruct 8K on a dataset we prepared. All three models have a <1% Pass@K before fine-tuning. These decoder-models were fine-tuned for the causal language modeling task of predicting a token that follows a given sequence of tokens. The prompt below is what a typical example in the dataset looked like:

Generate DataWeave code to transform the input specified in triple backticks to the output specified in triple backticks. 

Input: 
```
[{"data": "12-jan-22"}]
```

Output: 
```
[
  {
    "Date": "2022-01-12"
  }
]
```

<code>
%dw 2.0
output application/json
---
payload map ((item, index) -> {
    Date: (item.data as Date {format: "dd-MMM-yy"}) as String {format: "yyyy-MM-dd"}
})
</code>

Each example contains the instruction prefix on the top, followed by the input and the expected output payload enclosed in triple backticks, followed by the code (that is used to generate the output from the input) in <code></code> tags.

<code></code> tags were used to enclose the code as they help extract just the DataWeave code from the generated output. Also, asking the model to produce </code> tags in the end helped it terminate the DataWeave code being generated without extraneously repeating the parts of the code, leading to higher compilation percentages.

Generated code without <code> tags	Generated code with code tags
%dw 2.0 output application/json — (payload.JMH filter ($.Code == “B”))[0] filter ($.Code == “B”))[0] filter ($.Code == “B”))[0] filter	%dw 2.0 output application/json — (payload.JMH filter ($.Code == “B”))[0] </code> </code> </code>

Data preparation

Using the aforementioned format to prepare data primarily involves curating realistic input-output-code tuples. The training data we curated was sourced from Community Discussions on help.mulesoft.com and from DataWeave documentation on docs.mulesoft.com.

A data-extraction pipeline was built to extract data from the community forums that contain questions and answers to those questions The community forum tends to contain input and expected output in the question and the right code for the transformation in the answer.

We use a combination of rule-based approaches and GPT 3.5 to extract the input and expected output from the question and the code from the answer. The previously- mentioned command line tool is then used to verify if the extracted code compiles correctly and whether it generates the right output given the input by matching the generated output to the expected output. Successful (input, output, code) tuples are added to the training dataset to fine-tune a large language model.

Tuples for which the generated output matches the expected output are separated and added to a set named the matched set. The other tuples, for which some output is generated upon successful compilation, are added to the set named the unmatched set. These two sets are used during different phases of fine-tuning the models.

*Inputs and outputs for the data extraction pipeline*

Fine-tuning technique

For efficient training, we use the method introduced in the paper Low Rank Adaptation of Large Language Models, where the image below comes from. This method freezes the original weights of the pre-trained model being fine-tuned and pairs of rank-decomposition weight matrices (A and B in the image) are added to existing weights.

Only these newly added weights are updated during training hence significantly reducing the total number of trainable parameters. The HuggingFace PEFT library was used to implement LoRA for our training. Using LoRA allowed all models to be trained on an AWS SageMaker’s ml.g5.4xlarge instance that contains one NVIDIA A10G Tensor Core GPU.

Reports from literature as well as our own experiments show that these parameter-efficient techniques perform as well as fine-tuning the full model.

Fine-tuning details

From the data preparation strategy mentioned, we obtained two sets of data: matched (1287 samples) and unmatched (3725 samples). We found the best results on training the models on the unmatched for 1 epoch, and on the matched set for 2 epochs in case of XGen models and 3 epochs for the Bloom 7b1 model, respectively.

On average, 1 epoch of stage-1 training took about 46 minutes while 1 epoch of stage-2 training took about 14 minutes on an ml.g5.4xlarge instance.

Results

The table below shows the results we obtained on the same benchmark dataset used for GPT 3.5, GPT 4, and Claude from these models with T=0.2 (which showed higher Pass@1 and compile percentage than T=0.8), N=10 and with a maximum of 512 new tokens being generated.

Model	Setting	Pass@1	Compile %	Pass@1 for Compiled	Pass@2
XGen Instruct	0-shot	0.189	0.618	0.291	0.218
	1-shot	0.256	0.650	0.342	0.255
	2-shot	0.238	0.658	0.318	0.251
XGen 4k	0-shot	0.171	0.574	0.261	0.193
	1-shot	0.312	0.667	0.413	0.364
	2-shot	0.316	0.647	0.442	0.376
Bloom 7b1	0-shot	0.175	0.604	0.237	0.198
	1-shot	0.151	0.560	0.270	0.172
	2-shot	0.121	0.545	0.222	0.130

From the results above, while in a 0-shot setting, the models are similar in performance. In one-shot and two-shot settings, the Salesforce XGen 4k models outperform Bloom 7b1 model, and also outperforms Claude 1.3 and Claude 2 from above, but not GPT 3.5 or GPT 4.

The fine-tuned models also tend to have marginally higher compilation percentages, but lower Pass@K, suggesting that the fine-tuning has enabled them to understand the syntax of the language to a good extent, but struggle to generate the right code for the implicitly specified task. The effectiveness of fine-tuning is further elucidated by the fact that the models had <1% pass@1 before the fine-tuning process.

Future scope of the DataWeave Codegen project

For the fine-tuned models, we see that an immediate scope for future work is to improve the pass@k for the codes that compile. This may be achieved by adding more high quality data (like the matched set that was used).

Additionally, we are investing in Reinforcement Learning methods like RLHF that provide signals to LLM based on whether the generated code compiles and produces the right output.

Based on the observed trajectory, we believe we will soon outperform private third-party models for DataWeave Codegen by fine-tuning the Salesforce XGen model. By doing so, we achieve the best pass@k metrics compared to private models such as GPT3.5/4 or Claude, and the ongoing inference latency will be much lower given the smaller size of the XGen model.

This content was created in collaboration between the author and the following individuals: Yazdan Jamshidi and Hadi Minooei.