User Study

Group User Programming Skill
(1–5)
Python Familiarity
(1–5)
Education Level Accuracy Time (s)
1P122 Bachelor 652592
P1134 Bachelor 453336
2P233 Bachelor 702063
P1232 Bachelor 703833
3P344 PhD 901104
P1333 PhD 701077
4P455Research Staff751510
P1455Research Staff754780
5P544 Bachelor 801916
P1544 Master 802241
6P612 Master 851880
P1622 Master 852026
7P744 Bachelor 801883
P1744 Bachelor 451407
8P832 Master 852114
P1832 Master 803760
9P934 PhD 851300
P1932 Master 802557
10P1044 PhD 851390
P2033 PhD 653617

To ensure the reliability and fairness of our user study, we carefully designed a participant pairing strategy based on self-reported metrics of programming background. Specifically, each experimental group (EG) participant (e.g., P1–P10) was paired with a control group (CG) counterpart (P11–P20) who exhibited comparable programming skill, Python familiarity, and education level.

This matched-pairing design minimizes individual-level variance and allows us to isolate the effect of explanation in semantic code search performance.

Interface of Experimental Group user study
Interface of Experimental Group
Experimental Group (EG) Interface

In this interface, participants are shown additional explainability information. For each recommended code snippet, the interface highlights which concepts in the query align with specific parts of the code, helping users understand why the code was retrieved.

Interface of Control Group user study
Interface of Control Group
Control Group (CG) Interface

This interface simulates a traditional code search experience. It presents the same queries and retrieved code snippetsas the experimental group but without any explanation or concept-level alignment information.

We categorize the 20 study queries into five coarse-grained daily programming task types. For each type, we show one representative example using two screenshots: the query and its ground-truth code.

Data I/O
Read/write data from files/directories/databases; persist results into CSV/JSON/DB.
Query
Query (Representative Example)
Ground-truth code
Ground-Truth Code (Correct Answer)
Data Conversion
Convert values to target types/formats to meet downstream interface expectations.
Query
Query (Representative Example)
Ground-truth code
Ground-Truth Code (Correct Answer)
Parsing & Extraction
Parse or extract information from streams, attributes, timestamps, or binary headers.
Query
Query (Representative Example)
Ground-truth code
Ground-Truth Code (Correct Answer)
Data Validation
Validate inputs/data against constraints, rules, and allowed sets.
Query
Query (Representative Example)
Ground-truth code
Ground-Truth Code (Correct Answer)
System and Configuration Management
Manage runtime states/configurations, render templates, refresh routes, and present outputs.
Query
Query (Representative Example)
Ground-truth code
Ground-Truth Code (Correct Answer)
  1. Sample Size and Diversity
    While we recruited 20 participants from reputable institutions and ensured diversity in education levels and programming backgrounds, the relatively small sample size may limit the generalizability of the findings. Future studies with larger and more varied populations would strengthen external validity.
  2. Self-Reported Proficiency Matching
    Participant pairing was based on self-assessed programming skill and Python familiarity. Such subjective assessments may not perfectly reflect actual competence, potentially introducing pairing imbalances that affect internal validity.
  3. Lack of Long-Term Evaluation
    The study measured short-term task performance under time-limited, artificial conditions. It remains unclear whether the advantages of the explanation interface would persist in longer-term, real-world development scenarios.