AxiomSec

Can large language models write secure code out of the box? We put three leading LLMs to the test.

As part of my final project exploring the intersection of artificial intelligence and secure software development, I conducted a comparative evaluation of three prominent LLMs to answer a critical question: How well can modern AI assistants generate production-ready, secure code without additional prompting?

The Challenge

The experiment was straightforward but revealing. I provided each LLM with an identical prompt:

Write an advanced calculator in C using secure coding techniques.

No follow-up questions. No iterative refinement. Just one prompt to see if these tools could deliver industry-ready code that prioritizes security from the start.

The Contenders

I evaluated three leading AI coding assistants:

ChatGPT 5.2 (OpenAI)
Claude Sonnet 4.5 (Anthropic)
Cursor Composer 1

Each tool received the same prompt and was evaluated on the same criteria: security posture, code quality, functionality, verbosity, and readiness for production deployment.

The Evaluation Framework

To objectively assess the generated code, I analyzed each implementation across multiple dimensions:

Security Analysis:

Input validation and bounds checking
Buffer overflow protections
Error handling robustness
Memory safety practices
Resistance to common vulnerabilities

Code Quality:

Adherence to secure coding standards
Code clarity and maintainability
Lines of code (efficiency metric)
Documentation quality

Testing Methodology:

Manual code review using VS Code and VIM
Fuzzing with American Fuzzy Lop (AFL) to identify vulnerabilities
Functional testing for calculator operations

The Results

ChatGPT 5.2: The Clear Winner

ChatGPT delivered the most impressive result: a robust, secure, and production-ready advanced calculator in under 300 lines of C code.

Strengths:

Comprehensive input validation without bloat
Proper error handling for edge cases
Clean, maintainable code structure
Ready for production deployment with minimal modifications
Security-first approach integrated naturally

Security Posture: ChatGPT demonstrated the strongest security awareness, implementing bounds checking, safe string handling, and defensive programming practices without sacrificing code clarity.

Claude Sonnet 4.5: User Experience Focus

Claude placed second, delivering a functional and secure calculator with an emphasis on user interaction and code clarity.

Strengths:

Excellent user experience design
Clear, well-documented code
Strong security fundamentals
Verbose but readable implementation

Trade-offs: While secure and functional, Claude’s solution was notably more verbose than ChatGPT’s, prioritizing clarity and UX over code conciseness. The security implementation was solid but less elegant than ChatGPT’s approach.

Cursor Composer 1: Required Iteration

Cursor required additional prompting to approach the quality level of Claude, and still fell short of ChatGPT’s initial output.

Challenges:

Initial output lacked comprehensive security measures
Required multiple iterations to improve code quality
Final result approached but didn’t match Claude’s quality
Significantly behind ChatGPT in first-attempt quality

Key Findings

1. One-Shot Performance Varies Dramatically

The gap between ChatGPT and the other tools in one-shot secure code generation was significant. While all three eventually produced functional calculators, only ChatGPT delivered production-ready, secure code on the first attempt.

2. Security Is Not Always Default

Despite requesting “secure coding techniques,” not all models prioritized security equally. ChatGPT integrated security naturally throughout the implementation, while others treated it more as a checklist item.

3. Verbosity vs. Efficiency Trade-offs

Claude’s verbose approach made the code highly readable but less efficient. ChatGPT struck a better balance between clarity and conciseness, demonstrating that secure code doesn’t require excessive verbosity.

4. Testing Reveals Hidden Issues

Fuzzing with AFL proved essential for validation. While manual code review caught many issues, automated fuzzing revealed edge cases and potential vulnerabilities that weren’t immediately obvious.

Implications for Practitioners and Researchers

For Software Developers:

LLM-generated code quality varies significantly between tools
Always review and test AI-generated code, even from top-tier models
One-shot prompts can work, but set your expectations based on the tool
Fuzzing should be standard practice when working with AI-generated code

For Researchers:

There’s substantial room for improvement in secure code generation
The gap between leading models suggests different training approaches
Benchmark standardization would help track progress in this domain

For Security Professionals:

LLMs can accelerate secure development but cannot replace security expertise
Integration of security testing tools (like AFL) is essential
Different models show different security prioritization patterns

Looking Forward

This research reinforces that while LLMs are powerful tools for code generation, they are not yet consistently reliable for producing production-ready secure code without human oversight. ChatGPT’s strong performance demonstrates the potential, but the variability across tools highlights the need for:

Continued improvement in secure code generation capabilities
Standardized benchmarks for evaluating LLM security awareness
Integration of automated security testing in AI-assisted development workflows
Developer education on effective prompt engineering for security requirements

Technical Details

Tools Used:

Programming Language: C
Development Environments: VS Code, VIM
Fuzzing Tool: American Fuzzy Lop (AFL)
Analysis: Manual code review and automated testing

Repository: Code samples and detailed analysis will be available in my GitHub repository soon.

Conclusion

The LLM Secure Code Face Off reveals both the promise and limitations of AI-assisted secure software development. While ChatGPT demonstrated impressive capability in generating secure, production-ready code from a single prompt, the inconsistency across models emphasizes that AI is a tool to augment, not replace, human expertise in secure software development.

For practitioners and researchers working at the intersection of AI and cybersecurity, these findings underscore the importance of rigorous testing, tool selection, and maintaining security awareness throughout the development process—regardless of whether code is human-written or AI-generated.

This research was conducted as part of my final project exploring LLM capabilities in secure software development. The work bridges my industry experience in safety-critical systems with academic research in cybersecurity education.