Peking University launched the "strongest programming assistant": code big model CodeShell-7B open source, performance bully list 04/04 Update SLTechnology News&Howtos

Peking University launched the "strongest programming assistant": code big model CodeShell-7B open source, performance bully list

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > IT Information >

Shulou(Shulou.com)11/24 Report--

CTOnews.com October 19 news, Peking University Software Engineering National Engineering Research Center knowledge Computing Laboratory in conjunction with Sichuan Tianfu Bank AI Laboratory, today officially opened up its 7 billion parameters code large model CodeShell, known as "the strongest code base of the same size."

Officials have opened up models, related supporting programs and IDE plug-ins in GitHub, which are supported by merchants. Interested friends can go here.

▲ source official GitHub project CTOnews.com learned from the project details that CodeShell-7B received cold start training based on 500 billion Tokens, with a context window length of 8192, and integrated the core features of both StarCoder and Llama in architecture design.

Officials claim that CodeShell's original training data is based on its own crawled Github data, Stack and StarCoder data sets, as well as a small amount of "high-quality Chinese and English data". These pre-training data have gone through a series of pipelines such as data weight judgment, data filtering rules and data quality models.

CodeShell has constructed a thesaurus of 70,000 words. The compression ratio of Chinese, English and code is 2.83,3.29,3.21 respectively, which supports balanced and efficient coding and decoding of Chinese, English and code.

In terms of specific performance, in order to achieve maximum distributed training efficiency, Codeshell is based on Megatron-LM and claims to have made "deep customization in Attention operator optimization, data preprocessing, data loading, log output, state monitoring, distributed training management, etc." to support Flash Attention2 acceleration, and the training throughput has reached the industry advanced level of 3400 Token per GPU per second.

In the code evaluation benchmark HumanEval and MBPP, CodeShell surpasses CodeLlama-7B and StarCodeBase-7B, and remains the leading performance leader in other programming languages of humaneval, such as JavaScript, Java, and Caterpillar.

▲ source official GitHub project official also introduced the CodeShell-based "all-around code assistant model" CodeShell-Chat, the AI tool supports "dialogue", "code generation", "code completion", "code comments", "code review" and "test case generation" and other functions.

In terms of the IDE plug-in, the plug-in currently supports VSCode and IntelliJ IDEA, is suitable for a variety of mainstream programming languages, and provides "focus mode" and "interaction mode" to improve developer efficiency.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.