SpikingBrain 7B – More efficient than classic LLMs

SpikingBrain：Spiking Brain-inspired Large Models

📄 Technical Report: Chinese | English
🚀 Arxiv: arXiv:2509.05276
🧩 Models: Available Models

About SpikingBrain

Inspired by brain mechanisms, SpikingBrain integrates hybrid efficient attention, MoE modules, and spike encoding into its architecture, supported by a universal conversion pipeline compatible with the open-source model ecosystem. This enables continual pre-training with less than 2% of the data while achieving performance comparable to mainstream open-source models. We further adapt frameworks, operators, parallel strategies, and communication primitives for non-NVIDIA (MetaX) clusters, ensuring stable large-scale training and inference. SpikingBrain achieves over 100× speedup in TTFT for 4M-token sequences, while spiking delivers over 69% sparsity at the micro level. Combined with macro-level MoE sparsity, these advances provide valuable guidance for the design of next-generation neuromorphic chips.

Project Structure

This repository provides the full implementation and weights of SpikingBrain-7B, including the HuggingFace version, vLLM inference version, and quantized version, enabling flexible deployment and research across different scenarios.

SpikingBrain-7B/
├── hf_7B_model/ # HuggingFace version
├── run_model/   # Model run examples
├── vllm_hymeta/ # vLLM plugins and inference support
├── W8ASpike/    # Quantized inference version
├── setup.py
├── requirements.txt 
└── README.md

vLLM-HyMeta

vllm-hymeta is the plugin adaptation of HyMeta (Hybrid Models built on MetaX GPUs) for the vLLM inference framework, providing efficient inference support on NVIDIA GPUs.

By leveraging the plugins mechanism in vLLM, hardware backends can be integrated in a modular fashion, bringing the following benefits:

Decoupled codebase: Backend-specific code remains independent, keeping the vLLM core cleaner.
Reduced maintenance cost: vLLM developers can focus on general functionality without being affected by backend-specific implementations.
Faster integration: New backends can be integrated quickly and evolve independently with less engineering effort.

Container Deployment (NVIDIA)

sudo docker run -itd 
    --entrypoint /bin/bash 
    --network host 
    --name hymeta-bench 
    --shm-size 160g 
    --gpus all 
    --privileged 
    -v /host_path:/container_path 
    docker.1ms.run/vllm/vllm-openai:v0.10.0

Plugin Installation

git clone https://github.com/BICLab/SpikingBrain-7B.git
cd SpikingBrain-7B
pip install .

Recommended environment for installing vllm-hymeta on NVIDIA GPUs:

decorator
pyyaml
scipy
setuptools
setuptools-scm
flash_attn==2.7.3
flash-linear-attention==0.1
vllm==0.10.0
torch==2.7.1

Run with vLLM

You can serve a model with vLLM in the simplest way using the following command:

You may also set --tensor-parallel-size and --pipeline-parallel-size when launching if you want to run with multiple GPUs.

W8ASpike

W8ASpike is the quantized inference version of SpikingBrain-7B, aiming to reduce inference cost under low-precision settings and explore the potential of Spiking Neural Networks (SNNs).

The current implementation adopts pseudo-spiking, where activations are approximated as spike-like signals at the tensor level, rather than true asynchronous event-driven spiking on neuromorphic hardware.

Pseudo-spiking: Efficient approximation at the tensor level, suitable for prototyping and research.
True-spiking: Requires asynchronous hardware and event-driven operator support, which is beyond the scope of this repository.

The activation spike encoding process here is inspired by the pseudo-spiking interfaces from BICLab/Int2Spike. For additional PyTorch-based spiking interfaces, please refer to the Int2Spike library.

Available Models

The model weights are hosted on ModelScope. Please select the appropriate version based on your needs:

Pre-trained model (7B): https://www.modelscope.cn/models/Panyuqi/V1-7B-base
Chat model (7B-SFT): https://www.modelscope.cn/models/Panyuqi/V1-7B-sft-s3-reasoning
Quantized weights (7B-W8ASpike): https://www.modelscope.cn/models/Abel2076/SpikingBrain-7B-W8ASpike

Usage

Example scripts are provided in run_model/ for running the model with the released checkpoints.

Hugging Face
Load with AutoModelForCausalLM and use as a standard CausalLM (forward or generation); see run_model/run_model_hf.py.
For the SFT model, a chat template is used; see run_model/run_model_hf_chat_template.py.
vLLM
Perform inference using the provided vLLM Hymeta plugin; see run_model/run_model_vllm.py and the vLLM Hymeta section.

Performance Evaluation

Table 1: Performance evaluation of the SpikingBrain-7B pre-trained model. All models are tested with the HuggingFace framework and evaluated using a perplexity-based method. Except for Qwen2.5, the other baselines are trained on limited Chinese data, resulting in clear disadvantages on CMMLU and C-Eval.

Table 2: Performance evaluation of the SpikingBrain-76B pre-trained model. All models are tested with the vLLM framework and evaluated using a perplexity-based method. Except for Qwen2.5, the other baselines are trained on limited Chinese data, resulting in clear disadvantages on CMMLU and C-Eval.

Citation

If you find our work useful, please consider citing SpikingBrain:

@article{pan2025spikingbrain,
  title={SpikingBrain Technical Report: Spiking Brain-inspired Large Models},
  author={Pan, Yuqi and Feng, Yupeng and Zhuang, Jinghao and Ding, Siyu and Liu, Zehao and Sun, Bohan and Chou, Yuhong and Xu, Han and Qiu, Xuerui and Deng, Anlin and others},
  journal={arXiv preprint arXiv:2509.05276},
  year={2025}
}

Subscribe to Updates

What's Hot

SpikingBrain 7B – More efficient than classic LLMs

SpikingBrain 7B – More efficient than classic LLMs

SpikingBrain：Spiking Brain-inspired Large Models

About SpikingBrain

Project Structure

vLLM-HyMeta

Container Deployment (NVIDIA)

Plugin Installation

Run with vLLM

W8ASpike

Available Models

Usage

Performance Evaluation

Citation

Related Posts