Skip to content

Commit 3b38f2b

Browse files
authored
Update index.md
1 parent 21c0e64 commit 3b38f2b

File tree

1 file changed

+10
-10
lines changed

1 file changed

+10
-10
lines changed

docs/index.md

+10-10
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# SciCode: A Research Coding Benchmark Curated by Scientists
22

33
<p align="center">
4-
<strong>Minyang Tian<sup>1,2*‡</sup>, Luyu Gao<sup>3*</sup>, Shizhuo Dylan Zhang<sup>1</sup>, Xinan Chen<sup>1†</sup>, Cunwei Fan<sup>1†</sup>, Xuefei Guo<sup>1†</sup>, Roland Haas<sup>1†</sup>, Pan Ji<sup>4†</sup>, Kittithat Krongchon<sup>1†</sup>, Yao Li<sup>1†</sup>, Shengyan Liu<sup>1†</sup>, Di Luo<sup>5,6,11†</sup>, Yutao Ma<sup>7†</sup>, Hao Tong<sup>1†</sup>, Kha Trinh<sup>7†</sup>, Chenyu Tian<sup>8†</sup>, Zihan Wang<sup>1†</sup>, Bohao Wu<sup>1†</sup>, Yanyu Xiong<sup>9†</sup>, Shengzhu Yin<sup>1†</sup>, Minhui Zhu<sup>1†</sup>, Kilian Lieret<sup>10</sup>, Yanxin Lu<sup>1</sup>, Genglin Liu<sup>1</sup>, Yufeng Du<sup>1</sup>, Tianhua Tao<sup>1</sup>, Ofir Press<sup>10</sup>, Jamie Callan<sup>3</sup>, Eliu Huerta<sup>1,2,7‡</sup>, Hao Peng<sup>1‡</sup></strong>
4+
<strong>Minyang Tian<sup>1,2*‡</sup>, Luyu Gao<sup>3*</sup>, Shizhuo Dylan Zhang<sup>1</sup>, Xinan Chen<sup>1†</sup>, Cunwei Fan<sup>1†</sup>, Xuefei Guo<sup>1†</sup>, Roland Haas<sup>1†</sup>, Pan Ji<sup>4†</sup>, Kittithat Krongchon<sup>1†</sup>, Yao Li<sup>1†</sup>, Shengyan Liu<sup>1†</sup>, Di Luo<sup>5,6,11†</sup>, Yutao Ma<sup>7†</sup>, Hao Tong<sup>1†</sup>, Kha Trinh<sup>7†</sup>, Chenyu Tian<sup>8†</sup>, Zihan Wang<sup>1†</sup>, Bohao Wu<sup>1†</sup>, Yanyu Xiong<sup>9†</sup>, Shengzhu Yin<sup>1†</sup>, Minhui Zhu<sup>1†</sup>, Kilian Lieret<sup>10</sup>, Yanxin Lu<sup>1</sup>, Genglin Liu<sup>1</sup>, Yufeng Du<sup>1</sup>, Tianhua Tao<sup>1</sup>, Ofir Press<sup>10</sup>, Jamie Callan<sup>3</sup>, Eliu Huerta<sup>1,2,7‡</sup>, Hao Peng<sup>1‡</sup><b>
55
</p>
66

77
<p align="center">
@@ -66,7 +66,7 @@
6666

6767
## Introduction
6868
<p align="justify">
69-
SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of <b> 16 </b> subdomains from </strong>6</strong> domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains </strong>338</strong> subproblems decomposed from </strong>80</strong> challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only </strong>4.6%</strong> of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.
69+
SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of <b>16</b> subdomains from <b>6</b>b> domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains <b>338</b>b> subproblems decomposed from <b>80</b>b> challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only <b>4.6%</b>b> of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.
7070
</p>
7171

7272

@@ -76,24 +76,24 @@ SciCode sources challenging and realistic research-level coding problems across
7676

7777
Among various coding necessities, Scicode mainly focuses on: 1. Numerical methods 2. Simulation of systems 3. Scientific calculation. These are the tasks we believe require intense scientific knowledge and reasoning to optimally test LM’s science capability. The below figure is an example of the combination of 1 and 3.
7878

79-
In designing test cases for evaluation, we incorporate domain-specific test cases in addition to numerical cases. These tests are extracted from real scientific workflows: scientists must design domain-specific test cases to verify code accuracy by reproducing results published in papers or matching analytical solutions derived from theoretical models. Each problem goes through </strong>3</strong> rounds of validation (i.e. by in-domain scientists, out-of-domain scientists, GPT4) for quality control.
79+
In designing test cases for evaluation, we incorporate domain-specific test cases in addition to numerical cases. These tests are extracted from real scientific workflows: scientists must design domain-specific test cases to verify code accuracy by reproducing results published in papers or matching analytical solutions derived from theoretical models. Each problem goes through <b>3</b> rounds of validation (i.e. by in-domain scientists, out-of-domain scientists, GPT4) for quality control.
8080
</p>
8181
![Image Title](figures/SciCode_example_problem.png)
8282
## Benchmark Statistics
8383

84-
| </strong>Fields</strong> | </strong>Subfields</strong> |
84+
| <b>Fields</b> | <b>Subfields</b> |
8585
|----------------------|---------------------------------------------------------------------------------------------------------------|
86-
| </strong>Mathematics</strong> | [Numerical Linear Algebra](problems.md#numerical-linear-algebra) (8), [Computational Mechanics](problems.md#computational-mechanics) (5), [Computational Finance](problems.md#computational-finance) (1) |
87-
| </strong>Physics</strong> | [Condensed Matter Physics](problems.md#condensed-matter-physics) (13), [Optics](problems.md#optics) (10), [Quantum Information/Computing](problems.md#quantum-informationcomputing) (6), [Computational Physics](problems.md#computational-physics) (5), [Astrophysics](problems.md#astrophysics) (2), [Particle Physics](problems.md#particle-physics) (1) |
88-
| </strong>Chemistry</strong> | [Quantum Chemistry](problems.md#quantum-chemistry) (5), [Computational Chemistry](problems.md#computational-chemistry) (3) |
89-
| </strong>Biology</strong> | [Ecology](problems.md#ecology) (6), [Biochemistry](problems.md#biochemistry) (1), [Genetics](problems.md#genetics) (1) |
90-
| </strong>Material Science</strong> | [Semiconductor Materials](problems.md#semiconductor-materials) (7), [Molecular Modeling](problems.md#molecular-modeling) (6) |
86+
| <b>Mathematics</b> | [Numerical Linear Algebra](problems.md#numerical-linear-algebra) (8), [Computational Mechanics](problems.md#computational-mechanics) (5), [Computational Finance](problems.md#computational-finance) (1) |
87+
| <b>Physics</b> | [Condensed Matter Physics](problems.md#condensed-matter-physics) (13), [Optics](problems.md#optics) (10), [Quantum Information/Computing](problems.md#quantum-informationcomputing) (6), [Computational Physics](problems.md#computational-physics) (5), [Astrophysics](problems.md#astrophysics) (2), [Particle Physics](problems.md#particle-physics) (1) |
88+
| <b>Chemistry</b> | [Quantum Chemistry](problems.md#quantum-chemistry) (5), [Computational Chemistry](problems.md#computational-chemistry) (3) |
89+
| <b>Biology</b> | [Ecology](problems.md#ecology) (6), [Biochemistry](problems.md#biochemistry) (1), [Genetics](problems.md#genetics) (1) |
90+
| <b>Material Science</b> | [Semiconductor Materials](problems.md#semiconductor-materials) (7), [Molecular Modeling](problems.md#molecular-modeling) (6) |
9191

9292
![Image Title](figures/SciCode_chart.png)
9393
<p style="text-align: center;">Left: Distribution of Main Problems Right: Distribution of Subproblems</p>
9494

9595
<p align="justify">
96-
We include several research problems that are built upon or reproduce methods used in Nobel Prize-winning studies to highlight current trends in scientific research: the self-consistent field (SCF) method for density functional theory (DFT) calculations (</strong>The Nobel Prize in Chemistry 1998</strong>), the PMNS matrix for neutrino oscillation in matter (</strong>The Nobel Prize in Physics 2015</strong>), the Haldane model for the anomalous quantum Hall effect (</strong>The Nobel Prize in Physics 2016</strong>), optical tweezer simulations for microscopic thermodynamics (</strong>The Nobel Prize in Physics 2018</strong>), and the replica method for spin glasses (</strong>The Nobel Prize in Physics 2021</strong>).
96+
We include several research problems that are built upon or reproduce methods used in Nobel Prize-winning studies to highlight current trends in scientific research: the self-consistent field (SCF) method for density functional theory (DFT) calculations (<b>The Nobel Prize in Chemistry 1998</b>), the PMNS matrix for neutrino oscillation in matter (<b>The Nobel Prize in Physics 2015</b>), the Haldane model for the anomalous quantum Hall effect (<b>The Nobel Prize in Physics 2016</b>), optical tweezer simulations for microscopic thermodynamics (<b>The Nobel Prize in Physics 2018</b>), and the replica method for spin glasses (<b>The Nobel Prize in Physics 2021</b>).
9797
</p>
9898
## Experiment Results
9999
<p align="justify">

0 commit comments

Comments
 (0)