DOI: https://doi.org/10.1016/j.compind.2022.103733
50 Days Free Access Link: https://authors.elsevier.com/a/1fHOibquFR5MK
00
TL;DR
There is significant attention on the application of AI in the architecture field. The construction industry contains a large amount of textual information (e.g., engineering specifications, contracts, and construction documents), which has rich domain concepts and semantic features, containing complex domain knowledge. For example, every engineer can easily understand that “The fire resistance of firewalls in Class A and B factories and Class A, B, and C warehouses should not be less than 4.00h.” But can a computer automatically understand this? Natural language processing (NLP) technology is expected to achieve breakthroughs in this area, realizing automated text information processing and knowledge sharing, reducing manual input, and achieving industry transformation and upgrading.
However, existing deep learning-based NLP models in the architecture field still rely on a large amount of manual data annotation (as the saying goes: the more intelligence, the more manual work). Therefore, this paper explores the method of improving model performance using domain text prior knowledge without additional manual annotation. Specific work includes: (1) establishing and publicly releasing a domain-specific corpus for architecture, (2) systematically exploring the enhancement effects of different types of corpora and pre-training methods on models, and (3) proposing the architecture domain pre-trained model ARCBERT, which can automatically learn domain knowledge from unlabeled corpora and significantly improve the performance of various NLP tasks in the domain.
01
Abstract
As an important task in the construction industry, the information processing and retrieval of unstructured text data based on natural language processing (NLP) is receiving increasing attention. With the development of deep learning (DL) technology and open datasets for model pre-training, many NLP methods have also been further developed and improved. However, there is currently little research on domain-specific pretrained language models in the AEC field and their advantages. The main reasons are (1) the lack of a unified publicly available dataset for model evaluation, and (2) the scarcity of publicly available domain corpora for further research.
In the AEC field, deep learning-based methods still require very expensive manual annotations to prepare large training datasets. Therefore, it is necessary to explore how domain unlabeled corpora affect the performance of deep learning-based NLP tasks in the AEC field, thereby further enhancing model performance.
To address the above issues, the technical roadmap for this study is shown in Figure 1. Specifically, this study first developed a domain-specific corpus for architecture (open-sourced, access link: https://github.com/SkydustZ/AEC-domain-corpora/tree/main/domain%20corpus). Then, based on two types of domain corpora (in-domain and close-domain) and two pre-training methods (static word embedding and contextual word embedding), various pre-trained language models were trained. On this basis, this paper systematically studied the impact of pre-training corpora and methods on the performance of deep learning models using two downstream tasks in automated rule checking (ARC) (i.e., text classification (TC) and named entity recognition (NER)) as typical cases. It is worth mentioning that the ARCBERT model, further pre-trained based on the domain corpus developed in this paper, achieved optimal results on typical downstream NLP tasks in ARC, surpassing all other models. This means that significant improvements in the performance of various NLP tasks can be achieved through automatic learning from domain corpus prior knowledge without additional manual annotation.
Figure 1 Technical roadmap for enhancing deep learning models with domain corpora and pre-training methods
02
Development of Domain Corpus
First, this work collected a large amount of domain corpus, and then constructed in-domain corpus (this corpus consists of civil engineering regulatory texts, which are also the target corpus for ARC tasks, hence called in-domain) and close-domain corpus (this corpus consists of civil engineering materials including civil engineering encyclopedia entries, which are close to the ARC corpus, hence called close-domain).
Corpus establishment method: web scraping relevant texts, data cleaning (filtering irrelevant content, splitting long texts). The constructed domain corpus is shown in Table 1.
Table 1 Statistics of the Domain Corpus
03
Pre-trained Models
The pre-trained word vector models can be divided into static word embedding models and dynamic word embedding models (also known as contextual word embedding models). Static word embedding models assume that the semantics of any word do not change with context. Dynamic word embedding assumes that the semantics of words change with context.
For the training of static word embedding models, the technical roadmap is shown in Figure 2. Using Chinese Wikipedia corpus and domain corpus based on the skip-gram model, four word embedding models were trained: (1) general model, trained using the Chinese Wikipedia corpus (referred to as Wiki corpus); (2) in-domain model, trained using both Wiki and in-domain corpus; (3) close-domain model, trained using both Wiki and close-domain corpus; (4) mixed-domain model, trained using Wiki, in-domain, and close-domain corpus.
Figure 2 Method for enhancing deep learning models with architecture-specific static word embedding models
For the training of dynamic word embedding models, the technical roadmap is shown in Figure 3. First, two benchmark models were selected: (1) bert-base-chinese[1] and (2) ERNIE[2]. Then, the constructed domain corpus was combined and divided to explore the influence of the type and quantity of domain corpus. Specifically including: (1) in-domain corpus; (2) close-domain corpus; (3) mixed-domain corpus (mixed in-domain and close-domain); (4) 1/3 of the in-domain corpus; (5) 1/5 of the in-domain corpus, totaling five derived corpora.
On this basis, further pre-training was conducted using the masked language model with the aforementioned five domain corpora, resulting in the training of 10 pre-trained models, as shown in Table 2.
Figure 3 Method for enhancing deep learning models with architecture-specific dynamic word embedding models
Table 2 Statistics of the Domain Corpus
04
Evaluation of Performance Improvement of Deep Learning Models
To evaluate the performance improvement of the aforementioned pre-trained models on deep learning models, a series of experiments were conducted, and the following are the configurations of the experiments.
(1) Dataset selection: The methods and domain corpus in this study were evaluated on two datasets, including the regulatory TC dataset[3] and NER dataset[4].
(2) Metric selection: For the TC model, weighted F1 (weighted F1) was selected. For the NER model, macro average F1 (macro F1) was selected for measuring results. First, the precision (P), recall (R), and F1 score (F1) for each semantic label were calculated:
Where N {correct, labeled, true} represents the number of elements judged to be correct for each label, the total number of labeled elements, and the number of elements that are actually labeled correctly.
Then calculate the weighted F1 and macro average F1:
Where ni represents the number of elements for the ith semantic label;m represents the number of types of semantic labels.
(3) Dataset division: The two datasets were randomly split in the ratio Train: Validation: Test = 0.8: 0.1: 0.1.
(4) Model training and fine-tuning: Four types of experiments were conducted, including: 1) static word embedding model for TC; 2) static word embedding model for NER; 3) further pre-trained dynamic word embedding model for TC; 4) further pre-trained dynamic word embedding model for NER.
(5) Experimental conclusions: The summary of the experimental results is shown in Figures 4 and 5. The experimental results indicate that for the TC and NER tasks in ARC, in-domain (in-domain) corpus can be used to train domain-specific word embedding models or further pre-train BERT and ERNIE models to improve model performance without additional manual annotation.
Figure 4 Performance of various models on TC tasks
Figure 5 Performance of various models on NER tasks
05
Conclusion
In this study, the impact of domain corpus on the performance of deep learning methods in NLP tasks in the architecture field was systematically studied. First, a domain corpus was developed and publicly released, and then, based on four experiments, the advantages of the developed domain corpus and deep learning models based on dynamic word embedding models (such as BERT) were demonstrated:
(1) For TC and NER tasks, the domain corpus can optimize deep learning models based on static word embedding and dynamic word embedding models. For TC tasks, the weighted F1 scores of both types of models improved by 11.4% and 6.4%, respectively, while for NER tasks, the macro average F1 scores improved by 8.7% and 5.4%, respectively.
(2) The BERT model pre-trained on the domain corpus (ARCBERT) outperformed the deep learning models based on static word embedding, with the weighted F1 score for TC improved by 8.1%, and the macro average F1 score for NER improved by 3.8%. Dynamic word embedding-based deep learning models (such as BERT and ERNIE) performed better in NER and TC tasks than other models.
Finally, this study proposed a pre-trained model with a weighted F1 score of 94.4% in TC tasks, called ARCBERT Large. It also proposed a pre-trained model with a macro average F1 score of 81.8% in NER tasks, called ARCBERT Small. These two models achieved globally optimal results in TC and NER tasks. The domain corpus and pre-trained model for architecture developed in this study showed good results in various NLP tasks and may inspire various future NLP-related research and applications in the architecture field.
[1] Hugging Face, 2019. Bert-base-chinese.https://huggingface.co/bert-base-chinese/tree/main
[2] Sun, Y., Wang, S., Li, Y., Feng, S., Chen, X., Zhang, H., Tian, X., Zhu, D., Tian, H., Wu, H., 2019. Ernie: Enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223.
[3] Zheng, Z., Zhou, Y.C., Chen, K.Y., Lu, X.Z., Lin, J.R., She, Z.T., 2022. Text classification-based approach for automatically evaluating building codes’ interpretability. (in preparation).
[4] Zhou, Y.C., Zheng, Z., Lin J.R., Lu X.Z., 2020. Deep natural language processing-based rule transformation for automated regulatory compliance checking. Preprint. https://doi.org/10.13140/RG.2.2.22993.45921
—End—
Related Research
Special Issue Call for Papers
-
Special Issue of Earthquake Engineering Structural Dynamics: “AI and Data-Driven Methods in Earthquake Engineering”
Monographs
-
English Edition of “Engineering Seismic Disaster Simulation: From High-Rise Buildings to Urban Areas (2nd Edition)” Published
-
Monograph “Urban Seismic Resilience Analysis” Published
Artificial Intelligence and Machine Learning
-
Report Video: “Intelligent Design of Structural Generative”
-
New Paper | How Does the Hysteretic Model Based on Deep Learning Have the Ability of “Error Self-Correction”? (Includes Dataset and Program)
-
Using “Graphs” and “Graphs” to Generate “Graphs”? Chinese is not enough | Invention Patent: Multi-modal Input Deep Neural Network, Framework Structure Beam and Column Design Method and Device
-
Does AI want to do structural design? It must first learn structural mechanics! | New Paper and Invention Patent: Physics Enhanced Intelligent Design Method for Shear Wall Structures
-
AI+PKPM | Give a building layout diagram, structural design fully automated
-
Oh no! The structure height changed from 50m to 100m. AI: New structure scheme completed in 5 seconds | Invention Patent: Architectural Structure AI Design Method Integrating Text and Image Data
-
Unveiling the Scientific Principles of AI Designing Shear Wall Structures | New Paper: Design Method for Shear Wall Structures Based on Generative Adversarial Networks
-
Using AI for Structural Scheme Design | Invention Patent: Shear Wall Structure Layout Method Based on Adversarial Generative Networks
-
5 Minutes! From Design Scheme to Completion of Calculation Book | AI Design Shear Wall Structure Case Demonstration
-
New Invention Patent: A Method and System for Automatically Converting Regulatory Text into Computable Logical Rules
-
New Paper | Tilt Photography Point Cloud + Deep Learning = Automated Simulation of Urban Wind Environment
-
New Paper | Convolutional Neural Networks + Wavelet Time-Frequency Maps: A New Method for Earthquake Damage Assessment Based on Time-Frequency Domain Features of Ground Motion
-
What Can Be Learned from Other Fields | New Paper: Structural Earthquake Response Identification Based on Deep Transfer Learning
-
New Paper: To Improve AI Vibration Recognition Effect, We Tried Wavelet, HHT, MFCC, CNN, LSTM
-
How to Accelerate Earthquake Damage Assessment by 1500 Times? | New Paper: Real-Time Assessment of Earthquake Damage Based on LSTM
-
New Paper: Selection of Multivariate Earthquake Intensity Indicators Based on Machine Learning Methods and Real-Time Earthquake Damage Prediction
-
New Paper: Taking a Picture of Vibration Signals Can Improve Vibration Control Effects!
-
What Structure is This Building? Yuan Fang, What Do You Think? — Sir, Times Have Changed! | New Paper: Predicting Urban Building Structural Types Using Machine Learning Methods
-
New Paper: Method for Evaluating Building Seismic Damage Based on Drones and Deep Learning
-
New Paper: Drone + Machine Learning + Urban Plastic Analysis = Near Real-Time Loss Prediction After an Earthquake
Urban Disaster Simulation and Resilient Cities
-
Video: Urban Seismic Plastic Analysis and Its Application in Scenario Construction and Earthquake Emergency Response
-
New Paper | Assessment of Window Wind Damage at Urban Scale Based on Tilt Photography, Deep Learning, and CFD
-
New Paper | The Impact of Strong Earthquake Network Density on Damage Assessment
-
New Paper: Near Real-Time Assessment of Urban Tree Wind Damage at Urban Scale: Method Framework and Tsinghua Garden Case Application
-
New Paper: Near Real-Time Earthquake Landslide Prediction Method Based on Measured Ground Motion (with Source Code)
-
New Paper: Regional Ground Motion Field Simulation Method Based on Measured Earthquake Records
-
New Paper: Quantitative Impact Analysis of the “Site-City” Effect on Regional Damage Assessment
-
New Paper: Multi-Hazard Simulation Framework Applicable to Urban Building Clusters and Individual Important Buildings Based on City Information Model
-
New Paper: Economic Impact of Typhoon Wind Damage on Port Operations: A Case Study of Chinese Ports
-
New Paper: Simulation of Post-Earthquake Fire Rescue Scenarios Based on BIM and Virtual Reality
-
New Paper: Community Earthquake Safety Resilience Assessment System and Community Application Demonstration
-
New Paper: High-altitude Emission Design and Analysis of Harmful Gases in Wuhan Huoshenshan Hospital Wards
-
New Paper: Prediction Method for Regional Building Seismic Damage Under Main and Aftershock Actions
-
New Paper: Open Source Framework for Urban Seismic Plastic Analysis
-
New Paper: Urban Scale Building Seismic Resilience Simulation and Post-Earthquake Repair Planning Considering Labor Resource Constraints
- New Paper: Building Seismic Loss Prediction Method Combining BIM and FEMA P-58
- New Paper:Throwing Bricks, Jumping Boxes, This is Also Scientific Experiment! | Simulation of Personnel Evacuation in Secondary Falling Objects Scenario During Earthquake
- Review: Urban Seismic Plastic Analysis and Its Engineering Applications
- New Paper: Building Seismic Loss Evaluation Applicable to Multi-LOD BIM
- New Paper: Multi-LOD Damage Simulation of Urban Building Clusters and Case Study of Beijing CBD
- US NSF SimCenter + Tsinghua Urban Seismic Plastic Analysis = Earthquake Simulation of 1.84 Million Buildings in San Francisco
- New Paper: BIM + Next-Generation Performance-Based Design = Simulation of Secondary Fires After Sprinkler System Failure
- New Paper: Simulation of Fire Spread in Rural Building Clusters and Case Study
- New Paper: Challenges Faced in Building Seismic Resilient Cities
- New Paper: City in Flames | Fine and Highly Realistic Simulation of Secondary Fires After Earthquake
- New Paper: Urban Resilience — Thoughts Based on “The System of Systems in Three-Dimensional Space”
- New Paper: How to Determine the Damage Degree of High-Rise Buildings in Urban Seismic Plastic Analysis?
High-Performance Structures and Collapse Prevention
-
New Damping Model Paper | Frequency Domain Distribution Resilience Variation, Free Open Source Program Available
-
New Paper: Research on Continuous Collapse Caused by Shear Failure in Middle Column Joints
-
New Paper: New Ideas for Collapse Simulation by Combining Finite Element and Physics Engines
-
New Paper: Preliminary Analysis and Discussion of the Collapse of a Florida Apartment Building in the US
-
Review Paper: Overview of the “System Capacity Design Method” for Seismic Resistance in Building Structures
-
New Paper: Strengthening Effects of Different Reinforcement Structures on RC Slab-Column Joint Shear Failure Performance
-
New Paper: Comprehensive Resilience Defense Framework Against Earthquake and Continuous Collapse for Super High-Rise Systems
-
New Paper: Vulnerability Assessment of Large Span Cable-Stayed Bridges Under Strong Earthquakes Based on Digital Twin
-
New Paper: Research on Collapse Resistance Performance of Slab-Column Joints Under Different Shear Directions
-
New Paper: Nonlinear Model Updating and Collapse Prediction of Large Span Cable-Stayed Bridges
-
New Paper: A New Generation of Earthquake-Continuous Collapse Comprehensive Defense Combination Framework — Comprehensive Resilience Defense Combination Framework
-
New Paper: Research on the Factors Affecting the Load-Bearing Performance of Concrete Slab-Column Structures After Middle Column Failure
- New Paper: Research on the Collapse Resistance Performance of Slab-Column Joints Under Different Shear Directions
- New Paper: Experimental Analysis and Calculation Assessment of the Continuous Collapse Load-Bearing Capacity of Reinforced Concrete Frame Structures After Edge Column and Edge Middle Column Failure
- New Paper: Updating of Fine Finite Element Models for Large Span Cable-Stayed Bridges Based on Cluster Computing
- New Paper: Numerical Simulation of Punching Shear and Behavior After Punching Failure of Reinforced Concrete Slab-Column Joints
- New Paper: Experimental Research on Continuous Collapse Behavior of Flat Structures After Failure of Edge Columns and Edge Middle Columns
- New Paper: Seismic & Continuous Collapse: A New Structural Measure
- New Paper: Experimental Research on Dynamic Effects of Reinforced Concrete Beam-Column Structures Under Continuous Collapse
- What to Do if Interstory Displacement Angle Criterion is Not Applicable to Shear Walls? | New Paper: Shear Wall Damage Assessment Method Based on Curvature
- New Paper: Research on the Structural System of Earthquake-Continuous Collapse Comprehensive Defense Combination Framework
- New Paper: Research on the Impact of Shear Adjustment Strategies on the Seismic Performance of Frame-Core Tube Structures
- New Paper: New Calculation Method for the Load-Bearing Capacity of Resilient PC Frames Against Earthquake and Continuous Collapse
- New Paper: Experimental Research on Continuous Collapse Behavior of Flat Structures After Failure of Corner Columns
- New Paper: A Dual-Function Energy Dissipater Inspired by Origami
- New Review Paper: Research and Practice on Continuous Collapse and Robustness of Building Structures in the 21st Century
- New Paper: Research on the Force Transmission Mechanism of Slab-Column Structures After Middle Column Failure
- New Paper: Can You Calculate the Load-Bearing Capacity of a Reinforced Concrete Beam Correctly? | Calculation Method for the Arching Effect of Beams
- New Paper: This Concrete Frame Can Resist Earthquakes and Prevent Continuous Collapse, and Its Function Can Be Restored. Wouldn’t You Like to Take a Look?
- New Paper: One Good Man Has Three Helpers | Experimental Study on Seismic Performance of Armature Truss with End Damping Devices
- New Paper: What About the Anti-Buckling Support Armature Truss? A Few “Dog Bones” Are Indispensable!
- New Paper: Experimental Study on Continuous Collapse of Reinforced Concrete Frame Structures Considering the Impact of Floor Slabs
- New Paper: How Much Has the Seismic Safety Improved by Increasing the Load Factor for Earthquake Loads?
- New Paper: Using Vibration Control Devices to Control the Seismic Floor Acceleration of Super High-Rise Buildings
- New Paper: Simplified Model of Super High-Rise Buildings at 500m Level and Its Application in Structural System Comparison
- OpenSees Model of 606m Super High-Rise Building
Long press to identify the QR code and follow our research dynamics