This project is dedicated to exploring the relationship between EEG brainwave patterns and subjective cohesion during drumming sessions. The project utilizes machine learning models to uncover insights into how brain activity correlates with perceived social cohesion.
Initial models achieved above-chance classification of cohesive vs. non-cohesive pairs from dyadic EEG signals, suggesting some detectable signal in the data. However, instability and signs of overfitting mean the findings should be treated as exploratory rather than conclusive.
- Research Basis
- Repository Structure
- Machine Learning Models
- Cohesion Data Overview
- Data Cleaning
- Data Analysis
- Conclusions
- Setup & Installation
The central machine learning problem in this study is to develop predictive models that determine inter- personal cohesion based on Dyadic EEG signals. Given EEG recordings from two individuals engaged in a shared activity (specifically, a four-minute freestyle drumming session) the goal is to predict the pair’s subjective rating of social cohesion, measured on a scale from 1 to 6. This poses a supervised learning problem of classification, where the EEG signal features serve as input, and the self-reported cohesion scores separated by a threshold value act as binary labels (cohesive or not cohesive). The data was collected from Social Neuroscience Lab at Bar Ilan University. Prior research has demonstrated that Dyadic EEG patterns can provide insights into social and team dynamics. For example, Reinero et al. (2021) showed that EEG synchrony between individuals correlates with team performance, while Wang et al. (2024) found that it reflects emotional alignment. Also, Ji et al. (2024) conducted a similarly structured study in which it was found that EEG signalling in dyadic pairs, when fed into a CNN model, was able to differentiate between friends and strangers. Our preliminary analysis of the raw data showed that covariance between dyadic EEG signals was consistently near zero, suggesting that simple linear relationships were insufficient to capture meaningful patterns. This underlined the relevant use of machine learning techniques to model the more complex, nonlinear dynamics we assume underlies social cohesion. By taking a machine learning approach, our project seeks to model these neural synchrony patterns and establish a predictive link between brain activity and perceived social connection. A successful model could have broader applications in fields such as team formation, collaborative work, and even therapeutic interventions, offering an objective, neurobiological basis for assessing social cohesion.
- Contains MATLAB scripts for EEG signal preprocessing. Key steps include:
- Bandpass filtering
- Bandwidth separation
- Timepoint segmentation
The code is organized to separate two different pipelines. This pilot_pipline refers to an initial pipeline created in the project. facilitate both data restructuring and model training:
data_restructuring/: Scripts detailing the restructuring process from preprocessed data to the formats inseparated_pairs/andmixed_pairs/. These are for reference only, as the restructured data is already included. The first pair is further removed here, due to missing data in this pair.cnn/: Code for training convolutional neural networks (CNNs).svm/: Code for training support vector machine (SVM) models.visuals/:data/:
data/: The preprocessed EEG data split into 83, 249 and 747 time points, as well as the labels.csvfunc/: Functionality scripts which are called upon in the pipeline folder scripts. Mainly cnn_feature_extr_func.py, concatenate_pairs_func.py, model_training_func.py and analysis_func.py.pipeline/: Scripts to actually run the pipeline. For the main outputs of the models, given the input data: main.py. For the running of the analysis after the model outputs are saved: analysis_main.py.saved_models/: All models outputted from main.py saved here inn .pkl filesvisualisations/: Any saved visualisation to showcase: convolusion matrices and learning curves.
Here is a repository structure visual
dsen_pipeline/
├── data/ # Raw gamma data and labels
│ ├── raw_gammas_83.csv
│ ├── raw_gammas_249.csv
│ ├── raw_gammas_747.csv
│ └── labels.csv
├── func/ # Functions
│ ├── analysis_func.py
│ ├── cnn_feature_extr_func.py
│ ├── concatenate_pairs_func.py
│ └── model_training_func.py
├── pipeline/ # Main pipeline script
│ ├── main.py
│ └── analysis_main.py
├── saved_models/ # Pickled models saved
│ ├── Dataset [time_step] Time Step_[model].pkl
├── visualisations/ # Evaluation outputs
│ ├── [Dataset]_[Model]_confusion_matrix.png
│ ├── [Dataset]_[Model]_learning_curve.png
│ └── model_comparison.csv
Here is a theoretical pipeline structure visual
Three models were trained for this project (focusing on the DSEN_Pipeline):
- SVM (trained on all EEG signal time separations): Trained on data: 83, 249, 747 time separations.
- RF (trained on all EEG signal time separations): Trained on data: 83, 249, 747 time separations.
- MLP (trained on all EEG signal time separations): Trained on data: 83, 249, 747 time separations.
The repository is designed for easy usage: no preprocessing or restructuring scripts need to be run before training the models, as all necessary data is already included.
-
Averaged Cohesion Scores:
- Scale: 1 (No Cohesion) to 6 (High Cohesion). Converted to a binary of 1 (Cohesion) and 0 (No Cohesion), with the threshold of 4.7. If (Cohesion_Score_Person_1 + Cohesion_Score_Person_2) / 2 > 4.7, then Cohesion = 1, otherwise 0.
- Method: Based on participant ratings averaged across pairs after drumming sessions.
-
Raw EEG Data:
- Participants: 98 individuals (49 pairs).
- Session Details: 4-minute drumming sessions.
-
Preprocessed EEG Data (for DSEN pipeline):
- EEG data for gamma bandwidth only for each participant pairs.
- EEG signal combined together into 83 timepoints, 249 timepoints or 747 timepoints.
- Participants: 98 individuals, 49 pairs > 88 individuals, 44 pairs (after preprocessing) > 86 individuals, 43 pairs (after data restructuring)
- EEG Recording: Collected during 4 minute drumming session
- Model Outputs: Based on the accuracy, precision and F1 outputs, assessing which models and datasets performed the best for further analysis.
(for datatset separated into 249 timesteps only, as this produced the best performing models)
- Cross Validation Results: For the best performing models, using a t-test with accuracy against baseline performance (0.512), to assess whether the models consistently performed better across 5 fold validation.
- MLP: Since findings showed that the MLP model, trained and tested on the 249 granular timestep data, performed the best, we looked at the evluation of this model. It showcased very unsettled learning patterns over training set sizes, and instability.
- SVM Support Vector Influence: Since findings showed that the MLP model was unstable and not learning properly, we analysed the second best model, the RF. It showcased very obvious overfitting based on the training dataset learning curve (consitently 100%).Finally, we looked into the inputs which laid most closely to the boundaries and were therefore more heavily weighted in the decision-making process of the model.
| Model | Accuracy | Precision | F1 | Parameters | Conlusions of model evaluation | |
|---|---|---|---|---|---|---|
| - | Baseline | 0.512 | ------- | ---- | ------ | ------------ |
| 1 | MLP | 0.7444 | 0.7667 | 0.7243 | ReLU, 500 max iterations, (256, 128,645) hidden layer sizes | No signs of learning and severe underfitting |
| 2 | RF | 0.7389 | 0.85 | 0.6467 | 100 n_estimators, 50 max depth | Clear sign of severe overfitting |
| 3 | SVM | 0.7194 | 0.7933 | 0.6554 | RBF, C = 10 | Unclear learning pattern |
Was it a fluke, or is the SVM truly differentiating between cohesive and non-cohesive pairs persistently over the k-fold validations?
Conclusion of SVM
The SVM model was statistically significant in its performance, with a p-value of 0.0075, and a confidence interval that did not dip into the chance-level performance.
We went on to further analyse the SVM model : which pairs was it having trouble to differentiate into cohesive vs non-cohseive the most?
Assumptions made:
• Cohesion can be reasonably approximated using a 4.7+ threshold on subjective scores. • Gamma frequency EEG signals are more informative for distinguishing dyads than other bands. • The drumming task provides a socially relevant interaction context, but does not systematically bias the EEG.
Conclusions derived:
• The architecure of the MLP most successfully classifies socially cohesive vs non-cohesive pairs above chance levels based on their EEG signal data only. • There are particular pairs which are more relavant to the model’s decision boundaries, but the nature of their relevance is unknown and would benefit from further inspection in the context of a larger dataset. • Inconsistent performance across other datasets and lack of pattern among the influential pairs suggests that the model may be overfitting to this dataset’s statistical structure.
- Clone the repository:
git clone https://github.com/yourusername/MLCohesionEEGProject.git cd MLCohesionEEGProject - Run the main.py file in the dsen_pipeline/main folder to see the accuracy results of the models and visualisations



