{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Account for Familiy structure using Permutation testing\n", "This Python code is a conversion of a portion of the PALM (Permutation Analysis of Linear Models) package, initially created by Anderson M. Winkler. \n", "PALM serves as a robust tool for conducting permutation-based statistical analyses.\n", "If you wish to explore PALM further, please refer to the official [PALM website](http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/PALM) or read their guidelines on how the [exchangeable blocks](https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/PALM/ExchangeabilityBlocks) are created in MATLAB\n", "\n", "In this Python adaptation, our primary focus lies in accommodating family structure within your dataset. \n", "We implement methods in PALM that manage exchangeability blocks, which is described in: \n", "* Winkler AM, Webster MA, Vidaurre D, Nichols TE, Smith SM. Multi-level block permutation. Neuroimage. 2015;123:253-68. (Available as Open Access). DOI: 10.1016/j.neuroimage.2015.05.092\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, we will walk through a practical example of processing family-related data using the Human Connectome Project dataset in Python. \n", "\n", "Performing permutation tests with HCP data involves working with an ```EB.csv``` (Exchangeability Block) file, which is structured with multiple columns. We need to create ```EB.csv``` if we want to account for family structure when [testing across subjects](https://github.com/vidaurre/glhmm/blob/main/docs/notebooks/Testing_across_subjects.ipynb).\\\n", "These columns define blocks for each family, allowing for entire families to be shuffled collectively, and within these families, individual subjects can be permuted. \n", "Creating this file while considering the kinship relationships among hundreds of HCP subjects would be a time-consuming and error-prone task and can be construced using the function ```hcp2block``` from ```palm_functions.py```. \n", "To utilize it, follow these steps:\n", "\n", "1. Ensure that you have obtained the necessary permissions to access restricted HCP data. If you haven't obtained permission yet, please follow the instructions provided on the [HCP website](https://www.humanconnectome.org/).\n", "\n", "2. Once you have obtained permission, download the most up-to-date version of the restricted data file. This file is typically in CSV format and will be named something like ```RESTRICTED_yourname_MM_DD_YY_HH_MM_SS.csv``` after downloading.\n", "\n", "**Note:** Due to rendering issues when viewing this notebook through github, internal links, like the table of contents, may not work correctly. To ensure that the notebook renders correctly, you can view it through [this link](https://nbviewer.org/github/vidaurre/glhmm/blob/main/docs/notebooks/HCP_multi_level_block_permutation.ipynb).\n", "\n", "\n", "## Table of Contents\n", "1. [Explore data](#ex-data)\n", "2. [Create Exchangeability blocks](#ex-data2)\n", "3. [Re-index EB](#ex-data3)\n", "4. [PALM tree](#ex-data4)\n", "5. [Shuffle Permutation Tree](#ex-data5)\n", "6. [Run quick-perm function](#ex-data6)\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import libraries\n", "Let's start by importing the required libraries and modules." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import os\n", "import pandas as pd\n", "from glhmm import palm_functions \n", "import seaborn as sns\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Explore Data \n", "Here we are going to look the data from the ```RESTRICTED_yourname_MM_DD_YY_HH_MM_SS.csv```" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Sample data - Replace this with your actual data\n", "# famid, sibtype, famtype, age\n", "# Load data\n", "file_path = \"RESTRICTED_yourname_MM_DD_YY_HH_MM_SS.csv\"\n", "twin_data = pd.read_csv(file_path)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Summary statistics:\n", " Subject Age_in_Yrs Mother_ID Father_ID \\\n", "count 1206.000000 1206.000000 1206.000000 1206.000000 \n", "mean 374551.585406 28.837479 53110.315920 83166.267828 \n", "std 272686.898230 3.690534 3464.007126 3075.715552 \n", "min 100004.000000 22.000000 50263.000000 80216.000000 \n", "25% 154254.250000 26.000000 51603.000000 81476.000000 \n", "50% 212166.500000 29.000000 52312.000000 82201.000000 \n", "75% 586310.500000 32.000000 53302.250000 84431.000000 \n", "max 996782.000000 37.000000 99998.000000 99999.000000 \n", "\n", " TestRetestInterval Handedness SSAGA_Employ SSAGA_Income \\\n", "count 46.000000 1206.000000 1204.000000 1199.000000 \n", "mean 139.304348 65.621891 1.521595 5.003336 \n", "std 68.994000 44.994546 0.749862 2.172830 \n", "min 18.000000 -100.000000 0.000000 1.000000 \n", "25% 95.000000 60.000000 1.000000 3.000000 \n", "50% 133.500000 80.000000 2.000000 5.000000 \n", "75% 157.000000 95.000000 2.000000 7.000000 \n", "max 343.000000 100.000000 2.000000 8.000000 \n", "\n", " SSAGA_Educ SSAGA_InSchool ... SSAGA_Times_Used_Illicits \\\n", "count 1204.000000 1204.000000 ... 1204.000000 \n", "mean 14.863787 0.197674 ... 0.530731 \n", "std 1.819279 0.398411 ... 1.166826 \n", "min 11.000000 0.000000 ... 0.000000 \n", "25% 13.750000 0.000000 ... 0.000000 \n", "50% 16.000000 0.000000 ... 0.000000 \n", "75% 16.000000 0.000000 ... 0.000000 \n", "max 17.000000 1.000000 ... 5.000000 \n", "\n", " SSAGA_Times_Used_Cocaine SSAGA_Times_Used_Hallucinogens \\\n", "count 1204.000000 1204.000000 \n", "mean 0.214286 0.268272 \n", "std 0.939887 0.900885 \n", "min 0.000000 0.000000 \n", "25% 0.000000 0.000000 \n", "50% 0.000000 0.000000 \n", "75% 0.000000 0.000000 \n", "max 5.000000 5.000000 \n", "\n", " SSAGA_Times_Used_Opiates SSAGA_Times_Used_Sedatives \\\n", "count 1204.000000 1204.000000 \n", "mean 0.230897 0.176910 \n", "std 0.953539 0.833284 \n", "min 0.000000 0.000000 \n", "25% 0.000000 0.000000 \n", "50% 0.000000 0.000000 \n", "75% 0.000000 0.000000 \n", "max 5.000000 5.000000 \n", "\n", " SSAGA_Times_Used_Stimulants SSAGA_Mj_Use SSAGA_Mj_Ab_Dep \\\n", "count 1204.000000 1204.000000 1204.000000 \n", "mean 0.220100 0.543189 0.090532 \n", "std 0.896411 0.498338 0.287061 \n", "min 0.000000 0.000000 0.000000 \n", "25% 0.000000 0.000000 0.000000 \n", "50% 0.000000 1.000000 0.000000 \n", "75% 0.000000 1.000000 0.000000 \n", "max 5.000000 1.000000 1.000000 \n", "\n", " SSAGA_Mj_Age_1st_Use SSAGA_Mj_Times_Used \n", "count 654.000000 1204.000000 \n", "mean 2.577982 1.406146 \n", "std 0.937171 1.691599 \n", "min 1.000000 0.000000 \n", "25% 2.000000 0.000000 \n", "50% 3.000000 1.000000 \n", "75% 3.000000 3.000000 \n", "max 4.000000 5.000000 \n", "\n", "[8 rows x 184 columns]\n" ] } ], "source": [ "print(\"\\nSummary statistics:\")\n", "print(twin_data.describe())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Create Exchangeability blocks \n", "Now after looking at the ```RESTRICTED_yourname_MM_DD_YY_HH_MM_SS.csv``` we can create the exchangeability blocks (```EB```) by using the ```hcp2block``` function. Having ```EB``` makes it possible to create the Permutation Tree (```Ptree```)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1003, 201)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Filter twin_data\n", "folder_name = \"data\"\n", "data_ID_file = 'data_ID.npy'\n", "\n", "# Load behavioral data\n", "file_path = os.path.join(folder_name, data_ID_file)\n", "data_ID = np.load(file_path)\n", "\n", "# Filter the twin data\n", "twin_data_filtered = twin_data[twin_data['Subject'].isin(data_ID)]\n", "twin_data_filtered.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "These subjects have data missing and will be removed: [168240, 376247]\n" ] } ], "source": [ "# Example usage\n", "blocksfile = \"EB.csv\"\n", "tab, EB, famtype = palm_functions.hcp2block(twin_data_filtered, blocksfile, dz2sib=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Data Transformation Results**\\\n", "Display the results of using the function on the HCP twin data." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original tab shape: (1206, 201)\n", "Transformed tab shape: (1001, 6)\n", "Block structure shape: (1001, 5)\n", "Family type array shape: (1001,)\n" ] } ], "source": [ "# Display transformed data and results\n", "print(\"Original tab shape:\", twin_data.shape)\n", "print(\"Transformed tab shape:\", tab.shape)\n", "print(\"Block structure shape:\", EB.shape)\n", "print(\"Family type array shape:\", famtype.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The steps above will generate a file named ```EB.csv```. This file contains definitions for blocks corresponding to subjects with complete family information in the restricted file.\n", " \n", "Complete family information means subjects with both father and mother IDs, along with zygosity information. The order of these definitions in ```EB.csv``` mirrors the order of subjects in the restricted file. \n", "\n", "We'll use the data from ```EB``` to create a Permutation Tree (```Ptree```) that we can later use for permutation testing.\\\n", " When we examine the ```EB``` data, we can interpret its different columns as follows:\n", "* 1st col: The negative values in this column indicate that the sub-indices in the next column must remain unchanged during shuffling.\n", "* 2nd col: familie type, which is a calculated score assigned to each family. For example, a family consisting of a mother, father, 1 full sibling (FS), and 2 monozygotic twins (MZ) would have same score as other families with the same family combination.\n", "* 3rd col: family ID, providing a distinct identifier for each family.\n", "* 4th col: sibling type, which represents the score assigned to each subject.\n", "* 5th col: subject ID, which are unique identifiers for each subject in the dataset.\n", "\n", "Regarding permutations, dizygotic twins (DZ) can be handled differently from non-twins (NT). By default, DZ twins cannot be shuffled with non-twin subjects, even within the same family. However, since DZ pairs share the same kinship as NT siblings, you might want to allow them to be swapped during permutations. You can provide a third argument as 'true' to hcp2blocks to indicate this." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "
|---|---|---|---|---|---|
| 0 | \n", "-1 | \n", "2012 | \n", "-419 | \n", "10 | \n", "100206 | \n", "
| 1 | \n", "-1 | \n", "2012 | \n", "-78 | \n", "1000 | \n", "100307 | \n", "
| 2 | \n", "-1 | \n", "2012 | \n", "-135 | \n", "1000 | \n", "100408 | \n", "
| 3 | \n", "-1 | \n", "212 | \n", "-288 | \n", "100 | \n", "100610 | \n", "
| 4 | \n", "-1 | \n", "2002 | \n", "9 | \n", "1000 | \n", "101006 | \n", "