The data are a single column of entries from a text file that is a failed XML import (sample below). My challenge is to determine for each drug ("# Brand_Names:"), what are the corresponding categories ("# Drug_Category:"). This could be done manually, but it would take very long as there are >1,000,000 data rows in the column.
Here is what I have (repeated for different drugs down a single column for about 1,000,000 rows):
# Absorption: |
Bioavailability is 100% following injection. |
# Biotransformation: |
Lepirudin is thought to be metabolized by release of amino acids via catabolic hydrolysis of the parent drug. However, con-clusive data are not available. About 48% of the administration dose is excreted in the urine which consists of unchanged drug (35%) and other fragments of the parent drug. |
# Brand_Mixtures: |
Not Available |
# Brand_Names: |
Refludan |
# CAS_Registry_Number: |
120993-53-5 |
# ChEBI_ID: |
Not Available |
# Chemical_Formula: |
C287H440N80O110S6 |
# Chemical_IUPAC_Name: |
Not Available |
# Chemical_Structure: |
>DB00001 sequence |
LVYTDCTESGQNLCLCEGSNVCGQGNKCILGSDGEKNQCVTGEGTPKPQSHNDGDFEEIP |
EEYLQ |
# Creation_Date: |
######## |
# DPD_Drug_ID_Number: |
2240996 |
# Description: |
Lepirudin is identical to natural hirudin except for substitution of leucine for isoleucine at the N-terminal end of the molecule and the absence of a sulfate group on the tyrosine at position 63. It is produced via yeast cells. |
# Dosage_Forms: |
Powder, for solution |
# Drug_Category: |
Anticoagulants |
Antithrombotic Agents |
Fibrinolytic Agents |
# Drug_Interactions: |
Ginkgo biloba |
Treprostinil |
# Drug_Reference: |
16241940 |
16244762 |
16690967 |
# Drug_Type: |
Approved |
Biotech |
<colgroup><col width="64" style="width:48pt"> </colgroup><tbody>
</tbody>
And here is what I need from that data:
While there is only one brand name per drug, there is a variable number of categories (there are 3 in the example above, but there may be 1 or 5 or 3, etc.).
The best solution would be complicated and would also need a 3rd search term, "# Drug_Interactions:", and would look like this:
In Sheet1, start at R1C1 and go down the column (C1) to identify the first row with the value "# Brand_Names:"
Copy the next row (i.e. the row with the actual brand name) and paste it into R1C1 of Sheet2...this the drug name.
Keep going down C1 in Sheet1 until the row value = "# Drug_Category:"
Copy the range that goes from the next row to the row above the row with value = "# Drug_Interactions:" and paste it into R1C2:R1Cx of Sheet2, where x is the number of rows in the range...these are the drug categories.
Now loop to continue down C1 in Sheet1, looking for the next row with the value "# Brand_Names:"...etc until there are no more rows.
Sounds impossible even as I write this, let me know what you think!