The cSR framework is intend
This quickstart assumes the user is running some variant of UNIX, and has a working installation of python3 (at least 3.5.2 is expected).
For the code examples, the cSR package is assumed to be checked out to a directory ~/cSR
Before starting, make sure the current working directory is the cSR root directory. Also test that you have a recent version of python3:
cSR$ which python3
/usr/bin/python3
cSR$ python3 --version
Python 3.5.2First test that pip is linked to the right python executable:
cSR$ which pip3
/usr/bin/pip3
cSR$ pip3 --version
pip 19.3.1 (python 3.5.2)To create a virtual environment called SR_env:
cSR$ mkvirtualenv -p `which python3` SR_env
cSR$ workon SR_env
(SR_env) cSR$ pip3 install -r requirements.txtYou should now be able to run cSR. To reload the environment after logging out, run:
cSR$ workon SR_envFirst test that conda is linked to the right python executable:
cSR$ which conda
/usr/bin/conda
cSR$ conda --versionTo create a virtual environment called SR_env:
cSR$ conda create -n SR_env python==3.5.2
cSR$ activate SR_env
(SR_env) cSR$ conda install -c conda-forge --file requirements.txtYou should now be able to run cSR. To reload the environment after logging out, run:
cSR$ activate SR_envIf all dependencies have been installed correctly, the following command should run without error:
cSR$ python -m csr.Data --helpThe cSR framework will use GPU acceleration if this is supported by the local installation of tensorflow (tensorflow-gpu). Installing these dependencies is beyond the scope of this readme. For more information see the tensorflow instructions
conda is recommended to install tensorflow-gpu
All modules and tools reside in the csr module (folder). All classes and functions can be called from python programs, and several can be used from the command line.
The documentation of the tools are available directly in the tools, accessible with the --help flag. The documentation of the framework for use as a python library is available in the doc folder (doc/index.html)
The following modules can be used as standalone tools from the command line:
csr.Data: Class to inspect/store/load/handle data. Running the module from the command line allows inspecting and modifying existing data filescsr.Import: Methods to convert external formats to datastream formatcsr.Export: Methods to convert datastream files to external formatscsr.Medline: Methods to query PubMedcsr.Train: Training, applying and evaluating machine learning models on existing datasetscsr.ML.Evaluation: Methods to calculate ranking performance on existing datacsr.Vocabulary: Methods to pre-construct vocabulary files from datasets to decrease resource usage during training
As an example, to inspect the contents of data/full/COMET/COMET_update1_M.json:
cSR$ python -m csr.Data --inspect --input data/full/COMET/COMET_update1_M.jsonAll command line usage require calling the tools as python modules. That is, the following does not work:
cSR$ python csr/Data.py --inspect --input data/full/COMET/COMET_update1_M.jsoncSR$ bash csr/examples/small/rank_endnote_xml.shThis example uses the pipeline csr/examples/common/pipelines/sparse_trivial.yaml and the classifier csr/examples/common/classifiers/SGD_50epochs.yamlto import data from an external data format (EndNote XML), train and apply a ranker on the data, and sort the data files. The pipeline takes input features from titles, abstracts and keywords.
cSR$ bash csr/examples/small/rank_sentences.shcSR$ bash csr/examples/quick/run_COMET.sh- Inspect the entire file contents of
file1.json.
cSR$ python -m csr.Data --inspect --input file1.json- Inspect the file contents of
file1.json, limiting the results to columnslabelandtitle, and to the first 20 rows.
cSR$ python -m csr.Data --inspect --input file1.json --col label title --select 20- Merge files
file1.json,file2.jsonandfile3.json, writing the results tofile_out.json. Data format (columns) must be compatible.
cSR$ python -m csr.Data --input file1.json file2.json file3.json --output file_out.jsonOR
cSR$ python -m csr.Data --input file*.json --output file_out.json- Inspect all rows in
file1.jsonwhere the columnsplitis equal totrain, andlabelis equal toY.
cSR$ python -m csr.Data --inspect --input file1.json --get split=train label=Y- Open
file1.json, set the columndate_constructedtoJan 1, 1970for all rows (creating the column if it does not exist), and save the results tofile2.json.
cSR$ python -m csr.Data --inspect --input file1.json --set date_constructed="Jan 1, 1970" --output file2.json- Replace the label values
yeswithY, andnowithN, saving the the results tofile2.json
cSR$ python -m csr.Data --input file1.json --get label=yes --set label=Y --output file1_tempY.json
cSR$ python -m csr.Data --input file1.json --get label=no --set label=N --output file1_tempN.json
cSR$ python -m csr.Data --input file1_temp{Y,N}.json --output file2.jso
cSR$ rm file1_temp{Y,N}.json