Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
03ef009
PlanB working for windows
QJonny Jul 23, 2012
8666de0
PlanB working for windows
QJonny Jul 23, 2012
2c23544
PlanB working for windows
QJonny Jul 23, 2012
9de69fd
PlanB working for windows
QJonny Jul 23, 2012
9f2176d
itext processing added
QJonny Jul 31, 2012
1087309
itext processing added
QJonny Jul 31, 2012
63e3154
XML Parser functional for windows. Title is correctly parsed
QJonny Aug 10, 2012
fc35502
XML Parser working. Title correctly parsing
QJonny Aug 10, 2012
9137768
Abstract correctly parsing
QJonny Aug 14, 2012
9e83c3f
Parsing backup before big change in parsing structure
QJonny Aug 20, 2012
d7fbd7c
Implementation of a functional XML paragraphs detector
QJonny Aug 24, 2012
53e592a
Everything is done. Must just add some new reference formats
QJonny Aug 31, 2012
55fdf24
Commentation and structuration of the code in a more extensible way
QJonny Sep 3, 2012
480a568
Final release
QJonny Sep 11, 2012
46a62c8
Final release
QJonny Sep 11, 2012
1bffe6b
Final release
QJonny Sep 12, 2012
6ad38ba
Final release
QJonny Sep 12, 2012
96b6933
Final release for both Windows and Linux
QJonny Sep 12, 2012
919cb44
Final release
QJonny Sep 13, 2012
c9456b9
Merge pull request #1 from QJonny/xmlParser
arnfred Oct 5, 2012
5e84f5d
Merge pull request #1 from Traill/xmlParser
arnfred Oct 5, 2012
292192f
Structure transformation and enumeration system implemented
QJonny Oct 5, 2012
8d42308
Title modification
QJonny Oct 5, 2012
1be49e4
Merge pull request #2 from QJonny/xmlParser
arnfred Oct 5, 2012
47da121
Working except for graphs
QJonny Oct 5, 2012
7434bbe
Merge pull request #3 from QJonny/xmlParser
arnfred Oct 5, 2012
27616f1
Fixed an error with ExtendPaper
arnfred Oct 5, 2012
fb4d283
Added bag of Words to the project
AmineMan Oct 31, 2012
64e8c08
Added a line to Bag of Words
AmineMan Nov 1, 2012
19d5e2a
Added new code - final tests
AmineMan Nov 1, 2012
5b4f2a3
Finished merge
AmineMan Nov 2, 2012
0004199
Corrected normalization of final weights on BoW
AmineMan Nov 2, 2012
d95397b
Merge pull request #2 from AmineMan/Amine/BagofWords
arnfred Nov 6, 2012
255cc07
Merge branch 'master' of https://github.com/Traill/Papers
arnfred Nov 18, 2012
a8cc894
Deleted superflous files
arnfred Nov 18, 2012
7ee0dd7
Fixed a few errors and made everything compile
arnfred Nov 18, 2012
393ff81
Fixed something
arnfred Nov 18, 2012
37b7a9f
Made output be a AMD module (data.js)
arnfred Nov 19, 2012
8306ba7
Changed a bit in the program flow
arnfred Nov 20, 2012
3fe4db2
Added SLI
arnfred Nov 23, 2012
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .cache
Binary file not shown.
11 changes: 11 additions & 0 deletions .classpath
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
<?xml version="1.0" encoding="UTF-8"?>
<classpath>
<classpathentry kind="src" path=""/>
<classpathentry kind="con" path="org.scala-ide.sdt.launching.SCALA_CONTAINER"/>
<classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER"/>
<classpathentry kind="lib" path="C:/lib/breeze/dlwh-breeze-feca1cf/target/scala-2.9.2/breeze_2.9.2-0.1-SNAPSHOT.jar"/>
<classpathentry kind="lib" path="C:/lib/breeze/dlwh-breeze-feca1cf/math/target/scala-2.9.2/breeze-math_2.9.2-0.1-SNAPSHOT.jar" sourcepath="C:/lib/breeze/dlwh-breeze-feca1cf/math/src"/>
<classpathentry kind="lib" path="C:/lib/breeze/dlwh-breeze-feca1cf/learn/target/scala-2.9.2/breeze-learn_2.9.2-0.1-SNAPSHOT.jar"/>
<classpathentry kind="lib" path="C:/lib/breeze/dlwh-breeze-feca1cf/process/target/breeze-process-assembly-0.1-SNAPSHOT.jar" sourcepath="C:/lib/breeze/dlwh-breeze-feca1cf/process/src/main/scala/breeze"/>
<classpathentry kind="output" path="bin"/>
</classpath>
18 changes: 18 additions & 0 deletions .project
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
<?xml version="1.0" encoding="UTF-8"?>
<projectDescription>
<name>PapersProject</name>
<comment></comment>
<projects>
</projects>
<buildSpec>
<buildCommand>
<name>org.scala-ide.sdt.core.scalabuilder</name>
<arguments>
</arguments>
</buildCommand>
</buildSpec>
<natures>
<nature>org.scala-ide.sdt.core.scalanature</nature>
<nature>org.eclipse.jdt.core.javanature</nature>
</natures>
</projectDescription>
15 changes: 15 additions & 0 deletions README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
COMPILATION

In order to correctly compile and execute Trail Head, use sbt or anything you want.
The only important thing is that the repository containing the final executable file must also contain:
- the "tools" dossier
- the "cache" dossier
- the "pdf2xml.dtd" file



EXECUTION

When you start the program, after providing the input path, if at that location there isn't any "schedule.xml", an exception will be rised. Since this wasn't part of my tasksand I don't exactly know how this works, I didn't touch at that code part.
However, the parsing has been completed before the exception.
If you don't want this annoying exception, just comment the code after the parsing process.
14 changes: 13 additions & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,24 @@ name := "Parsing papers"

version := "1.0"

scalaVersion := "2.9.1"
scalaVersion := "2.9.2"

scalaSource in Compile <<= baseDirectory(_ / "src/paper")

scalacOptions ++= Seq("-unchecked", "-Ywarn-dead-code", "-deprecation")

libraryDependencies ++= Seq(
// other dependencies here
// pick and choose:
"org.scalanlp" %% "breeze-process" % "0.1"
)

resolvers ++= Seq(
// other resolvers here
// if you want to use snapshot builds (currently 0.2-SNAPSHOT), use this.
"Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/"
)

initialCommands := """
import System.{currentTimeMillis => now}
def time[T](f: => T): T = {
Expand Down
28 changes: 28 additions & 0 deletions pdf2xml.dtd
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT pdf2xml (page+)>
<!ELEMENT page (fontspec*, text*)>
<!ATTLIST page
number CDATA #REQUIRED
position CDATA #REQUIRED
top CDATA #REQUIRED
left CDATA #REQUIRED
height CDATA #REQUIRED
width CDATA #REQUIRED
>
<!ELEMENT fontspec EMPTY>
<!ATTLIST fontspec
id CDATA #REQUIRED
size CDATA #REQUIRED
family CDATA #REQUIRED
color CDATA #REQUIRED
>
<!ELEMENT text (#PCDATA | b | i)*>
<!ATTLIST text
top CDATA #REQUIRED
left CDATA #REQUIRED
width CDATA #REQUIRED
height CDATA #REQUIRED
font CDATA #REQUIRED
>
<!ELEMENT b (#PCDATA)>
<!ELEMENT i (#PCDATA)>
98 changes: 80 additions & 18 deletions src/paper/Analyze.scala
Original file line number Diff line number Diff line change
@@ -1,56 +1,118 @@
package paper

object Analyze {
def main(args : Array[String]): Unit = {

// create analyse
def main(args : Array[String]): Unit= {
// create analyzer
val A : Analyzer = new Analyzer()

// Check that a directory is supplied (there is an argument)
if (args.length == 0) println("You really need to supply a directory as argument");
if (args.length == 0 || args.length > 2) {
println("You should provide at least a path and at most a path and an option. Type -h for help.");
}

else {
val options = readOptions(args.last)

// Then go ahead
else A.analyze(args(0))
// Then go ahead
A.analyze(args(0), options)
}
}

// Reads in the options and converts them to a map of options
def readOptions(s : String) : Map[String,Boolean] = {

var options : Map[String,Boolean] = Map();

if (s.length > 0 && s(0) == '-') {
// Define options as always false, except
options = Map().withDefaultValue(false)

// Check for parsing
if (s.contains('p')) options += ("parse" -> true)

// Check for getting schedule
if (s.contains('s')) options += ("xmlschedule" -> true)

// Check for extending
if (s.contains('e')) options += ("extend" -> true)

// Check for linking
if (s.contains('l')) options += ("link" -> true)

// Check for linking
if (s.contains('g')) options += ("graph" -> true)

if(s.contains('h')) {
println("""How to call: Analyze [path]
[parameter]?\nPARAMETERS:\n\t-p : parsing\n\t-s : looks for xml
scheduler\n\t-c : compare\n\t-l : link\n\t-g : create graph\n\t-h
: shows this help page\n\tnothing : do everything""");
}

}

else {
// If no options are supplied we do everything by default
options = Map().withDefaultValue(true)
}

return options
}
}



class Analyzer extends Object with LoadPaper
with ParsePaper
with ExtendPaper
with ComparePaper
with BagOfWordsLSI
with XMLScheduleParser
with Graphs {

// Set a limit in percent for when papers get an edge between them
val limit : Int = 1

// Get cached papers in this order
val cache : List[String] = List(Cache.extended, Cache.linked, Cache.scheduled, Cache.parsed)
val cache : List[String] = List(Cache.linked, Cache.extended, Cache.scheduled, Cache.parsed)

// Set sources we want to extend with
//val sources : List[PaperSource] = List(TalkDates, TalkRooms, PdfLink)
val sources : List[PaperSource] = List(PdfLink)

// Analyze a paper
def analyze(paperPos: String): Unit = {
def analyze(paperPos: String, options: Map[String, Boolean]): List[Paper] = {

var papers : List[Paper] = List();

// Get a list of parsed papers
val papers : List[Paper] = load(paperPos, cache, Isit)
if (options("parse") == true) {
papers = loadAndParse(paperPos, cache, XMLParser, XMLConverterLoader)
}

// Mix in the schedule XML data
val xmlPapers : List[Paper] = getXMLSchedule(paperPos, papers)
if (options("xmlschedule") == true) {
papers = getXMLSchedule(paperPos, papers)
}

// Extend papers with tertiary data
if (options("extend") == true) {
papers = extend(paperPos, papers, sources)
}

// Compare the papers individually
val comparedPapers : List[Paper] = compare(xmlPapers, limit)
if (options("link") == true) {
papers = compareBoWLSI(paperPos, papers, limit)
}

// Extend papers with tertiary data
val extendedPapers : List[Paper] = extend(comparedPapers, sources)

// Create graph
val graph : Graph = getGraph(extendedPapers)
if (options("graph") == true) {
val graph : Graph = getGraph(paperPos, papers)

// Print graph to file 'data.json'
graph.save
// Print graph to file 'data.json'
graph.save
}

// Now return the papers as is
return papers
}
}
Loading